<a href="https://colab.research.google.com/github/mtpradoc/BigQueryAPI/blob/main/05_AS_%26_WITH_Dataset_Chicago_Taxi_Trips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Programmatically connect to BigQuery

In order to use a public dataset in BigQuery we need to programmatically authenticate to the google cloud platform

##1. Authenticate to GCP

In [None]:
from google.colab import auth
auth.authenticate_user()

Let's specify which project_id we are going to use. It can be any

In [None]:
project_id = "hazel-env-310501"

##2. Connect to the BigQuery API

In [None]:
from google.cloud import bigquery

In [None]:
client = bigquery.Client(project=project_id)

##3. Access the Dataset (public or private)

Let's put a reference for the dataset and project where the dataset we are going to work with

In [None]:
dataset_ref = client.dataset("chicago_taxi_trips", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

##4. List your tables

In [None]:
tables = list(client.list_tables(dataset))

for table in tables:
  print(table.table_id)

taxi_trips


##5. Check the table schema

In [None]:
table_ref = dataset_ref.table("taxi_trips")
table = client.get_table(table_ref)

table.schema

[SchemaField('unique_key', 'STRING', 'REQUIRED', 'Unique identifier for the trip.', ()),
 SchemaField('taxi_id', 'STRING', 'REQUIRED', 'A unique identifier for the taxi.', ()),
 SchemaField('trip_start_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip started, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_end_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip ended, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_seconds', 'INTEGER', 'NULLABLE', 'Time of the trip in seconds.', ()),
 SchemaField('trip_miles', 'FLOAT', 'NULLABLE', 'Distance of the trip in miles.', ()),
 SchemaField('pickup_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('dropoff_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('pickup_community_area', 'INTEGER', 'NULLABLE', '

##5. Show your data in a dataframe

In [None]:
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,pickup_location,dropoff_latitude,dropoff_longitude,dropoff_location
0,2b7543383dcf4eb2d88165f6366bfb162a80c80b,998375e02a53225a78f127e12aef428a79aa7f33ce9212...,2013-12-25 21:15:00+00:00,2013-12-25 21:15:00+00:00,120,0.0,,,,,4.05,0.0,0.0,5.0,9.05,Cash,Taxi Affiliation Services,,,,,,
1,6cd77ee8a16f8cec93e325fef7b03da4f969106d,8c508a77909d4e965c01698b799c7b25ab31d609051979...,2019-10-06 11:00:00+00:00,2019-10-06 11:00:00+00:00,240,0.9,,,,,5.5,2.0,0.0,0.0,7.5,Credit Card,Chicago Independents,,,,,,
2,6c670f5aa7c9e1a49f33690a12615d3c232e796b,a8078f80a679e11f94f21e3bc8e205025db5e17d1f204c...,2019-10-08 13:00:00+00:00,2019-10-08 13:15:00+00:00,300,1.2,,,,,6.25,1.0,0.0,0.0,7.25,Credit Card,"Taxicab Insurance Agency, LLC",,,,,,
3,6dec402bbc4010d8725bf53b0fe18bb319289a82,eb108801cfdcab102a686aa0772cbd99b03447a26b8907...,2019-10-09 15:45:00+00:00,2019-10-09 15:45:00+00:00,0,0.0,,,,,2.0,0.0,0.0,0.0,2.0,Cash,Taxi Affiliation Services,,,,,,
4,6c00353e69c88989b77a4eabfa9f1004b6a05a33,d95b99518116b5f943d75828e78c02e668ec6add7d28ba...,2019-10-19 06:30:00+00:00,2019-10-19 07:00:00+00:00,1860,14.4,,,,,38.25,10.0,0.0,0.0,48.25,Credit Card,Choice Taxi Association,,,,,,


##6. Explore your data

If the data is sufficiently old, we might be careful before assuming the data is still relevant to traffic patterns today. Write a query that counts the number of trips in each year.

Your results should have two columns:

* year - the year of the trips
* num_trips - the number of trips in that year

In [None]:
rides_per_year_query = """
                          SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS YEAR, 
                                 COUNT(1) AS num_trips
                          FROM
                                `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                          GROUP BY year
                          ORDER BY year ASC
                       """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_year_query_job = client.query(rides_per_year_query, job_config=safe_config)

rides_per_year_result = rides_per_year_query_job.to_dataframe()

print(rides_per_year_result.head())


   YEAR  num_trips
0  2013   27217716
1  2014   37395436
2  2015   32385875
3  2016   31759339
4  2017   24988003


You'd like to take a closer look at rides from 2017.  Copy the query you used above in `rides_per_year_query` into the cell below for `rides_per_month_query`.  Then modify it in two ways:
1. Use a **WHERE** clause to limit the query to data from 2017.
2. Modify the query to extract the month rather than the year.

In [None]:
rides_per_month_query = """
                           SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, 
                                  COUNT(1) AS num_trips
                           FROM
                                  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                           WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2017
                           GROUP BY month
                           ORDER BY month ASC
                        """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_job = client.query(rides_per_month_query, job_config=safe_config)

rides_per_mont_result = rides_per_year_query_job.to_dataframe()

print(rides_per_year_result.head())


   YEAR  num_trips
0  2013   27217716
1  2014   37395436
2  2015   32385875
3  2016   31759339
4  2017   24988003
