<a href="https://colab.research.google.com/github/mtpradoc/BigQueryAPI/blob/main/05_AS_%26_WITH_Dataset_Chicago_Taxi_Trips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Programmatically connect to BigQuery

In order to use a public dataset in BigQuery we need to programmatically authenticate to the google cloud platform

##1. Authenticate to GCP

In [1]:
from google.colab import auth
auth.authenticate_user()

Let's specify which project_id we are going to use. It can be any

In [2]:
project_id = "glossy-mason-326213"

##2. Connect to the BigQuery API

In [3]:
from google.cloud import bigquery

In [4]:
client = bigquery.Client(project=project_id)

##3. Access the Dataset (public or private)

Let's put a reference for the dataset and project where the dataset we are going to work with

In [5]:
serverdb = "bigquery-public-data"
db = "chicago_taxi_trips"

In [6]:
dataset_ref = client.dataset(db, project=serverdb)

dataset = client.get_dataset(dataset_ref)

##4. List your tables

In [7]:
tables = list(client.list_tables(dataset))

for table in tables:
  print(table.table_id)

taxi_trips


##5. Check the table schema

In [8]:
table_ref = dataset_ref.table("taxi_trips")
table = client.get_table(table_ref)

table.schema

[SchemaField('unique_key', 'STRING', 'REQUIRED', 'Unique identifier for the trip.', ()),
 SchemaField('taxi_id', 'STRING', 'REQUIRED', 'A unique identifier for the taxi.', ()),
 SchemaField('trip_start_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip started, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_end_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip ended, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_seconds', 'INTEGER', 'NULLABLE', 'Time of the trip in seconds.', ()),
 SchemaField('trip_miles', 'FLOAT', 'NULLABLE', 'Distance of the trip in miles.', ()),
 SchemaField('pickup_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('dropoff_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('pickup_community_area', 'INTEGER', 'NULLABLE', '

##5. Show your data in a dataframe

In [10]:
df = client.list_rows(table, max_results=500).to_dataframe()

Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,pickup_location,dropoff_latitude,dropoff_longitude,dropoff_location
0,fc0c0cb1d6d7080a3f3d97c443baab6a81959d02,823e84fe16b6c574b4d68ea21879f760fcaa733b867e31...,2016-01-19 10:15:00+00:00,2016-01-19 10:15:00+00:00,457.0,1.7,,,,,6.60,2.00,,0.0,8.60,Credit Card,303 Taxi,,,,,,
1,9c2ce7b2be6a3c5446233532a8d92c91a8b10fd0,e4d62c5f130dee786b1d1515868b32e04df99df544b98e...,2017-01-26 09:30:00+00:00,2017-01-26 09:45:00+00:00,1260.0,12.6,,,,,32.75,0.00,0.0,5.0,37.75,Cash,Chicago Independents,,,,,,
2,2c7254540b7e9cb659d45c6893f3707dd349a546,01d4efe447227b73918ec67e1cf9f29be1575b44d257b6...,2013-01-03 16:15:00+00:00,2013-01-03 16:15:00+00:00,60.0,0.0,,,,,45.05,5.00,0.0,0.0,50.05,Credit Card,Taxi Affiliation Services,,,,,,
3,8527bf0132f1d4a8ffbd53a34ac646974444c872,c02e4b7d55d11b15841c4ff678e4057182b1b0f61830b0...,2013-01-08 13:00:00+00:00,2013-01-08 12:45:00+00:00,,0.0,,,,,8.45,3.00,0.0,0.0,11.45,Credit Card,Taxi Affiliation Services,,,,,,
4,435d7372da8158aaafd20f787c12265b5d36df14,e4d62c5f130dee786b1d1515868b32e04df99df544b98e...,2017-01-26 09:45:00+00:00,2017-01-26 09:45:00+00:00,0.0,0.0,,,,,55.00,11.00,0.0,0.0,66.50,Credit Card,Chicago Independents,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,8b873ab9932d9fe1dd0d8826b371dddf4a8ec2b0,6c7371eb968c9f935283892afa470cdd021be65b50266f...,2017-01-25 09:15:00+00:00,2017-01-25 09:30:00+00:00,540.0,1.4,,,,,7.75,2.00,0.0,0.0,10.25,Credit Card,City Service,,,,,,
496,47edaef5433bda15ac937abbe74dbe7f09cf7bd3,ad4e285cc063909f5c4550bbc3b71f1c6672b9967c4868...,2013-01-09 23:00:00+00:00,2013-01-09 23:00:00+00:00,60.0,0.0,,,,,68.95,10.00,0.0,0.0,78.95,Credit Card,Taxi Affiliation Services,,,,,,
497,fdc2dae39333cfb293ee03d8b24b2169f2f6484e,64375e2b26ec87f4da6e0c3b8887d949b8e91e9072a428...,2013-01-14 12:00:00+00:00,2013-01-14 12:00:00+00:00,,0.0,,,,,8.05,3.00,0.0,0.0,11.05,Credit Card,Taxi Affiliation Services,,,,,,
498,288c4e8a9337bd462fbe385900ae86e1ae80a834,14d1d504a12a6fc73613a86d9c6c6bdce328aa23d60df4...,2016-01-22 17:15:00+00:00,2016-01-22 17:30:00+00:00,854.0,4.4,,,,,13.60,2.04,,0.0,15.64,Credit Card,303 Taxi,,,,,,


##6. Explore your data

If the data is sufficiently old, we might be careful before assuming the data is still relevant to traffic patterns today. Write a query that counts the number of trips in each year.

Your results should have two columns:

* year - the year of the trips
* num_trips - the number of trips in that year

In [None]:
rides_per_year_query = """
                          SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS YEAR, 
                                 COUNT(1) AS num_trips
                          FROM
                                `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                          GROUP BY year
                          ORDER BY year ASC
                       """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_year_query_job = client.query(rides_per_year_query, job_config=safe_config)

rides_per_year_result = rides_per_year_query_job.to_dataframe()

print(rides_per_year_result.head())


   YEAR  num_trips
0  2013   27217716
1  2014   37395436
2  2015   32385875
3  2016   31759339
4  2017   24988003


You'd like to take a closer look at rides from 2017.  Copy the query you used above in `rides_per_year_query` into the cell below for `rides_per_month_query`.  Then modify it in two ways:
1. Use a **WHERE** clause to limit the query to data from 2017.
2. Modify the query to extract the month rather than the year.

In [None]:
rides_per_month_query = """
                           SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month, 
                                  COUNT(1) AS num_trips
                           FROM
                                  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                           WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2017
                           GROUP BY month
                           ORDER BY month ASC
                        """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_job = client.query(rides_per_month_query, job_config=safe_config)

rides_per_mont_result = rides_per_year_query_job.to_dataframe()

print(rides_per_year_result.head())


   YEAR  num_trips
0  2013   27217716
1  2014   37395436
2  2015   32385875
3  2016   31759339
4  2017   24988003
