<a href="https://colab.research.google.com/github/Frenz86/BigQueryAPI/blob/main/00_Access_Dataset_Chicago_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Programmatically connect to BigQuery

In order to use a public dataset openaq (Open Air Quality) in BigQuery we need to programmatically authenticate to the google cloud platform

##1. Authenticate to GCP

In [1]:
from google.colab import auth
auth.authenticate_user()

Let's specify which project_id we are going to use. It can be any

In [4]:
project_id = 'glossy-mason-326213'

##2. Connect to the BigQuery API

In [5]:
from google.cloud import bigquery

In [6]:
client = bigquery.Client(project=project_id)

##3. Access the Dataset (public or private)

Let's put a reference for the dataset and project where the dataset we are going to work with

In [7]:
serverdb = "bigquery-public-data"
db = "openaq"

In [9]:
dataset_ref = client.dataset(db, project=serverdb)

dataset = client.get_dataset(dataset_ref)

##4. List your tables

In [10]:
table_ref = dataset_ref.table("global_air_quality")
table = client.get_table(table_ref)

##5. Check the table schema

In [11]:
table.schema

[SchemaField('location', 'STRING', 'NULLABLE', 'Location where data was measured', ()),
 SchemaField('city', 'STRING', 'NULLABLE', 'City containing location', ()),
 SchemaField('country', 'STRING', 'NULLABLE', 'Country containing measurement in 2 letter ISO code', ()),
 SchemaField('pollutant', 'STRING', 'NULLABLE', 'Name of the Pollutant being measured. Allowed values: PM25, PM10, SO2, NO2, O3, CO, BC', ()),
 SchemaField('value', 'FLOAT', 'NULLABLE', 'Latest measured value for the pollutant', ()),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'The datetime at which the pollutant was measured, in ISO 8601 format', ()),
 SchemaField('unit', 'STRING', 'NULLABLE', 'The unit the value was measured in coded by UCUM Code', ()),
 SchemaField('source_name', 'STRING', 'NULLABLE', 'Name of the source of the data', ()),
 SchemaField('latitude', 'FLOAT', 'NULLABLE', 'Latitude in decimal degrees. Precision >3 decimal points.', ()),
 SchemaField('longitude', 'FLOAT', 'NULLABLE', 'Longitude in d

##5. Show your data in a dataframe

In [13]:
df = client.list_rows(table, max_results=500).to_dataframe()
df

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,co,910.00000,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.609220,0.25
1,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,no2,131.87000,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.609220,0.25
2,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,o3,15.57000,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.609220,0.25
3,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,pm25,45.62000,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.609220,0.25
4,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,so2,4.49000,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.609220,0.25
...,...,...,...,...,...,...,...,...,...,...,...
495,Zgorzelec - Bohaterów Getta,Zgorzelec,PL,bc,0.32719,2020-06-11 01:00:00+00:00,µg/m³,GIOS,51.150390,15.008175,
496,Zgorzelec - Bohaterów Getta,Zgorzelec,PL,co,274.63600,2019-12-31 22:00:00+00:00,µg/m³,GIOS,51.150390,15.008175,
497,Zgorzelec - Bohaterów Getta,Zgorzelec,PL,no2,7.39891,2019-12-31 22:00:00+00:00,µg/m³,GIOS,51.150390,15.008175,
498,Zgorzelec - Bohaterów Getta,Zgorzelec,PL,so2,5.56383,2019-12-31 22:00:00+00:00,µg/m³,GIOS,51.150390,15.008175,


##6. Explore your data

Which countries have reported pollution levels in units of "ppm"? In the code cell below, set first_query to an SQL query that pulls the appropriate entries from the country column.

In [14]:
first_query = """
              SELECT country
              FROM `bigquery-public-data.openaq.global_air_quality`
              WHERE unit = 'ppm'
              """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_query_job = client.query(first_query, job_config=safe_config)

first_results = first_query_job.to_dataframe()

print(first_results.head())

  country
0      US
1      US
2      US
3      US
4      US


Which pollution levels were reported to be exactly 0?

Set zero_pollution_query to select all columns of the rows where the value column is 0.
Set zero_pollution_results to a pandas DataFrame containing the query results.

In [15]:
zero_pollution_query = """
                       SELECT *
                       FROM `bigquery-public-data.openaq.global_air_quality`
                       WHERE value = 0
                       """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query,job_config=safe_config)

zero_pollution_results = query_job.to_dataframe()

print(zero_pollution_results.head())

                                        location  ... averaged_over_in_hours
0                     Victoria Memorial - WBSPCB  ...                   0.25
1  Rabindra Bharati University, Kolkata - WBSPCB  ...                   0.25
2                   Zamość ul. Hrubieszowska 69A  ...                    NaN
3                               Końskie, MOBILNA  ...                    NaN
4                               Końskie, MOBILNA  ...                    NaN

[5 rows x 11 columns]


I would like to display all columns

In [16]:
import pandas as pd
pd.set_option('display.max_columns', 500)

In [17]:
zero_pollution_query = """
                       SELECT *
                       FROM `bigquery-public-data.openaq.global_air_quality`
                       WHERE value = 0
                       """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(zero_pollution_query,job_config=safe_config)

zero_pollution_results = query_job.to_dataframe()

print(zero_pollution_results.head())

                                        location     city country pollutant  \
0                     Victoria Memorial - WBSPCB  Kolkata      IN      pm25   
1  Rabindra Bharati University, Kolkata - WBSPCB  Kolkata      IN       so2   
2                   Zamość ul. Hrubieszowska 69A   Zamość      PL       no2   
3                               Końskie, MOBILNA  Końskie      PL      pm10   
4                               Końskie, MOBILNA  Końskie      PL      pm25   

   value                 timestamp   unit source_name   latitude  longitude  \
0    0.0 2017-10-16 20:45:00+00:00  µg/m³        CPCB  22.572645  88.363890   
1    0.0 2017-10-28 14:30:00+00:00  µg/m³        CPCB  22.627874  88.380400   
2    0.0 2020-05-19 05:00:00+00:00  µg/m³        GIOS  50.716630  23.290247   
3    0.0 2018-12-21 13:00:00+00:00  µg/m³        GIOS  51.189526  20.408892   
4    0.0 2018-12-21 13:00:00+00:00  µg/m³        GIOS  51.189526  20.408892   

   averaged_over_in_hours  
0                    0