<a href="https://colab.research.google.com/github/mtpradoc/BigQueryAPI/blob/main/04_Order_By_Dataset_World_Bank_International_Education.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Programmatically connect to BigQuery

In order to use a public dataset (world_bank_intl_education) in BigQuery we need to programmatically authenticate to the google cloud platform

##1. Authenticate to GCP

In [None]:
from google.colab import auth
auth.authenticate_user()

Let's specify which project_id we are going to use. It can be any

In [None]:
project_id = 'hazel-env-310501'

##2. Connect to the BigQuery API

In [None]:
from google.cloud import bigquery

In [None]:
client = bigquery.Client(project=project_id)

##3. Access the Dataset (public or private)

Let's put a reference for the dataset and project where the dataset we are going to work with

In [None]:
dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

##4. List your tables

In [None]:
tables = list(client.list_tables(dataset))

for table in tables:
  print(table.table_id)

country_series_definitions
country_summary
international_education
series_summary


##5. Check the table schema

In [None]:
table_ref = dataset_ref.table("international_education")

table = client.get_table(table_ref)

table.schema

[SchemaField('country_name', 'STRING', 'NULLABLE', '', ()),
 SchemaField('country_code', 'STRING', 'NULLABLE', '', ()),
 SchemaField('indicator_name', 'STRING', 'NULLABLE', '', ()),
 SchemaField('indicator_code', 'STRING', 'NULLABLE', '', ()),
 SchemaField('value', 'FLOAT', 'NULLABLE', '', ()),
 SchemaField('year', 'INTEGER', 'NULLABLE', '', ())]

##5. Show your data in a dataframe

In [None]:
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Palau,PLW,"Net enrolment rate, primary, both sexes (%)",SE.PRM.NENR,80.01433,2016
1,Turkmenistan,TKM,Population of the official age for pre-primary...,SP.PRE.TOTL.IN,305404.0,2016
2,Uzbekistan,UZB,Percentage of teachers in primary education wh...,SE.PRM.TCHR.FE.ZS,90.76502,2016
3,Uzbekistan,UZB,Population of the official age for pre-primary...,SP.PRE.TOTL.IN,2440150.0,2016
4,Georgia,GEO,Theoretical duration of lower secondary educat...,SE.SEC.DURS.LO,3.0,2016


##6. Explore your data

The value in the indicator_code column describes what type of data is shown in a given row.

One interesting indicator code is SE.XPD.TOTL.GD.ZS, which corresponds to "Government expenditure on education as % of GDP (%)".

Which countries spend the largest fraction of GDP on education?

Which countries spend the largest fraction of GDP on education?

To answer this question, consider only the rows in the dataset corresponding to indicator code SE.XPD.TOTL.GD.ZS, and write a query that returns the average value in the value column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average).

In [None]:
country_spend_pct_query = """
                            SELECT country_name, AVG(value) AS avg_ed_spending_pct
                            FROM `bigquery-public-data.world_bank_intl_education.international_education`
                            WHERE indicator_code = 'SE.XPD.TOTL.GD.ZS' AND
                              year BETWEEN 2010 AND 2017
                            GROUP BY country_name
                            ORDER BY AVG(value) DESC
                            """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
country_spend_pct_query_job = client.query(country_spend_pct_query, job_config=safe_config)

country_spending_results = country_spend_pct_query_job.to_dataframe()

print(country_spending_results.head())

            country_name  avg_ed_spending_pct
0                   Cuba            12.837270
1  Micronesia, Fed. Sts.            12.467750
2        Solomon Islands            10.001080
3                Moldova             8.372153
4                Namibia             8.349610


The last question started by telling you to focus on rows with the code SE.XPD.TOTL.GD.ZS. But how would you find more interesting indicator codes to explore?

There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countries.

Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

In [None]:
code_count_query = """
                   SELECT indicator_code, indicator_name, COUNT(1) as num_rows
                   FROM `bigquery-public-data.world_bank_intl_education.international_education`
                   WHERE year = 2016
                   GROUP BY indicator_code, indicator_name
                   HAVING COUNT(1) >= 175
                   ORDER BY COUNT(1) DESC
                   """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
code_count_query_job = client.query(code_count_query, job_config=safe_config)

code_count_results = code_count_query_job.to_dataframe()

print(code_count_results.head())

      indicator_code                   indicator_name  num_rows
0        SP.POP.GROW     Population growth (annual %)       232
1        SP.POP.TOTL                Population, total       232
2     IT.NET.USER.P2  Internet users (per 100 people)       223
3     SP.POP.0014.TO     Population, ages 0-14, total       213
4  SP.POP.TOTL.MA.ZS    Population, male (% of total)       213
