In [26]:
import os 
import pandas as pd
from google.cloud import bigquery

In [27]:
# Set this to the full absolute path of your downloaded key 
# get the key form google console to be able to access bigquery which is hosted my google
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "E:\\c_extend\\Documents\\egerdrive-7433adb919ad.json"
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/mwaki/Documents/Documents/Credentials/egerdrive-7433adb919ad.json"

The World Bank has made tons of interesting education data available through BigQuery. Run the following cell to see the first few rows of the *international_education* table from the *world_bank_intl_education* dataset.

In [28]:
client = bigquery.Client()
dataset_ref = client.dataset('world_bank_intl_education',project = 'bigquery-public-data')
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table('international_education')
table = client.get_table(table_ref)
client.list_rows(table,max_results=5).to_dataframe()

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,321921.0,2012
1,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,68809.0,2006
2,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,30551.0,1999
3,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,79784.0,2007
4,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,282699.0,2006


# Exercises

The value in the `indicator_code` column describes what type of data is shown in a given row.  

One interesting indicator code is `SE.XPD.TOTL.GD.ZS`, which corresponds to "Government expenditure on education as % of GDP (%)".

### 1) Government expenditure on education

Which countries spend the largest fraction of GDP on education?  

To answer this question, consider only the rows in the dataset corresponding to indicator code `SE.XPD.TOTL.GD.ZS`, and write a query that returns the average value in the `value` column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average). 

Requirements:
- Your results should have the country name rather than the country code. You will have one row for each country.
- The aggregate function for average is **AVG()**.  Use the name `avg_ed_spending_pct` for the column created by this aggregation.
- Order the results so the countries that spend the largest fraction of GDP on education show up first.

In case it's useful to see a sample query, here's a query you saw in the tutorial (using a different dataset):
```
# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """
```

In [29]:
query = """
select avg(value) as avg_ed_spending_pct,country_name
 
from `bigquery-public-data.world_bank_intl_education.international_education`
where indicator_code = 'SE.XPD.TOTL.GD.ZS' AND (year >=2010 AND year <=2017 )
group by country_name
order by avg_ed_spending_pct desc
"""

In [30]:
query_job = client.query(query)
query_job.to_dataframe()



Unnamed: 0,avg_ed_spending_pct,country_name
0,12.837270,Cuba
1,12.467750,"Micronesia, Fed. Sts."
2,10.001080,Solomon Islands
3,8.372153,Moldova
4,8.349610,Namibia
...,...,...
152,1.706404,Cambodia
153,1.503760,West Bank and Gaza
154,1.409726,South Sudan
155,1.409606,Monaco


### 2) Identify interesting codes to explore

The last question started by telling you to focus on rows with the code `SE.XPD.TOTL.GD.ZS`. But how would you find more interesting indicator codes to explore?

There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countries.

Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

Requirements:
- You should have one row for each indicator code.
- The columns in your results should be called `indicator_code`, `indicator_name`, and `num_rows`.
- Only select codes with 175 or more rows in the raw database (exactly 175 rows would be included).
- To get both the `indicator_code` and `indicator_name` in your resulting DataFrame, you need to include both in your **SELECT** statement (in addition to a **COUNT()** aggregation). This requires you to include both in your **GROUP BY** clause.
- Order from results most frequent to least frequent.  

In [39]:
query2 = """
select indicator_code,indicator_name,count(1) as num_rows
from `bigquery-public-data.world_bank_intl_education.international_education`

group by indicator_code,indicator_name having count(1) >=175
order by count(1)
"""


In [40]:
query_job2 = client.query(query2)
query_job2.to_dataframe()



Unnamed: 0,indicator_code,indicator_name,num_rows
0,UIS.FGP.6,Percentage of graduates from tertiary ISCED 6 ...,180
1,UIS.G.6.F,Graduates from ISCED 6 programmes in tertiary ...,180
2,UIS.TRTP.4,Percentage of teachers in post-secondary non-t...,182
3,LO.TIMSS.SCI4.INT,TIMSS: Fourth grade students reaching the inte...,183
4,LO.TIMSS.SCI4.LOW,TIMSS: Fourth grade students reaching the low ...,183
...,...,...,...
2074,SP.POP.TOTL.FE.ZS,"Population, female (% of total)",10233
2075,SP.POP.TOTL.MA.ZS,"Population, male (% of total)",10233
2076,SP.POP.1564.TO.ZS,"Population, ages 15-64 (% of total)",10243
2077,SP.POP.GROW,Population growth (annual %),11149
