# Description

This week the homework assignment will test your ability to load and manipulate data with Pandas. <br/>
The goal is to develop some intuition on how to filter, arrange, and merge data. This will be usefull for the next homework assignments.<br/>
Fill the empty cells with your code and deliver a copy of this notebook to Moodle. <br/>
This Homework counts 1 point to your final grade.

Remember to change the name of the notebook to "H.<student_id>.ipynb", replacing <student_id> by your student_id. <br/>

In [1]:
import numpy as np
import scipy
import pandas as pd

## Download and Load the World Development Indicators data set

We will work with the World Development Indicators data set. <br/> 
We download this data set from the world bank databank.<br/>
Hence, the very first step is to download the data to your computer, you can do this by running the following cell. <br/>
Alternatively you can copy and paste the url inside the .get() method into your browser.

In [None]:
# importing libraries
import requests, zipfile, io

#note this can take several minutes depending on your internet connection
r = requests.get('http://databank.worldbank.org/data/download/WDI_csv.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

# let us free the variales we used above
del z
del r

The above code downloads a zip archive to the working folder, which by default is the the location of this notebook in your computer. <br/>
Secondly, and since the document downloaded is a zip archive, it extracts the documents from the archive. <br/> 
The contents include multiple .csv files, however we will be working only with the document 'WDIData.csv'. <br/>

In the cell bellow, use Pandas to open the file "WDIData.csv" and save it to a variable called 'wdi'.<br/>
Note you will might need to specify the option enconding, in my case the option "ISO-8859-1" worked fine.
Find more information at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [2]:
wdi = pd.read_csv("WDIData.csv", encoding="ISO-8859-1")

## Download and Load the Penn World Table V9.0

We will additionally use data from the pwt v9.0 tables. <br/> 
Again run the following cell to download the dataset. This time using the library urllib.

In [None]:
import urllib
urllib.request.urlretrieve("https://www.rug.nl/ggdc/docs/pwt90.xlsx", "pwt90.xlsx")

In the following cell, open and read the file 'pwt90.xlsx' and save it into variable 'pwt'. <br/>
Remember that pandas has a method to read excel files, and secondly we need to specity the sheet we want to load data from.

In [3]:
pwt = pd.read_excel("pwt90.xlsx", sheet_name="Data")

## Data Wrangling

Now that we have loaded our data into variable 'wdi', we are ready to start playing with it. <br/>
Start by printing all column values in the cell bellow.

In [None]:
print(wdi.columns)

Next, list the values in the column 'Country Name'.<br/>
You will get a list with repeated values, delete all duplicates to ease your analysis. <br/>

Tip: see the method '.drop_duplicates()' https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html.

In [None]:
wdi['Country Name']
wdi.drop_duplicates(['Country Name'])

You might notice that while the bottom rows represent Countries, the top rows represent aggregates of countries (e.g., world regions). <br/> However we will be only interested in working with country-level data, and as such we need to filter out all unecessary rows.

Save all the values of column 'Country Name' in variable 'cnames'. <br/>
Delete all duplicate values.<br>
Print the first 50 values in cnames (remember you can use slice here).

In [None]:
cnames = wdi['Country Name']
cnames = cnames.drop_duplicates()

In [None]:
cnames[0:51]

You can verify, that the first 48 values in cnames 'Country Name' do not correspond to countries, but aggregates.<br/>
In the next cell filter out, from 'wdi', rows in which 'Country Name' represents an aggregate of countries.<br/>

Tip1 : You can use the negation of .isin() to perform a boolean filter over the rows of the DataFrame, see an example at  https://erikrood.com/Python_References/rows_cols_python.html <br/>
Tip2 : You can also perform this action by slicing out all rows unecessary rows.

In [None]:
wdi = wdi.loc[~wdi['Country Name'].isin(cnames[0:47])]

Reset the indexes of 'wdi', you can use the method reset_index(), see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html.
<br/> Perform this operation In Place.

In [None]:
wdi.reset_index(inplace = True)

Show that the indexes have been reseted.

In [None]:
wdi

Note that when reseting the index, pandas appends a new column at the begining of the data frame, which holds the previous index values. <br/>

## Indicator Codes and Indicator Name

Select the columns 'Indicator Name' and 'Indicator Code'.<br/> 
Delete all duplicates, and then Print the top 5 and bottom 5 values. <br/>
Tip: You should be able to do everything in a single line of code.

In [None]:
wdi[['Indicator Name','Indicator Code']].drop_duplicates().head()

In [None]:
wdi[['Indicator Name','Indicator Code']].drop_duplicates().tail()

Create a new DataFrame named 'indicators' made up of the columns 'Indicator Name' and 'Indicator Code'.<br/>
Delete all duplicated entries. <br/> 
Set the column 'Indicator Code' as the index of 'indicators'. <br/> 
The ouput should be a DataFrame with 1440 rows. <br/>
Try to perform all these steps in a single line of code.

In [None]:
indicators = ((wdi[['Indicator Name','Indicator Code']]).drop_duplicates()).set_index(['Indicator Code'])

The 'indicators' DataFrame can operate now as a dictionary. <br/> 
By passing an 'Indicator Code' (key) it returns the associated 'Indicator Name' (value).<br/>

Using 'indicators' DataFrame, find the 'Indicator Code' associated with the following observables:
1. 'Population', find the 'Indicator Code' of the total population in a country;
2. 'GDP', find the GDP measured in current US Dollars;
3. 'GINI index'

Tip1: You can use the method STRING.str.contains('substring') to check whether a string contains a substring, also note that the match is case sensitive.

In [1]:
indicators[indicators['Indicator Name'].str.contains('Population, total')]

NameError: name 'indicators' is not defined

In [None]:
pd.set_option('display.max_rows', 500)
indicators[indicators['Indicator Name'].str.contains("GDP")]

In [None]:
indicators[indicators['Indicator Name'].str.contains('GINI index')]

## Extracting and Cleaning data from WDI and PWT

From 'wdi' extract the columns 'Indicator Code', 'Country Code', and '2002'.
Save the output in variable 'wdi_sample'

Tip1: You should be able to perfom all operations in a single line of code. <br/>
Tip2: Use the method .loc\[\] to extract a row with a specified index value, see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html.

In [None]:
wdi_sample = wdi[['Indicator Code', 'Country Code', '2002']]

Select from 'wdi_sample' the lines associated with all the Indicator Codes that you found above, which concern the data of the 'GINI index', 'GDP', and 'Population total'.

In [None]:
print(wdi_sample.loc[wdi_sample['Indicator Code'].isin(['SI.POV.GINI','SP.POP.TOTL','NY.GDP.MKTP.CD'])])

In [None]:
wdi_sample = wdi_sample.loc[wdi_sample['Indicator Code'].isin(['SI.POV.GINI','SP.POP.TOTL','NY.GDP.MKTP.CD'])]

Create a pivot table, in which values are the column '2002', the index is the 'Country Code', and the columns are the Indicator Codes. <br/>

You can use the function pivot_table() from Pandas, see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html.

In [None]:
wdi_sample = wdi_sample.pivot_table(values='2002',index='Country Code', columns='Indicator Code')

Rename the column names of wdi_sample to 'Population', 'GDP', and 'GINI', accordingly.

In [None]:
wdi_sample.columns = ['GDP', 'GINI', 'Population']

From 'pwt' select only the values of the year 2002. <br/>
Then, extract the columns 'countrycode' and 'hc' into a new variable 'pwt_sample'. <br/>
Rename 'countrycode' to 'Country Code', so that it matches the same column in 'wdi_sample'<br/>
Note that here 'hc' stands for the Human Capital Index.<br/>

In [None]:
pwt_sample =  pwt.loc[pwt['year'] == 2002][['countrycode','hc']].rename(columns={'countrycode': 'Country Code'})

Finally, create a new dataframe named 'data' that contains the columns from wdi_sample and pwt_sample, matched by 'Country Code'. Use the method concat(), and make sure both dataframes have the same index ('Country Code').

In [None]:
data = pd.concat([pwt_sample.set_index('Country Code'), wdi_sample], axis = 1)

Consider the data for the year 2002 that you have prepared above. Perform the necessary data manipulations to answer the following questions:

1. Which countries have a population size of 10 million habitations +/- 1 million?
2. What is the average and the standard deviation in GDP of countries listed in 1?
3. What is the average and the standard deviation in the GDP of countries NOT listed in 1?
4. Repeat point 2 and 3 but for the GDP per capita.
5. What is the Country with the highest Human Capital (hc in the PWT tables)?
6. What is the Country with the Lowest Human Capital (hc in the PWT tables)?

Write the necessary code to obtain the answer to each question in a single cell. <br/>
Print the answer at the end of that cell.

In [None]:
data = data.reset_index()
data.rename(columns={'index':'Country'}, inplace=True)
Q1 = data.loc[(data['Population'] >= 9.000000) & (data['Population'] <= 11.000000), 'Country']
print(Q1)

In [None]:
mean_Q1 = (data.loc[data['Country'].isin(Q1)])["GDP"].mean()
std_Q1 = (data.loc[data['Country'].isin(Q1)])["GDP"].std()
print("The average is ",mean_Q1, "and the standard deviation is ", std_Q1)

In [None]:
mean_Q3 = (data.loc[~data['Country'].isin(Q1)])["GDP"].mean()
std_Q3 = (data.loc[~data['Country'].isin(Q1)])["GDP"].std()
print("The average is ",mean_Q3, "and the standard deviation is ", std_Q3)

In [None]:
mean_Q4_in = ((data.loc[data['Country'].isin(Q1)])["GDP"] / (data.loc[data['Country'].isin(Q1)])["Population"]).mean()
std_Q4_in = ((data.loc[data['Country'].isin(Q1)])["GDP"] / (data.loc[data['Country'].isin(Q1)])["Population"]).std()
mean_Q4_out = ((data.loc[~data['Country'].isin(Q1)])["GDP"] / (data.loc[~data['Country'].isin(Q1)])["Population"]).mean()
std_Q4_out = ((data.loc[~data['Country'].isin(Q1)])["GDP"] / (data.loc[~data['Country'].isin(Q1)])["Population"]).std()
print(mean_Q4_in,std_Q4_in, mean_Q4_out, std_Q4_out)

In [None]:
print(data[['Country']][data.hc == data.hc.max()])

In [None]:
print(data[['Country']][data.hc == data.hc.min()])