# PUBPOL 542 Computational Thinking for Governance Analytics
### Rebecca Xie
#### Winter 2020

## 1. Data Preprocessing: Scraping tables from the web

First, I will use pandas to obtain the desired table from the web. If you have not installed pandas you will need to use the command:

In [1]:
#!pip install pandas 

**Note:** Since I have already installed pandas, I used the "#" symbol to nullify the command. You *do not* need to include this.

In addition to pandas, you will need to install the following:

    - html5lib
    - beautifulsoup
    - lxml

In [2]:
import pandas as pd

In this section, I will be obtaining data from a table listed on the CIA.gov website. The code allows me to visualize the data we will be scraping. Additionally, by defining ciaUE as the link to the data, I can use it to create a dataframe, or data table.

In [3]:
from IPython.display import IFrame  
ciaUE="https://www.cia.gov/the-world-factbook/field/unemployment-rate/country-comparison" 
IFrame(ciaUE, width=700, height=300)

Using the Pandas command **read_html** below, I created a *list* of the data. The command includes:

        1. The link to the webpage
        2. The position of the header
        3. The external library that will be used to extract the text (flavor)
        4. The attributes of the table

In [4]:
UETables=pd.read_html(ciaUE,
                        flavor='bs4', 
                        attrs={'class': 'content-table'})

In [5]:
type(UETables)

list

In [6]:
UETables[0]

Unnamed: 0,Rank,Country,%,Date of Information
0,1,Cocos (Keeling) Islands,0.10,2011
1,2,Cambodia,0.30,2017 est.
2,3,Niger,0.30,2017 est.
3,4,Laos,0.70,2017 est.
4,5,Malta,0.78,2019 est.
...,...,...,...,...
213,215,Kenya,40.00,2013 est.
214,216,Haiti,40.60,2010 est.
215,217,Senegal,48.00,2007 est.
216,218,Syria,50.00,2017 est.


Next, we will create a dataframe called **UETablesA** and make a copy of the dataframe, **UETablesB**, for editing.

In [7]:
UETablesA=UETables[0]

In [8]:
UETablesB=UETablesA.copy()

## 2. Renaming Columns

First, I will drop variables that I do not need for data analysis

**Note:** Spelling and capitalization must be identical to what is listed in the dataframe

In [9]:
UETablesB.drop(['Rank','Date of Information'],1,inplace=True)

In [10]:
UETablesB

Unnamed: 0,Country,%
0,Cocos (Keeling) Islands,0.10
1,Cambodia,0.30
2,Niger,0.30
3,Laos,0.70
4,Malta,0.78
...,...,...
213,Kenya,40.00
214,Haiti,40.60
215,Senegal,48.00
216,Syria,50.00


Then I will specify some variables names 

In [11]:
some=['country', 'percentunemployment']

After specifying the variables names, I created a data dictionary that links the original variable name with the new variable names. Depending on the length of your table, you may need to update the index to suit your individual situation.

In [12]:
dict(zip(UETablesB.columns[:3], some))

{'Country': 'country', '%': 'percentunemployment'}

In [13]:
UETablesB

Unnamed: 0,Country,%
0,Cocos (Keeling) Islands,0.10
1,Cambodia,0.30
2,Niger,0.30
3,Laos,0.70
4,Malta,0.78
...,...,...
213,Kenya,40.00
214,Haiti,40.60
215,Senegal,48.00
216,Syria,50.00


As you can see, simply creating a dictionary does not automatically rename the variables in the dataframe. The following code renames the variables by defining a *new* dataframe using the dictionary.

In [14]:
dataBecca = UETablesB.rename(columns=dict(zip(UETablesB.columns[::], some)))

In [15]:
dataBecca

Unnamed: 0,country,percentunemployment
0,Cocos (Keeling) Islands,0.10
1,Cambodia,0.30
2,Niger,0.30
3,Laos,0.70
4,Malta,0.78
...,...,...
213,Kenya,40.00
214,Haiti,40.60
215,Senegal,48.00
216,Syria,50.00


The data is finally in the preferred format for analysis. Since I do not have any categorical values, I can save my data as a csv file for use in SPSS, STATA, R, or another software system.

In [16]:
dataBecca.to_csv('BeccaData.csv', index=False)