# Add in information on Countries UN Education Index<br>

## Purpose
Takes in the .csv created in the 325-Get_Elevation_Range.ipynb notebook and adds in data from the UN relating to each of the countries Education Index


## Datasets
* .csv created in the 325-Get_Elevation_Range.ipynb
* UNEducationIndex.csv which contains the education index data

Imports necessary libraries

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os.path

Loads in Country Data

In [69]:
# Ensure the file exists
filepath = "../../data/prep/Countries/countries_325.csv"
if not os.path.exists(filepath):
    print("Missing dataset file")
else:
    df = pd.read_csv(filepath, encoding = "ISO-8859-1")
    print("File Read")

File Read


Prints the first 5 lines of the dataframe

In [70]:
df.head()

Unnamed: 0,Country,Year,Population,Males,Females,Life_Expectancy,GDP,Region,Elevation,Area_SqKM,...,Population_Density,CO2_Emissions,Methane_Emissions,Nitrous_Oxide_Emisions,Total_Emissions,Emmisions_per_Capita,Code,Lowest_Point,Highest_Point,Elevation_Range
0,Afghanistan,1960,8996351.0,4649361.0,4346990.0,32.337561,537777800.0,West and Central Asia,1884.71,646212.0,...,13.921671,414.371,,,414.371,,AFG,258.0,7492.0,7234.0
1,Afghanistan,1964,9731361.0,4996990.0,4734371.0,34.101902,800000000.0,West and Central Asia,1884.71,646212.0,...,15.059084,839.743,,,839.743,,AFG,258.0,7492.0,7234.0
2,Afghanistan,1968,10604346.0,5419182.0,5185164.0,35.832415,1373333000.0,West and Central Asia,1884.71,646212.0,...,16.410011,1224.778,,,1224.778,,AFG,258.0,7492.0,7234.0
3,Afghanistan,1972,11721940.0,5967987.0,5753953.0,37.620171,1595555000.0,West and Central Asia,1884.71,646212.0,...,18.139465,1532.806,9170.59,2530.158,13233.554,,AFG,258.0,7492.0,7234.0
4,Afghanistan,1976,12840299.0,6524577.0,6315722.0,39.58539,2555556000.0,West and Central Asia,1884.71,646212.0,...,19.870103,1987.514,10535.6,3265.633,15788.747,,AFG,258.0,7492.0,7234.0


The education index data is only avaliable for years between 2014 and 1980 

In [71]:
#Years of Olympics 
eduYear = [1980,1984,1988,1992,1996,2000,2004,2008,2012,1994,1998,2002,2006,2010,2014]

## Loading Education Index Data Data

Loads in the .csv with data on the education index data

In [72]:
# Ensure the file exists
filepath = "../../data/raw/UNData/UNEducationIndex.csv"
if not os.path.exists(filepath):
    print("Missing dataset file")
else:
    eduDF = pd.read_csv(filepath, encoding = "ISO-8859-1")
    print("File Read")

File Read


Outputs the data on the dataframe containing the index data

In [73]:
eduDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 16 columns):
HDI Rank    194 non-null object
Country     200 non-null object
1980        195 non-null object
1985        195 non-null object
1990        195 non-null object
1995        195 non-null object
2000        195 non-null object
2005        195 non-null object
2006        195 non-null object
2007        195 non-null object
2008        195 non-null object
2009        195 non-null object
2010        195 non-null object
2011        195 non-null object
2012        195 non-null object
2013        195 non-null object
dtypes: object(16)
memory usage: 25.7+ KB


Chnages all of the coulumns within the dataframe to numeric values

In [74]:
eduDF[['1980','1985','1990','1995','2000','2005','2006','2007','2008','2009','2010','2011','2012','2013']] = eduDF[['1980','1985','1990','1995','2000','2005','2006','2007','2008','2009','2010','2011','2012','2013']].apply(pd.to_numeric, errors='coerce')
eduDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 16 columns):
HDI Rank    194 non-null object
Country     200 non-null object
1980        129 non-null float64
1985        131 non-null float64
1990        142 non-null float64
1995        142 non-null float64
2000        158 non-null float64
2005        174 non-null float64
2006        173 non-null float64
2007        174 non-null float64
2008        175 non-null float64
2009        178 non-null float64
2010        187 non-null float64
2011        187 non-null float64
2012        187 non-null float64
2013        187 non-null float64
dtypes: float64(14), object(2)
memory usage: 25.7+ KB


Changes the Country column within the dataframe into a string

In [75]:
eduDF['Country'] = eduDF['Country'].astype(str)
eduDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 16 columns):
HDI Rank    194 non-null object
Country     205 non-null object
1980        129 non-null float64
1985        131 non-null float64
1990        142 non-null float64
1995        142 non-null float64
2000        158 non-null float64
2005        174 non-null float64
2006        173 non-null float64
2007        174 non-null float64
2008        175 non-null float64
2009        178 non-null float64
2010        187 non-null float64
2011        187 non-null float64
2012        187 non-null float64
2013        187 non-null float64
dtypes: float64(14), object(2)
memory usage: 25.7+ KB


The years in the dataframe do not cover every olympic year because they are at five year increments, below are the years that are not included

In [76]:
for year in eduYear:
    if str(year) in eduDF.columns[2:]:
        continue
    else:
        print(year)

1984
1988
1992
1996
2004
1994
1998
2002
2014


To get the values for the years we require we considered that there would be constant growth or decline in the education index of a country throughout each five year increment. Using this assumption we are able to estimate the values for years falling between each the incremented years

><b> 1984</b>

In [77]:
year1984 = []
for index,row in eduDF.iterrows():
    curr = row['1985']
    prev = row['1980']
    diff = (curr-prev)/5
    year = curr-diff
    year1984.append(year)
eduDF['1984'] = year1984

><b> 1988</b>

In [78]:
year1988 = []
for index,row in eduDF.iterrows():
    curr = row['1990']
    prev = row['1985']
    diff = (curr-prev)/5
    year = curr-(diff*2)
    year1988.append(year)
eduDF['1988'] = year1988

><b> 1992</b><br>
><b> 1994</b>

In [79]:
year1992 = []
year1994 = []
for index,row in eduDF.iterrows():
    curr = row['1995']
    prev = row['1990']
    diff = (curr-prev)/5
    year = curr-(diff*3)
    temp = curr-diff
    year1992.append(year)
    year1994.append(temp)
eduDF['1992'] = year1992
eduDF['1994'] = year1994

><b> 1996</b><br>
><b> 1998</b>

In [80]:
year1996 = []
year1998 = []
for index,row in eduDF.iterrows():
    curr = row['2000']
    prev = row['1995']
    diff = (curr-prev)/5
    year = prev+diff
    temp = curr-(diff*2)
    year1996.append(year)
    year1998.append(temp)
eduDF['1996'] = year1996
eduDF['1998'] = year1998

><b> 2002</b><br>
><b> 2004</b>

In [81]:
year2004 = []
year2002 = []
for index,row in eduDF.iterrows():
    curr = row['2005']
    prev = row['2000']
    diff = (curr-prev)/5
    year = prev+(diff*2)
    temp = curr-diff
    year2002.append(year)
    year2004.append(temp)
eduDF['2004'] = year2004
eduDF['2002'] = year2002

Now that we have established the Education Index Values for the years we require we can limit the education index to only those years 

In [82]:
cols = [ str(year) for year in eduYear]
cols = cols[:len(cols)-1]
cols = ['Country']+cols

We set the Country column as the index for now

In [83]:
eduDF = eduDF[cols]
eduDF = eduDF.set_index('Country')
eduDF.head()

Unnamed: 0_level_0,1980,1984,1988,1992,1996,2000,2004,2008,2012,1994,1998,2002,2006,2010
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Afghanistan,0.0761,0.097687,0.114067,0.142949,0.185336,0.225522,0.283024,0.33216,0.365333,0.164509,0.205429,0.254273,0.308987,0.357
Albania,0.540589,0.530339,0.533423,0.53379,0.536048,0.565465,0.588829,0.599629,0.608519,0.530392,0.550757,0.577147,0.596323,0.601675
Algeria,0.321378,0.342018,0.368304,0.39918,0.438191,0.493489,0.548751,0.59808,0.642589,0.415971,0.46584,0.52112,0.570238,0.631478
Andorra,,,,,,,,,0.670287,,,,,0.670287
Angola,,,,,,0.298712,0.35149,0.410401,0.474212,,,0.325101,0.379923,0.440879


We not print the first 5 lines of our country dataframe to establis how we will join the two dataframes

In [84]:
df.head()

Unnamed: 0,Country,Year,Population,Males,Females,Life_Expectancy,GDP,Region,Elevation,Area_SqKM,...,Population_Density,CO2_Emissions,Methane_Emissions,Nitrous_Oxide_Emisions,Total_Emissions,Emmisions_per_Capita,Code,Lowest_Point,Highest_Point,Elevation_Range
0,Afghanistan,1960,8996351.0,4649361.0,4346990.0,32.337561,537777800.0,West and Central Asia,1884.71,646212.0,...,13.921671,414.371,,,414.371,,AFG,258.0,7492.0,7234.0
1,Afghanistan,1964,9731361.0,4996990.0,4734371.0,34.101902,800000000.0,West and Central Asia,1884.71,646212.0,...,15.059084,839.743,,,839.743,,AFG,258.0,7492.0,7234.0
2,Afghanistan,1968,10604346.0,5419182.0,5185164.0,35.832415,1373333000.0,West and Central Asia,1884.71,646212.0,...,16.410011,1224.778,,,1224.778,,AFG,258.0,7492.0,7234.0
3,Afghanistan,1972,11721940.0,5967987.0,5753953.0,37.620171,1595555000.0,West and Central Asia,1884.71,646212.0,...,18.139465,1532.806,9170.59,2530.158,13233.554,,AFG,258.0,7492.0,7234.0
4,Afghanistan,1976,12840299.0,6524577.0,6315722.0,39.58539,2555556000.0,West and Central Asia,1884.71,646212.0,...,19.870103,1987.514,10535.6,3265.633,15788.747,,AFG,258.0,7492.0,7234.0


## Joining the DataFrames

The loop below goes through every row of the dataframe inputing either the relevant education index value or a NaN value if there this does not exist

In [85]:
edu = []
for index,row in df.iterrows():
    if 'Korea, Rep.' in row['Country']:
        country = row['Country']
    else:
        country = row['Country'].split(",")[0]
    year = row['Year']
    try:
        index = eduDF.loc[country][str(year)]
    except:
        index = np.nan
    edu.append(index)
df['Education_Index']=edu
df.head(5)

Unnamed: 0,Country,Year,Population,Males,Females,Life_Expectancy,GDP,Region,Elevation,Area_SqKM,...,CO2_Emissions,Methane_Emissions,Nitrous_Oxide_Emisions,Total_Emissions,Emmisions_per_Capita,Code,Lowest_Point,Highest_Point,Elevation_Range,Education_Index
0,Afghanistan,1960,8996351.0,4649361.0,4346990.0,32.337561,537777800.0,West and Central Asia,1884.71,646212.0,...,414.371,,,414.371,,AFG,258.0,7492.0,7234.0,
1,Afghanistan,1964,9731361.0,4996990.0,4734371.0,34.101902,800000000.0,West and Central Asia,1884.71,646212.0,...,839.743,,,839.743,,AFG,258.0,7492.0,7234.0,
2,Afghanistan,1968,10604346.0,5419182.0,5185164.0,35.832415,1373333000.0,West and Central Asia,1884.71,646212.0,...,1224.778,,,1224.778,,AFG,258.0,7492.0,7234.0,
3,Afghanistan,1972,11721940.0,5967987.0,5753953.0,37.620171,1595555000.0,West and Central Asia,1884.71,646212.0,...,1532.806,9170.59,2530.158,13233.554,,AFG,258.0,7492.0,7234.0,
4,Afghanistan,1976,12840299.0,6524577.0,6315722.0,39.58539,2555556000.0,West and Central Asia,1884.71,646212.0,...,1987.514,10535.6,3265.633,15788.747,,AFG,258.0,7492.0,7234.0,


#### Output

In [86]:
df.to_csv('../../data/prep/Countries/countries_350.csv', index=False)