# Project 2 : Analyzing education data
*Pierre-Eloi Ragetly*

This project is part of the Open-Classrooms Data Scientist path.  
The main objective is to realize an Exploratory Data Analysis (EDA) for a fictive EdTech company called Academy.

Data will be dowloaded from the world bank web site:
https://datacatalog.worldbank.org/dataset/education-statistics

In [1]:
import analysis.dataload as ld
import analysis.datavisu as vs

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Collection" data-toc-modified-id="Data-Collection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Collection</a></span></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Processing</a></span></li><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature engineering</a></span></li></ul></div>

## Data Collection

Let's start by downloading data from the world bank web site.

In [2]:
url = 'http://databank.worldbank.org/data/download/Edstats_csv.zip'
list_csv = ld.download_data(url)

The extracted files are:
EdStatsSeries.csv
EdStatsFootNote.csv
EdStatsData.csv
EdStatsCountry-Series.csv
EdStatsCountry.csv


Now we have to load data with pandas before starting the analyse.

In [3]:
csv_file = list_csv[2]
data = ld.load_data(csv_file)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,


## Data Processing

Let's see first how many rows and columns contained data. To do so we use the `info()` method from pandas.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 886930 entries, 0 to 886929
Data columns (total 70 columns):
Country Name      886930 non-null object
Country Code      886930 non-null object
Indicator Name    886930 non-null object
Indicator Code    886930 non-null object
1970              72288 non-null float64
1971              35537 non-null float64
1972              35619 non-null float64
1973              35545 non-null float64
1974              35730 non-null float64
1975              87306 non-null float64
1976              37483 non-null float64
1977              37574 non-null float64
1978              37576 non-null float64
1979              36809 non-null float64
1980              89122 non-null float64
1981              38777 non-null float64
1982              37511 non-null float64
1983              38460 non-null float64
1984              38606 non-null float64
1985              90296 non-null float64
1986              39372 non-null float64
1987              38641 non-

These method display several interesting values. At the top we can see how many rows and columns are contained by this dataframe:  
- 886,930 rows
- 70 columns

We can also notice that most of column labels correspond to a specific year. There are both, legacy data &ndash; from 1970 to 2017 with a step of 1 year &ndash; and forcasted data &ndash; from 2020 to 2100 with a step of 5 years.  
For each row we have a given indicator for a given country. All details about the different indicators can be find in the "EdStatsSeries.csv" file.

In this dataset many data are missing. Indeed, excepted name and code columns, at least 80% of data are mising.
Before to look how many countries and indicators are included into this dataset, let's check if this datset contains any duplicate values. For that we use the `duplicated` method of pandas.


In [5]:
# display the number of duplicates with duplicated method.
col_labels = data.columns
n_duplicates = data.duplicated(col_labels[:4]).sum()
print("There are {} duplicates.".format(n_duplicates))

# display the number of countries
n_countries = data['Country Code'].unique().size
print("There are {} different countries.".format(n_countries))

# display the number of indicators
n_ind = data['Indicator Code'].unique().size
print("There are {} different indicators.".format(n_ind))

There are 0 duplicates.
There are 242 different countries.
There are 3665 different indicators.


Though does not include any duplicates, the dataset contains 3,665 indicators for 65 different years! That is a lot, we cannot keep all these features for our analysis. Especially knowing that many data are missing. The next step is to select a couple of years and keep only the most relevant indicators.

## Feature engineering