# Introduction  
This notebook focuses on cleaning and processing two datasets:
1. **Eduction Data**: Data provided by UN about education enrollments.
2. **Public Expenditure Data**: Data about expenditure for education measured as percentage of Gross National Income (GNI) and as percentage of total government expenditure

The goal of this notebook is to:
- Load and explore the data.
- Clean and preprocess the data.
- Perform basic analyses and visualization.
- Prepare the data for further exploration or modeling.

# Loading and Cleaning Data

In [87]:
import os
print(os.getcwd())


/home/salah/Github/data-mining-projects/un-education-data-analysis


In [88]:
import pandas as pd
import numpy as np

education_data = pd.read_csv('un-raw-data/education-data.csv', encoding='ISO-8859-1')
public_expenditure_data = pd.read_csv('un-raw-data/public-expenditure-on-education-and-access-to-computers.csv', encoding='ISO-8859-1')

In [89]:
education_data.info()
public_expenditure_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7531 entries, 0 to 7530
Data columns (total 7 columns):
 #   Column                                                                       Non-Null Count  Dtype 
---  ------                                                                       --------------  ----- 
 0   T07                                                                          7531 non-null   object
 1   Enrollment in primary, lower secondary and upper secondary education levels  7530 non-null   object
 2   Unnamed: 2                                                                   7531 non-null   object
 3   Unnamed: 3                                                                   7531 non-null   object
 4   Unnamed: 4                                                                   7531 non-null   object
 5   Unnamed: 5                                                                   654 non-null    object
 6   Unnamed: 6                                       

First we see that there is an issue with the columns, which has to be fiexed.

In [90]:
education_data.head()

Unnamed: 0,T07,"Enrollment in primary, lower secondary and upper secondary education levels",Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,Region/Country/Area,,Year,Series,Value,Footnotes,Source
1,1,"Total, all countries or areas",2005,Students enrolled in primary education (thousa...,678907,Estimate.,"United Nations Educational, Scientific and Cul..."
2,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (male),104.5,Estimate.,"United Nations Educational, Scientific and Cul..."
3,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (female),99.7,,"United Nations Educational, Scientific and Cul..."
4,1,"Total, all countries or areas",2005,Students enrolled in lower secondary education...,309665,,"United Nations Educational, Scientific and Cul..."


In [91]:
education_data_cleaning = education_data.drop(index=0)
education_data_cleaning.columns = ['Number', 'Region_Country_Area', 'Year', 'Series', 'Value', 'Footnotes', 'Source']
education_data_cleaning.reset_index(drop=True, inplace=True)
#education_data_cleaning.to_csv('cleaned_education_data.csv', index=False)
education_data_cleaning.head()

Unnamed: 0,Number,Region_Country_Area,Year,Series,Value,Footnotes,Source
0,1,"Total, all countries or areas",2005,Students enrolled in primary education (thousa...,678907.0,Estimate.,"United Nations Educational, Scientific and Cul..."
1,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (male),104.5,Estimate.,"United Nations Educational, Scientific and Cul..."
2,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (female),99.7,,"United Nations Educational, Scientific and Cul..."
3,1,"Total, all countries or areas",2005,Students enrolled in lower secondary education...,309665.0,,"United Nations Educational, Scientific and Cul..."
4,1,"Total, all countries or areas",2005,Gross enrollment ratio - Lower secondary level...,80.7,,"United Nations Educational, Scientific and Cul..."


In [92]:
#droping columns since all entrys are idenctical and not importent
print(education_data_cleaning.columns)
education_data_cleaning.drop(columns= ['Source'], inplace=True)

Index(['Number', 'Region_Country_Area', 'Year', 'Series', 'Value', 'Footnotes',
       'Source'],
      dtype='object')


In [93]:
education_data_cleaning.info()
education_data_cleaning.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7530 entries, 0 to 7529
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Number               7530 non-null   object
 1   Region_Country_Area  7530 non-null   object
 2   Year                 7530 non-null   object
 3   Series               7530 non-null   object
 4   Value                7530 non-null   object
 5   Footnotes            653 non-null    object
dtypes: object(6)
memory usage: 353.1+ KB


Unnamed: 0,Number,Region_Country_Area,Year,Series,Value,Footnotes
0,1,"Total, all countries or areas",2005,Students enrolled in primary education (thousa...,678907.0,Estimate.
1,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (male),104.5,Estimate.
2,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (female),99.7,
3,1,"Total, all countries or areas",2005,Students enrolled in lower secondary education...,309665.0,
4,1,"Total, all countries or areas",2005,Gross enrollment ratio - Lower secondary level...,80.7,


### Changing data typs

In [94]:
education_data_cleaning['Year'] = pd.to_numeric(education_data_cleaning['Year'], errors= 'coerce')
education_data_cleaning['Value'] = education_data_cleaning['Value'].replace({',': ''}, regex = True)
education_data_cleaning['Value'] = pd.to_numeric(education_data_cleaning['Value'], errors='coerce')
education_data_cleaning.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7530 entries, 0 to 7529
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Number               7530 non-null   object 
 1   Region_Country_Area  7530 non-null   object 
 2   Year                 7530 non-null   int64  
 3   Series               7530 non-null   object 
 4   Value                7530 non-null   float64
 5   Footnotes            653 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 353.1+ KB


Check for duplicates

In [95]:
duplicate_rows = education_data_cleaning[education_data_cleaning.duplicated()]
print(duplicate_rows)

Empty DataFrame
Columns: [Number, Region_Country_Area, Year, Series, Value, Footnotes]
Index: []


Check missing data or incorrect entries

In [96]:
missing_data = education_data_cleaning.isnull().sum()
print(missing_data)

Number                    0
Region_Country_Area       0
Year                      0
Series                    0
Value                     0
Footnotes              6877
dtype: int64


In [97]:
unique_footnotes = education_data_cleaning['Footnotes'].unique()
print(unique_footnotes)

['Estimate.' nan]


In [98]:
education_data_cleaning['Footnotes'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  education_data_cleaning['Footnotes'].fillna('Unknown', inplace=True)


In [99]:
education_data_cleaning.head(20)

Unnamed: 0,Number,Region_Country_Area,Year,Series,Value,Footnotes
0,1,"Total, all countries or areas",2005,Students enrolled in primary education (thousa...,678907.0,Estimate.
1,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (male),104.5,Estimate.
2,1,"Total, all countries or areas",2005,Gross enrollment ratio - Primary (female),99.7,Unknown
3,1,"Total, all countries or areas",2005,Students enrolled in lower secondary education...,309665.0,Unknown
4,1,"Total, all countries or areas",2005,Gross enrollment ratio - Lower secondary level...,80.7,Unknown
5,1,"Total, all countries or areas",2005,Gross enrollment ratio - Lower secondary level...,76.7,Unknown
6,1,"Total, all countries or areas",2005,Students enrolled in upper secondary education...,199767.0,Unknown
7,1,"Total, all countries or areas",2005,Gross enrollment ratio - Upper secondary level...,51.2,Unknown
8,1,"Total, all countries or areas",2005,Gross enrollment ratio - Upper secondary level...,48.3,Unknown
9,1,"Total, all countries or areas",2010,Students enrolled in primary education (thousa...,697253.0,Unknown


In [100]:
unique_region_country = education_data_cleaning['Region_Country_Area'].unique()
print(unique_region_country)
education_data_cleaning.to_csv("test.csv", index= False)

['Total, all countries or areas' 'Northern Africa' 'Sub-Saharan Africa'
 'Northern America' 'Latin America & the Caribbean' 'Central Asia'
 'Eastern Asia' 'South-eastern Asia' 'Southern Asia' 'Western Asia'
 'Europe' 'Oceania' 'Australia and New Zealand' 'Afghanistan' 'Albania'
 'Algeria' 'Andorra' 'Angola' 'Anguilla' 'Antigua and Barbuda' 'Argentina'
 'Armenia' 'Aruba' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain'
 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bermuda'
 'Bhutan' 'Bolivia (Plurin. State of)' 'Bosnia and Herzegovina' 'Botswana'
 'Brazil' 'British Virgin Islands' 'Brunei Darussalam' 'Bulgaria'
 'Burkina Faso' 'Burundi' 'Cabo Verde' 'Cambodia' 'Cameroon' 'Canada'
 'Cayman Islands' 'Central African Republic' 'Chad' 'Chile' 'China'
 'China, Hong Kong SAR' 'China, Macao SAR' 'Colombia' 'Comoros' 'Congo'
 'Cook Islands' 'Costa Rica' 'Côte d\x92Ivoire' 'Croatia' 'Cuba' 'Curaçao'
 'Cyprus' 'Czechia' "Dem. People's Rep. Korea" 'Dem. Rep. of the Congo'
 '

Name of some countries are not in correct format or is incorrect. 

TODO: change country names, find outliers (e.g. negative numbers), ...