# <u>**ACADEMY**</u> : Educational Systems Data Analysis

<details open>
<summary><u style="font-size: 2rem">Summary: </u></summary>

* <a href="#author" data-toc-modified-id="author" >Author</a>
* <a href="#introduction" data-toc-modified-id="introduction" >Introduction</a>
* <a href="#functions" data-toc-modified-id="functions" >Functions</a>
* <a href="#import-necessary-libraries" data-toc-modified-id="import-necessary-libraries" >Import Necessary Libraries</a>
* <a href="#load-the-data" data-toc-modified-id="load-the-data" >Load the Data</a>
* <a href="#initial-data-exploration" data-toc-modified-id="initial-data-exploration">Initial Data Exploration</a>
    * <a href="#stats-data" data-toc-modified-id="stats-data">1 - EdStatsData</a>
    * <a href="#stats-country" data-toc-modified-id="stats-country">2 - EdStatsCountry</a>
    * <a href="#stats-country-series" data-toc-modified-id="stats-country-series">3 - EdStatsCountry-Series</a>
    * <a href="#stats-series" data-toc-modified-id="stats-series">4 - EdStatsSeries</a>
    * <a href="#stats-foot-note" data-toc-modified-id="stats-foot-note">5 - EdStatsFootNote</a>
    * <a href="#analysis-note" data-toc-modified-id="analysis-note"><u>Analysis note</u></a>
* <a href="#data-cleaning" data-toc-modified-id="data-cleaning">Data Cleaning</a>
    * <a href="#relevant indicators" data-toc-modified-id="relevant indicators">Relevant Indicators</a>
    * <a href="#filter-&-eliminate-unnecessary-columns" data-toc-modified-id="filter-&-eliminate-unnecessary-columns">Filter & Eliminate Unnecessary Columns</a>
        * <a href="#cleaning-stats-data" data-toc-modified-id="cleaning-stats-data">1 - EdStatsData</a>
        * <a href="#cleaning-stats-data" data-toc-modified-id="cleaning-stats-data">2 - EdStatsCountry</a>
        * <a href="#cleaning-stats-country-series" data-toc-modified-id="cleaning-stats-country-series">3 - EdStatsCountry-Series</a>
        * <a href="#cleaning-series" data-toc-modified-id="cleaning-series">4 - EdStatsSeries</a>
        * <a href="#cleaning-stats-foot-note" data-toc-modified-id="cleaning-stats-foot-note">5 - EdStatsFootNote</a>


</details>

<a id='author'></a>

## Author
> Mohamed Ali EL HAMECH 

- @MasterCodeDevelop (Github)
- E-mail : master.code.develop@gmail.com
- Affiliation : ACADEMY #OpenClassRooms
<p><img align="left" src="https://github-readme-stats.vercel.app/api/top-langs?username=mastercodedevelop&show_icons=true&locale=en&layout=compact" alt="mastercodedevelop" />

<a id='introduction'></a>

## Introduction:
>In the rapidly evolving landscape of online education, ACADEMY, a burgeoning EdTech start-up, has been making significant strides by offering high-quality online training content tailored for high school and university students. As part of its strategic vision, ACADEMY is keenly exploring opportunities for international expansion, aiming to tap into markets with a robust educational framework and a potential clientele for its services.

_The objective of this analysis is twofold. Firstly, it seeks to delve into global educational data, sourced from the World Bank, to ascertain the viability and potential of various countries as prospective markets for ACADEMY. This dataset, curated by the "EdStats All Indicator Query" of the World Bank, boasts a comprehensive collection of over 4,000 international indicators. These indicators span a range of metrics, from access to education and graduation rates to insights about educators and educational expenditures._

_Secondly, this analysis aims to provide a clear, data-driven narrative that would aid ACADEMY's decision-makers in charting the company's international trajectory. By evaluating the quality of the dataset, understanding its breadth and depth, and extracting relevant insights, we hope to offer a roadmap that aligns with ACADEMY's mission and vision._

<a id='import-necessary-libraries'></a>

## Import Necessary Libraries
> Before you can work with the data, you need to import the necessary libraries.

In [1]:
import pandas as pd
import numpy as np

<a id='functions'></a>

## Functions

In [2]:
def title(text):
    """
    Format the given text with an underline.
    
    Parameters:
    - text (str): The input string to be formatted.
    
    Returns:
    - str: The formatted string with an underline with a line break.
    
    Example:
    --------
    >>> title("Hello World")
    '\x1B[4mHello World\x1B[0m \n'
    """
    
    # ANSI escape code '\x1B[4m' is used to start the underline effect
    # ANSI escape code '\x1B[0m' is used to reset the formatting
    return f"\x1B[4m" + text + "\x1B[0m \n"

In [3]:
def missing_values_table(df, percentage = 0):
    """
    Return a DataFrame displaying the count and percentage of missing values for each column.
    
    Parameters:
    - df (DataFrame): The input DataFrame to check for missing values.
    
    Returns:
    - DataFrame: A DataFrame with columns 'Missing Values' and 'Missing Percentages', sorted in descending order of missing percentages.
    """
    
    # Calculate missing values and their percentages
    missing_values = df.isnull().sum()
    missing_percentages = (df.isnull().mean() * 100).round(2)  # rounding to 2 decimal places for clarity
    
    # Create a DataFrame to display missing data info
    missing_data = pd.DataFrame({
        'Missing Values': missing_values, 
        'Missing Percentages (%)': missing_percentages
    })
    
    # Filtering columns based on the given percentage and sorting by percentage
    missing_data = missing_data[(missing_data['Missing Values'] > 0) & (missing_data['Missing Percentages (%)'] >= percentage)]
    missing_data = missing_data.sort_values(by='Missing Percentages (%)', ascending=False)
    
    # Display if percentage is greater than 0
    if percentage > 0:
        print(title(f"Displaying columns with missing data >= {percentage}%:"))
        print(f"Total columns that have missing data >= {percentage}%: {missing_data.index.size}")
    else:
        print(title("Displaying all columns with missing data:"))
    
    if missing_data.empty:
        print("There are no columns with the specified percentage of missing data.")
    else:
        return missing_data

In [4]:
def display_duplicate_rows(df):
    """
    Display duplicate rows of a DataFrame.
    
    Parameters:
    - df (DataFrame): The input DataFrame to check for duplicate rows.
    
    Returns:
    - None: Prints the number of duplicate rows or a message indicating no duplicates.
    """
    
    # Find duplicate rows
    duplicated_rows = df[df.duplicated()]
    duplicated_rows_sum = duplicated_rows.shape[0]

    if duplicated_rows_sum > 0:
        print(title("Total duplicate rows:"+ duplicated_rows_sum))
        print("Displaying duplicate rows:")
        display(duplicated_rows)  # This will display the DataFrame in a Jupyter Notebook environment
    else:
        print("There are no duplicate rows.")

In [5]:
def drop_columns(df, percentage=100):
    """
    Drop columns from a DataFrame based on a specified missing value percentage threshold.
    
    Parameters:
    - df (DataFrame): The input DataFrame from which columns will be dropped.
    - percentage (float, optional): The missing value percentage threshold. Columns with missing values 
      equal to or greater than this threshold will be dropped. Default is 100.
    
    Returns:
    - None: The function modifies the DataFrame in-place and prints the result of the operation.
    
    Example:
    --------
    >>> df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [np.nan, np.nan, np.nan]})
    >>> drop_columns(df)
    1 column dropped because it had 100% missing values: B
    """
    
    # Get the missing data information using the previously defined function
    missing_data = missing_values_table(df)
    
    # Identify columns to drop based on the given percentage threshold
    columns_to_drop = missing_data[missing_data['Missing Percentages (%)'] >= percentage].index.tolist()
    
    # Drop the identified columns and print the result
    if columns_to_drop:
        df.drop(columns_to_drop, axis=1, inplace=True)
        plural = 's' if len(columns_to_drop) > 1 else ''
        print(f"{len(columns_to_drop)} column{plural} dropped because it{plural if len(columns_to_drop) == 1 else ' they'} had {percentage}% missing values: {', '.join(columns_to_drop)}")
    else:
        print(f"No columns were dropped as none had {percentage}% missing values.")

<a id='load-the-data'></a>

## Load the Data
> Importing the dataset into the environment for analysis and exploration.

In [6]:
# These are the basic data variables we will be working with
COUNTRY_SERIES = pd.read_csv('./data/EdStatsCountry-Series.csv')
COUNTRY = pd.read_csv('./data/EdStatsCountry.csv')
DATA = pd.read_csv('./data/EdStatsData.csv')
FOOT_NOTE = pd.read_csv('./data/EdStatsFootNote.csv')
SERIES = pd.read_csv('./data/EdStatsSeries.csv')

# These are the data variables we will be working with and can be modified
country_series = COUNTRY_SERIES.copy()
country = COUNTRY.copy()
data = DATA.copy()
foot_note = FOOT_NOTE.copy()
series = SERIES.copy()


## Initial Data Exploration <a id='initial-data-exploration'></a>
> Diving into the dataset to understand its structure, content, and characteristics.

### 1 - <u>EdStatsData :</u><a id='stats-data'></a>

#### 1.1 - Preview of the First and Last Rows of the Dataset
> View of the dataset's start & end, providing a quick data structure glimpse.

In [7]:
print("First 5 rows of the dataset:")
DATA.head()

First 5 rows of the dataset:


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,


In [8]:
print("Last 5 rows of the dataset:")
DATA.tail()

Last 5 rows of the dataset:


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
886925,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
886926,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
886927,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
886928,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,
886929,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, m...",SE.ADT.1524.LT.MA.ZS,,,,,,,...,,,,,,,,,,


> <u>Note:</u> By examining the first and last 5 lines, we notice a significant number of missing data, marked as 'NaN'.

#### 1.2 - General information about the DataFrame

In [9]:
# This includes the data type, number of non-zero values, etc.
DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 886930 entries, 0 to 886929
Data columns (total 70 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Country Name    886930 non-null  object 
 1   Country Code    886930 non-null  object 
 2   Indicator Name  886930 non-null  object 
 3   Indicator Code  886930 non-null  object 
 4   1970            72288 non-null   float64
 5   1971            35537 non-null   float64
 6   1972            35619 non-null   float64
 7   1973            35545 non-null   float64
 8   1974            35730 non-null   float64
 9   1975            87306 non-null   float64
 10  1976            37483 non-null   float64
 11  1977            37574 non-null   float64
 12  1978            37576 non-null   float64
 13  1979            36809 non-null   float64
 14  1980            89122 non-null   float64
 15  1981            38777 non-null   float64
 16  1982            37511 non-null   float64
 17  1983      

In [10]:
# Display data types of each column
print(DATA.dtypes)

Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
1970              float64
                   ...   
2085              float64
2090              float64
2095              float64
2100              float64
Unnamed: 69       float64
Length: 70, dtype: object


In [11]:
# Count columns by their data type
print(title("Number of columns by data type:"))
print(DATA.dtypes.value_counts())

[4mNumber of columns by data type:[0m 

float64    66
object      4
dtype: int64


> <u>Note :</u>  While examining the data types of the columns, we observe that the first 4 columns are of type object, indicating they contain strings. Whereas columns with years as titles are of type float64, signifying they hold numerical data.

#### 1.3 - Columns

In [12]:
# Display columns names
DATA.columns.tolist()

['Country Name',
 'Country Code',
 'Indicator Name',
 'Indicator Code',
 '1970',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2020',
 '2025',
 '2030',
 '2035',
 '2040',
 '2045',
 '2050',
 '2055',
 '2060',
 '2065',
 '2070',
 '2075',
 '2080',
 '2085',
 '2090',
 '2095',
 '2100',
 'Unnamed: 69']

> <u>Note :</u> When observing the columns, we first notice that the titles of the first 4 columns are words. Then, starting from the fifth column, the titles represent years, ranging from 1970 to 2100 in ascending order. From 1970 to 2017, the years increase by 1. There are no columns for the years between 2017 and 2020, while from 2020 to 2100, the years increase in intervals of 5. Lastly, there's a column named 'Unnamed: 69' which appears to be there by mistake.

#### 1.4 - Missing Values & Percentages
> Overview of missing data in columns, presented as counts & percentages

In [13]:
print("\nNumber of missing values for each column:")
DATA.isnull()


Number of missing values for each column:


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886925,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
886926,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
886927,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
886928,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [14]:
missing_values_table(DATA)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 69,886930,100.00
2017,886787,99.98
2016,870470,98.14
1971,851393,95.99
1973,851385,95.99
...,...,...
2011,740918,83.54
2012,739666,83.40
2000,710254,80.08
2005,702822,79.24


In [15]:
# Display missing data > 70%
missing_values_table(DATA, 70)

[4mDisplaying columns with missing data >= 70%:[0m 

Total columns that have missing data >= 70%: 66


Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 69,886930,100.00
2017,886787,99.98
2016,870470,98.14
1971,851393,95.99
1973,851385,95.99
...,...,...
2011,740918,83.54
2012,739666,83.40
2000,710254,80.08
2005,702822,79.24


In [16]:
# Display missing data > 90%
missing_values_table(DATA, 90)

[4mDisplaying columns with missing data >= 90%:[0m 

Total columns that have missing data >= 90%: 45


Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 69,886930,100.0
2017,886787,99.98
2016,870470,98.14
1971,851393,95.99
1973,851385,95.99
1972,851311,95.98
1974,851200,95.97
1979,850121,95.85
1976,849447,95.77
1982,849419,95.77


> <u>Note :</u> Out of the 70 columns, a significant number show missing data: 45 columns have over 90% missing data, and 65 have more than 70% missing. Notably, the column "Unnamed: 69" is entirely missing data, at a full 100%.

#### 1.5 - Descriptive Statistics
> Displaying descriptive statistics to understand central tendencies, dispersion, and shape of the dataset's distribution.

In [17]:
DATA.describe()

Unnamed: 0,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
count,72288.0,35537.0,35619.0,35545.0,35730.0,87306.0,37483.0,37574.0,37576.0,36809.0,...,51436.0,51436.0,51436.0,51436.0,51436.0,51436.0,51436.0,51436.0,51436.0,0.0
mean,1974772000.0,4253638000.0,4592365000.0,5105006000.0,5401493000.0,2314288000.0,5731808000.0,6124437000.0,6671489000.0,7436724000.0,...,722.4868,727.129,728.3779,726.6484,722.8327,717.6899,711.3072,703.4274,694.0296,
std,121168700000.0,180481400000.0,191408300000.0,205917000000.0,211215000000.0,137505900000.0,221554600000.0,232548900000.0,247398600000.0,266095700000.0,...,22158.45,22879.9,23523.38,24081.49,24558.97,24965.87,25301.83,25560.69,25741.89,
min,-1.435564,-1.594625,-3.056522,-4.032582,-4.213563,-3.658569,-2.950945,-3.17487,-3.558749,-2.973612,...,-1.63,-1.44,-1.26,-1.09,-0.92,-0.78,-0.65,-0.55,-0.45,
25%,0.89,8.85321,9.24092,9.5952,9.861595,1.4,9.312615,9.519913,10.0,10.0,...,0.03,0.03,0.02,0.02,0.01,0.01,0.01,0.01,0.01,
50%,6.317724,63.1624,66.55139,69.69595,70.8776,9.67742,71.0159,71.33326,72.90512,75.10173,...,0.23,0.23,0.23,0.23,0.23,0.23,0.23,0.23,0.22,
75%,62.5125,56552.0,58636.5,62029.0,63836.75,78.54163,56828.0,57391.75,59404.25,64115.0,...,7.505,7.5,7.3,7.1,6.7225,6.08,5.4625,4.68,4.0325,
max,19039290000000.0,19864570000000.0,21009160000000.0,22383670000000.0,22829910000000.0,23006340000000.0,24241280000000.0,25213830000000.0,26221010000000.0,27308730000000.0,...,2951569.0,3070879.0,3169711.0,3246239.0,3301586.0,3337871.0,3354746.0,3351887.0,3330484.0,


> <u>Note :</u> By examining the standard deviation and quantiles for each column, it's evident that the average (mean), minimum (min), and maximum (max) values show significant variability across the different columns.

#### 1.6 - Display duplicate rows

In [18]:
display_duplicate_rows(DATA)

There are no duplicate rows.


> <u>Note :</u> We can observe that there are no duplicated rows.

#### 1.7 - Unique Values Check
> Checking unique values for categorical columns to understand the different categories present.

In [19]:
for column in DATA.select_dtypes(include=['object']).columns:
    print(f"Unique values for {column}: {DATA[column].nunique()}")
    print(DATA[column].unique())
    print("\n")

Unique values for Country Name: 242
['Arab World' 'East Asia & Pacific'
 'East Asia & Pacific (excluding high income)' 'Euro area'
 'Europe & Central Asia' 'Europe & Central Asia (excluding high income)'
 'European Union' 'Heavily indebted poor countries (HIPC)' 'High income'
 'Latin America & Caribbean'
 'Latin America & Caribbean (excluding high income)'
 'Least developed countries: UN classification' 'Low & middle income'
 'Low income' 'Lower middle income' 'Middle East & North Africa'
 'Middle East & North Africa (excluding high income)' 'Middle income'
 'North America' 'OECD members' 'South Asia' 'Sub-Saharan Africa'
 'Sub-Saharan Africa (excluding high income)' 'Upper middle income'
 'World' 'Afghanistan' 'Albania' 'Algeria' 'American Samoa' 'Andorra'
 'Angola' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia'
 'Austria' 'Azerbaijan' 'Bahamas, The' 'Bahrain' 'Bangladesh' 'Barbados'
 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herze

> <u>Note: </u> We observe the presence of country names, regions, and indicator code

#### 1.8 - Dataset Dimensions & Unique Entries Overview
> - the number of rows 
> - the number of rows columns
> - unique number of countries 
> - unique number of indicators

In [20]:
print(f"Number of rows: {DATA.shape[0]}")
print(f"Number of columns {DATA.shape[1]}")
print(f"Nimber of unique countries: {DATA['Country Name'].nunique()}")
print(f"Number of unique indicators: {DATA['Indicator Name'].nunique()}")

Number of rows: 886930
Number of columns 70
Nimber of unique countries: 242
Number of unique indicators: 3665


> <u>Note :</u> The dataset, spanning 886 930 rows and 70 columns, encompasses 242 countries and 3,665 unique indicators. Its vast scope suggests in-depth information for each country and indicator, potentially over multiple years. The diversity of countries and indicators attests to a wealth of information suitable for varied analyses. This composition implies a global perspective with multiple data points per country.

#### 1.9 - <u>Notes</u> : 

The dataset, consisting of 886,930 rows and 70 columns, covers 242 countries with 3,665 unique indicators. The first four columns contain strings, while the subsequent ones, titled by years, hold numerical data. There's an annual progression from 1970 to 2017, a gap between 2017 and 2020, and then a five-year progression up to 2100. Many missing data points are observed, especially in the 'Unnamed: 69' column. Data variability is significant, and no rows are duplicated. The dataset provides a comprehensive perspective, rich for diverse analyses.

### 2 - <u>EdStatsCountry :</u><a id='stats-country'></a>

#### 2.1 - Columns 

In [21]:
# Display columns names
COUNTRY.columns.tolist()

['Country Code',
 'Short Name',
 'Table Name',
 'Long Name',
 '2-alpha code',
 'Currency Unit',
 'Special Notes',
 'Region',
 'Income Group',
 'WB-2 code',
 'National accounts base year',
 'National accounts reference year',
 'SNA price valuation',
 'Lending category',
 'Other groups',
 'System of National Accounts',
 'Alternative conversion factor',
 'PPP survey year',
 'Balance of Payments Manual in use',
 'External debt Reporting status',
 'System of trade',
 'Government Accounting concept',
 'IMF data dissemination standard',
 'Latest population census',
 'Latest household survey',
 'Source of most recent Income and expenditure data',
 'Vital registration complete',
 'Latest agricultural census',
 'Latest industrial data',
 'Latest trade data',
 'Latest water withdrawal data',
 'Unnamed: 31']

> <u>Notes :</u> The data represents metadata pertaining to various countries, encompassing codes, names, economic details, and census dates. The 'Unnamed: 31' column appears redundant and may contain errors or missing values.

#### 2.2 - Preview of the First and Last Rows of the Dataset

In [22]:
print('First 5 rows of the dataset:')
COUNTRY.head()

First 5 rows of the dataset:


Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,SNA data for 2000-2011 are updated from offici...,Latin America & Caribbean,High income: nonOECD,AW,...,,2010,,,Yes,,,2012.0,,
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2012.0,2000.0,
2,AGO,Angola,Angola,People's Republic of Angola,AO,Angolan kwanza,"April 2013 database update: Based on IMF data,...",Sub-Saharan Africa,Upper middle income,AO,...,General Data Dissemination System (GDDS),1970,"Malaria Indicator Survey (MIS), 2011","Integrated household survey (IHS), 2008",,2015,,,2005.0,
3,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2010.0,2012.0,2006.0,
4,AND,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income: nonOECD,AD,...,,2011. Population figures compiled from adminis...,,,Yes,,,2006.0,,


In [23]:
print('Last 5 rows of the dataset:')
COUNTRY.tail()

Last 5 rows of the dataset:


Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
236,XKX,Kosovo,Kosovo,Republic of Kosovo,,Euro,"Kosovo became a World Bank member on June 29, ...",Europe & Central Asia,Lower middle income,KV,...,General Data Dissemination System (GDDS),2011,,"Integrated household survey (IHS), 2011",,,,,,
237,YEM,Yemen,"Yemen, Rep.",Republic of Yemen,YE,Yemeni rial,Based on official government statistics and In...,Middle East & North Africa,Lower middle income,RY,...,General Data Dissemination System (GDDS),2004,"Demographic and Health Survey (DHS), 2013","Expenditure survey/budget survey (ES/BS), 2005",,,2006.0,2012.0,2005.0,
238,ZAF,South Africa,South Africa,Republic of South Africa,ZA,South African rand,Fiscal year end: March 31; reporting period fo...,Sub-Saharan Africa,Upper middle income,ZA,...,Special Data Dissemination Standard (SDDS),2011,"Demographic and Health Survey (DHS), 2003; Wor...","Expenditure survey/budget survey (ES/BS), 2010",,2007,2010.0,2012.0,2000.0,
239,ZMB,Zambia,Zambia,Republic of Zambia,ZM,New Zambian kwacha,National accounts data have rebased to reflect...,Sub-Saharan Africa,Lower middle income,ZM,...,General Data Dissemination System (GDDS),2010,"Demographic and Health Survey (DHS), 2013","Integrated household survey (IHS), 2010",,2010. Population and Housing Census.,,2011.0,2002.0,
240,ZWE,Zimbabwe,Zimbabwe,Republic of Zimbabwe,ZW,U.S. dollar,Fiscal year end: June 30; reporting period for...,Sub-Saharan Africa,Low income,ZW,...,General Data Dissemination System (GDDS),2012,"Demographic and Health Survey (DHS), 2010/11","Integrated household survey (IHS), 2011/12",,,,2012.0,2002.0,


> <u>Note:</u> The data includes various elements such as dates and region names. Missing values, denoted by 'NaN', are evident. A column named 'Unnamed: 31' column appears to be an oversight as it only contains missing values.

#### 2.3 - Missing Values & Percentages

In [24]:
print("\nNumber of missing values for each column:")
COUNTRY.isnull()


Number of missing values for each column:


Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
0,False,False,False,False,False,False,False,False,False,False,...,True,False,True,True,False,True,True,False,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,True,False,True
3,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,True,False,False,False,...,True,False,True,True,False,True,True,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,False,False,False,False,True,False,False,False,False,False,...,False,False,True,False,True,True,True,True,True,True
237,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,True
238,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,True
239,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,True


In [25]:
missing_values_table(COUNTRY)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 31,241,100.0
National accounts reference year,209,86.72
Alternative conversion factor,194,80.5
Other groups,183,75.93
Latest industrial data,134,55.6
Vital registration complete,130,53.94
External debt Reporting status,117,48.55
Latest household survey,100,41.49
Latest agricultural census,99,41.08
Lending category,97,40.25


In [26]:
# Display missing data > 70%
missing_values_table(COUNTRY, 70)

[4mDisplaying columns with missing data >= 70%:[0m 

Total columns that have missing data >= 70%: 4


Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 31,241,100.0
National accounts reference year,209,86.72
Alternative conversion factor,194,80.5
Other groups,183,75.93


> <u>Note :</u> Out of the 32 columns, 4 have over 70% missing data. Notably, the 'Unnamed: 31' column is entirely devoid of data, with a 100% missing rate.

#### 2.4 - Display duplicate rows

In [27]:
display_duplicate_rows(COUNTRY)

There are no duplicate rows.


> <u>Note :</u> We can observe that there are no duplicated rows.

#### 2.5 - Dataset Dimensions & Unique Entries Overview

In [28]:
print(f"Number of rows: {COUNTRY.shape[0]}")
print(f"Number of columns {COUNTRY.shape[1]}")
print(f"Number of unique Regions: {COUNTRY['Region'].nunique()}")


Number of rows: 241
Number of columns 32
Number of unique Regions: 7


> <u>Note :</u> The dataset, spanning 241 rows and 32 columns, encompasses 7 unique regions.

#### 2.6 - <u>Notes</u> : 

The dataset, spanning 241 rows and 32 columns, contains metadata about various countries and 7 unique regions. It features elements such as dates, region names, and economic information. Several columns have missing values, notably the 'Unnamed: 31' column which is entirely empty. Additionally, some columns display unusual values.

### 3 - <u>EdStatsCountry-Series :</u><a id='stats-country-series'></a>

#### 3.1 - Columns 

In [29]:
# Display columns names
COUNTRY_SERIES.columns.tolist()

['CountryCode', 'SeriesCode', 'DESCRIPTION', 'Unnamed: 3']

> <u>Notes:</u> The data represents metadata relating to various countries, encompassing codes, series codes and descriptions. The "Unnamed: 31" column appears redundant and may contain errors or missing values.

#### 3.2 - Preview of the First and Last Rows of the Dataset

In [30]:
print('First 5 rows of the dataset:')
COUNTRY_SERIES.head()

First 5 rows of the dataset:


Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...,
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...,
2,AFG,SP.POP.GROW,Data sources: United Nations World Population ...,
3,AFG,NY.GDP.PCAP.PP.CD,Estimates are based on regression.,
4,AFG,SP.POP.TOTL,Data sources : United Nations World Population...,


In [31]:
print('Last 5 rows of the dataset:')
COUNTRY_SERIES.tail()

Last 5 rows of the dataset:


Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
608,ZAF,SP.POP.GROW,"Data sources : Statistics South Africa, United...",
609,ZMB,SP.POP.GROW,Data sources: United Nations World Population ...,
610,ZMB,SP.POP.TOTL,Data sources : United Nations World Population...,
611,ZWE,SP.POP.TOTL,Data sources : United Nations World Population...,
612,ZWE,SP.POP.GROW,Data sources: United Nations World Population ...,


> <u>Note:</u>The data exhibits various features. We notice missing values in the 'Unnamed: 3' column, denoted by 'NaN'. The 'Unnamed: 3' column likely represents an error, as it solely contains missing values.

#### 3.3 - Missing Values & Percentages

In [32]:
print("\nNumber of missing values for each column:")
COUNTRY_SERIES.isnull()


Number of missing values for each column:


Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
0,False,False,False,True
1,False,False,False,True
2,False,False,False,True
3,False,False,False,True
4,False,False,False,True
...,...,...,...,...
608,False,False,False,True
609,False,False,False,True
610,False,False,False,True
611,False,False,False,True


In [33]:
missing_values_table(COUNTRY_SERIES)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 3,613,100.0


> <u>Note :</u> Of the 4 columns, column 'Unnamed: 3' is the only one with no data, with a missing data rate of 100%.

#### 3.4 - Display duplicate rows

In [34]:
display_duplicate_rows(COUNTRY_SERIES)

There are no duplicate rows.


> <u>Note :</u> We can observe that there are no duplicated rows.

#### 3.5 - Dataset Dimensions & Unique Entries Overview

In [35]:
print(f"Number of rows: {COUNTRY_SERIES.shape[0]}")
print(f"Number of columns {COUNTRY_SERIES.shape[1]}")
print(f"Number of unique Country Code: {COUNTRY_SERIES['CountryCode'].nunique()}")
print(f"Number of unique Series Code: {COUNTRY_SERIES['SeriesCode'].nunique()}")



Number of rows: 613
Number of columns 4
Number of unique Country Code: 211
Number of unique Series Code: 21


> <u>Note :</u> The dataset, spanning 613 rows and 4 columns, Country Code: 211 and 21 unique Series Code.

#### 3.6 - <u>Notes</u> : 

The dataset, spanning 613 rows and 4 columns, showcases metadata for 211 countries and 21 unique series codes. The 'Unnamed: 3' column stands out as it is entirely filled with 'NaN' values. This column appears to be an oversight and doesn't contain any useful data.

### 4 - <u>EdStatsSeries :</u><a id='stats-series'></a>

#### 4.1 - Columns 

In [36]:
# Display columns names
SERIES.columns.tolist()

['Series Code',
 'Topic',
 'Indicator Name',
 'Short definition',
 'Long definition',
 'Unit of measure',
 'Periodicity',
 'Base Period',
 'Other notes',
 'Aggregation method',
 'Limitations and exceptions',
 'Notes from original source',
 'General comments',
 'Source',
 'Statistical concept and methodology',
 'Development relevance',
 'Related source links',
 'Other web links',
 'Related indicators',
 'License Type',
 'Unnamed: 20']

> <u>Notes:</u> The 'Unnamed: 20' column appears redundant and may contain errors or missing values.

#### 4.2 - Preview of the First and Last Rows of the Dataset

In [37]:
print('First 5 rows of the dataset:')
SERIES.head()

First 5 rows of the dataset:


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15-19 with...,Percentage of female population age 15-19 with...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,Percentage of population age 15-19 with no edu...,Percentage of population age 15-19 with no edu...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 15+ with n...,Percentage of female population age 15+ with n...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,Percentage of population age 15+ with no educa...,Percentage of population age 15+ with no educa...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,Percentage of female population age 20-24 with...,Percentage of female population age 20-24 with...,,,,,,...,,,Robert J. Barro and Jong-Wha Lee: http://www.b...,,,,,,,


In [38]:
print('Last 5 rows of the dataset:')
SERIES.tail()

Last 5 rows of the dataset:


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
3660,UIS.XUNIT.USCONST.3.FSGOV,Expenditures,Government expenditure per upper secondary stu...,,"Average total (current, capital and transfers)...",,,,,,...,,,UNESCO Institute for Statistics,,,,,,,
3661,UIS.XUNIT.USCONST.4.FSGOV,Expenditures,Government expenditure per post-secondary non-...,,"Average total (current, capital and transfers)...",,,,,,...,,,UNESCO Institute for Statistics,,,,,,,
3662,UIS.XUNIT.USCONST.56.FSGOV,Expenditures,Government expenditure per tertiary student (c...,,"Average total (current, capital and transfers)...",,,,,,...,,,UNESCO Institute for Statistics,,,,,,,
3663,XGDP.23.FSGOV.FDINSTADM.FFD,Expenditures,Government expenditure in secondary institutio...,"Total general (local, regional and central) go...","Total general (local, regional and central) go...",,,,Secondary,,...,,,UNESCO Institute for Statistics,,,,,,,
3664,XGDP.56.FSGOV.FDINSTADM.FFD,Expenditures,Government expenditure in tertiary institution...,"Total general (local, regional and central) go...","Total general (local, regional and central) go...",,,,Tertiary,,...,,,UNESCO Institute for Statistics,,,,,,,


> <u>Note:</u> We observe that several columns contain empty values.

#### 4.3 - Missing Values & Percentages

In [39]:
print("\nNumber of missing values for each column:")
SERIES.isnull()


Number of missing values for each column:


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,...,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20
0,False,False,False,False,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
1,False,False,False,False,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
2,False,False,False,False,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
3,False,False,False,False,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
4,False,False,False,False,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3660,False,False,False,True,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
3661,False,False,False,True,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
3662,False,False,False,True,False,True,True,True,True,True,...,True,True,False,True,True,True,True,True,True,True
3663,False,False,False,False,False,True,True,True,False,True,...,True,True,False,True,True,True,True,True,True,True


In [40]:
missing_values_table(SERIES)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unit of measure,3665,100.0
Notes from original source,3665,100.0
Other web links,3665,100.0
Related indicators,3665,100.0
License Type,3665,100.0
Unnamed: 20,3665,100.0
Development relevance,3662,99.92
Limitations and exceptions,3651,99.62
General comments,3651,99.62
Statistical concept and methodology,3642,99.37


In [41]:
# Display missing data > 80%
missing_values_table(SERIES, 80)

[4mDisplaying columns with missing data >= 80%:[0m 

Total columns that have missing data >= 80%: 15


Unnamed: 0,Missing Values,Missing Percentages (%)
Unit of measure,3665,100.0
Notes from original source,3665,100.0
Other web links,3665,100.0
Related indicators,3665,100.0
License Type,3665,100.0
Unnamed: 20,3665,100.0
Development relevance,3662,99.92
Limitations and exceptions,3651,99.62
General comments,3651,99.62
Statistical concept and methodology,3642,99.37


> <u>Note :</u> Out of the 21 columns, 15 have over 80% missing data. Notably, 6 of these columns are entirely devoid of data, with a 100% missing rate.

#### 4.4 - Display duplicate rows

In [42]:
display_duplicate_rows(SERIES)

There are no duplicate rows.


> <u>Note :</u> We can observe that there are no duplicated rows.

#### 4.5 - Dataset Dimensions & Unique Entries Overview

In [43]:
print(f"Number of rows: {SERIES.shape[0]}")
print(f"Number of columns {SERIES.shape[1]}")

Number of rows: 3665
Number of columns 21


> <u>Note :</u> The dataset, spanning 3665 rows and 32 columns.

#### 4.6 - <u>Notes</u> : 

The dataset, spanning 3665 rows and 32 columns, has distinct characteristics. Several columns, including 'Unnamed: 20', appear redundant or entirely empty. Indeed, 15 out of 21 columns have over 80% missing data, with 6 being completely devoid of content.

### 5 - <u>EdStatsFootNote :</u><a id='stats-foot-note'></a>

#### 5.1 - Columns 

In [44]:
# Display columns names
FOOT_NOTE.columns.tolist()

['CountryCode', 'SeriesCode', 'Year', 'DESCRIPTION', 'Unnamed: 4']

> <u>Notes:</u> The 'Unnamed: 4' column appears redundant and may contain errors or missing values.

#### 5.2 - Preview of the First and Last Rows of the Dataset

In [45]:
print('First 5 rows of the dataset:')
FOOT_NOTE.head()

First 5 rows of the dataset:


Unnamed: 0,CountryCode,SeriesCode,Year,DESCRIPTION,Unnamed: 4
0,ABW,SE.PRE.ENRL.FE,YR2001,Country estimation.,
1,ABW,SE.TER.TCHR.FE,YR2005,Country estimation.,
2,ABW,SE.PRE.TCHR.FE,YR2000,Country estimation.,
3,ABW,SE.SEC.ENRL.GC,YR2004,Country estimation.,
4,ABW,SE.PRE.TCHR,YR2006,Country estimation.,


In [46]:
print('Last 5 rows of the dataset:')
FOOT_NOTE.tail()

Last 5 rows of the dataset:


Unnamed: 0,CountryCode,SeriesCode,Year,DESCRIPTION,Unnamed: 4
643633,ZWE,SH.DYN.MORT,YR2007,Uncertainty bound is 91.6 - 109.3,
643634,ZWE,SH.DYN.MORT,YR2014,Uncertainty bound is 54.3 - 76,
643635,ZWE,SH.DYN.MORT,YR2015,Uncertainty bound is 48.3 - 73.3,
643636,ZWE,SH.DYN.MORT,YR2017,5-year average value between 0s and 5s,
643637,ZWE,SP.POP.GROW,YR2017,5-year average value between 0s and 5s,


> <u>Note:</u> We note that the 'Unnamed: 4' column appears to be predominantly empty.

#### 5.3 - Missing Values & Percentages

In [47]:
print("\nNumber of missing values for each column:")
FOOT_NOTE.isnull()


Number of missing values for each column:


Unnamed: 0,CountryCode,SeriesCode,Year,DESCRIPTION,Unnamed: 4
0,False,False,False,False,True
1,False,False,False,False,True
2,False,False,False,False,True
3,False,False,False,False,True
4,False,False,False,False,True
...,...,...,...,...,...
643633,False,False,False,False,True
643634,False,False,False,False,True
643635,False,False,False,False,True
643636,False,False,False,False,True


In [48]:
missing_values_table(FOOT_NOTE)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 4,643638,100.0


> <u>Note :</u> We observe that only the 'Unnamed: 4' column is entirely devoid of data, with a 100% missing rate.

#### 5.4 - Display duplicate rows

In [49]:
display_duplicate_rows(FOOT_NOTE)

There are no duplicate rows.


> <u>Note :</u> We can observe that there are no duplicated rows.

#### 5.5 - Dataset Dimensions & Unique Entries Overview

In [50]:
print(f"Number of rows: {FOOT_NOTE.shape[0]}")
print(f"Number of columns {FOOT_NOTE.shape[1]}")

Number of rows: 643638
Number of columns 5


> <u>Note :</u> The dataset, spanning 643638 rows and 5 columns.

#### 5.6 - <u>Notes</u> : 

The dataset, spanning 643,638 rows and 5 columns, has distinct features. While several columns have empty values, the 'Unnamed: 4' column stands out as being completely devoid of data, with a 100% missing rate.

<a id='analysis-note'></a>

### <u>Analysis note:</u>

#### 1. sData
- **Country Name**: Name of the countries.
- **Country Code**: Unique code of the country.
- **Indicator Name**: Name of the indicator.
- **Indicator Code**: Unique code of the indicator.
- **Years**: Years corresponding to each data point.

#### 2. sCountry
- **Country Code**: Unique code of the country.
- **Table Name**: Country name as per ISO standards.
- **Region**: Geographical region of the country.
- **Income Group**: Country's income category.

#### 3. sCountry Serie
- **Country Code**: Unique code of the country.
- **Series Code**: Unique code of the indicator.
- **Description**: Data source of the indicator.

#### 4. sSerie
- **Series Code**: Unique code of the indicator.
- **Topic**: Subject or theme of the indicator.
- **Indicator Name**: Name of the indicator.

#### 5. sFoot
Contains similar information to sCountry, but with the added year of the indicator and a detailed description of the data source.


#### General Observations
- No duplicate rows were found across all files.
- Several columns have a significant number of missing values, especially those with a 100% missing rate. I think it's recommended to remove these columns during the data cleaning phase.


#### Conclusion
Based on this information, the most relevant columns for our Ed-Tech study are: Country Name, Indicator Code, Country Code, Region, Income Group, Series Code, and Years.

## Data Cleaning <a id='data-cleaning'></a>
> Address any issues in the dataset to ensure its quality and reliability. This involves handling missing values, duplicates, and any inconsistencies.

### Relevant Indicators <a id='relevant indicators'></a>

To access the courses offered, potential clients must have a computer and an internet connection. Our target audience is students in upper secondary school and tertiary education, specifically those aged 15 to 24.

Here's a breakdown of the indicators:
- **UIS.E.3**: This represents the total number of students enrolled in both public and private upper secondary education institutions, irrespective of their age.

- **IT.CMP.PCMP.P2**: This indicates the number of personal computers per 100 people. Personal computers are standalone devices intended for individual use.

- **IT.NET.USER.P2**: This represents the number of internet users per 100 people. Internet users are defined as individuals who have accessed the internet (from any location) in the past 3 months. The internet can be accessed via various devices such as computers, mobile phones, PDAs, gaming consoles, digital TVs, etc.

- **SE.TER.ENRL**: This is the total number of students enrolled in both public and private tertiary education institutions.

- **SE.XPD.TOTL.GD.ZS**: This indicates the total expenditure (current, capital, and transfers) by local, regional, and central governments on education, expressed as a percentage of the GDP.

- **SP.POP.1524.TO.UN**: This represents the total population aged between 15 and 24.

In conclusion, we have identified 6 key indicators that will be instrumental for our analysis.


In [51]:
# List of relevant indicators for the analysis
relevant_indicators = [
    'SE.TER.ENRL',        # Total students enrolled in tertiary education
    'UIS.E.3',            # Total students enrolled in upper secondary education
    'IT.NET.USER.P2',     # Internet users per 100 people
    'IT.CMP.PCMP.P2',     # Personal computers per 100 people
    'SE.XPD.TOTL.GD.ZS',  # Government expenditure on education as % of GDP
    'SP.POP.1524.TO.UN'   # Total population aged between 15 and 24
]

### Filter & Eliminate unnecessary columns <a id='filter-&-eliminate-unnecessary-columns'></a>

#### 1 - <u>EdStatsData</u> <a id='cleaning-stats-data'></a>

In [52]:
# Filter the data DataFrame based on the relevant indicators
data = data[data['Indicator Code'].isin(relevant_indicators)]

In [53]:
# Display data
data

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
1204,Arab World,ARB,"Enrolment in tertiary education, all programme...",SE.TER.ENRL,706416.125,733981.25,794759.0,871347.3125,957383.125,1066646.75,...,,,,,,,,,,
1214,Arab World,ARB,"Enrolment in upper secondary education, both s...",UIS.E.3,,,,,,,...,,,,,,,,,,
1260,Arab World,ARB,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,,,,,,,...,,,,,,,,,,
1375,Arab World,ARB,Internet users (per 100 people),IT.NET.USER.P2,,,,,,,...,,,,,,,,,,
2084,Arab World,ARB,Personal computers (per 100 people),IT.CMP.PCMP.P2,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884479,Zimbabwe,ZWE,"Enrolment in upper secondary education, both s...",UIS.E.3,,,,,,,...,,,,,,,,,,
884525,Zimbabwe,ZWE,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,,,,,,,...,,,,,,,,,,
884640,Zimbabwe,ZWE,Internet users (per 100 people),IT.NET.USER.P2,,,,,,,...,,,,,,,,,,
885349,Zimbabwe,ZWE,Personal computers (per 100 people),IT.CMP.PCMP.P2,,,,,,,...,,,,,,,,,,


In [54]:
# Display missing data == 70%
missing_values_table(data, 70)

[4mDisplaying columns with missing data >= 70%:[0m 

Total columns that have missing data >= 70%: 40


Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 69,1452,100.0
2060,1452,100.0
2025,1452,100.0
2030,1452,100.0
2035,1452,100.0
2040,1452,100.0
2045,1452,100.0
2050,1452,100.0
2055,1452,100.0
2065,1452,100.0


In [55]:
# Remove columns that contain +70% missing values
drop_columns(data, 70)

[4mDisplaying all columns with missing data:[0m 

40 columns dropped because it they had 70% missing values: Unnamed: 69, 2060, 2025, 2030, 2035, 2040, 2045, 2050, 2055, 2065, 2017, 2070, 2075, 2080, 2085, 2090, 2095, 2100, 2020, 1970, 1974, 1972, 1973, 1976, 1977, 1978, 1971, 1975, 1979, 1983, 1987, 1985, 1984, 1980, 1982, 1981, 1986, 2016, 1988, 1989


In [56]:
# Displaying all columns with missing data filtered by percentage >= 70%
missing_values_table(data)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
1991,943,64.94
1992,921,63.43
1993,894,61.57
2015,880,60.61
1994,835,57.51
1990,762,52.48
1995,750,51.65
1997,742,51.1
1996,711,48.97
2014,667,45.94


In [57]:
print(title("After Filtering Out Irrelevant Data:"))
print(f"- Number of rows: {data.shape[0]}")
print(f"- Number of columns: {data.shape[1]}")

[4mAfter Filtering Out Irrelevant Data:[0m 

- Number of rows: 1452
- Number of columns: 30


#### 2 - <u>EdStatsCountry</u> <a id='cleaning-stats-country'></a>

In [58]:
# Display columns names
COUNTRY.columns.tolist()

['Country Code',
 'Short Name',
 'Table Name',
 'Long Name',
 '2-alpha code',
 'Currency Unit',
 'Special Notes',
 'Region',
 'Income Group',
 'WB-2 code',
 'National accounts base year',
 'National accounts reference year',
 'SNA price valuation',
 'Lending category',
 'Other groups',
 'System of National Accounts',
 'Alternative conversion factor',
 'PPP survey year',
 'Balance of Payments Manual in use',
 'External debt Reporting status',
 'System of trade',
 'Government Accounting concept',
 'IMF data dissemination standard',
 'Latest population census',
 'Latest household survey',
 'Source of most recent Income and expenditure data',
 'Vital registration complete',
 'Latest agricultural census',
 'Latest industrial data',
 'Latest trade data',
 'Latest water withdrawal data',
 'Unnamed: 31']

In [59]:
# Keep only relevant columns
country = country[['Country Code', 'Short Name', 'Table Name', 'Long Name']]

In [60]:
# Display table of 'country'
country

Unnamed: 0,Country Code,Short Name,Table Name,Long Name
0,ABW,Aruba,Aruba,Aruba
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan
2,AGO,Angola,Angola,People's Republic of Angola
3,ALB,Albania,Albania,Republic of Albania
4,AND,Andorra,Andorra,Principality of Andorra
...,...,...,...,...
236,XKX,Kosovo,Kosovo,Republic of Kosovo
237,YEM,Yemen,"Yemen, Rep.",Republic of Yemen
238,ZAF,South Africa,South Africa,Republic of South Africa
239,ZMB,Zambia,Zambia,Republic of Zambia


In [61]:
missing_values_table(country)

[4mDisplaying all columns with missing data:[0m 

There are no columns with the specified percentage of missing data.


In [62]:
print(title("After Filtering Out Irrelevant Data:"))
print(f"- Number of rows: {country.shape[0]}")
print(f"- Number of columns: {country.shape[1]}")

[4mAfter Filtering Out Irrelevant Data:[0m 

- Number of rows: 241
- Number of columns: 4


In [63]:
country_series

Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION,Unnamed: 3
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...,
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...,
2,AFG,SP.POP.GROW,Data sources: United Nations World Population ...,
3,AFG,NY.GDP.PCAP.PP.CD,Estimates are based on regression.,
4,AFG,SP.POP.TOTL,Data sources : United Nations World Population...,
...,...,...,...,...
608,ZAF,SP.POP.GROW,"Data sources : Statistics South Africa, United...",
609,ZMB,SP.POP.GROW,Data sources: United Nations World Population ...,
610,ZMB,SP.POP.TOTL,Data sources : United Nations World Population...,
611,ZWE,SP.POP.TOTL,Data sources : United Nations World Population...,


#### 3 - <u>EdStatsCountrySeries</u> <a id='cleaning-stats-country-series'></a>

In [64]:
# Display columns names
country_series.columns.tolist()

['CountryCode', 'SeriesCode', 'DESCRIPTION', 'Unnamed: 3']

In [65]:
missing_values_table(country_series)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 3,613,100.0


In [66]:
# Drop columns that contain +100% missing values
drop_columns(country_series, 100)

[4mDisplaying all columns with missing data:[0m 

1 column dropped because it had 100% missing values: Unnamed: 3


In [67]:
# Displaying the table with missing data filtered by percentage >= 100%
country_series

Unnamed: 0,CountryCode,SeriesCode,DESCRIPTION
0,ABW,SP.POP.TOTL,Data sources : United Nations World Population...
1,ABW,SP.POP.GROW,Data sources: United Nations World Population ...
2,AFG,SP.POP.GROW,Data sources: United Nations World Population ...
3,AFG,NY.GDP.PCAP.PP.CD,Estimates are based on regression.
4,AFG,SP.POP.TOTL,Data sources : United Nations World Population...
...,...,...,...
608,ZAF,SP.POP.GROW,"Data sources : Statistics South Africa, United..."
609,ZMB,SP.POP.GROW,Data sources: United Nations World Population ...
610,ZMB,SP.POP.TOTL,Data sources : United Nations World Population...
611,ZWE,SP.POP.TOTL,Data sources : United Nations World Population...


#### 4 - <u>EdStatsSeries</u> <a id='cleaning-stats-series'></a>

In [68]:
# Display columns names
series.columns.tolist()

['Series Code',
 'Topic',
 'Indicator Name',
 'Short definition',
 'Long definition',
 'Unit of measure',
 'Periodicity',
 'Base Period',
 'Other notes',
 'Aggregation method',
 'Limitations and exceptions',
 'Notes from original source',
 'General comments',
 'Source',
 'Statistical concept and methodology',
 'Development relevance',
 'Related source links',
 'Other web links',
 'Related indicators',
 'License Type',
 'Unnamed: 20']

In [69]:
series = series[['Series Code','Topic','Indicator Name','Notes from original source','Other web links','Related indicators','License Type']]
series

Unnamed: 0,Series Code,Topic,Indicator Name,Notes from original source,Other web links,Related indicators,License Type
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,,,,
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...,,,,
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,,,,
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...,,,,
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...,,,,
...,...,...,...,...,...,...,...
3660,UIS.XUNIT.USCONST.3.FSGOV,Expenditures,Government expenditure per upper secondary stu...,,,,
3661,UIS.XUNIT.USCONST.4.FSGOV,Expenditures,Government expenditure per post-secondary non-...,,,,
3662,UIS.XUNIT.USCONST.56.FSGOV,Expenditures,Government expenditure per tertiary student (c...,,,,
3663,XGDP.23.FSGOV.FDINSTADM.FFD,Expenditures,Government expenditure in secondary institutio...,,,,


In [70]:
missing_values_table(series)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Notes from original source,3665,100.0
Other web links,3665,100.0
Related indicators,3665,100.0
License Type,3665,100.0


In [71]:
# Drop columns that contain +100% missing values
drop_columns(series, 100)

[4mDisplaying all columns with missing data:[0m 

4 columns dropped because it they had 100% missing values: Notes from original source, Other web links, Related indicators, License Type


In [72]:
# Displaying the table with missing data filtered by percentage >= 100%
series

Unnamed: 0,Series Code,Topic,Indicator Name
0,BAR.NOED.1519.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...
1,BAR.NOED.1519.ZS,Attainment,Barro-Lee: Percentage of population age 15-19 ...
2,BAR.NOED.15UP.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...
3,BAR.NOED.15UP.ZS,Attainment,Barro-Lee: Percentage of population age 15+ wi...
4,BAR.NOED.2024.FE.ZS,Attainment,Barro-Lee: Percentage of female population age...
...,...,...,...
3660,UIS.XUNIT.USCONST.3.FSGOV,Expenditures,Government expenditure per upper secondary stu...
3661,UIS.XUNIT.USCONST.4.FSGOV,Expenditures,Government expenditure per post-secondary non-...
3662,UIS.XUNIT.USCONST.56.FSGOV,Expenditures,Government expenditure per tertiary student (c...
3663,XGDP.23.FSGOV.FDINSTADM.FFD,Expenditures,Government expenditure in secondary institutio...


#### 5 - <u>EdStatsFootNote</u> <a id='cleaning-stats-foot-note'></a>

In [73]:
# Display columns names
foot_note.columns.tolist()

['CountryCode', 'SeriesCode', 'Year', 'DESCRIPTION', 'Unnamed: 4']

In [74]:
missing_values_table(foot_note)

[4mDisplaying all columns with missing data:[0m 



Unnamed: 0,Missing Values,Missing Percentages (%)
Unnamed: 4,643638,100.0


In [75]:
# Drop columns that contain +100% missing values
drop_columns(foot_note, 100)

[4mDisplaying all columns with missing data:[0m 

1 column dropped because it had 100% missing values: Unnamed: 4


In [76]:
# Displaying the table with missing data filtered by percentage >= 100%
foot_note

Unnamed: 0,CountryCode,SeriesCode,Year,DESCRIPTION
0,ABW,SE.PRE.ENRL.FE,YR2001,Country estimation.
1,ABW,SE.TER.TCHR.FE,YR2005,Country estimation.
2,ABW,SE.PRE.TCHR.FE,YR2000,Country estimation.
3,ABW,SE.SEC.ENRL.GC,YR2004,Country estimation.
4,ABW,SE.PRE.TCHR,YR2006,Country estimation.
...,...,...,...,...
643633,ZWE,SH.DYN.MORT,YR2007,Uncertainty bound is 91.6 - 109.3
643634,ZWE,SH.DYN.MORT,YR2014,Uncertainty bound is 54.3 - 76
643635,ZWE,SH.DYN.MORT,YR2015,Uncertainty bound is 48.3 - 73.3
643636,ZWE,SH.DYN.MORT,YR2017,5-year average value between 0s and 5s
