### UNICEF Global database on school-age digital connectivity

<b>Definition:</b>	School-age digital connectivity data set – Percentage of children in a school attendance age (approximately 3-17 years old depending on the country) that have internet connection at home

<b>Unit of measure:</b>	Percentage

<b>Time frame for survey:</b>	Household survey data as of year 2010 onwards are used to calculate the indicator. For countries with multiple years of data, the most recent dataset is used.

<b>Data Definitions</b>

<b>ISO:</b>	Three-digit alphabetical codes International Standard ISO 3166-1 assigned by the International Organization for Standardization (ISO). The latest version is available online at http://www.iso.org/iso/home/standards/country_codes.htm. (column A)

<b>Countries and areas:</b>	The UNICEF Global databases contain a set of 202 countries as reported on through the State of the World's Children Statistical Annex 2017 (column B)
	

<b>Region, Sub-region:</b>	UNICEF regions (column C) and UNICEF Sub-regions (column D)
EAP	East Asia and the Pacific
ECA	Europe and Central Asia
EECA	Eastern Europe and Central Asia
ESA	Eastern and Southern Africa
LAC	Latin America and the Caribbean
MENA	Middle East and North Africa
NA	North America
SA	South Asia
SSA	Sub-Saharan Africa
WCA	West and Central Africa

<b>Development regions:</b>	Economies are currently divided into four income groupings: low, lower-middle, upper-middle, and high. Income is measured using gross national income (GNI) per capita, in U.S. dollars, converted from local currency using the World Bank Atlas method (column E).
	
	
Disclaimer	
All reasonable precautions have been taken to verify the information in this database. In no event shall UNICEF be liable for damages arising from its use or interpretation	


In [45]:
# import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Explanatory Data Analysis (EDA)

In [46]:
# load the datset
df = pd.read_csv('../education_dataset/Digital_Connectivity.csv')

In [47]:
# display the first 5 rows
df.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),0.24,0.09,0.32,0.01,0.77,...,0.25,0.1,0.33,0.01,0.78,,,,,
1,AGO,Angola,SSA,ESA,Lower middle income (LM),0.17,0.02,0.24,0.0,0.62,...,0.19,0.02,0.27,0.0,0.62,0.24,0.02,0.33,0.0,0.69
2,ARG,Argentina,LAC,LAC,Upper middle income (UM),0.4,,,,,...,0.43,,,,,0.45,0.0,0.0,0.0,0.0
3,ARM,Armenia,ECA,EECA,Upper middle income (UM),0.81,0.71,0.88,0.47,0.99,...,0.81,0.71,0.89,0.44,0.99,0.85,0.78,0.91,0.54,1.0
4,BGD,Bangladesh,SA,SA,Lower middle income (LM),0.37,0.33,0.52,0.09,0.76,...,0.34,0.31,0.48,0.08,0.73,0.42,0.38,0.57,0.13,0.79


The percentage are in decimal format this would make format incompatiable with other data frame. Multiply with 100 to alter complete number 
But, it is important to understand that they are still percentage without the <b>%</b> symbol.

In [48]:
# first copy the dataframe 
dc = df.copy()

dc.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),0.24,0.09,0.32,0.01,0.77,...,0.25,0.1,0.33,0.01,0.78,,,,,
1,AGO,Angola,SSA,ESA,Lower middle income (LM),0.17,0.02,0.24,0.0,0.62,...,0.19,0.02,0.27,0.0,0.62,0.24,0.02,0.33,0.0,0.69
2,ARG,Argentina,LAC,LAC,Upper middle income (UM),0.4,,,,,...,0.43,,,,,0.45,0.0,0.0,0.0,0.0
3,ARM,Armenia,ECA,EECA,Upper middle income (UM),0.81,0.71,0.88,0.47,0.99,...,0.81,0.71,0.89,0.44,0.99,0.85,0.78,0.91,0.54,1.0
4,BGD,Bangladesh,SA,SA,Lower middle income (LM),0.37,0.33,0.52,0.09,0.76,...,0.34,0.31,0.48,0.08,0.73,0.42,0.38,0.57,0.13,0.79


In [49]:
# multiply the number variables with 100
dc[['School_Age_Digital_Connectivity_Total', 
    'School_Age_Digital_Connectivity_Rural', 
    'School_Age_Digital_Connectivity_Urban',
    'School_Age_Digital_Connectivity_Poorest',
    'School_Age_Digital_Connectivity_Richest',
    'Pre_Primary_Digital_Connectivity_Total',
    'Pre_Primary_Digital_Connectivity_Rural',
    'Pre_Primary_Digital_Connectivity_Urban',
    'Pre_Primary_Digital_Connectivity_Poorest',
    'Pre_Primary_Digital_Connectivity_Richest',
    'Primary_Digital_Connectivity_Total',
    'Primary_Digital_Connectivity_Rural',
    'Primary_Digital_Connectivity_Urban',
    'Primary_Digital_Connectivity_Poorest',
    'Primary_Digital_Connectivity_Richest',
    'Lower_Secondary_Digital_Connectivity_Total',
    'Lower_Secondary_Digital_Connectivity_Rural',
    'Lower_Secondary_Digital_Connectivity_Urban',
    'Lower_Secondary_Digital_Connectivity_Poorest',
    'Lower_Secondary_Digital_Connectivity_Richest',
    'Upper_Secondary_Digital_Connectivity_Total',
    'Upper_Secondary_Digital_Connectivity_Rural',
    'Upper_Secondary_Digital_Connectivity_Urban',
    'Upper_Secondary_Digital_Connectivity_Poorest',
    'Upper_Secondary_Digital_Connectivity_Richest'
]] = dc[['School_Age_Digital_Connectivity_Total', 
    'School_Age_Digital_Connectivity_Rural', 
    'School_Age_Digital_Connectivity_Urban',
    'School_Age_Digital_Connectivity_Poorest',
    'School_Age_Digital_Connectivity_Richest',
    'Pre_Primary_Digital_Connectivity_Total',
    'Pre_Primary_Digital_Connectivity_Rural',
    'Pre_Primary_Digital_Connectivity_Urban',
    'Pre_Primary_Digital_Connectivity_Poorest',
    'Pre_Primary_Digital_Connectivity_Richest',
    'Primary_Digital_Connectivity_Total',
    'Primary_Digital_Connectivity_Rural',
    'Primary_Digital_Connectivity_Urban',
    'Primary_Digital_Connectivity_Poorest',
    'Primary_Digital_Connectivity_Richest',
    'Lower_Secondary_Digital_Connectivity_Total',
    'Lower_Secondary_Digital_Connectivity_Rural',
    'Lower_Secondary_Digital_Connectivity_Urban',
    'Lower_Secondary_Digital_Connectivity_Poorest',
    'Lower_Secondary_Digital_Connectivity_Richest',
    'Upper_Secondary_Digital_Connectivity_Total',
    'Upper_Secondary_Digital_Connectivity_Rural',
    'Upper_Secondary_Digital_Connectivity_Urban',
    'Upper_Secondary_Digital_Connectivity_Poorest',
    'Upper_Secondary_Digital_Connectivity_Richest'
]] * 100

In [50]:
# check the first 5 rows
dc.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),24.0,9.0,32.0,1.0,77.0,...,25.0,10.0,33.0,1.0,78.0,,,,,
1,AGO,Angola,SSA,ESA,Lower middle income (LM),17.0,2.0,24.0,0.0,62.0,...,19.0,2.0,27.0,0.0,62.0,24.0,2.0,33.0,0.0,69.0
2,ARG,Argentina,LAC,LAC,Upper middle income (UM),40.0,,,,,...,43.0,,,,,45.0,0.0,0.0,0.0,0.0
3,ARM,Armenia,ECA,EECA,Upper middle income (UM),81.0,71.0,88.0,47.0,99.0,...,81.0,71.0,89.0,44.0,99.0,85.0,78.0,91.0,54.0,100.0
4,BGD,Bangladesh,SA,SA,Lower middle income (LM),37.0,33.0,52.0,9.0,76.0,...,34.0,31.0,48.0,8.0,73.0,42.0,38.0,57.0,13.0,79.0


In [51]:
# check the data types 
dc.dtypes

ISO3                                             object
Countries and areas                              object
Region                                           object
Sub-region                                       object
Income Group                                     object
School_Age_Digital_Connectivity_Total           float64
School_Age_Digital_Connectivity_Rural           float64
School_Age_Digital_Connectivity_Urban           float64
School_Age_Digital_Connectivity_Poorest         float64
School_Age_Digital_Connectivity_Richest         float64
Pre_Primary_Digital_Connectivity_Total          float64
Pre_Primary_Digital_Connectivity_Rural          float64
Pre_Primary_Digital_Connectivity_Urban          float64
Pre_Primary_Digital_Connectivity_Poorest        float64
Pre_Primary_Digital_Connectivity_Richest        float64
Primary_Digital_Connectivity_Total              float64
Primary_Digital_Connectivity_Rural              float64
Primary_Digital_Connectivity_Urban              

In [52]:
# check the no. of missing values
dc.isna().sum()

ISO3                                             0
Countries and areas                              0
Region                                           0
Sub-region                                       0
Income Group                                     0
School_Age_Digital_Connectivity_Total            0
School_Age_Digital_Connectivity_Rural           10
School_Age_Digital_Connectivity_Urban            7
School_Age_Digital_Connectivity_Poorest         17
School_Age_Digital_Connectivity_Richest         18
Pre_Primary_Digital_Connectivity_Total           0
Pre_Primary_Digital_Connectivity_Rural          10
Pre_Primary_Digital_Connectivity_Urban           7
Pre_Primary_Digital_Connectivity_Poorest        17
Pre_Primary_Digital_Connectivity_Richest        19
Primary_Digital_Connectivity_Total               1
Primary_Digital_Connectivity_Rural               1
Primary_Digital_Connectivity_Urban               1
Primary_Digital_Connectivity_Poorest             1
Primary_Digital_Connectivity_Ri

In [53]:
# statistical analysis
dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 30 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   ISO3                                          87 non-null     object 
 1   Countries and areas                           87 non-null     object 
 2   Region                                        87 non-null     object 
 3   Sub-region                                    87 non-null     object 
 4   Income Group                                  87 non-null     object 
 5   School_Age_Digital_Connectivity_Total         87 non-null     float64
 6   School_Age_Digital_Connectivity_Rural         77 non-null     float64
 7   School_Age_Digital_Connectivity_Urban         80 non-null     float64
 8   School_Age_Digital_Connectivity_Poorest       70 non-null     float64
 9   School_Age_Digital_Connectivity_Richest       69 non-null     float

In [54]:
# describe statistics
dc.describe()

Unnamed: 0,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,Pre_Primary_Digital_Connectivity_Total,Pre_Primary_Digital_Connectivity_Rural,Pre_Primary_Digital_Connectivity_Urban,Pre_Primary_Digital_Connectivity_Poorest,Pre_Primary_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
count,87.0,77.0,80.0,70.0,69.0,87.0,77.0,80.0,70.0,68.0,...,87.0,77.0,80.0,70.0,68.0,81.0,81.0,81.0,81.0,81.0
mean,35.632184,27.077922,43.9375,18.457143,62.565217,33.505747,25.558442,41.775,17.8,60.823529,...,36.183908,27.298701,44.6875,18.8,62.720588,39.864198,27.444444,44.024691,17.493827,53.962963
std,29.539928,28.396055,29.634629,25.887541,34.749726,29.145534,27.886751,29.584881,26.047115,35.03173,...,29.661473,28.588305,29.516609,25.893091,34.720997,31.436346,30.344275,32.634711,26.396554,39.656476
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,8.0,2.0,17.0,0.0,32.0,6.0,2.0,14.75,0.0,26.75,...,8.0,3.0,18.5,0.0,34.0,9.0,1.0,12.0,0.0,12.0
50%,31.0,13.0,44.0,4.0,76.0,27.0,12.0,39.5,3.0,72.5,...,31.0,13.0,46.0,4.0,74.0,38.0,13.0,42.0,2.0,64.0
75%,60.5,48.0,69.5,30.25,94.0,51.0,39.0,63.5,23.75,94.0,...,63.0,50.0,68.0,27.25,94.0,72.0,51.0,77.0,25.0,94.0
max,99.0,91.0,97.0,89.0,100.0,99.0,91.0,97.0,89.0,100.0,...,99.0,91.0,96.0,88.0,100.0,99.0,92.0,97.0,90.0,100.0


As we can see that there are missing values in the dataset. As we have discussed in the meeting, we will impute the missing data 
according to the economic stauses of the countries. Howerver, there is no 'Development Regions' column but there is 'Income Group' column.
We will use this column as our economic characteristics to divide different groups in order to perform straitified imputation.

##### Definition: Stratified imputation involves dividing the data into strata (e.g., by region, income group, or development status) and imputing missing values within each stratum separately. This approach helps account for differences in literacy rates across different groups and reduces bias.

In [55]:
# identify different income group in the dataset 
print(dc['Income Group'].unique())

['Upper middle income (UM)' 'Lower middle income (LM)' 'High income (H)'
 'Low income (L)']


In [56]:
# group the countries based on their incomes
low_income_countries = dc[dc['Income Group'] == 'Low income (L)'].reset_index(drop = True)
lower_middle_income_countries = dc[dc['Income Group'] == 'Lower middle income (LM)'].reset_index(drop = True)
upper_middle_income_countries = dc[dc['Income Group'] == 'Upper middle income (UM)'].reset_index(drop = True)
high_income_countries = dc[dc['Income Group'] == 'High income (H)'].reset_index(drop = True)

In [57]:
# check the strata
low_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,BEN,Benin,SSA,WCA,Low income (L),4.0,1.0,7.0,0.0,16.0,...,4.0,2.0,7.0,0.0,16.0,5.0,2.0,10.0,0.0,21.0
1,BFA,Burkina Faso,SSA,WCA,Low income (L),1.0,1.0,4.0,1.0,5.0,...,2.0,1.0,5.0,1.0,6.0,2.0,0.0,5.0,0.0,6.0
2,CAF,Central African Republic,SSA,WCA,Low income (L),4.0,1.0,9.0,0.0,16.0,...,5.0,1.0,11.0,0.0,18.0,6.0,1.0,12.0,0.0,19.0
3,TCD,Chad,SSA,WCA,Low income (L),2.0,1.0,8.0,0.0,9.0,...,2.0,1.0,8.0,0.0,9.0,3.0,1.0,12.0,0.0,12.0
4,COD,Democratic Republic of the Congo,SSA,WCA,Low income (L),1.0,0.0,2.0,0.0,3.0,...,1.0,0.0,2.0,0.0,2.0,1.0,0.0,3.0,0.0,3.0


In [58]:
lower_middle_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,AGO,Angola,SSA,ESA,Lower middle income (LM),17.0,2.0,24.0,0.0,62.0,...,19.0,2.0,27.0,0.0,62.0,24.0,2.0,33.0,0.0,69.0
1,BGD,Bangladesh,SA,SA,Lower middle income (LM),37.0,33.0,52.0,9.0,76.0,...,34.0,31.0,48.0,8.0,73.0,42.0,38.0,57.0,13.0,79.0
2,BOL,Bolivia (Plurinational State of),LAC,LAC,Lower middle income (LM),12.0,4.0,17.0,,,...,12.0,3.0,17.0,,,,,,,
3,CMR,Cameroon,SSA,WCA,Lower middle income (LM),5.0,0.0,10.0,0.0,24.0,...,5.0,0.0,10.0,0.0,23.0,7.0,0.0,12.0,0.0,28.0
4,CIV,Côte d'Ivoire,SSA,WCA,Lower middle income (LM),3.0,1.0,5.0,0.0,14.0,...,4.0,1.0,7.0,0.0,16.0,3.0,2.0,5.0,0.0,12.0


In [59]:
upper_middle_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),24.0,9.0,32.0,1.0,77.0,...,25.0,10.0,33.0,1.0,78.0,,,,,
1,ARG,Argentina,LAC,LAC,Upper middle income (UM),40.0,,,,,...,43.0,,,,,45.0,0.0,0.0,0.0,0.0
2,ARM,Armenia,ECA,EECA,Upper middle income (UM),81.0,71.0,88.0,47.0,99.0,...,81.0,71.0,89.0,44.0,99.0,85.0,78.0,91.0,54.0,100.0
3,BIH,Bosnia and Herzegovina,ECA,EECA,Upper middle income (UM),59.0,51.0,76.0,8.0,96.0,...,61.0,52.0,79.0,11.0,95.0,70.0,65.0,81.0,13.0,98.0
4,BRA,Brazil,LAC,LAC,Upper middle income (UM),83.0,51.0,89.0,84.0,97.0,...,82.0,50.0,89.0,84.0,97.0,84.0,55.0,90.0,84.0,96.0


In [60]:
high_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,BRB,Barbados,LAC,LAC,High income (H),66.0,61.0,69.0,10.0,98.0,...,68.0,66.0,69.0,20.0,100.0,76.0,76.0,76.0,4.0,100.0
1,CHL,Chile,LAC,LAC,High income (H),86.0,70.0,89.0,73.0,99.0,...,87.0,71.0,90.0,75.0,99.0,87.0,72.0,90.0,75.0,98.0
2,JPN,Japan,EAP,EAP,High income (H),78.0,83.0,77.0,62.0,87.0,...,78.0,77.0,79.0,64.0,88.0,88.0,88.0,87.0,55.0,94.0
3,PAN,Panama,LAC,LAC,High income (H),31.0,10.0,44.0,4.0,94.0,...,33.0,11.0,48.0,4.0,92.0,38.0,15.0,51.0,8.0,97.0
4,TTO,Trinidad and Tobago,LAC,LAC,High income (H),44.0,34.0,52.0,3.0,96.0,...,47.0,38.0,54.0,4.0,93.0,51.0,41.0,60.0,7.0,97.0


In [61]:
# indentify no. of countries in each straturm
print('No. of least developed countries: ', low_income_countries.shape[0])
print('No. of leass developed countries: ', lower_middle_income_countries.shape[0])
print('No. of more developed countries: ', upper_middle_income_countries.shape[0])
print('No. of unclassified countries: ', high_income_countries.shape[0])
print("Total: ", low_income_countries.shape[0]+lower_middle_income_countries.shape[0]+upper_middle_income_countries.shape[0]+high_income_countries.shape[0])

No. of least developed countries:  18
No. of leass developed countries:  30
No. of more developed countries:  32
No. of unclassified countries:  7
Total:  87


Among the different imputation methods, I would like to choose the median value to impute the null values a.k.a "NAN". Replacing null values with mean or median of a certain column is very popular in imputation. "The technique, in this instance, replaces the null values with mean, rounded mean, or median values determined for that feature across the whole dataset (in our case, in each group). It is advised to utilize the median rather than the mean when your dataset has a significant number of outliers (Simplilearn, 2023)."

<b>Ref:</b> https://www.simplilearn.com/data-imputation-article

In [62]:
# identify the no. of missing rows in low_income_countries data frame 
low_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           2
School_Age_Digital_Connectivity_Urban           2
School_Age_Digital_Connectivity_Poorest         2
School_Age_Digital_Connectivity_Richest         2
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          2
Pre_Primary_Digital_Connectivity_Urban          2
Pre_Primary_Digital_Connectivity_Poorest        2
Pre_Primary_Digital_Connectivity_Richest        2
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [63]:
# identify median for each columns in low_income_countries data frame
low_income_countries.median(numeric_only=True)

School_Age_Digital_Connectivity_Total            4.5
School_Age_Digital_Connectivity_Rural            2.5
School_Age_Digital_Connectivity_Urban           11.5
School_Age_Digital_Connectivity_Poorest          0.0
School_Age_Digital_Connectivity_Richest         19.5
Pre_Primary_Digital_Connectivity_Total           3.5
Pre_Primary_Digital_Connectivity_Rural           2.5
Pre_Primary_Digital_Connectivity_Urban           9.5
Pre_Primary_Digital_Connectivity_Poorest         0.0
Pre_Primary_Digital_Connectivity_Richest        19.0
Primary_Digital_Connectivity_Total               4.0
Primary_Digital_Connectivity_Rural               1.0
Primary_Digital_Connectivity_Urban               8.0
Primary_Digital_Connectivity_Poorest             0.0
Primary_Digital_Connectivity_Richest            14.5
Lower_Secondary_Digital_Connectivity_Total       6.0
Lower_Secondary_Digital_Connectivity_Rural       3.0
Lower_Secondary_Digital_Connectivity_Urban      14.0
Lower_Secondary_Digital_Connectivity_Poorest  

In [64]:
# median imputation 
low_income_countries = low_income_countries.fillna(low_income_countries.median(numeric_only=True))

In [65]:
# check the imputation
low_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,BEN,Benin,SSA,WCA,Low income (L),4.0,1.0,7.0,0.0,16.0,...,4.0,2.0,7.0,0.0,16.0,5.0,2.0,10.0,0.0,21.0
1,BFA,Burkina Faso,SSA,WCA,Low income (L),1.0,1.0,4.0,1.0,5.0,...,2.0,1.0,5.0,1.0,6.0,2.0,0.0,5.0,0.0,6.0
2,CAF,Central African Republic,SSA,WCA,Low income (L),4.0,1.0,9.0,0.0,16.0,...,5.0,1.0,11.0,0.0,18.0,6.0,1.0,12.0,0.0,19.0
3,TCD,Chad,SSA,WCA,Low income (L),2.0,1.0,8.0,0.0,9.0,...,2.0,1.0,8.0,0.0,9.0,3.0,1.0,12.0,0.0,12.0
4,COD,Democratic Republic of the Congo,SSA,WCA,Low income (L),1.0,0.0,2.0,0.0,3.0,...,1.0,0.0,2.0,0.0,2.0,1.0,0.0,3.0,0.0,3.0


In [66]:
# identify the no. of missing rows in low_income_countries data frame again
low_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           0
School_Age_Digital_Connectivity_Urban           0
School_Age_Digital_Connectivity_Poorest         0
School_Age_Digital_Connectivity_Richest         0
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          0
Pre_Primary_Digital_Connectivity_Urban          0
Pre_Primary_Digital_Connectivity_Poorest        0
Pre_Primary_Digital_Connectivity_Richest        0
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [67]:
# identify the no. of missing rows in lower_middle_income_countries data frame 
lower_middle_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           5
School_Age_Digital_Connectivity_Urban           2
School_Age_Digital_Connectivity_Poorest         8
School_Age_Digital_Connectivity_Richest         8
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          5
Pre_Primary_Digital_Connectivity_Urban          2
Pre_Primary_Digital_Connectivity_Poorest        8
Pre_Primary_Digital_Connectivity_Richest        8
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [68]:
# identify median for each columns in lower_middle_income_countries data frame
lower_middle_income_countries.median(numeric_only=True)

School_Age_Digital_Connectivity_Total           17.0
School_Age_Digital_Connectivity_Rural            9.0
School_Age_Digital_Connectivity_Urban           25.0
School_Age_Digital_Connectivity_Poorest          1.0
School_Age_Digital_Connectivity_Richest         48.5
Pre_Primary_Digital_Connectivity_Total          15.0
Pre_Primary_Digital_Connectivity_Rural           8.0
Pre_Primary_Digital_Connectivity_Urban          23.0
Pre_Primary_Digital_Connectivity_Poorest         1.5
Pre_Primary_Digital_Connectivity_Richest        48.5
Primary_Digital_Connectivity_Total              16.0
Primary_Digital_Connectivity_Rural               4.0
Primary_Digital_Connectivity_Urban              22.0
Primary_Digital_Connectivity_Poorest             0.0
Primary_Digital_Connectivity_Richest            30.0
Lower_Secondary_Digital_Connectivity_Total      18.0
Lower_Secondary_Digital_Connectivity_Rural       9.0
Lower_Secondary_Digital_Connectivity_Urban      26.5
Lower_Secondary_Digital_Connectivity_Poorest  

In [69]:
# median imputation 
lower_middle_income_countries = lower_middle_income_countries.fillna(lower_middle_income_countries.median(numeric_only=True))

In [70]:
# check the imputation
lower_middle_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,AGO,Angola,SSA,ESA,Lower middle income (LM),17.0,2.0,24.0,0.0,62.0,...,19.0,2.0,27.0,0.0,62.0,24.0,2.0,33.0,0.0,69.0
1,BGD,Bangladesh,SA,SA,Lower middle income (LM),37.0,33.0,52.0,9.0,76.0,...,34.0,31.0,48.0,8.0,73.0,42.0,38.0,57.0,13.0,79.0
2,BOL,Bolivia (Plurinational State of),LAC,LAC,Lower middle income (LM),12.0,4.0,17.0,1.0,48.5,...,12.0,3.0,17.0,1.5,48.0,21.0,3.5,28.5,0.0,41.0
3,CMR,Cameroon,SSA,WCA,Lower middle income (LM),5.0,0.0,10.0,0.0,24.0,...,5.0,0.0,10.0,0.0,23.0,7.0,0.0,12.0,0.0,28.0
4,CIV,Côte d'Ivoire,SSA,WCA,Lower middle income (LM),3.0,1.0,5.0,0.0,14.0,...,4.0,1.0,7.0,0.0,16.0,3.0,2.0,5.0,0.0,12.0


In [71]:
# identify the no. of missing rows in lower_middle_income_countries data frame again
lower_middle_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           0
School_Age_Digital_Connectivity_Urban           0
School_Age_Digital_Connectivity_Poorest         0
School_Age_Digital_Connectivity_Richest         0
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          0
Pre_Primary_Digital_Connectivity_Urban          0
Pre_Primary_Digital_Connectivity_Poorest        0
Pre_Primary_Digital_Connectivity_Richest        0
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [72]:
# identify the no. of missing rows in upper_middle_income_countries data frame 
upper_middle_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           2
School_Age_Digital_Connectivity_Urban           2
School_Age_Digital_Connectivity_Poorest         6
School_Age_Digital_Connectivity_Richest         6
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          2
Pre_Primary_Digital_Connectivity_Urban          2
Pre_Primary_Digital_Connectivity_Poorest        6
Pre_Primary_Digital_Connectivity_Richest        7
Primary_Digital_Connectivity_Total              1
Primary_Digital_Connectivity_Rural              1
Primary_Digital_Connectivity_Urban              1
Primary_Digital_Connectivity_Poorest            1
Primary_Digital_Connectivity_Richest            1


In [73]:
# identify median for each columns in upper_middle_income_countries data frame
upper_middle_income_countries.median(numeric_only=True)

School_Age_Digital_Connectivity_Total           49.5
School_Age_Digital_Connectivity_Rural           49.0
School_Age_Digital_Connectivity_Urban           65.0
School_Age_Digital_Connectivity_Poorest         33.5
School_Age_Digital_Connectivity_Richest         95.5
Pre_Primary_Digital_Connectivity_Total          48.0
Pre_Primary_Digital_Connectivity_Rural          38.5
Pre_Primary_Digital_Connectivity_Urban          63.0
Pre_Primary_Digital_Connectivity_Poorest        28.0
Pre_Primary_Digital_Connectivity_Richest        95.0
Primary_Digital_Connectivity_Total              49.0
Primary_Digital_Connectivity_Rural              43.0
Primary_Digital_Connectivity_Urban              56.0
Primary_Digital_Connectivity_Poorest            21.0
Primary_Digital_Connectivity_Richest            91.0
Lower_Secondary_Digital_Connectivity_Total      50.0
Lower_Secondary_Digital_Connectivity_Rural      50.5
Lower_Secondary_Digital_Connectivity_Urban      62.5
Lower_Secondary_Digital_Connectivity_Poorest  

In [74]:
# median imputation 
upper_middle_income_countries = upper_middle_income_countries.fillna(upper_middle_income_countries.median(numeric_only=True))

In [75]:
# check the imputation
upper_middle_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),24.0,9.0,32.0,1.0,77.0,...,25.0,10.0,33.0,1.0,78.0,71.5,56.5,79.0,32.0,94.0
1,ARG,Argentina,LAC,LAC,Upper middle income (UM),40.0,49.0,65.0,33.5,95.5,...,43.0,50.5,62.5,34.5,97.0,45.0,0.0,0.0,0.0,0.0
2,ARM,Armenia,ECA,EECA,Upper middle income (UM),81.0,71.0,88.0,47.0,99.0,...,81.0,71.0,89.0,44.0,99.0,85.0,78.0,91.0,54.0,100.0
3,BIH,Bosnia and Herzegovina,ECA,EECA,Upper middle income (UM),59.0,51.0,76.0,8.0,96.0,...,61.0,52.0,79.0,11.0,95.0,70.0,65.0,81.0,13.0,98.0
4,BRA,Brazil,LAC,LAC,Upper middle income (UM),83.0,51.0,89.0,84.0,97.0,...,82.0,50.0,89.0,84.0,97.0,84.0,55.0,90.0,84.0,96.0


In [76]:
# identify the no. of missing rows in upper_middle_income_countries data frame again
upper_middle_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           0
School_Age_Digital_Connectivity_Urban           0
School_Age_Digital_Connectivity_Poorest         0
School_Age_Digital_Connectivity_Richest         0
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          0
Pre_Primary_Digital_Connectivity_Urban          0
Pre_Primary_Digital_Connectivity_Poorest        0
Pre_Primary_Digital_Connectivity_Richest        0
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [77]:
# identify the no. of missing rows in high_income_countries data frame 
high_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           1
School_Age_Digital_Connectivity_Urban           1
School_Age_Digital_Connectivity_Poorest         1
School_Age_Digital_Connectivity_Richest         2
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          1
Pre_Primary_Digital_Connectivity_Urban          1
Pre_Primary_Digital_Connectivity_Poorest        1
Pre_Primary_Digital_Connectivity_Richest        2
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [78]:
# identify median for each columns in high_income_countries data frame
high_income_countries.median(numeric_only=True)

School_Age_Digital_Connectivity_Total           66.0
School_Age_Digital_Connectivity_Rural           54.0
School_Age_Digital_Connectivity_Urban           67.0
School_Age_Digital_Connectivity_Poorest         22.5
School_Age_Digital_Connectivity_Richest         96.0
Pre_Primary_Digital_Connectivity_Total          61.0
Pre_Primary_Digital_Connectivity_Rural          48.0
Pre_Primary_Digital_Connectivity_Urban          57.5
Pre_Primary_Digital_Connectivity_Poorest        10.0
Pre_Primary_Digital_Connectivity_Richest        96.0
Primary_Digital_Connectivity_Total              63.0
Primary_Digital_Connectivity_Rural              53.0
Primary_Digital_Connectivity_Urban              64.0
Primary_Digital_Connectivity_Poorest             9.0
Primary_Digital_Connectivity_Richest            95.0
Lower_Secondary_Digital_Connectivity_Total      68.0
Lower_Secondary_Digital_Connectivity_Rural      56.0
Lower_Secondary_Digital_Connectivity_Urban      68.5
Lower_Secondary_Digital_Connectivity_Poorest  

In [79]:
# median imputation 
high_income_countries = high_income_countries.fillna(high_income_countries.median(numeric_only=True))

In [80]:
# check the imputation
high_income_countries.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,BRB,Barbados,LAC,LAC,High income (H),66.0,61.0,69.0,10.0,98.0,...,68.0,66.0,69.0,20.0,100.0,76.0,76.0,76.0,4.0,100.0
1,CHL,Chile,LAC,LAC,High income (H),86.0,70.0,89.0,73.0,99.0,...,87.0,71.0,90.0,75.0,99.0,87.0,72.0,90.0,75.0,98.0
2,JPN,Japan,EAP,EAP,High income (H),78.0,83.0,77.0,62.0,87.0,...,78.0,77.0,79.0,64.0,88.0,88.0,88.0,87.0,55.0,94.0
3,PAN,Panama,LAC,LAC,High income (H),31.0,10.0,44.0,4.0,94.0,...,33.0,11.0,48.0,4.0,92.0,38.0,15.0,51.0,8.0,97.0
4,TTO,Trinidad and Tobago,LAC,LAC,High income (H),44.0,34.0,52.0,3.0,96.0,...,47.0,38.0,54.0,4.0,93.0,51.0,41.0,60.0,7.0,97.0


In [81]:
# identify the no. of missing rows in high_income_countries data frame again
high_income_countries.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           0
School_Age_Digital_Connectivity_Urban           0
School_Age_Digital_Connectivity_Poorest         0
School_Age_Digital_Connectivity_Richest         0
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          0
Pre_Primary_Digital_Connectivity_Urban          0
Pre_Primary_Digital_Connectivity_Poorest        0
Pre_Primary_Digital_Connectivity_Richest        0
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


In [82]:
# concatenate all together back to form a full dataframe
modified_df = pd.concat([low_income_countries, lower_middle_income_countries, upper_middle_income_countries, high_income_countries], 
                        axis = 0,
                       ignore_index = True)

In [83]:
# show the first 5 rows
modified_df.head(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,BEN,Benin,SSA,WCA,Low income (L),4.0,1.0,7.0,0.0,16.0,...,4.0,2.0,7.0,0.0,16.0,5.0,2.0,10.0,0.0,21.0
1,BFA,Burkina Faso,SSA,WCA,Low income (L),1.0,1.0,4.0,1.0,5.0,...,2.0,1.0,5.0,1.0,6.0,2.0,0.0,5.0,0.0,6.0
2,CAF,Central African Republic,SSA,WCA,Low income (L),4.0,1.0,9.0,0.0,16.0,...,5.0,1.0,11.0,0.0,18.0,6.0,1.0,12.0,0.0,19.0
3,TCD,Chad,SSA,WCA,Low income (L),2.0,1.0,8.0,0.0,9.0,...,2.0,1.0,8.0,0.0,9.0,3.0,1.0,12.0,0.0,12.0
4,COD,Democratic Republic of the Congo,SSA,WCA,Low income (L),1.0,0.0,2.0,0.0,3.0,...,1.0,0.0,2.0,0.0,2.0,1.0,0.0,3.0,0.0,3.0


In [84]:
# sort the dataframe
modified_df = modified_df.sort_values(by=['Countries and areas'], ignore_index = True )

In [85]:
# display the dataset again
modified_df.head(10)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
0,DZA,Algeria,MENA,MENA,Upper middle income (UM),24.0,9.0,32.0,1.0,77.0,...,25.0,10.0,33.0,1.0,78.0,71.5,56.5,79.0,32.0,94.0
1,AGO,Angola,SSA,ESA,Lower middle income (LM),17.0,2.0,24.0,0.0,62.0,...,19.0,2.0,27.0,0.0,62.0,24.0,2.0,33.0,0.0,69.0
2,ARG,Argentina,LAC,LAC,Upper middle income (UM),40.0,49.0,65.0,33.5,95.5,...,43.0,50.5,62.5,34.5,97.0,45.0,0.0,0.0,0.0,0.0
3,ARM,Armenia,ECA,EECA,Upper middle income (UM),81.0,71.0,88.0,47.0,99.0,...,81.0,71.0,89.0,44.0,99.0,85.0,78.0,91.0,54.0,100.0
4,BGD,Bangladesh,SA,SA,Lower middle income (LM),37.0,33.0,52.0,9.0,76.0,...,34.0,31.0,48.0,8.0,73.0,42.0,38.0,57.0,13.0,79.0
5,BRB,Barbados,LAC,LAC,High income (H),66.0,61.0,69.0,10.0,98.0,...,68.0,66.0,69.0,20.0,100.0,76.0,76.0,76.0,4.0,100.0
6,BEN,Benin,SSA,WCA,Low income (L),4.0,1.0,7.0,0.0,16.0,...,4.0,2.0,7.0,0.0,16.0,5.0,2.0,10.0,0.0,21.0
7,BOL,Bolivia (Plurinational State of),LAC,LAC,Lower middle income (LM),12.0,4.0,17.0,1.0,48.5,...,12.0,3.0,17.0,1.5,48.0,21.0,3.5,28.5,0.0,41.0
8,BIH,Bosnia and Herzegovina,ECA,EECA,Upper middle income (UM),59.0,51.0,76.0,8.0,96.0,...,61.0,52.0,79.0,11.0,95.0,70.0,65.0,81.0,13.0,98.0
9,BRA,Brazil,LAC,LAC,Upper middle income (UM),83.0,51.0,89.0,84.0,97.0,...,82.0,50.0,89.0,84.0,97.0,84.0,55.0,90.0,84.0,96.0


In [86]:
modified_df.tail(5)

Unnamed: 0,ISO3,Countries and areas,Region,Sub-region,Income Group,School_Age_Digital_Connectivity_Total,School_Age_Digital_Connectivity_Rural,School_Age_Digital_Connectivity_Urban,School_Age_Digital_Connectivity_Poorest,School_Age_Digital_Connectivity_Richest,...,Lower_Secondary_Digital_Connectivity_Total,Lower_Secondary_Digital_Connectivity_Rural,Lower_Secondary_Digital_Connectivity_Urban,Lower_Secondary_Digital_Connectivity_Poorest,Lower_Secondary_Digital_Connectivity_Richest,Upper_Secondary_Digital_Connectivity_Total,Upper_Secondary_Digital_Connectivity_Rural,Upper_Secondary_Digital_Connectivity_Urban,Upper_Secondary_Digital_Connectivity_Poorest,Upper_Secondary_Digital_Connectivity_Richest
82,URY,Uruguay,LAC,LAC,High income (H),63.0,47.0,65.0,35.0,96.0,...,66.0,46.0,68.0,28.0,93.0,70.0,48.0,73.0,53.0,0.0
83,UZB,Uzbekistan,ECA,EECA,Lower middle income (LM),19.0,16.0,29.0,1.0,69.0,...,18.0,14.0,31.0,1.0,70.0,22.0,17.0,38.0,0.0,70.0
84,VNM,Viet Nam,EAP,EAP,Lower middle income (LM),62.0,9.0,62.0,1.0,48.5,...,62.0,9.0,62.0,1.5,48.0,67.0,0.0,67.0,0.0,0.0
85,ZMB,Zambia,SSA,ESA,Lower middle income (LM),6.0,2.0,13.0,0.0,28.0,...,8.0,2.0,16.0,0.0,31.0,7.0,3.0,14.0,0.0,28.0
86,ZWE,Zimbabwe,SSA,ESA,Lower middle income (LM),26.0,18.0,49.0,4.0,62.0,...,24.0,17.0,50.0,4.0,65.0,29.0,20.0,51.0,5.0,64.0


In [87]:
# check the missing values for the last time 
modified_df.isna().sum()

ISO3                                            0
Countries and areas                             0
Region                                          0
Sub-region                                      0
Income Group                                    0
School_Age_Digital_Connectivity_Total           0
School_Age_Digital_Connectivity_Rural           0
School_Age_Digital_Connectivity_Urban           0
School_Age_Digital_Connectivity_Poorest         0
School_Age_Digital_Connectivity_Richest         0
Pre_Primary_Digital_Connectivity_Total          0
Pre_Primary_Digital_Connectivity_Rural          0
Pre_Primary_Digital_Connectivity_Urban          0
Pre_Primary_Digital_Connectivity_Poorest        0
Pre_Primary_Digital_Connectivity_Richest        0
Primary_Digital_Connectivity_Total              0
Primary_Digital_Connectivity_Rural              0
Primary_Digital_Connectivity_Urban              0
Primary_Digital_Connectivity_Poorest            0
Primary_Digital_Connectivity_Richest            0


There is no missing values in the dataset. Now, save the dataframe into comma separated values (CSV format). 

In [88]:
# save the dataframe 
modified_df.to_csv('Cleaned_Digital_Connectivity.csv', index = False)