# Analysis of out-of-school rates around the world

Education is one of the most important factors that can lift disadvantaged communities from poverty to a comfortable life. Unfortunately, we can see that access to education is most limited in areas which need it the most. Traditionally there is also a gap between boys' and girls' access to education in many parts of the world. That is what makes the analysis of school dropout rates essential to ensure equal access and equal opportunity to all. 

### Loading the libraries and the data

In [1]:
# Loading libraries 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import os

In [2]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [3]:
primary_df = df = pd.read_csv('/kaggle/input/out-of-school-rates-global-data/Primary.csv', encoding = 'latin1')
primary_df.head()

In [4]:
primary_df.columns

### Dealing with the null values

In [5]:
primary_df.isnull().sum()

In [6]:
primary_df = primary_df.fillna(primary_df.median())

In [7]:
primary_df.info()

## Visual analysis

The data has three categories of regions- least developed, more developed and less developed. We can clearly see that most of the countries are in the less developed category.

In [8]:
parameter = 'Development Regions'
sns.countplot(parameter,data=primary_df)
plt.show()

Checking the distribution over geographical regions, we can see that most of the countries fall under ECA (Europe and Central Asia). SSA (Sub Saharan Africa) comes a close second. SA (South Asia) has the lowest number of members. The other regions are LAC (Latin America and the Carribbean), EAP (East Asia and Pacific) and MENA (Middle East and North Africa)

In [9]:
parameter = 'Region'
sns.countplot(primary_df[parameter])
plt.show()

If we further divide the regions into sub-regions, we can see that LAC (Latin America and the Carribbean) has the most members, and SA (South Asia) has the least. The other regions are EAP (East Asia and Pacific), WCA (West and Central Africa), ESA (Eastern and Southern Africa), MENA (Middle East and North Africa), EECA (Eastern Europe and Central Asia) and WE (Western Europe).

In [10]:
parameter = 'Sub-region'
sns.countplot(primary_df[parameter])
plt.show()

Here we can see that much of the kids residing in rural areas who dropped out of school were from the least developed countries. This could be attributed to the data that most of the population resides in rural areas in the least developed countries. 

In [11]:
parameters = ('Development Regions', 'Rural_Residence')
primary_df[[parameters[0],parameters[1]]].groupby([parameters[0]]).mean().plot.bar()
plt.show()

Here we look at the distribution of the total percentage of kids who drop out of school. We can see that in most of the cases, less than 10% of kids drop out of school. The distribution is quite similar for the distribution of male and female students. 

In [12]:
parameters = 'Total'
sns.distplot(primary_df[parameters])
plt.show()

In [13]:
parameters = 'Female'
sns.distplot(primary_df[parameters])
plt.show()

In [14]:
parameters = 'Male'
sns.distplot(primary_df[parameters])
plt.show()

In the following figure, we can see the gulf between the least developed countries and the other countries in terms of poverty. Most of the dropouts in the least developed countries came from the poorest wealth quintile.  

In [15]:
parameters = ('Development Regions', 'Poorest_Wealth quintile')
primary_df[[parameters[0],parameters[1]]].groupby([parameters[0]]).mean().plot.bar()
plt.show()

In this countplot, we can see that the total percentage of dropouts is lower than 5% for the more developed countries. For less developed countries the percentage goes upto 25% and for the least developed countries, the highest percentage of dropouts goes up to 72%. The distribution is similar for the male and female dropouts. 

In [16]:
parameters = ('Development Regions', 'Total')
sns.countplot(parameters[1],hue=parameters[0],data=primary_df)
plt.xticks(rotation=90)
plt.show()

In [17]:
parameters = ('Development Regions', 'Female')
sns.countplot(parameters[1],hue=parameters[0],data=primary_df)
plt.xticks(rotation=90)
plt.show()

In [18]:
parameters = ('Development Regions', 'Male')
sns.countplot(parameters[1],hue=parameters[0],data=primary_df)
plt.xticks(rotation=90)
plt.show()

The following scatter plot is yet another illustration of how the more developed countries have single digit dropout rates whereas the least developed countries have a high dropout rate for both males and females. 

In [19]:
plt.figure(figsize=(12,6))
sns.scatterplot(primary_df['Female'],primary_df['Male'],hue=primary_df['Development Regions'])
plt.show()

Discretizing the numerical feature Total% of dropouts

In [20]:
old_feature_name = 'Total'
new_feature_name = 'Total_band'
primary_df[new_feature_name]='None'
primary_df.loc[primary_df[old_feature_name]<=15,new_feature_name]='Low'
primary_df.loc[(primary_df[old_feature_name]>15)&(primary_df[old_feature_name]<=30),new_feature_name]='Medium'
primary_df.loc[(primary_df[old_feature_name]>30),new_feature_name]='High'
primary_df[new_feature_name].value_counts().to_frame().style.background_gradient(cmap='summer')

Discretizing the numerical feature Female% of dropouts

In [21]:
old_feature_name = 'Female'
new_feature_name = 'Female_band'
primary_df[new_feature_name]='None'
primary_df.loc[primary_df[old_feature_name]<=15,new_feature_name]='Low'
primary_df.loc[(primary_df[old_feature_name]>15)&(primary_df[old_feature_name]<=30),new_feature_name]='Medium'
primary_df.loc[(primary_df[old_feature_name]>30),new_feature_name]='High'
primary_df[new_feature_name].value_counts().to_frame().style.background_gradient(cmap='summer')

Discretizing the numerical feature Male% of dropouts

In [22]:
old_feature_name = 'Male'
new_feature_name = 'Male_band'
primary_df[new_feature_name]='None'
primary_df.loc[primary_df[old_feature_name]<=15,new_feature_name]='Low'
primary_df.loc[(primary_df[old_feature_name]>15)&(primary_df[old_feature_name]<=30),new_feature_name]='Medium'
primary_df.loc[(primary_df[old_feature_name]>30),new_feature_name]='High'
primary_df[new_feature_name].value_counts().to_frame().style.background_gradient(cmap='summer')

In the following figure, we can see that in Sub Saharan Africa, there are 8 countries wherein both males and females drop out of schools in a high rate. In South Asia and Sub Saharan Africa there are a few countries where the males drop out of schools at a higher rate than females. In Europe. East Asia and the Middle East, the drop out rate is low for both boys and girls. 

In [23]:
pd.crosstab([primary_df.Male_band,primary_df.Female_band],[primary_df.Region],margins=True).style.background_gradient(cmap='summer_r')

Once again, we can see that Sub Saharan Africa sees the highest rate of drop outs in girls. Middle East and North Africa, and South Asia has one country with a medium drop out rate. 

In [24]:
pd.crosstab([primary_df.Female_band],[primary_df.Region],margins=True).style.background_gradient(cmap='summer_r')

Checking the correlation of all of the features, we can draw the following conclusions:
* Poorest of the population do not live in urban areas usually (correlation of 0.86, lower than other values)
* Richest of the population do not live in rural areas usually (correlation of 0.86, lower than other values)
* Economic factors seem to influence male and female drop out rates in the same way

In [25]:
sns.heatmap(df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()