# Project 2

## Setup 

Download the data from the worldbank

In [4]:
# download and update the data in the project
!./setup.sh

mkdir: cannot create directory ‘data’: File exists
--2024-09-26 14:44:43--  https://databank.worldbank.org/data/download/EdStats_CSV.zip
Resolving databank.worldbank.org (databank.worldbank.org)... 172.64.145.25, 104.18.42.231, 2606:4700:4400::6812:2ae7, ...
Connecting to databank.worldbank.org (databank.worldbank.org)|172.64.145.25|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://databankfiles.worldbank.org/public/ddpext_download/EdStats_CSV.zip [following]
--2024-09-26 14:44:44--  https://databankfiles.worldbank.org/public/ddpext_download/EdStats_CSV.zip
Resolving databankfiles.worldbank.org (databankfiles.worldbank.org)... 13.107.246.56, 2620:1ec:bdf::56
Connecting to databankfiles.worldbank.org (databankfiles.worldbank.org)|13.107.246.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38943514 (37M) [application/x-zip-compressed]
Saving to: ‘EdStats_CSV.zip’


2024-09-26 14:44:44 (59.6 MB/s) - ‘EdStats_CSV.

In [5]:
!pip install -r requirements.txt



In [6]:
import pandas as pd
import scipy.stats as stats
from scipy.stats import shapiro
from scipy.stats import kruskal
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.stats.api as sms

In [7]:
# load and clean the dataset
country_codes = pd.read_csv("./data/countries.csv")
df = pd.read_csv('./data/EdStatsData.csv')

Setup the python environment

### Hypothesis 1 - Gender parity and drop-out rates

#### Hypothesis:

Countries with higher gender parity in enrolment (GPI) will have lower drop-out rates in both primary and secondary education

#### Indicators:
1. Adjusted net enrolment rate, gender parity index (primary, secondary, upper secondary)
2. Cumulative drop-out rate (primary, secondary)

#### Null and alternative Hypothesis:

Null Hypothesis (H0): There is no correlation between gender parity in enrolment (GPI) and drop-out rates
Alternative hypothesis (H1): Higher gender parity in enrolment (GPI) is associated with lower drop-out rates

For the testing of this hypothesis the Pearson's correlation test will be used to test the relationship between the indicators.

Pearsons correlation produces a value in the range [-1, 1] where:
- p = 1, indicates a perfect correlation: as GPI increases, drop-out rates increase
- p = -1, indicates a perfect negative correlation: as GPI increases, drop-out rates decrease
- p = 0, indicates that there is no correlation between the two indicators

P-value tells us whether the observed correlation is statistically significant
- if p-value is < 0.05, we can conclude that the correlation is significant and we can reject the null hypothesis
- if the p-value is > 0.05 we can conclude that the correlation is not statistically significant and we fail to reject the null hypothesis

In [8]:
hyp1_df = df[df["Country Code"].isin(country_codes["alpha-3"])]

hyp1_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
91625,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,7.05911,,...,,,,,,,,,,
91626,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,2.53138,,...,,,,,,,,,,
91627,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,0.22154,,...,,,,,,,,,,
91628,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,11.42652,,...,,,,,,,,,,
91629,Afghanistan,AFG,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,,,,,,,...,,,,,,,,,,


In [9]:
# define the Indicators from the dataset for the cleaning process of the data

# define the inidcators for the gpi
gpi_indicators = [
    'Adjusted net enrolment rate, primary, gender parity index (GPI)',
    'Adjusted net enrolment rate, lower secondary, gender parity index (GPI)'
]

# defin the inidicators for the drop out rate
drop_out_indicators = [
    'Cumulative drop-out rate to the last grade of primary education, both sexes (%)',
    'Cumulative drop-out rate to the last grade of lower secondary general education, both sexes (%)'
]


In [10]:
def categorize_gpi(gpi):
    if gpi < 1:
        return "Male"
    elif gpi == 1:
        return "equal"
    else:
        return "Female"

In [11]:
# Re-shape the data to prevent the Years from being presented as individual columns
df_melted = hyp1_df.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
                    var_name="Year", value_name="Value")

df_melted.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,Value
0,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,1970,
1,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,1970,
2,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,1970,
3,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,1970,
4,Afghanistan,AFG,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,1970,


In [14]:


# Convert the data to a numeric value to work with it
# df_melted['Year'] = pd.to_numeric(df_melted['Year'], errors='coerce')

# Get the list Gpi indicator values and drop out values
gpi_df = df_melted[df_melted['Indicator Name'].isin(gpi_indicators)]
dropout_df = df_melted[df_melted['Indicator Name'].isin(drop_out_indicators)]

# Drop the null values to clean the data
gpi_df_clean = gpi_df.dropna(subset=['Value'])
dropout_df_clean = dropout_df.dropna(subset=['Value'])

# convert the values to numeric to work with them
gpi_df_clean['Value'] = pd.to_numeric(gpi_df_clean['Value'], errors='coerce')
gpi_df_clean["Value"] = gpi_df_clean["Value"].apply(categorize_gpi)
dropout_df_clean['Value'] = pd.to_numeric(dropout_df_clean['Value'], errors='coerce')

# merge the dataset
hyp1_df = pd.merge(gpi_df_clean, dropout_df_clean, on=['Country Code', 'Year'], suffixes=('_GPI', '_Dropout'))

hyp1_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gpi_df_clean['Value'] = pd.to_numeric(gpi_df_clean['Value'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gpi_df_clean["Value"] = gpi_df_clean["Value"].apply(categorize_gpi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dropout_df_clean['Value'] = pd.to_numeric(dropout_df_clean[

Unnamed: 0,Country Name_GPI,Country Code,Indicator Name_GPI,Indicator Code_GPI,Year,Value_GPI,Country Name_Dropout,Indicator Name_Dropout,Indicator Code_Dropout,Value_Dropout
0,Argentina,ARG,"Adjusted net enrolment rate, primary, gender p...",UIS.NERA.1.GPI,1970,Female,Argentina,Cumulative drop-out rate to the last grade of ...,UIS.DR.2.GPV.T,20.250620
1,Argentina,ARG,"Adjusted net enrolment rate, primary, gender p...",UIS.NERA.1.GPI,1970,Female,Argentina,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,36.379501
2,Botswana,BWA,"Adjusted net enrolment rate, primary, gender p...",UIS.NERA.1.GPI,1970,Female,Botswana,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,50.345482
3,Costa Rica,CRI,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,1970,Female,Costa Rica,Cumulative drop-out rate to the last grade of ...,UIS.DR.2.GPV.T,51.214520
4,Costa Rica,CRI,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,1970,Female,Costa Rica,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,28.816540
...,...,...,...,...,...,...,...,...,...,...
6456,Kazakhstan,KAZ,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,2015,Female,Kazakhstan,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,0.968390
6457,Kazakhstan,KAZ,"Adjusted net enrolment rate, primary, gender p...",UIS.NERA.1.GPI,2015,Female,Kazakhstan,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,0.968390
6458,Nepal,NPL,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,2015,Female,Nepal,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,23.148970
6459,Nepal,NPL,"Adjusted net enrolment rate, primary, gender p...",UIS.NERA.1.GPI,2015,Male,Nepal,Cumulative drop-out rate to the last grade of ...,SE.PRM.DROP.ZS,23.148970


In [55]:
# Prepare the first indicator

# The GPI for primary, secondary and upper secondary schools
gpi_values = hyp1_df['Value_GPI']
drop_out_values = hyp1_df["Value_Dropout"]

In [17]:
anova_result = stats.f_oneway(
    hyp1_df[hyp1_df["Value_GPI"] != "equal"]["Value_Dropout"],
    hyp1_df[hyp1_df['Value_GPI'] == "equal"]["Value_Dropout"],
)

print(f"f-ANOVA Statistic: {anova_result.statistic}")
print(f"P-value: {anova_result.pvalue}")

f-ANOVA Statistic: 2.2426012958722157
P-value: 0.1343038193795568


In [56]:
# Test for a correlation

correlation, p_value = stats.pearsonr(gpi_values, drop_out_values)

print(f"Pearsons correlation coefficient: {correlation}")
print(f"P-value: {p_value}")

Pearsons correlation coefficient: -0.20755224109799786
P-value: 8.300701558340677e-64


In [57]:
if p_value < 0.05:
    print("There is a statistically significant correlation between GPI and Drop-out rate (Reject H0)")
else:
    print("There is no statistically significant correlation between GPI and Drop-out rate (Faile to reject H0)")

There is a statistically significant correlation between GPI and Drop-out rate (Reject H0)


In [16]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=gpi_values, y=drop_out_values)
plt.title('Gender Parity Index (GPI) vs Drop-out Rates')
plt.xlabel('Gender Parity Index (GPI)')
plt.ylabel('Drop-out Rates')
plt.show()

NameError: name 'gpi_values' is not defined

<Figure size 1000x600 with 0 Axes>

### Interpretation of results:

Pearson correlation coefficient: -0.25

This is a weak negative correlation between **gender parity (GPI)** and **Drop out rates**. This suggests that as GPI increases (.i.e. gender parity improves), the drop out rate decrease slightly but the effect is not very strong

P-Value: 1.7583200777245715e-114:

The p-value is extremely small, much smaller than the usual threshold of 0.05. This means the correlation is statistically significant. Even though the correlation is weak, the p-value indicates that the result is not due to random chance

### Conclusion:

We can confidently reject the null hypothesis (H0), meaning there is indeed a relationship between gender parity and drop out rates 


## Data analysis

In [59]:
print("Descriptive Statistics for GPI")
print(hyp1_df['Value_GPI'].describe())

print("\nDescriptive Statistics for Drop-out Rates")
print(hyp1_df['Value_Dropout'].describe())

Descriptive Statistics for GPI
count    6461.000000
mean        0.976505
std         0.182833
min         0.190660
25%         0.962490
50%         1.001210
75%         1.025080
max         2.494290
Name: Value_GPI, dtype: float64

Descriptive Statistics for Drop-out Rates
count    6461.000000
mean       19.195747
std        17.929322
min         0.007320
25%         4.069590
50%        13.937510
75%        30.442551
max        92.459167
Name: Value_Dropout, dtype: float64


In [60]:
median_dropout = hyp1_df['Value_Dropout'].median()
high_dropout = hyp1_df[hyp1_df['Value_Dropout'] > median_dropout]
low_dropout = hyp1_df[hyp1_df['Value_Dropout'] <= median_dropout]

In [61]:
# run t-testing
t_stat, p_value_ttest = stats.ttest_ind(high_dropout['Value_GPI'], low_dropout['Value_GPI'])
print(f"T-test: t-statistic = {t_stat}, p-value = {p_value_ttest}")

T-test: t-statistic = -13.973484421264791, p-value = 9.757703722627095e-44


In [None]:
# do violin plot of groups

# test - for each group
# for normally distributed - shapiro - wilk test
# if normally - test variances -> levenes test
#                               -> bartlett test


# ALt for anova -> kruskal wallace test 
#               -> welsh anova 

In [62]:
gpi_values = hyp1_df['Value_GPI'].dropna()  # Drop missing values
dropout_values = hyp1_df['Value_Dropout'].dropna()  # Drop missing values

shapiro_gpi_stat, shapiro_gpi_pvalue = shapiro(gpi_values)
print(f"Shapiro-Wilk Test for GPI: Statistic = {shapiro_gpi_stat}, p-value = {shapiro_gpi_pvalue}")


shapiro_dropout_stat, shapiro_dropout_pvalue = shapiro(dropout_values)
print(f"Shapiro-Wilk Test for Drop-out Rates: Statistic = {shapiro_dropout_stat}, p-value = {shapiro_dropout_pvalue}")

if shapiro_gpi_pvalue < 0.05:
    print("GPI data is not normally distributed (Reject H0)")
else:
    print("GPI data is normally distributed (Fail to reject H0)")

if shapiro_dropout_pvalue < 0.05:
    print("Drop-out Rate data is not normally distributed (Reject H0)")
else:
    print("Drop-out Rate data is normally distributed (Fail to reject H0)")

Shapiro-Wilk Test for GPI: Statistic = 0.760653422998363, p-value = 1.4212061742204283e-70
Shapiro-Wilk Test for Drop-out Rates: Statistic = 0.8849735339081437, p-value = 1.996512272431664e-56
GPI data is not normally distributed (Reject H0)
Drop-out Rate data is not normally distributed (Reject H0)


  res = hypotest_fun_out(*samples, **kwds)


In [64]:
# ANOVA test

country_groups_gpi = [group['Value_GPI'].dropna() for _, group in hyp1_df.groupby('Country Name')]
country_groups_dropout = [group['Value_Dropout'].dropna() for _, group in hyp1_df.groupby('Country Name')]

kruskal_gpi_stat, kruskal_gpi_pvalue = kruskal(*country_groups_gpi)
print(f"Kruskal-Wallis Test for GPI: Statistic = {kruskal_gpi_stat}, p-value = {kruskal_gpi_pvalue}")

kruskal_dropout_stat, kruskal_dropout_pvalue = kruskal(*country_groups_dropout)
print(f"Kruskal-Wallis Test for Drop-out Rates: Statistic = {kruskal_dropout_stat}, p-value = {kruskal_dropout_pvalue}")

if kruskal_gpi_pvalue < 0.05:
    print("There is a significant difference in GPI across countries (Reject H0)")
else:
    print("There is no significant difference in GPI across countries (Fail to reject H0)")

if kruskal_dropout_pvalue < 0.05:
    print("There is a significant difference in Drop-out Rates across countries (Reject H0)")
else:
    print("There is no significant difference in Drop-out Rates across countries (Fail to reject H0)")


Kruskal-Wallis Test for GPI: Statistic = 4161.903114806945, p-value = 0.0
Kruskal-Wallis Test for Drop-out Rates: Statistic = 4495.364696542772, p-value = 0.0
There is a significant difference in GPI across countries (Reject H0)
There is a significant difference in Drop-out Rates across countries (Reject H0)


#### Interpretation of ANOVA results



## Hypothesis 2 - Teacher training and student performance
### Hypothesis:
A higher percentage of trained teachers is postively correlated with student retention and lower drop-out rates
### Indicators:
1. Percentage of new recruited teachers trained (primary, secondary, by gender)
2. Cumulative drop-out rates

## Hypothesis 3 - School resources and Literacty:
### Hypothesis:
Schools with better access to infrastructure (electricty, potable water, toilets) will have higher enrolment rates

### Indicators:
1. Percentage of schools with access to electricity, potable water, mixed-sex/single-sex toilets
2. Net enrolment rates (primary, secondary)

## Hypothesis 4 - Government Expenditure and Enrolment

### Hypothesis:
Higher government expenditure on education correlates with increased enrolment rates

### Indicators:
1. Expenditure as % of total in public institutions (primary, secondary)
2. Net enrolement rate (primary, secondary)
