### Question 2: Are women more likely to complete secondary education in some countries than others? In the coming years, what percentage of women overall and by country, do we expect to enroll in secondary education? What factors indicate whether or not a women completes secondary education?

In [1]:
#import packages

# general
import numpy as np
import pandas as pd
import time

# sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector

# visualization
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
#load in the dataset
df = pd.read_csv('transformed_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Year,A woman can be head of household in the same way as a man (1=yes; 0=no),A woman can choose where to live in the same way as a man (1=yes; 0=no),A woman can get a job in the same way as a man (1=yes; 0=no),A woman can obtain a judgment of divorce in the same way as a man (1=yes; 0=no),A woman can open a bank account in the same way as a man (1=yes; 0=no),A woman can register a business in the same way as a man (1=yes; 0=no),A woman can sign a contract in the same way as a man (1=yes; 0=no),A woman can travel outside her home in the same way as a man (1=yes; 0=no),...,Country Name_Ukraine,Country Name_United Arab Emirates,Country Name_United Kingdom,Country Name_United States,Country Name_Uruguay,Country Name_Uzbekistan,"Country Name_Venezuela, RB",Country Name_Viet Nam,Country Name_West Bank and Gaza,Country Name_Zimbabwe
0,0,2000,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
1,1,2003,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
2,2,2011,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
3,3,2015,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
4,4,2016,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
#create X and y dataframes
X = df.drop(columns='School enrollment, secondary, female (% gross)')
y = df[['School enrollment, secondary, female (% gross)']]

In [4]:
#split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)
y_test

Unnamed: 0,"School enrollment, secondary, female (% gross)"
396,97.504059
227,99.455688
673,97.418770
702,97.418770
643,71.628517
...,...
35,97.418770
870,73.966942
503,97.572731
472,96.097511


In [5]:
#initialize sfs, use linear regression as the baseline model for this question
sfs = SequentialFeatureSelector(estimator = LinearRegression(),
                                n_features_to_select = "auto",
                                direction = 'forward',
                                scoring = 'neg_mean_squared_error',
                                cv = 10)

#fit the data to sfs
sfs = sfs.fit(X_train, y_train)

#retrieve the and print the names of the selected features
feature_names = np.array(df.columns.difference(['School enrollment, secondary, female (% gross)']))
selected_feature_names = feature_names[sfs.get_support()].tolist()
print("Selected features:", selected_feature_names)

# transform X_train and X_test to include only the selected features
X_train_selected = sfs.transform(X_train)
X_test_selected = sfs.transform(X_test)

# display the shape of transformed X_train_selected and X_test_selected
print("Transformed X_train shape:", X_train_selected.shape)
print("Transformed X_test shape:", X_test_selected.shape)

Selected features: ['Age dependency ratio (% of working-age population)', 'Country Name_Algeria', 'Country Name_Antigua and Barbuda', 'Country Name_Australia', 'Country Name_Austria', 'Country Name_Azerbaijan', 'Country Name_Bangladesh', 'Country Name_Belarus', 'Country Name_Belgium', 'Country Name_Belize', 'Country Name_Benin', 'Country Name_Brunei Darussalam', 'Country Name_Bulgaria', 'Country Name_Burkina Faso', 'Country Name_Cabo Verde', 'Country Name_Cambodia', 'Country Name_Chad', 'Country Name_Colombia', 'Country Name_Costa Rica', 'Country Name_Czechia', 'Country Name_Denmark', 'Country Name_Dominican Republic', 'Country Name_Ecuador', 'Country Name_El Salvador', 'Country Name_Estonia', 'Country Name_Eswatini', 'Country Name_Georgia', 'Country Name_Grenada', 'Country Name_Guatemala', 'Country Name_Guyana', 'Country Name_Honduras', 'Country Name_Hong Kong SAR, China', 'Country Name_Hungary', 'Country Name_India', 'Country Name_Indonesia', 'Country Name_Iraq', 'Country Name_Korea,

In [6]:
#create a linear regression model using the selected features

#initialize the model
lr_model = LinearRegression()

#fit the model
lr_model.fit(X_train_selected, y_train)

#making predictions on the training and test sets
y_pred_train = lr_model.predict(X_train_selected)
y_pred_test = lr_model.predict(X_test_selected)

#evaluate the model using MSE and R squared
mse_train = mean_squared_error(y_train,y_pred_train)
mse_test = mean_squared_error(y_test,y_pred_test)
r_sq_train = lr_model.score(X_train_selected, y_train)
r_sq_test = lr_model.score(X_test_selected, y_test)

#print the MSE and R_squared
print('MSE (Train): ', round(mse_train, 3))
print('MSE (Test): ', round(mse_test, 3))
print('R-Squared (Train): ', round(r_sq_train, 3))
print('R-Squared (Test): ', round(r_sq_test, 3))

MSE (Train):  78.115
MSE (Test):  139.234
R-Squared (Train):  0.876
R-Squared (Test):  0.752


> The initial model does fairly okay with an r-squared of 0.876 for the training data and 0.752 for the test data. Because the test r-squared is quite a bit lower than the train, there could be some concern of overfitting. Additionally, the MSE of the test data is much worse than it is for the training data. For a simple linear regression, this model does not perform too bad, but it will be interesting to see how more advanced models perform in comparison.

### References

1. Lab 3 Solutions