<a href="https://colab.research.google.com/github/JedRoundy/Machine_Learning_For_Economists/blob/main/midterm/midterm2023fall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
np.random.seed(seed=484)
import warnings
warnings.filterwarnings('ignore')

# Midterm Exam

Exam is open book, open note, and open Google. You are not allowed outside
help from another person, however. All work, including coding, must be yours alone. Remember to turn in both the written portion and this coding portion. The coding portion can be turned in by submitting a shared link to your Colab notebook. To complete this coding portion, make sure to save a copy of this notebook in your own Google drive, supply the python code in the empty cells below, and execute the notebook. To get full credit, the completed notebook should be able to run top to bottom, producing the results asked for in the prompts below.

This portion of the exam will take you through the steps of the supervised machine learning process.

## 1. Figure out your question

The question you want to answer is: How does childbearing impact labor market outcomes for women? We can use machine learning to help answer this question by building a model that predicts how many children a woman gives birth to on the basis of her characteristics.

## 2. Obtain a labeled dataset

Import the python library that is good for manipulating datasets:

In [2]:
import pandas as pd

Accompanying the exam materials are a spreadsheet of female survey respondents, 'femalelaborsupply.csv' and a text file, 'femalelaborsupplydefs.txt' that explains each variable in the spreadsheet. Read in the data in the spreadsheet 'femalelaborsupply.csv', print out the first few rows of data with the variable names, and print out the number of observations and variables in the dataset:

In [3]:
fls = pd.read_csv('https://www.dropbox.com/s/r5ahpsb6kt63fw3/femalelaborsupply.csv?dl=1')

print(fls.head())
print('\n')
print(f'Number of observations: {len(fls)}')
print(f'Number of variables in dataset: {len(fls.columns)}')


   asex  aage  aqtrbrth  ageqk  asex2nd  aage2nd  ageq2nd  ageq3rd  kidcount  \
0     0     0         0     36        0        0       30      NaN         2   
1     0     0         0     23        0        0        9      NaN         2   
2     0     0         0     44        0        0       22      NaN         2   
3     0     0         0     24        0        0       12      NaN         2   
4     0     0         0     28        0        0       14      NaN         2   

   agem  ...  hourswm    incomed    incomem    faminc1    famincl     nonmomi  \
0    27  ...        0  33597.273      0.000  33597.273  10.422200  33597.2730   
1    25  ...       38        NaN  18273.307  21642.479   9.982413   3369.1726   
2    30  ...       40  20834.297  18903.059  43326.941  10.676530  24423.8830   
3    27  ...        0  30658.430      0.000  30658.430  10.330663  30658.4300   
4    35  ...        0  44450.000      0.000  44450.000  10.702120  44450.0000   

    nonmomil  qobm  const  msamp

Define a label (outcome) vector, $y_1$, to be how many children the woman has, another outcome vector, $y_2$ to be an indicator for having three or more children, and define a feature (regressor) matrix, $X$, to contain the mother's age, marital status, race, ethnicity, and education:

In [4]:
y1 = fls['kidcount']
y2 = fls['morekids']
X = fls[['agem1', 'marital', 'blackm', 'hispm', 'othracem', 'educm']]


X['Married w/ Spouse'] = [1 if x == 0 else 0 for x in X['marital']]
X['Married w/o Spouse'] = [1 if x == 1 else 0 for x in X['marital']]
X['Separated'] = [1 if x == 2 else 0 for x in X['marital']]
X['Divorced'] = [1 if x == 3 else 0 for x in X['marital']]
X['Widowed'] = [1 if x == 4 else 0 for x in X['marital']]


len(X['marital'])

394840

"Pre-process" your features, $X$, by standardizing them to have zero mean and unit variance. Hint: you may import a useful package to do this.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)


## 3. Divide into training and set sets

Import the python library that is good for randomly splitting datasets into training and test sets:

In [6]:
from sklearn.model_selection import train_test_split



Now make a training and test feature matrix and a training and test label vectors $y_1$ and $y_2$:

In [7]:
X_train, X_test, y1_train, y1_test, y2_train, y2_test = train_test_split(X, y1, y2, random_state = 0)

## 4. Pick an appropriate method

Choose a method appropriate for classification and import its library:

In [8]:
from sklearn.ensemble import RandomForestClassifier

## 5 and 6. Choose regularization parameters via cross-validation on the training set and fit model on the whole training set using the cross-validated parameters

The outcome you should use in this part is $y_2$, the indicator for having at least three kids

Search over a grid of values of the regularization parameters for the parameters that perform the best on the left-out folds:

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_estimators': [100, 200, 300], 'max_depth': [6,7,8], 'min_samples_split': [10, 20, 30]}

rfc = RandomForestClassifier()
model = GridSearchCV(estimator = rfc, param_grid = params, cv = 4, verbose = 2)

model.fit(X_train, y2_train)

model.score(X_test, y2_test)

Fitting 4 folds for each of 27 candidates, totalling 108 fits
[CV] END max_depth=6, min_samples_split=10, n_estimators=100; total time=  11.4s
[CV] END max_depth=6, min_samples_split=10, n_estimators=100; total time=   6.0s
[CV] END max_depth=6, min_samples_split=10, n_estimators=100; total time=   7.1s
[CV] END max_depth=6, min_samples_split=10, n_estimators=100; total time=   6.4s
[CV] END max_depth=6, min_samples_split=10, n_estimators=200; total time=  13.5s
[CV] END max_depth=6, min_samples_split=10, n_estimators=200; total time=  13.7s
[CV] END max_depth=6, min_samples_split=10, n_estimators=200; total time=  13.4s
[CV] END max_depth=6, min_samples_split=10, n_estimators=200; total time=  13.0s
[CV] END max_depth=6, min_samples_split=10, n_estimators=300; total time=  19.8s
[CV] END max_depth=6, min_samples_split=10, n_estimators=300; total time=  20.4s
[CV] END max_depth=6, min_samples_split=10, n_estimators=300; total time=  20.2s
[CV] END max_depth=6, min_samples_split=10, n_e

## 7. Evaluate model by applying it to test set

Compute and print out the "score" of the model applied to the test set:

In [None]:
model.best_params

## 8. Repeat 4-7 for $y_1$
using a method appropriate for regression-style prediction to predict number of children, not the probability of having at least three children

Import the method's library, do cross validation to find tuning parameters, fit the model on the training data using the cross-validated tuning parameters, and compute (and report) the model's score on the test set:

In [None]:
from sklearn.ensemble import RandomForestRegressor

params = {'n_estimators': [100, 500, 1000, 2000], 'max_depth': [3,4,5,6,7,8], 'min_samples_split': [10, 20, 30]}

rfr = RandomForestRegressor()

model = GridSearchCV(estimator = rfr, param_grid = params, cv = 4)

model.fit(X_train, y1_train)

model.score(X_test, y1_test)

## 9. Apply the prediction  models to new observations for which we have no labels

The spreadsheet 'newfemales.csv' contains information on two new females, with identical characteristics, except one is a high school graduate, and the other has a bachelor's degree.

Read in the new observations' information and apply the models to predict the probability of each applicant having at least three kids, and the predicted number of kids each applicant will have, and print out the predictions. Hint: don't forget to apply the same pre-processing steps to the new observations as you did to your training and test observations. This means standardizing the new observations using the means and variances of your labeled dataset, not the means and variances of these two new observations.