In [None]:
import numpy as np
np.random.seed(seed=484)
import warnings
warnings.filterwarnings('ignore')

# Midterm Exam

Exam is open book, open note, and open Google. You are not allowed outside
help from another person, however. All work, including coding, must be yours alone. Remember to turn in both the written portion and this coding portion. The coding portion can be turned in by submitting a shared link to your Colab notebook. To complete this coding portion, make sure to save a copy of this notebook in your own Google drive, supply the python code in the empty cells below, and execute the notebook. To get full credit, the completed notebook should be able to run top to bottom, producing the results asked for in the prompts below.

This portion of the exam will take you through the steps of the supervised machine learning process.

## 1. Figure out your question

The question you want to answer is: How does childbearing impact labor market outcomes for women? We can use machine learning to help answer this question by building a model that predicts how many children a woman gives birth to on the basis of her characteristics.

## 2. Obtain a labeled dataset

Import the python library that is good for manipulating datasets:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

Accompanying the exam materials are a spreadsheet of female survey respondents, 'femalelaborsupply.csv' and a text file, 'femalelaborsupplydefs.txt' that explains each variable in the spreadsheet. Read in the data in the spreadsheet 'femalelaborsupply.csv', print out the first few rows of data with the variable names, and print out the number of observations and variables in the dataset:

In [None]:
femlaborsupply=pd.read_csv('/content/drive/MyDrive/Econ 484/femalelaborsupply.csv')
print(femlaborsupply.head())
print("Shape: {}. The number of observations is 394840. The number of features is 49.".format(str(femlaborsupply.shape)))
# or
femlaborsupply.head(5)

   asex  aage  aqtrbrth  ageqk  ...   nonmomil  qobm  const  msample
0     0     0         0     36  ...  10.422200     2      1        1
1     0     0         0     23  ...   8.122422     3      1        0
2     0     0         0     44  ...  10.103316     4      1        1
3     0     0         0     24  ...  10.330663     3      1        1
4     0     0         0     28  ...  10.702120     1      1        1

[5 rows x 49 columns]
Shape: (394840, 49). The number of observations is 394840. The number of features is 49.


Unnamed: 0,asex,aage,aqtrbrth,ageqk,asex2nd,aage2nd,ageq2nd,ageq3rd,kidcount,agem,marital,yobm,multi2nd,boy1st,boy2nd,boys2,girls2,samesex,morekids,blackm,hispm,othracem,blackd,hispd,othraced,educm,hsgrad,hsormore,moreths,agem1,aged1,ageqm,agefstm,agefstd,weeksm1,weeksd1,workedm,workedd,hourswd,hourswm,incomed,incomem,faminc1,famincl,nonmomi,nonmomil,qobm,const,msample
0,0,0,0,36,0,0,30,,2,27,0,52,0,1,0,0,0,0,0,0,0,0,0,0,0,12,1,1,0,27,35.0,109,18,26.0,0,16.0,0,1,48.0,0,33597.273,0.0,33597.273,10.4222,33597.273,10.4222,2,1,1
1,0,0,0,23,0,0,9,,2,25,2,54,0,1,0,0,0,0,0,0,0,0,0,0,1,12,1,1,0,25,,100,19,,52,,1,1,,38,,18273.307,21642.479,9.982413,3369.1726,8.122422,3,1,0
2,0,0,0,44,0,0,22,,2,30,0,49,0,0,1,0,0,0,0,0,0,0,0,0,0,7,0,0,0,30,28.0,119,18,17.0,30,32.0,1,1,40.0,40,20834.297,18903.059,43326.941,10.67653,24423.883,10.103316,4,1,1
3,0,0,0,24,0,0,12,,2,27,0,52,0,1,0,0,0,0,0,0,0,0,0,0,0,12,1,1,0,27,30.0,108,21,24.0,0,52.0,0,1,40.0,0,30658.43,0.0,30658.43,10.330663,30658.43,10.330663,3,1,1
4,0,0,0,28,0,0,14,,2,35,0,45,0,1,0,0,0,0,0,1,0,0,1,0,0,14,0,1,1,35,36.0,138,27,29.0,0,52.0,0,1,40.0,0,44450.0,0.0,44450.0,10.70212,44450.0,10.70212,1,1,1


Define a label (outcome) vector, $y_1$, to be how many children the woman has, another outcome vector, $y_2$ to be an indicator for having three or more children, and define a feature (regressor) matrix, $X$, to contain the mother's age, marital status, race, ethnicity, and education:

In [None]:
y2 = femlaborsupply['kidcount'] # the number of kids woman has in household
y1 = femlaborsupply['morekids'] # 3 or more kids that a woman has in a household
X = femlaborsupply.loc[:,['agem','marital','blackm','hispm','othracem','educm']]



"Pre-process" your features, $X$, by standardizing them to have zero mean and unit variance. Hint: you may import a useful package to do this.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled.shape

(394840, 6)

## 3. Divide into training and set sets

Import the python library that is good for randomly splitting datasets into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split 

Now make a training and test feature matrix and a training and test label vectors:

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X,y1,random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier().fit(X_train,y_train)

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {"max_depth": [80,90,110], "min_samples_leaf": [0,1,2,5,8,10], "n_estimators": [100,200,300]}
rf_gs = GridSearchCV(RandomForestClassifier(), param_grid, scoring="f1", n_jobs=-1)
rf_gs.fit(X_train,y_train)
print(rf_gs.best_params_, rf_gs.best_score_, sep='\n')

{'max_depth': 110, 'min_samples_leaf': 10, 'n_estimators': 100}
0.3555204987710009


In [None]:
rf = RandomForestClassifier(max_depth = 80, min_samples_leaf = 1, n_estimators = 200).fit(X_train,y_train)

In [None]:
from sklearn import metrics
y_pred = rf.predict(X_test)
print("Accuracy: ", metrics.accuracy_score(y_test,y_pred))
print("Auc_score: ", metrics.roc_auc_score(y_test,y_pred))
print(rf.score(X_test,y_test))

## 4. Pick an appropriate method

Choose a method appropriate for classification and import its library:

In [None]:
from sklearn.linear_model import Lasso


## 5 and 6. Choose regularization parameters via cross-validation on the training set and fit model on the whole training set using the cross-validated parameters

The outcome you should use in this part is $y_2$, the indicator for having at least three kids

Search over a grid of values of the regularization parameters for the parameters that perform the best on the left-out folds:

In [None]:
lasso = Lasso().fit(X_train,y_train)
param_grid = {"alpha": [.0001, .0005,.001, .002, .0022, .003, .004, .006, .008, .01, .012, .014, .016 ,.018, .02 ], "max_iter": [100000]}
lazzo = GridSearchCV(lasso,param_grid,cv=5,return_train_score=True)
lazzo.fit(X_train,y_train)
print(lazzo.best_params_, lazzo.best_score_, sep='\n')
lasso = Lasso(alpha=0.0022, max_iter=100000).fit(X_train,y_train)
y_pred = lasso.predict(X_test)


## 7. Evaluate model by applying it to test set

Compute and print out the "score" of the model applied to the test set:

In [None]:
print('Lasso score on test set: {:.4f}'.format(lasso.score(X_test,y_test)))
print("the lasso score for y2 is", lasso.score(X,y2))
#

## 8. Repeat 4-7 for $y_1$
using a method appropriate for regression-style prediction to predict number of children, not the probability of having at least three children

Import the method's library, do cross validation to find tuning parameters, fit the model on the training data using the cross-validated tuning parameters, and compute (and report) the model's score on the test set:

In [None]:
# step 4 The method chosen is ridge regression.

from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

In [None]:
# step 5 the cross validation process.
ridgecv= RidgeCV(cv=5).fit(X_train, np.ravel(y_train))


In [None]:
# step 6 fitting the model on the training set
ridgecvalpha =Ridge(alpha = ridgecv.alpha_,max_iter=100000).fit(X_train,y_train)

In [None]:
# step 7 fittin the model on the test set
print('Ridge score on test set: {:.4f}'.format(ridgecvalpha.score(X_test,y_test)))
print("The ridge regression score for y1 is" ,ridgecv.score(X,y1))

## 9. Apply the prediction  models to new observations for which we have no labels

The spreadsheet 'newfemales.csv' contains information on two new females, with identical characteristics, except one is a high school graduate, and the other has a bachelor's degree.

Read in the new observations' information and apply the models to predict the probability of each applicant having at least three kids, and the predicted number of kids each applicant will have, and print out the predictions. Hint: don't forget to apply the same pre-processing steps to the new observations as you did to your training and test observations. This means standardizing the new observations using the means and variances of your labeled dataset, not the means and variances of these two new observations.

In [None]:
newfemz = pd.read_csv("/content/drive/MyDrive/Econ 484/newfemales.csv")


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled.shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=50)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=50)

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

In [None]:
ridgecv= RidgeCV(cv=5).fit(X_train, np.ravel(y_train))


In [None]:
femzscaled = scaler.transform(newfemz)
y_pred = ridgecv.predict(femzscaled)
print(y_pred)
print(y_pred)