### Ex 1.8.1
The goal of this exercise is to briefly explain and demonstrate sample splitting. We'll just pull down the data as in the example notebooks.

In [2]:
## Import libraries and data
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
import sys
from sklearn.base import BaseEstimator
import warnings

file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
df = pd.read_csv(file)

## Present Data
df.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


### What is sample spitting and why do it
To test the degree of overfitting and gain crucial information on how to design our models we can reserve some section of the dataset as a test set (often a test and validation set) and observe the performance of our model trained or fitted to the larger portion of the data set when tested against the held back portion. When performing sample splitting on smaller data sets or data sets with unique sensitivites it can be important to make sure that some or all subgroups are similarily distributed across the two sample groups. 

In [59]:
## Perform sample splitting
train, test = train_test_split(df, test_size=0.2)

## Fit model to train data
# How models are set in statsmodel
''' 
When using statsmodel to fit linear predictors we give the set of parameters we wish to fit on as a string
The first element of the string before the ~ seperator is the property to predict, in our case we choose to predict log wages
The next elements, seperated by + give the set of predictors, using * takes the possible combinations of predictors
'''
# 1. Basic Model
# We'll be predicting soley using sex, and educational status

# First let's fit and see the results on fitting
predictors = ['sex', 'shs', 'hsg', 'scl', 'clg', 'ad']
mdl = sm.OLS(train['lwage'], train[predictors])
fit = mdl.fit()
isPred = fit.predict(exog = train[predictors])
MSE_train = sum((train['lwage'] - isPred)**2) / train.shape[0]
R2_train = 1. - MSE_train / np.var(train['lwage'])
print("MSE On Training Data: %s" %(MSE_train))
print("R2 On Training Data: %s" %(R2_train))
print()

oosPred = fit.predict(exog = test[predictors])
MSE_test = sum((test['lwage'] - oosPred)**2) / test.shape[0]
R2_test = 1. - MSE_test / np.var(train['lwage'])
print("MSE On Test Data: %s" %(MSE_test))
print("R2 On Training Data: %s" %(R2_test))

MSE On Training Data: 0.2669935817537695
R2 On Training Data: 0.1671546586770526

MSE On Test Data: 0.3052463621765683
R2 On Training Data: 0.04783100393407558


### How about in an extreme case
By giving the model more variables and letting it overfit more intensly we can grow the difference between train and 
test performance.

In [99]:
## Perform sample splitting
train, test = train_test_split(df, test_size=0.2)

## Fit model to train data
# How models are set in statsmodel
''' 
When using statsmodel to fit linear predictors we give the set of parameters we wish to fit on as a string
The first element of the string before the ~ seperator is the property to predict, in our case we choose to predict log wages
The next elements, seperated by + give the set of predictors, using * takes the possible combinations of predictors
'''
# 1. Basic Model
# We'll be predicting with all variables now
# First let's fit and see the results on fitting
mdl = smf.ols("lwage ~ sex * ((shs + hsg + scl + clg + ad)*we*(exp1 + exp2 + exp3 + exp4)* (occ + occ2) + (ind + ind2))", train)
fit = mdl.fit()
isPred = fit.predict(train)
MSE_train = sum((train['lwage'] - isPred)**2) / train.shape[0]
R2_train = 1. - MSE_train / np.var(train['lwage'])
print("MSE On Training Data: %s" %(MSE_train))
print("R2 On Training Data: %s" %(R2_train))
print()

oosPred = fit.predict(test)
MSE_test = sum((test['lwage'] - oosPred)**2) / test.shape[0]
R2_test = 1. - MSE_test / np.var(train['lwage'])
print("MSE On Test Data: %s" %(MSE_test))
print("R2 On Training Data: %s" %(R2_test))

MSE On Training Data: 0.2268287508712496
R2 On Training Data: 0.3116604702616701

MSE On Test Data: 2.3688362619162917
R2 On Training Data: -6.188522761297084
