# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 8 - Re-Sampling Methods and Model Selection </div>
<div align="center"> Jonathan Holmes, (he/him)</div>

# Review: Machine Learning (in a nutshell)

<div align="center">  $x \, \,$ --> $\, \, \hat{f}(x) \, \,$ --> $\, \, \hat{y}$ </div>

#### Our Goal: 
- Find the $\hat{f}$ that gives us the best predictions $\hat{y}$ of $y$. 

#### Content of the course: 
- Different ways you can design $\hat{f}$
- Different ways of figuring out if you have chosen the best $\hat{f}$




## Different ways to design $\hat{f}$

So far: 
1. Linear models (OLS)
    - Linear models with dummy variables
    - Linear models with interaction terms ($x*z$)
    - Linear models with polynomial terms ($x^2$, $x^3$)

2. Classification algorithms
    - Linear probability model
    - Logistic model

This class: 
- How to select which variables to use
    
    
Still coming: 
1. Lasso
2. Ridge Regression
3. (Maybe) Tree models
4. Deep learning

## Different ways to figuring out if we have chosen the "best" $f$

So far: 
1. Statistics based on mean-squared errors $r^2$, $RSS$, $MSE$

2. For classification algorithms, True Postive rate and False Positive rate

This class: 
1. Adjusted $r^2$, AIC, BIC

2. Cross-validation

# Roadmap

- In the last lecture we saw the risk associated with overfitting. 
- Although we discussed the case of polynomial regressions, the risk of overfitting is common to all learning models
- This risk usually increases with the number of parameters
$$ Y = f(\mathbf{X}) + \varepsilon$$
- Regression and classification used tools that you were already familiar with (at least to some extent)
- But we only scratched the surface of what these functions may look like
- Today, we will keep using these tools with a different goal in mind
- This goal will require new tools...

## Dealing with overfitting
- You may think that the issue came from allowing for (undue) higher degrees in our model
- But the issue is broader than this.
- We used a model in which Y was related to $X$ and $X^2$ not $X^n$ for n>2
- In other words we were using too many predictors

## RSS and MSE
- the training set MSE is generally an underestimate of the test MSE. 
    - Recall that MSE = RSS/n. 
- When we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS (but not the test RSS) is as small as possible. 
- In particular, the training error will decrease as more variables are included in the model, but the test error may not.
- Therefore, training set RSS and training set R2 cannot be used to select from among a set of models with different numbers of variables.
- However, a number of techniques for adjusting the training error for the model size are available.

## Auto.csv

- Let's use the <span style="color:orange;">Auto dataset</span> from [ISLR](https://www.statlearning.com/)
- Using this data we want to predict fuel consumption (miles per gallon _mpg_) based on _horsepower_
- This time we do not know what is the true function
    - But increasing marginal cost of power may apply here...
- So let's try a few polynomials in the relationship between these two variables

In [None]:
df=pd.read_csv("Auto.csv") # load data
df['horsepower']=pd.to_numeric(df['horsepower'].replace("?",np.nan)) # data cleaning
df.dropna(subset=['horsepower','mpg'],inplace=True)
df.reset_index(inplace=True)
display(df.info()) # display info
df.head().append(df.tail()) # show head and tail

In [None]:
# Regress Sales on a constant term and TV
results = smf.ols('mpg ~ horsepower', data=df).fit() # degree 1
# Inspect the results
print(results.summary())

In [None]:
models=['mpg ~ horsepower', 'mpg ~ horsepower + np.square(horsepower)','mpg ~ horsepower + np.square(horsepower)+np.power(horsepower,3)']
n=len(models)

R2=np.full(n, np.nan); degrees=np.full(n, np.nan)
beta_1=np.full(n, np.nan) ; beta_2=np.full(n, np.nan) ; beta_3=np.full(n, np.nan) 
p_1=np.full(n, np.nan) ; p_2=np.full(n, np.nan) ; p_3=np.full(n, np.nan)
for i,m in enumerate(models):
    results = smf.ols(m, data=df).fit()
    #print(results.summary()) # uncomment if you want details
    
    R2[i]=results.rsquared ; degrees[i]=i+1
    beta_1[i]=results.params['horsepower'] ; p_1[i]=results.pvalues['horsepower']
    if i>0:
        beta_2[i]=results.params['np.square(horsepower)'] ; p_2[i]=results.pvalues['np.square(horsepower)']
        if i>1:
            beta_3[i]=results.params['np.power(horsepower, 3)'] ; p_3[i]=results.pvalues['np.power(horsepower, 3)']
        
res=pd.DataFrame({'Degree':degrees, 
                  r'$R^2$': R2,
                  r'$\hat{\beta}_1$':beta_1, r'p-value $\hat{\beta}_1$':p_1,
                  r'$\hat{\beta}_2$':beta_2, r'p-value $\hat{\beta}_2$':p_2,
                  r'$\hat{\beta}_3$':beta_3, r'p-value $\hat{\beta}_3$':p_3})

res['Degree']=res['Degree'].astype(np.int8)
res

## Do we need higher-order polynomials?
- Is higher better?
    - Here we do not even need out of sample MSE since:
        - the p-value attached to the cubic term suggests it is not significant
        - the value of the parameter is very small
        - the $R^2$ is left unchanged
- We may still want to decide based on out of sample measures of the MSE

## The Validation Set Approach

- A simple way to know about the out of sample performance of our model is to split the dataset in two:
    -  __training set__ 
    -__validation set__ or hold-out set 
    validation set or hold-out set.
- You then compute the MSE on the validation set

- Here again we will use [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- We will split the data between train and test sets, 50/50

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['horsepower'], df['mpg'], test_size=.5, random_state=1706)
    
print(f"Shape of X_train: {X_train.shape}") ; print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}") ; print(f"Shape of y_test: {y_test.shape}")

polynomials=[1,2,3]
for i,n in enumerate(polynomials): # i corresponds to the index (0, 1, 2); n takes the values in the polynomial list (1,2,3)
    # prepocess the data
    polynomial_features =PolynomialFeatures(degree=n,include_bias=False) # iniate a class to get data of degree n
    X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
    X_test_new=polynomial_features.fit_transform(X_test.values.reshape(-1, 1)) # get data of degree n

    
    # regression
    reg = LinearRegression() # initiate the regression class
    reg.fit(X_train_new,y_train) # fit the data
    
    # Out of Sample MSE:
    mse=mean_squared_error(y_test, reg.predict(X_test_new))
    print(f"Polynomial regression of degree {n} has an out of sample MSE score of {mse}")
    


## Sampling properties
- Remember when talked sampling properties of OLS?
- The smaller the sample size the less precise the estimate
    - What if our 50/50 split happens to train our data on a fairly unrepresentative sample?
    - let's confirm this using different 50/50 splits for our models

In [None]:
polynomials=[1,2,3,4,5]
DFs=[] # empty list of dataframes
for n, p in enumerate(polynomials):
    MSEs=[] # empty list of MSE scores
    for i in range(len(df)):
        # split the data (a different sample with each iteration)
        X_train, X_test, y_train, y_test = train_test_split(df['horsepower'], df['mpg'], test_size=.5, random_state=i)

        # prepocess the data
        polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
        X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
        X_test_new=polynomial_features.fit_transform(X_test.values.reshape(-1, 1)) # get data of degree n

        # regression
        reg = LinearRegression() # initiate the regression class
        reg.fit(X_train_new,y_train) # fit the data

        # Out of Sample MSE:
        mse=mean_squared_error(y_test, reg.predict(X_test_new))
        MSEs.append(mse)
    # store all MSEs for degree p in a dataframe and append to list of dataframes    
    DFs.append(pd.DataFrame({'Degree':p, 'MSE':MSEs}))

dd=pd.concat(DFs)    
# show the distribution of MSEs
fig, ax = plt.subplots(1,1, figsize=(8,8))

sns.kdeplot(data=dd, x='MSE', ax=ax, hue='Degree')
ax.set_xlabel("MSE Scores - Validation Set Approach")
plt.show()

##  Leave-one-out cross-validation (LOOCV) 
- leave-one out cross validation is closely related to the validation set approach but it attempts to address that method’s drawbacks.
- Like the validation set approach, _LOOCV_ involves splitting the set of observations into two parts. 
    - However, instead of creating two subsets of comparable size 
    - a single observation $(x_1, y_1)$ is used for the validation set
    - the remaining observations $\{(x_2, y_2), . . . , (x_n, y_n)\}$ make up the training set
- The statistical learning method is fit on the n − 1 training observations, and a prediction $\hat{y}_1$ is made for the excluded observation, using its value x1

In [2]:
##  Leave-one-out cross-validation (LOOCV), continued
- Since $(x_1, y_1)$ was not used in the fitting process
    - $MSE_{1}$ = $(y_1-\hat{y}_1)^2$
- This provides an approximately __unbiased estimate for the test error__.
- But even though MSE is unbiased for the test error, it is a poor estimate because it is highly variable, since it is based upon a single observation $(x_1, y_1)$.
- We can repeat the procedure by selecting (x2, y2) and so on:
    -     - $MSE_{(n)}$ = $(y_n-\hat{y}_n)^2$
- The LOOCV estimate for the test MSE is the average of these n test error estimates:
    - $MSE_{(LOOCV)} = \frac{1}{n} \sum_{i=1}^n MSE_{i}$

NameError: name 'pd' is not defined

In [None]:
%%time
polynomials=[1,2,3,4,5]
DFs=[] # empty list of dataframes
for n, p in enumerate(polynomials):
    MSEs=[] # empty list of MSE scores
    for i in range(len(df)):
        # split the data (a different sample with each iteration)
        X_train, X_test, y_train, y_test = df.loc[df.index!=i,'horsepower'],df.loc[df.index==i,'horsepower'] , df.loc[df.index!=i,'mpg'],df.loc[df.index==i,'mpg'].values[0]

        # prepocess the data
        polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
        X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
        X_test_new=polynomial_features.fit_transform(np.array(X_test).reshape(-1, 1)) # get data of degree n

        # regression
        reg = LinearRegression() # initiate the regression class
        reg.fit(X_train_new,y_train) # fit the data

        # Out of Sample MSE:
        mse=np.square(y_test-reg.predict(X_test_new)[0]) # Out of Sample MSE
        MSEs.append(mse)
    # store all MSEs for degree p in a dataframe and append to list of dataframes    
    DFs.append(pd.DataFrame({'Degree':p, 'MSE':MSEs}))

dd=pd.concat(DFs)    
display(dd.groupby('Degree').agg(Mean_MSE=('MSE','mean')))
# show the distribution of MSEs
fig, ax = plt.subplots(1,1, figsize=(8,8))
sns.kdeplot(data=dd, x='MSE', ax=ax, hue='Degree')
ax.set_xlabel("MSE Scores - LOOCV")
plt.show()

## k-Fold Cross-Validation
- An alternative to LOOCV is k-fold cross validation. 
- This approach involves randomly dividing the set of observations into k groups, or __folds__, of approximately equal size. 
- The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. 
- The mean squared error, $MSE_1$, is then computed on the observations in the held-out fold. 
- This procedure is repeated k times; each time, a different group of observations is treated as a validation set.
- This process results in k estimates of the test error, $MSE_1$,$MSE_2$, . . . ,$MSE_k$. 
- The k-fold CV estimate is computed by averaging these values:
    - $MSE_{(k-fold CV)} = \frac{1}{k} \sum_{i=1}^k MSE_{i}$

# k-fold Cross-Validation, continued
- Note that LOOCV is a special case of k-fold CV in which k is set to equal n. 
- Standard practices consist in using k = 5 or k = 10. 
- What is the advantage of using k = 5 or k = 10 rather than k = n? 
    - Computation time which will increase with dataset sizes. 
    - LOOCV requires fitting the statistical learning method n times. 
        - This has the potential to be computationally expensive 
    - Performing 10-fold CV requires fitting the learning procedure only ten times

In [None]:
%%time
X = df['horsepower'].values
y = df['mpg'].values
polynomials=[1,2,3,4,5]
kDFs=[]
for k in [5,10]:
    DFs=[]
    # Split in k folds
    kf = KFold(n_splits=k, random_state=1706, shuffle=True)
    for p in polynomials:
        MSEs=[] # empty list of MSE scores
        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]

            # prepocess the data
            polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
            X_train_new=polynomial_features.fit_transform(X_train.reshape(-1, 1)) # get data of degree n
            X_test_new=polynomial_features.fit_transform(X_test.reshape(-1, 1)) # get data of degree n

            # regression
            reg = LinearRegression() # initiate the regression class
            reg.fit(X_train_new,y_train) # fit the data

            # Out of Sample MSE:
            mse=mean_squared_error(y_test, reg.predict(X_test_new))
            MSEs.append(mse)
            # store all MSEs for degree p in a dataframe and append to list of dataframes    
        DFs.append(pd.DataFrame({'Degree':p, 'k-fold':k,'MSE':MSEs}))

    dd=pd.concat(DFs) 
    kDFs.append(dd)

dd=pd.concat(kDFs)
mse_change=dd.groupby(['Degree','k-fold']).agg(Mean_MSE=('MSE','mean'))

In [None]:
# show the distribution of MSEs
fig, axes = plt.subplots(2,2, figsize=(10,10))
sns.kdeplot(data=dd.loc[dd['k-fold']==5], x='MSE', ax=axes[0,0], hue='Degree')
axes[0,0].set_xlabel("MSE Scores - 5-fold Cross Validation",fontsize=12)
sns.kdeplot(data=dd.loc[dd['k-fold']==10], x='MSE', ax=axes[0,1], hue='Degree')
axes[0,1].set_xlabel("MSE Scores - 10-fold Cross Validation",fontsize=12)

gs = axes[1, 1].get_gridspec()
for ax in axes[1, :]:
    ax.remove()
axbottom = fig.add_subplot(gs[1, :])
sns.lineplot(data=mse_change.reset_index(),x='Degree', y='Mean_MSE',hue='k-fold',alpha=.7, ax=axbottom)
sns.scatterplot(data=mse_change.reset_index(),x='Degree', y='Mean_MSE',hue='k-fold',style='Degree', ax=axbottom,legend=False)
axbottom.set_xlabel("Degrees of the Polynomial Regression",fontsize=12)
axbottom.set_ylabel("Mean Squared Errors",fontsize=12)
axbottom.set_xticks(np.arange(1,len(polynomials)+1))
fig.tight_layout()

plt.show()

In [None]:
## Why do we use resampling methods?
- Cross Validation is one of the several __resampling methods__ one can use to learn about predictive performance 
- Do we really care about the test MSE value we get from CV?
    - If you want to know the performance of a given statistical model, yes
        - e.g. Given my model how wrong should I expect my predictions to be "on average"?
- Sometimes you may only be interested in the location of the minimum point in the test MSE
    - If you want to compare perfomance of different models 
        - Like we did today
    - The actual value of the test MSE is not important
     - What matters is which models performs best
         - within the same method using different levels of flexibility (like we did today)
         - or across methods

In [None]:
## Bias-Variance Trade-Off, 1/3
- LOOCV is a special case of k-fold (since k<n) 
- We said that a reason to privilege k-fold has to do with computational power
    - There are not to be neglected
    - But what if LOOCV performs better?
- Turns out, there is another reason to use k-fold CV: __accuracy__
    - k-fold validation usually gets closer to the true MSE than LOOCV

In [None]:
## Bias-Variance Trade-Off - Bias Reduction, 2/3 

- An issue with the _validation set_ approach is that it can overestimate the test MSE
    - You train your model on half the data which can generate a __bias__
- With LOOCV you use almost all (n-1) the data in your training
     - This means that LOOCV will give approximately __unbiased estimates__ of the test error
     - k-fold is somewhere in between the _validation set_ and _LOOCV_
- So if you care about lowest bias, LOOCV is the best method

In [None]:
## Bias-Variance Trade-Off - Variance Reduction, 3/3

- But we know that bias is not the only source for concern in an estimating procedure (wink, t-test and p-values)
    - we must also consider the procedure’s variance. 
- LOOCV has __higher variance__ than does k-fold CV with k < n. 
- Remember that, for both, you take averages of test squared errors
    - You actually take n averages in LOOCV from almost identical training datasets
    - This implies that test $MSE_{(LOOCV),i}$ are highly correlated 
        - k-fold are less correlated: overlap between training sets is smaller

- The mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated
- the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.