# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 8 - Re-Sampling Methods and Model Selection </div>
<div align="center"> Jonathan Holmes, (he/him)</div>

# Review: Machine Learning (in a nutshell)

<div align="center">  $x \, \,$ --> $\, \, \hat{f}(x) \, \,$ --> $\, \, \hat{y}$ </div>

#### Our Goal: 
- Find the $\hat{f}$ that gives us the best predictions $\hat{y}$ of $y$. 

#### Content of the course: 
- Different ways you can design $\hat{f}$
- Different ways of figuring out if you have chosen the best $\hat{f}$




## Review: Different ways to design $\hat{f}$

So far: 
1. Linear models (OLS)
    - Linear models with dummy variables
    - Linear models with interaction terms ($x*z$)
    - Linear models with polynomial terms ($x^2$, $x^3$)

2. Classification algorithms
    - Linear probability model
    - Logistic model
    
Still coming: 
1. Lasso
2. Ridge Regression
3. (Maybe) Tree models
4. Deep learning

## Review: Different ways to figuring out if we have chosen the "best" $f$

So far: 
1. Statistics based on mean-squared errors $r^2$, $RSS$, $MSE$

2. For classification algorithms: True Postive rate and False Positive rate, ROC Curve

Still Coming 
1. Adjusted $r^2$, AIC, BIC

2. Cross-Validation


## Review: Bias-Variance Trade-Off


$$f(x) = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_N X^P $$






How do we determine how many polynomial terms $N$ to include in the model? 

Bias-Variance Trade-Off:
- Higher N -> Lower Bias
- Higher N -> Higher Variance

GOAL: Minimize out-of-sample criterion (usually mean-squared error)
- Balance bias + variance

## New terminology

$$f(x) = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_P X^P $$

Parameter vs. Hyperparameter: 
- $\beta_0$ ... $\beta_P$ are __parameters__ 
- The number of polynomial terms $P$ is called a __tuning parameter__ or a __hyperparameter__

Parameters: Variables that are used to fit a model $f$ to the data

Tuning parameter: Any parameter that is used to control the learning process
- They refer to a __model selection__ task
- You define tuning parameters separately from training the model




## Auto.csv

- Let's use the <span style="color:orange;">Auto dataset</span> from [ISLR](https://www.statlearning.com/)
- Using this data we want to predict fuel consumption (miles per gallon _mpg_) based on _horsepower_
- Let's try a few polynomials in the relationship between these two variables

In [None]:
# Let me set my current directory using the %cd magic
%cd "~/Dropbox/_teaching/ECO4199/2023/Data-Science-for-Social-Scientists/Class 08 - Resampling Methods/"

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [None]:
df=pd.read_csv("Auto.csv") # load data
df['horsepower']=pd.to_numeric(df['horsepower'].replace("?",np.nan)) # data cleaning
df.dropna(subset=['horsepower','mpg'],inplace=True)
df.reset_index(inplace=True)
display(df.info()) # display info
df.head().append(df.tail()) # show head and tail

In [None]:
# Regress Sales on a constant term and TV
results = smf.ols('mpg ~ horsepower', data=df).fit() # degree 1
# Inspect the results
print(results.summary())

In [None]:
models=['mpg ~ horsepower', 'mpg ~ horsepower + np.square(horsepower)',
        'mpg ~ horsepower + np.square(horsepower)+np.power(horsepower,3)',
        'mpg ~ horsepower + np.square(horsepower)+np.power(horsepower,3)+np.power(horsepower,4)']
n=len(models)

R2=np.full(n, np.nan); degrees=np.full(n, np.nan)
aic=[];bic=[]; adj_R2_list=[]

beta_1=np.full(n, np.nan) ; beta_2=np.full(n, np.nan) ; beta_3=np.full(n, np.nan) ; beta_4=np.full(n,np.nan)
p_1=np.full(n, np.nan) ; p_2=np.full(n, np.nan) ; p_3=np.full(n, np.nan); p_4 = np.full(n, np.nan)
for i,m in enumerate(models):
    results = smf.ols(m, data=df).fit()
    #print(results.summary()) # uncomment if you want details
    
    R2[i]=results.rsquared ; degrees[i]=i+1
    adj_R2_list.append(results.rsquared_adj); aic.append(results.aic); bic.append(results.bic)
    
    beta_1[i]=results.params['horsepower'] ; p_1[i]=results.pvalues['horsepower']
    if i>0:
        beta_2[i]=results.params['np.square(horsepower)'] ; p_2[i]=results.pvalues['np.square(horsepower)']
    if i>1:
        beta_3[i]=results.params['np.power(horsepower, 3)'] ; p_3[i]=results.pvalues['np.power(horsepower, 3)']
    if i>2:
        beta_4[i]=results.params['np.power(horsepower, 4)'] ; p_4[i] = results.pvalues['np.power(horsepower, 4)']

res=pd.DataFrame({'Degree':degrees, 
                  r'$R^2$': R2,
                  r'$\hat{\beta}_1$':beta_1, r'p-value $\hat{\beta}_1$':p_1,
                  r'$\hat{\beta}_2$':beta_2, r'p-value $\hat{\beta}_2$':p_2,
                  r'$\hat{\beta}_3$':beta_3, r'p-value $\hat{\beta}_3$':p_3,
                  r'$\hat{\beta}_4$':beta_4, r'p-value $\hat{\beta}_3$':p_4})

res['Degree']=res['Degree'].astype(np.int8)
res

## How to determine the tuning parameter $p$? 

Recall: In-sample $R^2$ only increases if 


### Different approaches: 
1. Economist approach: Use p-values and economic models/intuitions
2. Corrected in-sample statistics 
3. The validation set approach
4. Cross-validation


## In-class exercise: 

Q1: Using just the estimates and p-values, what level of polynomial is best? 


## 2: Corrected Statistics

Problem: In-sample RSS and $R^2$ never go down when you add variables to the model
- These statistics do NOT balance variance and bias

Potential Solution: Corrected statistics
- These statistics "penalize" in-sample estimates when you add more variables to the model
- Higher penalty -> Less variance 
- Lower penalty -> Less bias

Three examples: 
- Adjusted $R^2$
- AIC
- BIC


## Adjusted $R^2$
$$\large \text{Adjusted R}^2 = 1- \frac{\frac{\text{RSS}}{n-p-1}}{\frac{TSS}{n-1}}$$

Definitions: 
- RSS: Residual sum of squares
- TSS: Total sum of squares
- Recall that the usual $R^2= 1- \frac{\text{RSS}}{\text{TSS}}$
- n: Number of observations
- p: Number of parameters

Notes: 
- Maximizing the adjusted $R^2$ is equivalent to minimizing $\frac{\text{RSS}}{n-p-1}$
    - While RSS always decreases as the number of variables in the model increases
    - $\frac{\text{RSS}}{n-p-1}$ may increase or decrease, due to the presence of p in the denominator.
- The intuition behind the adjusted $R^2$ is that once all of the correct variables have been included in the model, adding additional variables just adds noise

## Akaike information criterion (AIC)

- For a fitted least squares model containing p predictors
$$\large \text{AIC}=\frac{1}{n\hat{\sigma}^2}(\text{RSS} + 2p\hat{\sigma}^2) = \frac{\text{RSS}}{n\hat{\sigma}^2} + \frac{2p}{n}$$
- Recall: MSE $= \frac{1}{n} \text{RSS}$


- $\hat{\sigma}^2$ is an estimate of the variance of the error $\varepsilon$ associated with each response measurement 
- $2p\hat{\sigma}^2$  is a statistical penalty on the training RSS
    - In order to account for the fact that training error tends to underestimate the test error
    - The penalty increases with the number of predictors $p$ to adjust for the corresponding decrease in training RSS
- Smaller values of AIC indicates better fit


## Bayesian information criterion (BIC)
$$\large \text{BIC}=\frac{1}{n\hat{\sigma}^2}(\text{RSS} + \log(n)p\hat{\sigma}^2)$$
- $n$ represents the number of observations
- Notice that BIC replaces the $2p\hat{\sigma}^2$ used by AIC with a $\log(n)p\hat{\sigma}^2$
term, where n is the number of observations
- Since $\log(n) > 2$ for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables compared with AIC 
- As before, smaller values of BIC indicates better fit

In [None]:

dd = pd.DataFrame(data={'Degree':degrees, f'$R^2$': R2,f'Adjusted $R^2$':adj_R2_list, "AIC":aic, "BIC":bic})
dd['Degree']=dd['Degree'].astype(np.int8)
dd.head()
#dd=pd.concat(DFs)
#dd.head()


In [None]:

fig, axes = plt.subplots(2,2, figsize=(10,10),sharex=True)
axes = axes.ravel() # access axes with a single position instead of 2
for i, statistics in enumerate([f'$R^2$',f"Adjusted $R^2$","AIC","BIC"]):
    sns.scatterplot(x='Degree',y=statistics,data=dd,ax=axes[i], color='black',marker='.')
    sns.lineplot(x='Degree',y=statistics,data=dd,ax=axes[i], color='darkorange')
    
fig.suptitle("In-Sample Statistics")
fig.tight_layout()
plt.show()

## In-class exercise: 

Q2: According to the corrected in-sample statistics, how many polynomial terms should we include? 

Q3: Does each method select a different number of polynomial terms? 

## Advanced In-class exercise: 

Q4: If we did this with other datasets and other circomstances, do you think AIC, BIC, and Adjusted $R^2$ will always select the same number of parameters? 


Q5: Lets say you select P using AIC (Model #1) and then with BIC (Model #2). Which one will have a higher bias? Which one will have a higher variance? 

# How to set the tuning parameter N? 

### Different approaches: 
1. Economist approach: Use p-values and economic models/intuitions
2. Corrected in-sample statistics 
3. __The validation set approach__
4. __Cross-validation__



## 3: The Validation Set Approach

- A simple way to know about the out of sample performance of our model is to split the dataset in two:
    -  __training set__ 
    -__validation set__ or hold-out set 
    validation set or hold-out set.
- You then compute the MSE on the validation set

- Here again we will use [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- We will split the data between train and test sets, 50/50

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['horsepower'], df['mpg'], test_size=.5, random_state=1706)
    
print(f"Shape of X_train: {X_train.shape}") ; print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}") ; print(f"Shape of y_test: {y_test.shape}")

polynomials=[1,2,3, 4]
for i,n in enumerate(polynomials): # i corresponds to the index (0, 1, 2); n takes the values in the polynomial list (1,2,3)
    # prepocess the data
    polynomial_features =PolynomialFeatures(degree=n,include_bias=False) # iniate a class to get data of degree n
    X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
    X_test_new=polynomial_features.fit_transform(X_test.values.reshape(-1, 1)) # get data of degree n

    
    # regression
    reg = LinearRegression() # initiate the regression class
    reg.fit(X_train_new,y_train) # fit the data
    
    # Out of Sample MSE:
    mse=mean_squared_error(y_test, reg.predict(X_test_new))
    print(f"Polynomial regression of degree {n} has an out of sample MSE score of {mse}")
    


## In-Class Exercise #3

Q6: Using the Validation set approach, which polynomial degree is best? 

Q7: If I re-ran this exercise, would I get the same answer? 

Q8: What would happen to bias and variance if I changed the size of the leave-out sample from .5 to .2? 


## Sampling properties
- Remember when talked sampling properties of OLS?
- The smaller the sample size the less precise the estimate
    - What if our 50/50 split happens to train our data on a fairly unrepresentative sample?
    - let's confirm this using different 50/50 splits for our models

In [None]:
polynomials=[1,2,3,4,5]
DFs=[] # empty list of dataframes
for n, p in enumerate(polynomials):
    MSEs=[] # empty list of MSE scores
    for i in range(len(df)):
        # split the data (a different sample with each iteration)
        X_train, X_test, y_train, y_test = train_test_split(df['horsepower'], df['mpg'], test_size=.5, random_state=i)

        # prepocess the data
        polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
        X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
        X_test_new=polynomial_features.fit_transform(X_test.values.reshape(-1, 1)) # get data of degree n

        # regression
        reg = LinearRegression() # initiate the regression class
        reg.fit(X_train_new,y_train) # fit the data

        # Out of Sample MSE:
        mse=mean_squared_error(y_test, reg.predict(X_test_new))
        MSEs.append(mse)
    # store all MSEs for degree p in a dataframe and append to list of dataframes    
    DFs.append(pd.DataFrame({'Degree':p, 'MSE':MSEs}))

dd=pd.concat(DFs, ignore_index=True)
# show the distribution of MSEs
fig, ax = plt.subplots(1,1, figsize=(8,8))

sns.kdeplot(data=dd, ax=ax, x='MSE', hue='Degree')
ax.set_xlabel("MSE Scores - Validation Set Approach")
plt.show()



In [None]:
# Average Mean-Squared Error Across All Simulations
dd.groupby('Degree').agg('mean')

##  Leave-one-out cross-validation (LOOCV) 
- leave-one out cross validation is closely related to the validation set approach but it attempts to address that method’s drawbacks.
- Like the validation set approach, _LOOCV_ involves splitting the set of observations into two parts, BUT 
    - a single observation $(x_1, y_1)$ is used for the validation set
    - the remaining observations $\{(x_2, y_2), . . . , (x_n, y_n)\}$ make up the training set
- The statistical learning method is fit on the n − 1 training observations, and a prediction $\hat{y}_1$ is made for the excluded observation, using its value x1

##  Leave-one-out cross-validation (LOOCV), continued
- Since $(x_1, y_1)$ was not used in the fitting process
    - $MSE_{1}$ = $(y_1-\hat{y}_1)^2$
- This provides an approximately __unbiased estimate for the test error__.
- But even though MSE is unbiased for the test error, it is a poor estimate because it is highly variable, since it is based upon a single observation $(x_1, y_1)$.
- We can repeat the procedure by selecting (x2, y2) and so on:
    -     - $MSE_{(n)}$ = $(y_n-\hat{y}_n)^2$
- The LOOCV estimate for the test MSE is the average of these n test error estimates:
    - $MSE_{(LOOCV)} = \frac{1}{n} \sum_{i=1}^n MSE_{i}$

In [None]:
%%time
polynomials=[1,2,3,4,5]
DFs=[] # empty list of dataframes
for n, p in enumerate(polynomials):
    MSEs=[] # empty list of MSE scores
    for i in range(len(df)):
        # split the data (a different sample with each iteration)
        X_train, X_test, y_train, y_test = df.loc[df.index!=i,'horsepower'],df.loc[df.index==i,'horsepower'] , df.loc[df.index!=i,'mpg'],df.loc[df.index==i,'mpg'].values[0]

        # prepocess the data
        polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
        X_train_new=polynomial_features.fit_transform(X_train.values.reshape(-1, 1)) # get data of degree n
        X_test_new=polynomial_features.fit_transform(np.array(X_test).reshape(-1, 1)) # get data of degree n

        # regression
        reg = LinearRegression() # initiate the regression class
        reg.fit(X_train_new,y_train) # fit the data

        # Out of Sample MSE:
        mse=np.square(y_test-reg.predict(X_test_new)[0]) # Out of Sample MSE
        MSEs.append(mse)
    # store all MSEs for degree p in a dataframe and append to list of dataframes    
    DFs.append(pd.DataFrame({'Degree':p, 'MSE':MSEs}))

dd=pd.concat(DFs, ignore_index=True)    
display(dd.groupby('Degree').agg(Mean_MSE=('MSE','mean')))
# show the distribution of MSEs
fig, ax = plt.subplots(1,1, figsize=(8,8))
sns.kdeplot(data=dd, x='MSE', ax=ax, hue='Degree')
ax.set_xlabel("MSE Scores - LOOCV")
plt.show()

## k-Fold Cross-Validation
- An alternative to LOOCV is k-fold cross validation. 
- This approach involves randomly dividing the set of observations into k groups, or __folds__, of approximately equal size. 
- The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. 
- The mean squared error, $MSE_1$, is then computed on the observations in the held-out fold. 
- This procedure is repeated k times; each time, a different group of observations is treated as a validation set.
- This process results in k estimates of the test error, $MSE_1$,$MSE_2$, . . . ,$MSE_k$. 
- The k-fold CV estimate is computed by averaging these values:
    - $MSE_{(k-fold CV)} = \frac{1}{k} \sum_{i=1}^k MSE_{i}$

# k-fold Cross-Validation, continued
- Note that LOOCV is a special case of k-fold CV in which k is set to equal n. 
- Standard practices consist in using k = 5 or k = 10. 
- What is the advantage of using k = 5 or k = 10 rather than k = n? 
    - Computation time which will increase with dataset sizes. 
    - LOOCV requires fitting the statistical learning method n times. 
        - This has the potential to be computationally expensive 
    - Performing 10-fold CV requires fitting the learning procedure only ten times

In [None]:
%%time
X = df['horsepower'].values
y = df['mpg'].values
polynomials=[1,2,3,4,5]
kDFs=[]
for k in [5,10]:
    DFs=[]
    # Split in k folds
    kf = KFold(n_splits=k, random_state=1706, shuffle=True)
    for p in polynomials:
        MSEs=[] # empty list of MSE scores
        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]

            # prepocess the data
            polynomial_features =PolynomialFeatures(degree=p,include_bias=False) # iniate a class to get data of degree n
            X_train_new=polynomial_features.fit_transform(X_train.reshape(-1, 1)) # get data of degree n
            X_test_new=polynomial_features.fit_transform(X_test.reshape(-1, 1)) # get data of degree n

            # regression
            reg = LinearRegression() # initiate the regression class
            reg.fit(X_train_new,y_train) # fit the data

            # Out of Sample MSE:
            mse=mean_squared_error(y_test, reg.predict(X_test_new))
            MSEs.append(mse)
            # store all MSEs for degree p in a dataframe and append to list of dataframes    
        DFs.append(pd.DataFrame({'Degree':p, 'k-fold':k,'MSE':MSEs}))

    dd=pd.concat(DFs) 
    kDFs.append(dd)

dd=pd.concat(kDFs, ignore_index=True)
mse_change=dd.groupby(['Degree','k-fold']).agg(Mean_MSE=('MSE','mean'))

In [None]:
# show the distribution of MSEs
fig, axes = plt.subplots(2,2, figsize=(10,10))
sns.kdeplot(data=dd.loc[dd['k-fold']==5], x='MSE', ax=axes[0,0], hue='Degree')
axes[0,0].set_xlabel("MSE Scores - 5-fold Cross Validation",fontsize=12)
sns.kdeplot(data=dd.loc[dd['k-fold']==10], x='MSE', ax=axes[0,1], hue='Degree')
axes[0,1].set_xlabel("MSE Scores - 10-fold Cross Validation",fontsize=12)

gs = axes[1, 1].get_gridspec()
for ax in axes[1, :]:
    ax.remove()
axbottom = fig.add_subplot(gs[1, :])
sns.lineplot(data=mse_change.reset_index(),x='Degree', y='Mean_MSE',hue='k-fold',alpha=.7, ax=axbottom)
sns.scatterplot(data=mse_change.reset_index(),x='Degree', y='Mean_MSE',hue='k-fold',style='Degree', ax=axbottom,legend=False)
axbottom.set_xlabel("Degrees of the Polynomial Regression",fontsize=12)
axbottom.set_ylabel("Mean Squared Errors",fontsize=12)
axbottom.set_xticks(np.arange(1,len(polynomials)+1))
fig.tight_layout()

plt.show()

## Why do we use resampling methods?
- Cross Validation is one of the several __resampling methods__ one can use to learn about predictive performance 
- Do we really care about the test MSE value we get from CV?
    - If you want to know the performance of a given statistical model, yes
        - e.g. Given my model how wrong should I expect my predictions to be "on average"?
- Sometimes you may only be interested in the location of the minimum point in the test MSE
    - If you want to compare perfomance of different models 
        - Like we did today
    - The actual value of the test MSE is not important
     - What matters is which models performs best
         - within the same method using different levels of flexibility (like we did today)
         - or across methods

## Bias-Variance Trade-Off, 1/3
- LOOCV is a special case of k-fold (since k<n) 
- We said that a reason to privilege k-fold has to do with computational power
    - There are not to be neglected
    - But what if LOOCV performs better?
- Turns out, there is another reason to use k-fold CV: __accuracy__
    - k-fold validation usually gets closer to the true MSE than LOOCV

## Bias-Variance Trade-Off - Bias Reduction, 2/3 

- An issue with the _validation set_ approach is that it can overestimate the test MSE
    - You train your model on half the data which can generate a __bias__
- With LOOCV you use almost all (n-1) the data in your training
     - This means that LOOCV will give approximately __unbiased estimates__ of the test error
     - k-fold is somewhere in between the _validation set_ and _LOOCV_
- So if you care about lowest bias, LOOCV is the best method

## Bias-Variance Trade-Off - Variance Reduction, 3/3

- But we know that bias is not the only source for concern in an estimating procedure (wink, t-test and p-values)
    - we must also consider the procedure’s variance. 
- LOOCV has __higher variance__ than does k-fold CV with k < n. 
- Remember that, for both, you take averages of test squared errors
    - You actually take n averages in LOOCV from almost identical training datasets
    - This implies that test $MSE_{(LOOCV),i}$ are highly correlated 
        - k-fold are less correlated: overlap between training sets is smaller

- The mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated
- the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.

## In-Class Exercise: 


Q9: As you increase the number of folds in K-fold cross-|validation
- What happens to bias?
- What happens to variance? 