---
title: "Lab 6: Variable Selection and Regularization"
author: "James Compagno"
format: 
  html:
    code-fold: true
    code-line-numbers: true
    code-tools: true
    self-contained: true
execute:
  message: false
  echo: false
  eval: false
---

In [2]:
import pandas as pd
import numpy as np
import plotnine as p9
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

# Dataset: Baseball Players

In this lab, we will use predictive modeling to design a model that predicts a baseball player's salary in a given year.

This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

**Format:** A data frame with 322 observations of major league players on the following 20 variables.

`AtBat` Number of times at bat in 1986 

`Hits` Number of hits in 1986 

`HmRun` Number of home runs in 1986

`Runs` Number of runs in 1986 

`RBI` Number of runs batted in in 1986 

`Walks` Number of walks in 1986 

`Years` Number of years in the major leagues 

`CAtBat` Number of times at bat during his career 

`CHits` Number of hits during his career 

`CHmRun` Number of home runs during his career 

`CRuns` Number of runs during his career 

`CRBI` Number of runs batted in during his career 

`CWalks` Number of walks during his career 

`League` A factor with levels A and N indicating player's league at the end of 1986 

`Division` A factor with levels E and W indicating player's division at the end of 1986 

`PutOuts` Number of put outs in 1986 

`Assists` Number of assists in 1986 

`Errors` Number of errors in 1986 

`Salary` 1987 annual salary on opening day in thousands of dollars 

`NewLeague` A factor with levels A and N indicating player's league at the beginning of 1987

You can download the dataset from [here](https://www.dropbox.com/s/boshaqfgdjiaxh4/Hitters.csv?dl=1).

A couple notes about this lab:

1.  Although it isn't listed as a specific question, don't forget to clean your data at the beginning. How will you handle missing data? Are there any variables that need adjusting?

2.  There are a **lot** of variables in the dataset! You may want to use the `remainder = "passthrough"` trick in your column transformers, rather than typing out a ton of gene names.

3.  Don't forget that in penalized regression, we **must** standardize our numeric variables.

4.  There is a lot of repetition in this lab. Think about ways to streamline your code - for example, you might consider writing simple functions to easily create pipelines.


# Part I: Different Model Specs

## A. Regression without regularization

1.  Create a pipeline that includes *all* the columns as predictors for `Salary`, and performs ordinary linear regression

2.  Fit this pipeline to the full dataset, and interpret a few of the most important coefficients.

3.  Use cross-validation to estimate the MSE you would expect if you used this pipeline to predict 1989 salaries.

## B. Ridge regression

1.  Create a pipeline that includes *all* the columns as predictors for `Salary`, and performs ordinary ridge regression

2.  Use cross-validation to **tune** the $\lambda$ hyperparameter.

3.  Fit the pipeline with your chosen $\lambda$ to the full dataset, and interpret a few of the most important coefficients.

4.  Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

## C. Lasso Regression

1.  Create a pipeline that includes *all* the columns as predictors for `Salary`, and performs ordinary ridge regression

2.  Use cross-validation to **tune** the $\lambda$ hyperparameter.

3.  Fit the pipeline with your chosen $\lambda$ to the full dataset, and interpret a few of the most important coefficients.

4.  Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

## D. Elastic Net

1.  Create a pipeline that includes *all* the columns as predictors for `Salary`, and performs ordinary ridge regression

2.  Use cross-validation to **tune** the $\lambda$ and $\alpha$ hyperparameters.

3.  Fit the pipeline with your chosen hyperparameters to the full dataset, and interpret a few of the most important coefficients.

4.  Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

# Part II. Variable Selection

Based on the above results, decide on:

-   Which *numeric* variable is most important.

-   Which *five* numeric variables are most important

-   Which *categorical* variable is most important

For **each** of the four model specifications, compare the following possible feature sets:

1.  Using only the one best numeric variable.

2.  Using only the five best variables.

3.  Using the five best numeric variables *and* their interactions with the one best categorical variable.

Report which combination of features and model performed best, based on the validation metric of MSE.

(Note: $\lambda$ and $\alpha$ must be re-tuned for each feature set.)

# Part III. Discussion

## A. Ridge

Compare your Ridge models with your ordinary regression models. How did your coefficients compare? Why does this make sense?

## B. LASSO

Compare your LASSO model in I with your three LASSO models in II. Did you get the same $\lambda$ results? Why does this make sense? Did you get the same MSEs? Why does this make sense?

## C. Elastic Net

Compare your MSEs for the Elastic Net models with those for the Ridge and LASSO models. Why does it make sense that Elastic Net always "wins"?

# Part IV: Final Model

Fit your final best pipeline on the full dataset, and summarize your results in a few short sentences and a plot.