# Chapter 6 - Linear Model Selection and Regularization

### [Lab 1: Subset Selection Methods](#lab1)
- [Lab 6.5.1 Best Subset Selection](#lab-6.5.1)
- [Lab 6.5.2 Forward and Backward Stepwise Selection](#lab-6.5.2)
- [Lab 6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation](#lab-6.5.3)

### [Lab 2: Ridge Regression and the Lasso](#lab2)
- [Lab 6.6.1 Ridge Regression](#lab-6.6.1)
- [Lab 6.6.2 The Lasso](#lab-6.6.2)

### [Lab 3: PCR and PLS Regression](#lab3)
- [Lab 6.7.1 Principal Components Regression](#lab-6.7.1)
- [Lab 6.7.2 Partial Least Squares](#lab-6.7.2)

### Imports and Configurations

In [1]:
# Use rpy2 for loading R datasets
from rpy2.robjects.packages import importr
from rpy2.robjects.packages import data as rdata
from rpy2.robjects import pandas2ri

# Math and data processing
import numpy as np
import scipy as sp
import pandas as pd

# StatsModels
import statsmodels.api as sm
import statsmodels.formula.api as smf

# scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, cross_val_score
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error

# Visulization
from IPython.display import display
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
mpl.style.use('ggplot')

<a id='lab1'></a>

## Lab 1: Subset Selection Methods

<a id='lab-6.5.1'></a>

### 6.5.1 Best Subset Selection

In [2]:
# Hitters dataset is in R ISLR package
islr = importr('ISLR')
hitters_rdf = rdata(islr).fetch('Hitters')['Hitters']
hitters = pandas2ri.ri2py(hitters_rdf)
display(hitters.head(5))

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
-Andy Allanson,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
-Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
-Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
-Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
-Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [3]:
print(hitters.info())

<class 'pandas.core.frame.DataFrame'>
Index: 322 entries, -Andy Allanson to -Willie Wilson
Data columns (total 20 columns):
AtBat        322 non-null int32
Hits         322 non-null int32
HmRun        322 non-null int32
Runs         322 non-null int32
RBI          322 non-null int32
Walks        322 non-null int32
Years        322 non-null int32
CAtBat       322 non-null int32
CHits        322 non-null int32
CHmRun       322 non-null int32
CRuns        322 non-null int32
CRBI         322 non-null int32
CWalks       322 non-null int32
League       322 non-null object
Division     322 non-null object
PutOuts      322 non-null int32
Assists      322 non-null int32
Errors       322 non-null int32
Salary       263 non-null float64
NewLeague    322 non-null object
dtypes: float64(1), int32(16), object(3)
memory usage: 32.7+ KB
None


In [4]:
print(hitters.Salary.isnull().sum())

59


In [5]:
hitters.dropna(axis=0, inplace=True)

In [6]:
print(hitters.shape)
print(hitters.Salary.isnull().sum())

(263, 20)
0


<a id='lab-6.5.2'></a>

### Lab 6.5.2 Forward and Backward Stepwise Selection

<a id='lab-6.5.3'></a>

### Lab 6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation

<a id='lab2'></a>

## Lab 2: Ridge Regression and the Lasso

# ab-6.6.1

### Lab 6.6.1 Ridge Regression

# ab-6.6.2

### Lab 6.6.2 The Lasso

<a id='lab3'></a>

## Lab 3: PCR and PLS Regression

# ab-6.7.1

### Lab 6.7.1 Principal Components Regression

# ab-6.7.2

### Lab 6.7.2 Partial Least Squares