# 4. Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Train/Test Split](#4.5_Train/Test_Split)
       * [4.5.1 Home and Away Variables](#4.5.1_Home_and_Away_Variables)
       * [4.5.2 X Variables](#4.5.2_X_Variables)     
       * [4.5.3 Train Test Split](#4.5.3_Train_Test_Split)     
  * [4.6 Linear Regression](#4.6_Linear_Regression)
       * [4.6.1 LR Metrics](#4.6.1_LR_Metrics) 
       * [4.6.2 Cross Validation of Linear Regression](#4.6.2_Cross_Validation_of_Linear_Regression) 
       * [4.6.3 Grid Search CV for LR](#4.6.3_Grid_Search_CV_for_LR) 
  * [4.7 Ridge Regression](#4.7_Ridge_Regression)
       * [4.7.1 RR Metrics](#4.7.1_RR_Metrics)  
       * [4.7.2 Cross Validation of Ridge Regression](#4.7.2_Cross_Validation_of_Ridge_Regression)
       * [4.7.3 Grid Search CV for RR](#4.7.3_Grid_Search_CV_for_RR)   
  * [4.8 Lasso Regression](#4.8_Lasso_Regression)
       * [4.8.1 Lasso Regression Metrics](#4.8.1_Lasso_Regression_Metrics) 
       * [4.8.2 Cross Validation of Lasso Regression](#4.8.2_Cross_Validation_of_Lasso_Regression)
       * [4.8.3 Grid Search CV for Lasso Regression](#4.8.3_Grid_Search_for_Lasso_Regression)
  * [4.9 Random Forest Model](#4.9_Random_Forest_Model)
       * [4.9.1 RF Metrics](#4.9.1_RF_Metrics)
       * [4.9.2 Cross Validation of RF](#4.9.2_Cross_Validation_of_RF)
       * [4.9.3 Grid Search CV for Random Forest](#4.9.3_Grid_Search_CV_for_Random_Forest)
  * [4.10 Model Metrics](#4.10_Model_Metrics)
  * [4.11 Summary](#4.11_Summary)


## 4.2 Introduction<a id='4.2_Introduction'></a>

   In the last few steps of our data analysis we filled in the NaN values with median, mean or simply dropped the row altogether. We found which variables had the greatest correlation, like attempts and completions, (.94). This will helped us to predict how many touchdowns a QB will throw in a game.
   
   Now we will begin to create machine learning models with our QB data. Here we will compare four models to see which one is the best for predicting how many touchdowns are thrown based upon our variables. We will compare the models using RMSE, MAE and R2 score. Which ever one has the best scores we will choose going forward to predict the Y variable of touchdowns.

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVC, SVR
from math import sqrt

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [2]:
df = pd.read_csv('QB_stats_clean.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,qb,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,home_away,year
0,0,Boomer EsiasonB. Esiason,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,away,1996
1,1,Jim HarbaughJ. Harbaugh,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,home,1996
2,2,Paul JustinP. Justin,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,home,1996
3,3,Jeff GeorgeJ. George,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,away,1996
4,4,Kerry CollinsK. Collins,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,home,1996


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13172 entries, 0 to 13171
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   13172 non-null  int64  
 1   qb           13172 non-null  object 
 2   cmp          13172 non-null  int64  
 3   att          13172 non-null  int64  
 4   comp %       13172 non-null  float64
 5   yds          13172 non-null  float64
 6   td           13172 non-null  int64  
 7   int          13172 non-null  int64  
 8   rate         13172 non-null  float64
 9   long         13172 non-null  float64
 10  sack         13172 non-null  float64
 11  game_points  13172 non-null  int64  
 12  ypa          13172 non-null  float64
 13  ypc          13172 non-null  float64
 14  td_per_cmp   13172 non-null  float64
 15  td_per_att   13172 non-null  float64
 16  loss_yds     13172 non-null  float64
 17  home_away    13172 non-null  object 
 18  year         13172 non-null  int64  
dtypes: f

Delete the column "unnamed:0" with  no numbers. No information from this column is useful. Lets also get ride of the qb column as we want this to be unkown no matter who the QB is. This will reduce our own personal basis towards the QB.

In [4]:
df = df.drop(df.columns[0:2], axis=1)

In [5]:
df.shape

(13172, 17)

In [6]:
#Lets insure we dropped the columns
df.head()

Unnamed: 0,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,home_away,year
0,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,away,1996
1,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,home,1996
2,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,home,1996
3,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,away,1996
4,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,home,1996


## 4.5 Train/Test Split<a id='4.5_Train/Test_Split'></a>

Now we will Train/Test the QB data. We will set aside data, (the test) to evaluate our model performance. A train/test split is helpful to check in on future performance that we predict. Lets see what the size of the train/test split would be.

In [7]:
#Size of the 70% Train & 30% Test
len(df) * .7, len(df) * .3

(9220.4, 3951.6)

### 4.5.1 Home and Away Variables<a id='4.5.1_Home_and_Away_Variables'></a>

Lets get numeric values for the "home_away" column. 1 for away and 0 for home using get dummies.

In [8]:
#get dummy varibles for home_away column to make it numerical
df= pd.get_dummies(df, columns=['home_away'])

Rename the columns to make it easier to read.

In [9]:
#inplace to make it permanent to the data frame
df.rename(columns={'home_away_away': 'away', 'home_away_home': 'home'}, inplace=True)

In [10]:
#Lets check and see what type the away and home column are
df['home'].dtype
df['away'].dtype

dtype('uint8')

In [11]:
#Lets change them to int
df['home']=df['home'].astype(int)
df['away']=df['away'].astype(int)

### 4.5.2 X Variables<a id='4.5.2_X_Variables'></a>

Lets create our X variable which is all of the columns minus the td column.

In [12]:
#Get all the features minus the rate. We will predict the rate using all the other features.
features = ['cmp', 'att','comp %', 'yds', 'int', 'rate','long', 'sack', 'game_points', 'ypa', 'ypc', 'td_per_cmp', 'td_per_att', 'loss_yds', 'home', 'away', 'year']

In [13]:
df.head()

Unnamed: 0,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,year,away,home
0,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,1996,1,0
1,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,1996,0,1
2,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,1996,0,1
3,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,1996,1,0
4,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,1996,0,1


### 4.5.3 Train Test Split<a id='4.5.3_Train_Test_Split'></a>

Now we will do the train test split of the data with a test size of 30%.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(df[features], df['td'],test_size=0.3, 
                                                    random_state=47)

In [15]:
#Lets check how many rows and columns we have for the X variable
X_train.shape, X_test.shape

((9220, 17), (3952, 17))

In [16]:
#Also check the Y
y_train.shape, y_test.shape

((9220,), (3952,))

In [17]:
#Make sure we have numeric values for X train
X_train.dtypes

cmp              int64
att              int64
comp %         float64
yds            float64
int              int64
rate           float64
long           float64
sack           float64
game_points      int64
ypa            float64
ypc            float64
td_per_cmp     float64
td_per_att     float64
loss_yds       float64
home             int32
away             int32
year             int64
dtype: object

In [18]:
#Now check that we have numeric values for X test
X_test.dtypes

cmp              int64
att              int64
comp %         float64
yds            float64
int              int64
rate           float64
long           float64
sack           float64
game_points      int64
ypa            float64
ypc            float64
td_per_cmp     float64
td_per_att     float64
loss_yds       float64
home             int32
away             int32
year             int64
dtype: object

We now have all numeric features for our X Train/Test split!

## 4.6 Linear Regression<a id='4.6_Linear_Regression'></a>

Make a Pipeline for Linear Regression.

In [19]:
#Create the pipeline
lr_pipeline=make_pipeline(
    StandardScaler(), 
    LinearRegression())

In [20]:
#fit to the training data
lr_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [21]:
#predict the y (QB Rating) using the test data
y_te_pred = lr_pipeline.predict(X_test)
print(y_te_pred)

[ 0.96449569  1.53131995  2.29045684 ...  0.91216184 -0.51769935
  1.10188415]


This is predicting how many TDs a QB threw utilzing the X variables from the unseen test data.

In [22]:
#predict the y (QB Rating) using the train data
y_tr_pred = lr_pipeline.predict(X_train)
print(y_tr_pred)

[0.15850306 0.94629965 0.45264682 ... 0.90742178 0.28048023 1.85117139]


This is predicting how many TDs a QB threw utilzing the X variables from the training data. This is data the model has already seen before.

### 4.6.1 LR Metrics<a id='4.6.1_LR_Metrics'></a>

First lets do the test R2 score for Linear Regresion since we want to see if it fits our model.

#### R2 Score

In [23]:
r2_score(y_test, y_te_pred)

0.7773340771441586

The R2 score is decent but could be better. This score explains how much of the dependant variable, (touchdowns) is explained by the independant variables, (df['features']). This is the most important score to see if our model is working properly and is fitting the data.

In [24]:
r2_score(y_train, y_tr_pred)

0.7908455260457911

The training R2 score is slightly higher than the test score. If it was a substantial percentage amount it would mean that the training data is overfitting the data.

#### MAE

In [25]:
mean_absolute_error(y_test, y_te_pred)

0.37686733549094426

Here we can expect to be .37686 off of guessing the amount of any given touchdowns a QB throws in a game utilizing our test variables in this Linear Regression model.

#### RMSE

In [26]:
sqrt(mean_squared_error(y_test,y_te_pred))

0.5291923708849422

This RMSE means we can expect to be off by .52919 +/- of the prediction. We could over or under guess the amount of touchdowns by .52919

### 4.6.2 Cross Validation of Linear Regression<a id='4.6.2_Cross_Validation_of_Linear_Regression'></a>

Lets check the Cross Validation to see if our score is different using 10 fold.

In [27]:
cv_results_lr = cross_validate(lr_pipeline, X_train, y_train, cv=10)
cv_results_lr

{'fit_time': array([0.01499081, 0.01299167, 0.01399064, 0.01599789, 0.01799011,
        0.01998854, 0.01499104, 0.0129745 , 0.01700807, 0.01898932]),
 'score_time': array([0.00499678, 0.00298166, 0.00499654, 0.00499606, 0.00399375,
        0.00299835, 0.00199914, 0.00299788, 0.00498271, 0.00299811]),
 'test_score': array([0.81269061, 0.80562203, 0.73362744, 0.81383986, 0.75751313,
        0.79672061, 0.80324857, 0.68370744, 0.80905836, 0.80575595])}

Now we need to pull up just the CV score.

In [28]:
cv_scores_lr = cv_results_lr['test_score']
cv_scores_lr

array([0.81269061, 0.80562203, 0.73362744, 0.81383986, 0.75751313,
       0.79672061, 0.80324857, 0.68370744, 0.80905836, 0.80575595])

Lets compute the mean of these 10 different scores.

In [29]:
np.mean(cv_scores_lr), np.std(cv_scores_lr)

(0.7821784001734245, 0.041326757649802225)

The mean of the CV score .782 is slighly higher than the R2 score of .777 for our test set. The standard deviation is .04, very low.

### 4.6.3 Grid Search CV for LR<a id='4.6.3_Grid_Search_CV_for_LR'></a>

Lets run a Grid Search to check different hyperparameters to see if we get a different R2 score.

In [30]:
param_grid = {'C': [0.1, 1, 10, 100], 'max_iter':[10000]}             

I tried out different parameters in the param grid. Most of the time it still came up with the same R2 score in the end.

In [31]:
#Run the GridSearchCV()
lr_grid_cv = GridSearchCV(SVR(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [32]:
#Fit the GridSearch() to the train data
lr_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(estimator=SVR(), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'max_iter': [10000]},
             verbose=3)

In [33]:
lr_grid_cv.best_params_

{'C': 100, 'max_iter': 10000}

These were the best parameters for the data.

In [34]:
lr_best_cv_results = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, cv=10)
lr_best_scores = lr_best_cv_results['test_score']
lr_best_scores

array([0.7062803 , 0.71636659, 0.74450861, 0.74671012, 0.75678309,
       0.73860765, 0.72582974, 0.73109968, 0.74288954, 0.72878372])

This is the array of the test scores for 10 different cross validations of the data.

In [35]:
np.mean(lr_best_scores), np.std(lr_best_scores)

(0.7337859041363517, 0.014396308773709055)

Here we get a mean of the 10 CV R2 scores for the data using Grid Search. The .733 is a little lower than the CV and R2 score we computed before. This means we can be confident that the Linear model is fitting the data around .733 of the data. It explains the Y variable, (touchdowns) with the rest of our data variables.

## 4.7 Ridge Regression<a id='4.7_Ridge_Regression'></a>

In [36]:
#Create the pipeline
r_pipeline=make_pipeline(
    StandardScaler(), 
    Ridge(alpha=10))

After changing the parameters a few times this was the best score for the model.

In [37]:
#fit to the training data
r_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('ridge', Ridge(alpha=10))])

In [38]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_r = r_pipeline.predict(X_train)
y_te_pred_r = r_pipeline.predict(X_test)

### 4.7.1 RR Metrics<a id='4.7.1_RR_Metrics'></a>

Lets check the most important score first, R2.

#### R2 Score

In [39]:
r2_score(y_train, y_tr_pred_r), r2_score(y_test, y_te_pred_r)

(0.7906719345075179, 0.7780171773899436)

This R2 score is ok but could be better for a model that is predict the independant variable of touchdowns. The training score is slightly higher once again.

#### MAE

In [40]:
mean_absolute_error(y_train, y_tr_pred_r), mean_absolute_error(y_test, y_te_pred_r)

(0.3728687781206778, 0.37961858717621005)

The mean absolute error for ridge regression is pretty close to our last linear regression model. Here we can expect to be .37961 off of guessing the amount of any given touchdowns a QB throws in a game utilizing our test variables in this model. 

#### RMSE

In [41]:
sqrt(mean_squared_error(y_test,y_te_pred_r))

0.5283800123788575

Once again the RMSE is close to our last model. We can expect to be off by .5283 +/- of the prediction. We could over or under guess the amount of touchdowns by .5283.

In [42]:
sqrt(mean_squared_error(y_train,y_tr_pred_r))

0.5119318364171611

The training model has a lower RMSE. Wich means we could predict the amount of touchdowns plus or minus in the training set.

### 4.7.2 Cross Validation of Ridge Regression<a id='4.7.2_Cross_Validation_of_Ridge_Regression'></a>

Lets do a 10 CV of Ridge Regression. 

In [43]:
cv_results_rr = cross_validate(r_pipeline, X_train, y_train, cv=10)
cv_results_rr

{'fit_time': array([0.01998854, 0.01599646, 0.01699543, 0.01998806, 0.01999044,
        0.01199174, 0.01998878, 0.01399684, 0.02098536, 0.01900721]),
 'score_time': array([0.00300264, 0.00599337, 0.00299716, 0.00199699, 0.00399613,
        0.0049963 , 0.00299788, 0.00499463, 0.00499582, 0.00399804]),
 'test_score': array([0.81061265, 0.80404773, 0.73904116, 0.81154059, 0.75992797,
        0.79425532, 0.802458  , 0.69468228, 0.80862431, 0.80331953])}

In [44]:
cv_scores_rr = cv_results_rr['test_score']
cv_scores_rr

array([0.81061265, 0.80404773, 0.73904116, 0.81154059, 0.75992797,
       0.79425532, 0.802458  , 0.69468228, 0.80862431, 0.80331953])

In [45]:
np.mean(cv_scores_rr), np.std(cv_scores_rr)

(0.7828509535714215, 0.03722809495198848)

The CV mean score is just a little higher than the original R2 score.

### 4.7.3 Grid Search CV for RR<a id='4.7.3_Grid_Search_CV_for_RR'></a>

Here we will run a Grid Search on Ridge Regression. Lets try out some different alpha variables to see what we get.

In [46]:
param_grid = {'alpha': [0.1, 1, 10, 100]}          

In [47]:
rr_grid_cv = GridSearchCV(Ridge(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [48]:
rr_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': [0.1, 1, 10, 100]}, verbose=3)

In [49]:
rr_grid_cv.best_params_

{'alpha': 0.1}

The best parameter is an alpha of .1.

In [50]:
rr_best_cv_results = cross_validate(rr_grid_cv.best_estimator_, X_train, y_train, cv=10)
rr_best_scores = rr_best_cv_results['test_score']
rr_best_scores

array([0.81067923, 0.80402985, 0.73929009, 0.8116448 , 0.76011151,
       0.7943459 , 0.80239554, 0.694259  , 0.8081836 , 0.80356723])

These are the 10 different Grid Search CV scores on the data.

In [51]:
np.mean(rr_best_scores), np.std(rr_best_scores)

(0.7828506756018052, 0.037283026133781955)

The mean of those scores is .782 and the standard deviation is .0372. This R2 Grid Search CV score is very close to the CV mean and our original R2 score that we got previously but slightly lower.

## 4.8 Lasso Regression<a id='4.8_Lasso_Regression'></a>

In [52]:
#Create the pipeline
l_pipeline=make_pipeline(
    StandardScaler(), 
    Lasso(alpha=.001, random_state=42, max_iter=1000))

This was the best model for Lasso Regression after changing the parameters a few times. In particular I changed the alpha around from .002, .003, 1,10,20,35 and 100. The lower penalization alpha of .001 was the best model for our data.

In [53]:
#fit to the training data
l_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('lasso', Lasso(alpha=0.001, random_state=42))])

In [54]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_l = l_pipeline.predict(X_train)
y_te_pred_l = l_pipeline.predict(X_test)

Lets look at the predictions for the y test.

In [55]:
y_te_pred_l

array([ 0.96656131,  1.5360834 ,  2.28769757, ...,  0.9185723 ,
       -0.53319295,  1.1021776 ])

Lasso Regression is model is working.

### 4.8.1 Lasso Regression Metrics<a id='4.8.1_Lasso_Regression_Metrics'></a>

#### R2 Score

In [56]:
r2_score(y_train, y_tr_pred_l), r2_score(y_test, y_te_pred_l)

(0.7906819550245227, 0.7779645798606276)

The Lasso Regression is pretty relative to the other models so far at a R2 score of .779 on predicting the y in the test data. So far this score is slightly higher than Ridge Regression.

#### MAE

In [57]:
mean_absolute_error(y_train, y_tr_pred_l), mean_absolute_error(y_test, y_te_pred_l)

(0.37192380704279043, 0.3785919709341438)

This MAE acore is relative to the other two models so far.

#### RMSE

In [58]:
sqrt(mean_squared_error(y_test,y_te_pred_l))

0.5284426069454146

This model is off by over .528 touchdowns +/- per prediction in the test data. The Lasso Regression is very close to the Ridge Regression for the RMSE.

In [59]:
sqrt(mean_squared_error(y_train,y_tr_pred_l))

0.5119195832034019

A slightly lower RMSE score on the training data. This means the the model is better at predicting touchdowns in the training data.

### 4.8.2 Cross Validation of Lasso Regression<a id='4.8.2_Cross_Validation_of_Lasso_Regression'></a>

Lets try a cross validation of 10 on Lasso Regression.

In [60]:
cv_results_l = cross_validate(l_pipeline, X_train, y_train, cv=10)
cv_results_l

{'fit_time': array([0.06495929, 0.06396317, 0.06596279, 0.05597091, 0.05596566,
        0.06296349, 0.06296611, 0.0629611 , 0.05996537, 0.06196809]),
 'score_time': array([0.00299859, 0.0049963 , 0.00199819, 0.0029974 , 0.0039978 ,
        0.00499749, 0.0049963 , 0.00399899, 0.00399733, 0.00299692]),
 'test_score': array([0.8117399 , 0.80489404, 0.73463088, 0.81250893, 0.75766884,
        0.79486866, 0.80352   , 0.69059874, 0.81013249, 0.80447323])}

In [61]:
cv_scores_l = cv_results_l['test_score']
cv_scores_l

array([0.8117399 , 0.80489404, 0.73463088, 0.81250893, 0.75766884,
       0.79486866, 0.80352   , 0.69059874, 0.81013249, 0.80447323])

In [62]:
np.mean(cv_scores_l), np.std(cv_scores_l)

(0.7825035687930466, 0.039309137165472095)

Here we get a mean of the 10 CVs of .782 and a std of .0372. This is higher than the .779 original R2 score.

### 4.8.3 Grid Search CV for Lasso Regression<a id='4.8.3_Grid_Search_for_Lasso_Regression'></a> 

Now we will do a grid search on Lasso Regression. Lets try some different alpha and maximum iterators numbers.

In [63]:
param_grid = {'alpha': [0.1, 1, 10, 100], 'max_iter':[10000]}

In [64]:
l_grid_cv = GridSearchCV(Lasso(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [65]:
l_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(estimator=Lasso(), n_jobs=-1,
             param_grid={'alpha': [0.1, 1, 10, 100], 'max_iter': [10000]},
             verbose=3)

In [66]:
l_grid_cv.best_params_

{'alpha': 0.1, 'max_iter': 10000}

The Grid Search found the best results are an alpha of .1 and a max iterator of 10,000.

In [67]:
l_best_cv_results = cross_validate(l_grid_cv.best_estimator_, X_train, y_train, cv=10)
l_best_scores = l_best_cv_results['test_score']
l_best_scores

array([0.69301418, 0.70537618, 0.71904834, 0.72593044, 0.71497801,
       0.70683503, 0.69980357, 0.6949455 , 0.72641134, 0.70297613])

In [68]:
np.mean(l_best_scores), np.std(l_best_scores)

(0.7089318713802191, 0.011481331926527916)

This average mean and std in Grid Search are lower than the values we found for the cross validation of Lasso.

## 4.9 Random Forest Model<a id='4.9_Random_Forest_Model'></a>

In [69]:
#Create the pipeline
rf_pipeline=make_pipeline(
    StandardScaler(), 
    RandomForestRegressor(n_estimators = 10000,
                           random_state = 42,
                           min_samples_split = 10,
                           bootstrap = True))

In [70]:
#fit to the training data
rf_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(min_samples_split=10, n_estimators=10000,
                                       random_state=42))])

In [71]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_rf = rf_pipeline.predict(X_train)
y_te_pred_rf = rf_pipeline.predict(X_test)

In [72]:
#Lets look at what RF predicts for each row
y_tr_pred_rf

array([0., 1., 0., ..., 1., 0., 2.])

This looks pretty good!

## 4.9.1 RF Metrics<a id='4.9.1_RF_Metrics'></a>

In [73]:
r2_score(y_train, y_tr_pred_rf), r2_score(y_test, y_te_pred_rf)

(0.9991705526741305, 0.9971020584351208)

Wow! This is the best model by far! An R2 of .9971 in the test data is pretty good at predicting new data with the random forest model. The training test score is not that much higher.

In [74]:
mean_absolute_error(y_train, y_tr_pred_rf), mean_absolute_error(y_test, y_te_pred_rf)

(0.0034843786707509923, 0.006537173162885364)

We can predict that we can be off by .0065 of a touchdown given any row of data.

In [75]:
sqrt(mean_squared_error(y_test,y_te_pred_rf))

0.060371436260887

In [76]:
sqrt(mean_squared_error(y_train,y_tr_pred_rf))

0.03222499171236639

The RMSE is very good as well for the RF model. It is a very low number. We could be off by .0603 +/- a touchdown given the data.

### 4.9.2 Cross Validation of RF<a id='4.9.2_Cross_Validation_of_RF'></a>

Now lets try a CV of Random Forest.

In [77]:
cv_results_rf = cross_validate(rf_pipeline, X_train, y_train, cv=10)
cv_results_rf

{'fit_time': array([388.70339704, 135.12183571, 137.74378371, 135.51948953,
        136.53145313, 137.38998008, 135.21757078, 125.26301837,
        121.81878233, 122.98504758]),
 'score_time': array([1.16499758, 1.1692152 , 1.14762163, 1.09387016, 1.15711069,
        1.05225754, 1.05475068, 1.03540492, 1.015414  , 1.02440929]),
 'test_score': array([0.99831267, 0.99648179, 0.99809721, 0.99797098, 0.99560623,
        0.99725179, 0.99744879, 0.99864914, 0.99981528, 0.99770187])}

In [78]:
cv_scores_rf = cv_results_rf['test_score']
cv_scores_rf

array([0.99831267, 0.99648179, 0.99809721, 0.99797098, 0.99560623,
       0.99725179, 0.99744879, 0.99864914, 0.99981528, 0.99770187])

The test scores look very high.

In [79]:
np.mean(cv_scores_rf), np.std(cv_scores_rf)

(0.9977335743674296, 0.0011001129205543254)

The mean and std are solid for a model. The mean CV score is just barely higher than the original R2 score.

### 4.9.3 Grid Search CV for Random Forest<a id='4.9.3_Grid_Search_CV_for_Random_Forest'></a> 

Lets try out the Grid Search for Random Forest using a few different number of trees and a max depths.

In [80]:
param_grid = {'n_estimators':[100, 200, 500, 1000], 'max_depth':[10, 50, 100]}

In [81]:
rf_grid_cv = GridSearchCV(RandomForestRegressor(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [82]:
rf_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_depth': [10, 50, 100],
                         'n_estimators': [100, 200, 500, 1000]},
             verbose=3)

In [83]:
rf_grid_cv.best_params_

{'max_depth': 10, 'n_estimators': 200}

The Grid Search found that the best max depth was 100 and 100 number of trees.

In [84]:
rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=10)
rf_best_scores = rf_best_cv_results['test_score']
rf_best_scores

array([0.99849054, 0.99704228, 0.99802433, 0.9984509 , 0.99605966,
       0.99611866, 0.99759887, 0.99889723, 0.99990314, 0.99814022])

Looks like the best scores are all very high!

In [85]:
np.mean(rf_best_scores), np.std(rf_best_scores)

(0.9978725818862509, 0.0011450454106730575)

Looks like RF is the model after trying it out on a Grid Search. The GS mean was just barely higher than our CV mean and original R2 score.

## 4.10 Model Metrics<a id='4.10_Model_Metrics'></a> 

In [105]:
data = {'R2 Score': [.7773,.7780,.7779,.9971],
        'MAE': [.3768,.3796,.3785,.0065],
        'RMSE': [.5291,.5283,.5284,.0605],
        'CV Mean Score': [.7821,.7828,.7825,.9977],
        'Grid Search CV Best Score': [.7337,.7828,.7089,.9978]}
  


In [106]:
# Creates pandas DataFrame.
df_metrics = pd.DataFrame(data, index=['Linear Regression',
                               'Ridge Regression',
                               'Lasso Regression',
                               'Random Forest Model'])

In [107]:
df_metrics

Unnamed: 0,R2 Score,MAE,RMSE,CV Mean Score,Grid Search CV Best Score
Linear Regression,0.7773,0.3768,0.5291,0.7821,0.7337
Ridge Regression,0.778,0.3796,0.5283,0.7828,0.7828
Lasso Regression,0.7779,0.3785,0.5284,0.7825,0.7089
Random Forest Model,0.9971,0.0065,0.0605,0.9977,0.9978


## 4.11 Summary<a id='4.11_Summary'></a>

In this section we tried four different models to see which one predicts the y variable the best in the test data. Our y variable is the amount of touchdowns a QB will throw in any given game. The X variable is the rest of our data columns. These different models used the X variables to predict the amount of touchdowns. 

The models Linear Regression, Lasso Regression and Ridge Regression were very relative. Their R2 scores were all .77. Linear Regression and Ridge Regression had an Mae of both .37 and  a RMSE of .52 in the test data. The Linear Regression had a Cross Validation average score of .782 and a standard deviation of .0413. The Grid Search CV score on Linear Regresson was even lower at .733 and a std of .014. The Grid Search is accurate at locating the true R2 score and thus we can be sure that the score is likely lower at .733. When we ran the Grid Search on Ridge Regression it had a higher score of .782. While when we ran the GS for Lasso it had a much lower score of .708.

The best model was Random Forest. Its R2 score was nearly perfect at .997. A very solid MAE of .006 and RMSE of .06 on the test data. The CV average & Grid Search CV for Random Forest was a very high score of .9977 and .9979. Going forward we will use the Random Forest model since it is the most accurrate in these metrics.