# 4. Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Train/Test Split](#4.5_Train/Test_Split)
       * [4.5.1 Home and Away Variables](#4.5.1_Home_and_Away_Variables)
       * [4.5.2 X Variables](#4.5.2_X_Variables)     
       * [4.5.3 Train Test Split](#4.5.3_Train_Test_Split)     
  * [4.6 Linear Regression](#4.6_Linear_Regression)
       * [4.6.1 LR Metrics](#4.6.1_LR_Metrics)    
  * [4.7 Ridge Regression](#4.7_Ridge_Regression)
       * [4.7.1 RR Metrics](#4.7.1_RR_Metrics)    
  * [4.8 Lasso Regression](#4.8_Lasso_Regression)
       * [4.8.1 Lasso Regression Metrics](#4.8.1_Lasso_Regression_Metrics) 
  * [4.9 Random Forest Model](#4.9_Random_Forest_Model)
       * [4.9.1 RF Metrics](#4.9.1_RF_Metrics)    
  * [4.10 Summary](#4.10_Summary)


## 4.2 Introduction<a id='4.2_Introduction'></a>

   In the last few steps of our data analysis we filled in the NaN values with median, mean or simply dropped the row altogether. We found which variables had the greatest correlation, like attempts and completions, (.94). This will helped us to predict how many touchdowns a QB will throw in a game.
   
   Now we will begin to create machine learning models with our QB data. Here we will compare four models to see which one is the best for predicting how many touchdowns are thrown based upon our variables. We will compare the models using RMSE, MAE and R2 score. Which ever one has the best scores we will choose going forward to predict the Y variable of touchdowns.

## 4.3 Imports<a id='4.3_Imports'></a>

In [144]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.metrics import mean_squared_error
from math import sqrt
import datetime

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [145]:
df = pd.read_csv('QB_stats_clean.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,qb,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,home_away,year
0,0,Boomer EsiasonB. Esiason,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,away,1996
1,1,Jim HarbaughJ. Harbaugh,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,home,1996
2,2,Paul JustinP. Justin,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,home,1996
3,3,Jeff GeorgeJ. George,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,away,1996
4,4,Kerry CollinsK. Collins,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,home,1996


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13172 entries, 0 to 13171
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   13172 non-null  int64  
 1   qb           13172 non-null  object 
 2   cmp          13172 non-null  int64  
 3   att          13172 non-null  int64  
 4   comp %       13172 non-null  float64
 5   yds          13172 non-null  float64
 6   td           13172 non-null  int64  
 7   int          13172 non-null  int64  
 8   rate         13172 non-null  float64
 9   long         13172 non-null  float64
 10  sack         13172 non-null  float64
 11  game_points  13172 non-null  int64  
 12  ypa          13172 non-null  float64
 13  ypc          13172 non-null  float64
 14  td_per_cmp   13172 non-null  float64
 15  td_per_att   13172 non-null  float64
 16  loss_yds     13172 non-null  float64
 17  home_away    13172 non-null  object 
 18  year         13172 non-null  int64  
dtypes: f

Delete the column "unnamed:0" with  no numbers. No information from this column is useful. Lets also get ride of the qb column as we want this to be unkown no matter who the QB is. This will reduce our own personal basis towards the QB.

In [147]:
df = df.drop(df.columns[0:2], axis=1)

In [148]:
df.shape

(13172, 17)

In [149]:
#Lets insure we dropped the columns
df.head()

Unnamed: 0,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,home_away,year
0,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,away,1996
1,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,home,1996
2,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,home,1996
3,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,away,1996
4,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,home,1996


## 4.5 Train/Test Split<a id='4.5_Train/Test_Split'></a>

Now we will Train/Test the QB data. We will set aside data, (the test) to evaluate our model performance. A train/test split is helpful to check in on future performance that we predict. Lets see what the size of the train/test split would be.

In [150]:
#Size of the 70% Train & 30% Test
len(df) * .7, len(df) * .3

(9220.4, 3951.6)

### 4.5.1 Home and Away Variables<a id='4.5.1_Home_and_Away_Variables'></a>

Lets get numeric values for the "home_away" column. 1 for away and 0 for home using get dummies.

In [151]:
#get dummy varibles for home_away column to make it numerical
df= pd.get_dummies(df, columns=['home_away'])

Rename the columns to make it easier to read.

In [152]:
#inplace to make it permanent to the data frame
df.rename(columns={'home_away_away': 'away', 'home_away_home': 'home'}, inplace=True)

In [153]:
#Lets check and see what type the away and home column are
df['home'].dtype
df['away'].dtype

dtype('uint8')

In [154]:
#Lets change them to int
df['home']=df['home'].astype(int)
df['away']=df['away'].astype(int)

### 4.5.2 X Variables<a id='4.5.2_X_Variables'></a>

Lets create our X variable which is all of the columns minus the td column.

In [155]:
#Get all the features minus the rate. We will predict the rate using all the other features.
features = ['cmp', 'att','comp %', 'yds', 'int', 'rate','long', 'sack', 'game_points', 'ypa', 'ypc', 'td_per_cmp', 'td_per_att', 'loss_yds', 'home', 'away', 'year']

In [156]:
df.head()

Unnamed: 0,cmp,att,comp %,yds,td,int,rate,long,sack,game_points,ypa,ypc,td_per_cmp,td_per_att,loss_yds,year,away,home
0,25,38,65.8,237.0,0,0,82.9,20.0,2.0,13,6.2,9.5,0.0,0.0,11.0,1996,1,0
1,16,25,64.0,196.0,2,1,98.1,35.0,0.0,20,7.8,12.2,0.125,0.08,0.0,1996,0,1
2,5,8,62.5,53.0,0,0,81.8,30.0,1.0,20,6.6,10.6,0.0,0.0,11.0,1996,0,1
3,16,35,45.7,215.0,0,0,65.8,55.0,7.0,6,6.1,13.4,0.0,0.0,53.0,1996,1,0
4,17,31,54.8,198.0,2,0,95.9,30.0,4.0,29,6.4,11.6,0.118,0.065,12.0,1996,0,1


### 4.5.3 Train Test Split<a id='4.5.3_Train_Test_Split'></a>

Now we will do the train test split of the data with a test size of 30%.

In [157]:
X_train, X_test, y_train, y_test = train_test_split(df[features], df['td'],test_size=0.3, 
                                                    random_state=47)

In [158]:
#Lets check how many rows and columns we have for the X variable
X_train.shape, X_test.shape

((9220, 17), (3952, 17))

In [159]:
#Also check the Y
y_train.shape, y_test.shape

((9220,), (3952,))

In [160]:
#Make sure we have numeric values for X train
X_train.dtypes

cmp              int64
att              int64
comp %         float64
yds            float64
int              int64
rate           float64
long           float64
sack           float64
game_points      int64
ypa            float64
ypc            float64
td_per_cmp     float64
td_per_att     float64
loss_yds       float64
home             int32
away             int32
year             int64
dtype: object

In [161]:
#Now check that we have numeric values for X test
X_test.dtypes

cmp              int64
att              int64
comp %         float64
yds            float64
int              int64
rate           float64
long           float64
sack           float64
game_points      int64
ypa            float64
ypc            float64
td_per_cmp     float64
td_per_att     float64
loss_yds       float64
home             int32
away             int32
year             int64
dtype: object

We now have all numeric features for our X Train/Test split!

## 4.6 Linear Regression<a id='4.6_Linear_Regression'></a>

Make a Pipeline for Linear Regression.

In [162]:
#Create the pipeline
lr_pipeline=make_pipeline(
    StandardScaler(), 
    LinearRegression())

In [163]:
#fit to the training data
lr_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [164]:
#predict the y (QB Rating) using the test data
y_te_pred = lr_pipeline.predict(X_test)
print(y_te_pred)

[ 0.96449569  1.53131995  2.29045684 ...  0.91216184 -0.51769935
  1.10188415]


This is predicting how many TDs a QB threw utilzing the X variables from the unseen test data.

In [165]:
#predict the y (QB Rating) using the train data
y_tr_pred = lr_pipeline.predict(X_train)
print(y_tr_pred)

[0.15850306 0.94629965 0.45264682 ... 0.90742178 0.28048023 1.85117139]


This is predicting how many TDs a QB threw utilzing the X variables from the training data. This is data the model has already seen before.

### 4.6.1 LR Metrics<a id='4.6.1_LR_Metrics'></a>

First lets do the test R2 score for Linear Regresion since we want to see if it fits our model.

#### R2 Score

In [166]:
r2_score(y_test, y_te_pred)

0.7773340771441586

The R2 score is decent but could be better. This score explains how much of the dependant variable, (touchdowns) is explained by the independant variables, (df['features']). This is the most important score to see if our model is working properly and is fitting the data.

In [167]:
r2_score(y_train, y_tr_pred)

0.7908455260457911

The training R2 score is slightly higher than the test score. If it was a substantial percentage amount it would mean that the training data is overfitting the data.

#### MAE

In [168]:
mean_absolute_error(y_test, y_te_pred)

0.37686733549094426

Here we can expect to be .37686 off of guessing the amount of any given touchdowns a QB throws in a game utilizing our test variables in this Linear Regression model.

#### RMSE

In [169]:
sqrt(mean_squared_error(y_test,y_te_pred))

0.5291923708849422

This RMSE means we can expect to be off by .52919 +/- of the prediction. We could over or under guess the amount of touchdowns by .52919

## 4.7 Ridge Regression<a id='4.7_Ridge_Regression'></a>

In [170]:
#Create the pipeline
r_pipeline=make_pipeline(
    StandardScaler(), 
    Ridge(alpha=10))

After changing the parameters a few times this was the best score for the model.

In [171]:
#fit to the training data
r_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('ridge', Ridge(alpha=10))])

In [172]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_r = r_pipeline.predict(X_train)
y_te_pred_r = r_pipeline.predict(X_test)

### 4.7.1 RR Metrics<a id='4.7.1_RR_Metrics'></a>

Lets check the most important score first, R2.

#### R2 Score

In [173]:
r2_score(y_train, y_tr_pred_r), r2_score(y_test, y_te_pred_r)

(0.7906719345075179, 0.7780171773899436)

This R2 score is ok but could be better for a model that is predict the independant variable of touchdowns. The training score is slightly higher once again.

#### MAE

In [174]:
mean_absolute_error(y_train, y_tr_pred_r), mean_absolute_error(y_test, y_te_pred_r)

(0.3728687781206778, 0.37961858717621005)

The mean absolute error for ridge regression is pretty close to our last linear regression model. Here we can expect to be .37961 off of guessing the amount of any given touchdowns a QB throws in a game utilizing our test variables in this model. 

#### RMSE

In [196]:
sqrt(mean_squared_error(y_test,y_te_pred_r))

0.5283800123788575

Once again the RMSE is close to our last model. We can expect to be off by .5283 +/- of the prediction. We could over or under guess the amount of touchdowns by .5283.

In [197]:
sqrt(mean_squared_error(y_train,y_tr_pred_r))

0.5119318364171611

The training model has a lower RMSE. Wich means we could predict the amount of touchdowns plus or minus in the training set.

## 4.8 Lasso Regression<a id='4.8_Lasso_Regression'></a>

In [291]:
#Create the pipeline
l_pipeline=make_pipeline(
    StandardScaler(), 
    Lasso(alpha=35, random_state=42, max_iter=1000))

This was the best model for Lasso Regression after changing the parameters afew times.

In [292]:
#fit to the training data
l_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('lasso', Lasso(alpha=35, random_state=42))])

In [293]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_l = l_pipeline.predict(X_train)
y_te_pred_l = l_pipeline.predict(X_test)

Lets look at the predictions for the y test.

In [294]:
y_te_pred_l

array([1.11908894, 1.11908894, 1.11908894, ..., 1.11908894, 1.11908894,
       1.11908894])

Lasso Regression is predicting the same thing everytime.

### 4.8.1 Lasso Regression Metrics<a id='4.8.1_Lasso_Regression_Metrics'></a>

#### R2 Score

In [295]:
r2_score(y_train, y_tr_pred_l), r2_score(y_test, y_te_pred_l)

(0.0, -2.836666816996569e-07)

This is the lowest model score as Lasso Regression is not good at predicting the independat variable.

#### MAE

In [296]:
mean_absolute_error(y_train, y_tr_pred_l), mean_absolute_error(y_test, y_te_pred_l)

(0.8881615934425303, 0.8869918413587782)

This is the highest MAE score so far out of all the models.

#### RMSE

In [259]:
sqrt(mean_squared_error(y_test,y_te_pred_l))

1.1214676687270984

This model is off by over 1 touchdown per prediction in the test data. This is the most inaccurate model.

In [297]:
sqrt(mean_squared_error(y_train,y_tr_pred_l))

1.1189183820410535

## 4.9 Random Forest Model<a id='4.9_Random_Forest_Model'></a>

In [298]:
#Create the pipeline
rf_pipeline=make_pipeline(
    StandardScaler(), 
    RandomForestRegressor(n_estimators = 10000,
                           random_state = 42,
                           min_samples_split = 10,
                           bootstrap = True))

In [299]:
#fit to the training data
rf_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(min_samples_split=10, n_estimators=10000,
                                       random_state=42))])

In [300]:
#predict the y (QB Rating) using the train and test data
y_tr_pred_rf = rf_pipeline.predict(X_train)
y_te_pred_rf = rf_pipeline.predict(X_test)

In [306]:
#Lets look at what RF predicts for each row
y_tr_pred_rf

array([0., 1., 0., ..., 1., 0., 2.])

This looks pretty good!

## 4.9.1 RF Metrics<a id='4.9.1_RF_Metrics'></a>

In [302]:
r2_score(y_train, y_tr_pred_rf), r2_score(y_test, y_te_pred_rf)

(0.9991705526741305, 0.9971020584351208)

Wow! This is the best model by far! An R2 of .9971 in the test data is pretty good at predicting new data with the random forest model. The training test score is not that much higher.

In [303]:
mean_absolute_error(y_train, y_tr_pred_rf), mean_absolute_error(y_test, y_te_pred_rf)

(0.0034843786707509923, 0.006537173162885364)

We can predict that we can be off by .0065 of a touchdown given any row of data.

In [304]:
sqrt(mean_squared_error(y_test,y_te_pred_rf))

0.060371436260887

In [305]:
sqrt(mean_squared_error(y_train,y_tr_pred_rf))

0.03222499171236639

The RMSE is very good as well for the RF model. It is a very low number. We could be off by .0603 +/- a touchdown given the data.

## 4.10 Summary<a id='4.10_Summary'></a>

In this section we tried four different models to see which one predicts the y variable the best in the test data. Our y variable is the amount of touchdowns a QB will throw in any given game. The X variable is the rest of our data columns. These different models used the X variables to predict the amount of touchdowns. 

The models Linear Regression and Ridge Regression were very relative. Their R2 scores were both .77. Linear Regression and Ridge Regression had an Mae of both .37 and  a RMSE of .52 in the test data. The lasso Regression was the least accurate model with a negitive R2 score. The MAE and RMSE in Lasso was higher than Linear and Ridge Regression.

The best model was Random Forest. Its R2 score was nearly perfect at .99. A very solid MAE of .006 and RMSE of .06 on the test data. Going forward we will use the Random Forest model since it is the most accurrate in these three different metrics.