<a href="https://colab.research.google.com/github/Chadschneider37/Data-Analytics-Projects/blob/main/NFL_PLAYOFF_CHANCES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NFL Offense vs. NFL Defense
Does having a better NFL offense or NFL defense give you a better shot at making the playoffs?

Trying to prove if the old adage that defenses win championships is really true. In order to win a championship, you must be able to make the playoffs. 

I have gathered both offensive and defensive rankings across categories for the past 5 seasons (2017-2022) from lineups.com/nfl-team-rankings to determine if it is possible to predict whether having a strong offense or strong defense can get you in the playoffs and increase your chances at winning an NFL title.

The bottom of this workbook contains results and reasoning behind which side of the ball wins: Offense or Defense

In [None]:
#Import all necessary libraries into colab
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE

In [None]:
#Import Offensive Rankings Dataset
from google.colab import files
heights = files.upload()

Saving Offense Rankings.csv to Offense Rankings.csv


In [None]:
#Create Offense dataframe
df = pd.read_csv('Offense Rankings.csv')

#Print top 5 of dataframe
df.head()

Unnamed: 0,YEAR,MADE PLAYOFFS,PTS,FPTS,PLAYS,YDS,PASS YDS,PASS ATT,PASS COMP,PASS TD,...,TD,RZ ATT,RZ TD,RZ TD PERCENTAGE,FIRST DOWNS,3RD DOWN CONVERSION,4TH DOWN CONVERSION,INT,TO,SACKS ALLOW
0,2021,1,2,1,4,2,1,1,1,1,...,2,5,3,2,2,2,14,10,21,1
1,2021,1,4,4,5,3,4,2,2,6,...,4,5,5,14,1,1,1,11,28,4
2,2021,1,10,8,16,10,8,15,11,4,...,6,3,5,19,7,8,9,1,5,12
3,2021,1,3,6,3,5,9,5,7,7,...,4,1,1,7,4,3,20,25,26,2
4,2021,1,1,2,2,1,2,6,3,3,...,1,7,5,6,6,11,11,7,6,12


In [None]:
#Find null values, if any
df.isna().sum()

YEAR                   0
MADE PLAYOFFS          0
PTS                    0
FPTS                   0
PLAYS                  0
YDS                    0
PASS YDS               0
PASS ATT               0
PASS COMP              0
PASS TD                0
RUSH YDS               0
RUSH ATT               0
RUSH TD                0
TD                     0
RZ ATT                 0
RZ TD                  0
RZ TD PERCENTAGE       0
FIRST DOWNS            0
3RD DOWN CONVERSION    0
4TH DOWN CONVERSION    0
INT                    0
TO                     0
SACKS ALLOW            0
dtype: int64

In [None]:
#Find the shape of the dataset
df.shape

(160, 23)

In [None]:
#Find datatypes of the dataset
df.dtypes

YEAR                   int64
MADE PLAYOFFS          int64
PTS                    int64
FPTS                   int64
PLAYS                  int64
YDS                    int64
PASS YDS               int64
PASS ATT               int64
PASS COMP              int64
PASS TD                int64
RUSH YDS               int64
RUSH ATT               int64
RUSH TD                int64
TD                     int64
RZ ATT                 int64
RZ TD                  int64
RZ TD PERCENTAGE       int64
FIRST DOWNS            int64
3RD DOWN CONVERSION    int64
4TH DOWN CONVERSION    int64
INT                    int64
TO                     int64
SACKS ALLOW            int64
dtype: object

In [None]:
#Descriptive Statistics of the Dataset
df.describe()

Unnamed: 0,YEAR,MADE PLAYOFFS,PTS,FPTS,PLAYS,YDS,PASS YDS,PASS ATT,PASS COMP,PASS TD,...,TD,RZ ATT,RZ TD,RZ TD PERCENTAGE,FIRST DOWNS,3RD DOWN CONVERSION,4TH DOWN CONVERSION,INT,TO,SACKS ALLOW
count,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,...,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0
mean,2019.0,0.375,16.45625,16.375,16.4,16.48125,16.49375,16.40625,16.36875,15.88125,...,16.03125,16.05625,15.95,16.425,16.3875,16.3875,16.2875,15.41875,15.73125,16.075
std,1.418654,0.485643,9.283005,9.24492,9.22275,9.265458,9.271582,9.258209,9.297746,9.156482,...,9.250138,9.323162,9.327581,9.257022,9.234191,9.267504,9.260308,9.431295,9.27682,9.291607
min,2017.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2018.0,0.0,8.75,8.75,8.75,8.75,8.75,8.75,8.75,8.0,...,7.0,8.0,7.75,8.0,8.75,8.0,8.75,7.0,7.0,8.0
50%,2019.0,0.0,16.5,16.0,16.5,16.5,16.5,16.0,16.0,16.0,...,16.0,16.0,16.0,16.5,16.5,16.0,16.0,14.0,16.0,16.0
75%,2020.0,1.0,24.25,24.0,24.25,24.0,24.25,24.0,24.25,24.0,...,24.0,24.0,24.0,24.25,24.25,24.25,24.25,23.0,23.25,24.25
max,2021.0,1.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,...,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0


In [None]:
# Select offensive predictors (Rankings based on items in the csv file) 
X = df[['PASS YDS','PASS COMP','PASS TD','RUSH YDS','RUSH TD']]

# Select offensive responses (Whether a team made the playoffs)
y = df['MADE PLAYOFFS']

#Split datasets into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
#Fit Data To Offensive Models

In [None]:
#Logistic Regression Pipeline
pipe_log_std = Pipeline([
('log_reg', LogisticRegression(random_state=0))])
pipe_log_std.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
log_scores = cross_val_score(pipe_log_std, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_log_std.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV is', round(log_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_log_std.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 80.36 %
The mean 10-fold CV is 80.23 %
The model peformance against test set is 77.08 %


In [None]:
#Naive Bayes Model
pipe_nb = Pipeline([
('GaussianNB', GaussianNB())])
pipe_nb.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
nb_scores = cross_val_score(pipe_nb, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_nb.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(nb_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_nb.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 80.36 %
The mean 10-fold CV accuracy is 76.67 %
The model peformance against test set is 77.08 %


In [None]:
#Lasso Logistic Regression Model
pipe_l1 = Pipeline([
('log_reg', LogisticRegression(random_state=0, penalty='l1', solver='liblinear'))])
pipe_l1.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
l1_scores = cross_val_score(pipe_l1, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_l1.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(l1_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_l1.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 81.25 %
The mean 10-fold CV accuracy is 77.5 %
The model peformance against test set is 75.0 %


In [None]:
#Ridge Logistic Regression Model
pipe_l2 = Pipeline([
('log_reg', LogisticRegression(random_state=0, penalty='l2', solver='liblinear'))])
pipe_l2.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
l2_scores = cross_val_score(pipe_l2, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_l2.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(l2_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_l2.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 79.46 %
The mean 10-fold CV accuracy is 77.5 %
The model peformance against test set is 81.25 %


In [None]:
#K-Nearest Neighbors with K = 5
pipe_knn = Pipeline([
('KNN', KNeighborsClassifier(n_neighbors=5))])
pipe_knn.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
knn_scores = cross_val_score(pipe_knn, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_knn.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(knn_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_knn.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 78.57 %
The mean 10-fold CV accuracy is 73.03 %
The model peformance against test set is 75.0 %


In [None]:
#Support Vector Machines Model with Linear Kernal
pipe_svm_lr = Pipeline([
('SVC', SVC(kernel='linear'))])
pipe_svm_lr.fit(X_train, y_train)

#Print Accuracy Score, Cross Validation Mean and Standard Deviation
svm_lr_scores = cross_val_score(pipe_svm_lr, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_lr.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_lr_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_lr.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 82.14 %
The mean 10-fold CV accuracy is 80.23 %
The model peformance against test set is 77.08 %


In [None]:
#Support Vector Machines Model with Polynomial Kernal
pipe_svm_poly = Pipeline([
('SVC', SVC(kernel='poly'))])
pipe_svm_poly.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
svm_poly_scores = cross_val_score(pipe_svm_poly, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_poly.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_poly_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_poly.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 84.82 %
The mean 10-fold CV accuracy is 73.26 %
The model peformance against test set is 72.92 %


In [None]:
#Support Vector Machine Model with Radial Basis Function Kernel
pipe_svm_rbf = Pipeline([
('SVC', SVC(kernel='rbf'))])
pipe_svm_rbf.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
svm_rbf_scores = cross_val_score(pipe_svm_rbf, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_rbf.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_rbf_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_rbf.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 82.14 %
The mean 10-fold CV accuracy is 77.58 %
The model peformance against test set is 75.0 %


In [None]:
#Create comparison dataframe with headers for accuracy, cross-validation mean and standard deviation
compare_df = pd.DataFrame(columns = ['Train Accuracy', 'CV 10 Fold Mean','Test Accuracy'])

#Append the comparison dataframe with results of the models above
compare_df.loc['Logistic Regression',:] = [(round(pipe_log_std.score(X_train, y_train),4)*100),round(log_scores.mean()*100,2),round(pipe_log_std.score(X_test, y_test)*100,2)]
compare_df.loc['Lasso Model',:] = [(round(pipe_l1.score(X_train, y_train),4)*100),round(l1_scores.mean()*100,2), round(pipe_l1.score(X_test, y_test)*100,2)]
compare_df.loc['Ridge Model',:] = [(round(pipe_l2.score(X_train, y_train),4)*100),round(l2_scores.mean()*100,2), round(pipe_l2.score(X_test, y_test)*100,2)]
compare_df.loc['Naive Bayes',:] = [(round(pipe_nb.score(X_train, y_train),4)*100),round(nb_scores.mean()*100,2), round(pipe_nb.score(X_test, y_test)*100,2)]
compare_df.loc['KNN',:] = [(round(pipe_knn.score(X_train, y_train),4)*100),round(knn_scores.mean()*100,2), round(pipe_knn.score(X_test, y_test)*100,2)]
compare_df.loc['SVM Linear',:] = [(round(pipe_svm_lr.score(X_train, y_train),4)*100),round(svm_lr_scores.mean()*100,2), round(pipe_svm_lr.score(X_test, y_test)*100,2)]
compare_df.loc['SVM Polynomial',:] = [(round(pipe_svm_poly.score(X_train, y_train),4)*100),round(svm_poly_scores.mean()*100,2), round(pipe_svm_poly.score(X_test, y_test)*100,2)]
compare_df.loc['SVM RBF',:] = [(round(pipe_svm_rbf.score(X_train, y_train),4)*100),round(svm_rbf_scores.mean()*100,2), round(pipe_svm_rbf.score(X_test, y_test)*100,2)]

#Print comparison dataframe
print(compare_df)

                    Train Accuracy CV 10 Fold Mean Test Accuracy
Logistic Regression          80.36           80.23         77.08
Lasso Model                  81.25            77.5          75.0
Ridge Model                  79.46            77.5         81.25
Naive Bayes                  80.36           76.67         77.08
KNN                          78.57           73.03          75.0
SVM Linear                   82.14           80.23         77.08
SVM Polynomial               84.82           73.26         72.92
SVM RBF                      82.14           77.58          75.0


In [None]:
#Load Defense Rankings Dataset

In [None]:
#Select files to import
from google.colab import files
heights = files.upload()

Saving Defense Rankings.csv to Defense Rankings.csv


In [None]:
#Create Defense dataframe
df = pd.read_csv('Defense Rankings.csv')

#Print top 5 of dataframe
df.head()

Unnamed: 0,YEAR,MADE PLAYOFFS,PTS,FPTS,SACKS,INT,TO,PTS ALLOW,PLAYS,YDS,...,RUSH YDS,RUSH ATT,RUSH TD,TD,RZ ATT,RZ TD,RZ TD PERCENTAGE,FIRST DOWNS,3RD DOWN CONVERSION,4TH DOWN CONVERSION
0,2021,1,9,3,11,3,3,1,2,1,...,13,16,26,1,3,3,6,1,1,8
1,2021,1,5,2,18,2,3,2,8,4,...,22,20,1,2,6,3,2,3,5,15
2,2021,0,22,13,18,15,21,3,9,8,...,15,15,1,4,1,2,3,5,28,4
3,2021,0,9,5,8,6,10,4,14,7,...,4,9,6,2,4,1,1,2,7,5
4,2021,1,9,6,6,8,5,5,22,13,...,3,1,5,6,19,12,10,17,12,9


In [None]:
# Select defensive predictors (Rankings based on items in the csv file) 
X = df[['PASS YDS','PASS COMP','PASS TD','RUSH YDS','RUSH TD']]

# Select defensive responses (Whether a team made the playoffs)
y = df['MADE PLAYOFFS']

#Split datasets into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
#Logistic Regression Pipeline Standardized
pipe_log_std = Pipeline([
('log_reg', LogisticRegression(random_state=0))])
pipe_log_std.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
log_scores = cross_val_score(pipe_log_std, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_log_std.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV is', round(log_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_log_std.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 72.32 %
The mean 10-fold CV is 72.35 %
The model peformance against test set is 66.67 %


In [None]:
#Naive Bayes Model
pipe_nb = Pipeline([
('GaussianNB', GaussianNB())])
pipe_nb.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
nb_scores = cross_val_score(pipe_nb, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_nb.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(nb_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_nb.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 74.11 %
The mean 10-fold CV accuracy is 73.26 %
The model peformance against test set is 64.58 %


In [None]:
#Lasso Logistic Regression Model
pipe_l1 = Pipeline([
('log_reg', LogisticRegression(random_state=0, penalty='l1', solver='liblinear'))])
pipe_l1.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
l1_scores = cross_val_score(pipe_l1, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_l1.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(l1_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_l1.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 71.43 %
The mean 10-fold CV accuracy is 66.97 %
The model peformance against test set is 66.67 %


In [None]:
#Ridge Logistic Regression Model
pipe_l2 = Pipeline([
('log_reg', LogisticRegression(random_state=0, penalty='l2', solver='liblinear'))])
pipe_l2.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
l2_scores = cross_val_score(pipe_l2, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_l2.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(l2_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_l2.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 72.32 %
The mean 10-fold CV accuracy is 66.97 %
The model peformance against test set is 64.58 %


In [None]:
#Ridge Logistic Regression Model
pipe_l2 = Pipeline([
('log_reg', LogisticRegression(random_state=0, penalty='l2', solver='liblinear'))])
pipe_l2.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
l2_scores = cross_val_score(pipe_l2, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_l2.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(l2_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_l2.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 72.32 %
The mean 10-fold CV accuracy is 66.97 %
The model peformance against test set is 64.58 %


In [None]:
#K-Nearest Neighbors with K = 5
pipe_knn = Pipeline([
('KNN', KNeighborsClassifier(n_neighbors=5))])
pipe_knn.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
knn_scores = cross_val_score(pipe_knn, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_knn.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(knn_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_knn.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 76.79 %
The mean 10-fold CV accuracy is 71.44 %
The model peformance against test set is 54.17 %


In [None]:
#Support Vector Machines Model with Linear Kernal
pipe_svm_lr = Pipeline([
('SVC', SVC(kernel='linear'))])
pipe_svm_lr.fit(X_train, y_train)

#Print Accuracy Score, Cross Validation Mean and Standard Deviation
svm_lr_scores = cross_val_score(pipe_svm_lr, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_lr.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_lr_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_lr.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 74.11 %
The mean 10-fold CV accuracy is 70.53 %
The model peformance against test set is 68.75 %


In [None]:
#Support Vector Machines Model with Polynomial Kernal
pipe_svm_poly = Pipeline([
('SVC', SVC(kernel='poly'))])
pipe_svm_poly.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
svm_poly_scores = cross_val_score(pipe_svm_poly, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_poly.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_poly_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_poly.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 76.79 %
The mean 10-fold CV accuracy is 65.15 %
The model peformance against test set is 60.42 %


In [None]:
#Support Vector Machine Model with Radial Basis Function Kernel
pipe_svm_rbf = Pipeline([
('SVC', SVC(kernel='rbf'))])
pipe_svm_rbf.fit(X_train, y_train)

#Print model accuracy score against training dataset, cross validation mean/standard deviation, and accuracy against test dataset
svm_rbf_scores = cross_val_score(pipe_svm_rbf, X_train, y_train, cv=10)
print('The accuracy of the model is', round(pipe_svm_rbf.score(X_train, y_train)*100,2),'%')
print('The mean 10-fold CV accuracy is', round(svm_rbf_scores.mean()*100,2),'%')
print('The model peformance against test set is', round(pipe_svm_rbf.score(X_test, y_test)*100,2),'%')

The accuracy of the model is 77.68 %
The mean 10-fold CV accuracy is 66.97 %
The model peformance against test set is 64.58 %


In [None]:
#Create comparison dataframe with headers for accuracy, cross-validation mean and standard deviation
compare_df = pd.DataFrame(columns = ['Train Accuracy', 'CV 10 Fold Mean','Test Accuracy'])

#Append the comparison dataframe with results of the models above
compare_df.loc['Logistic Regression',:] = [(round(pipe_log_std.score(X_train, y_train),4)*100),round(log_scores.mean()*100,2),round(pipe_log_std.score(X_test, y_test)*100,2)]
compare_df.loc['Lasso Model',:] = [(round(pipe_l1.score(X_train, y_train),4)*100),round(l1_scores.mean()*100,2), round(pipe_l1.score(X_test, y_test)*100,2)]
compare_df.loc['Ridge Model',:] = [(round(pipe_l2.score(X_train, y_train),4)*100),round(l2_scores.mean()*100,2), round(pipe_l2.score(X_test, y_test)*100,2)]
compare_df.loc['Naive Bayes',:] = [(round(pipe_nb.score(X_train, y_train),4)*100),round(nb_scores.mean()*100,2), round(pipe_nb.score(X_test, y_test)*100,2)]
compare_df.loc['KNN',:] = [(round(pipe_knn.score(X_train, y_train),4)*100),round(knn_scores.mean()*100,2), round(pipe_knn.score(X_test, y_test)*100,2)]
compare_df.loc['SVM Linear',:] = [(round(pipe_svm_lr.score(X_train, y_train),4)*100),round(svm_lr_scores.mean()*100,2), round(pipe_svm_lr.score(X_test, y_test)*100,2)]
compare_df.loc['SVM Polynomial',:] = [(round(pipe_svm_poly.score(X_train, y_train),4)*100),round(svm_poly_scores.mean()*100,2), round(pipe_svm_poly.score(X_test, y_test)*100,2)]
compare_df.loc['SVM RBF',:] = [(round(pipe_svm_rbf.score(X_train, y_train),4)*100),round(svm_rbf_scores.mean()*100,2), round(pipe_svm_rbf.score(X_test, y_test)*100,2)]

#Print comparison dataframe
print(compare_df)

                    Train Accuracy CV 10 Fold Mean Test Accuracy
Logistic Regression          72.32           72.35         66.67
Lasso Model                  71.43           66.97         66.67
Ridge Model                  72.32           66.97         64.58
Naive Bayes                  74.11           73.26         64.58
KNN                          76.79           71.44         54.17
SVM Linear                   74.11           70.53         68.75
SVM Polynomial               76.79           65.15         60.42
SVM RBF                      77.68           66.97         64.58


# Final Outcome

### How did you identify the model target and features?
I selected both the same set of rankings for both offense and defense to compare apples to apples. Passing Yards, Passing Completions, Passing Touchdowns, Rushing Yards, and Rushing Touchdowns capture the breadth of how an offense and defense collide.

### What steps did you take to prepare the data for modeling?
I decided to use the yearly rankings vs. the actual statistics so I did not have to standardize the dataset to model it.

Calculated the number of nulls in the selected features and found no nulls in any of the data.

The response data was already encoded in the .csv files so no manipulation was necessary to run the models.

### Which models did you choose and why? How did you evaluate the model's performance?

Because this is of the classification requirement, I selected multiple logistic regression models including Naive Bayes, Lasso, Ridge, KNN, and Support Vector Machine models.

I evaluted model performance by using the accuracy against the training dataset, a 10 fold cross validation of the mean accuracy, as well as compared against the test dataset.

### What were your findings?

SVM linear model had best average performance across both the offense and defense model sets.

Overall, the offensive models were better predictors for whether a team will make the playoffs than the defensive models.

Based on this information, NFL general managers today should invest more on the offensive side ofhte team during the draft and free agency to ensure a better potential chance and making the playoffs.

###Closing thoughts...
Creating the .csv minimized the amount of EDA and data manipulation I have to do.

Using yearly rankings vs. actual output was a better approach because it standardized the data accounted for the league shifting towards higher offensive output numbers towards the recent years.

ADA/gradient boost and decision tree models were heavily overtrained to this smaller dataset.

My hunch was offense was going to prevail based on my knowledge of the league and how it has changed, but may get different understanding by modeling a different timeframe or statistics.



