# CS329E Data Analytics Project

**Team Members:** *Bryce Holladay, Joshua Mathew, Austin Rinn, Eddie Castillo*

Using the techniques that we have learned in class, we attempted to predict the result of a National Football League (NFL) play based on elements existing before the play begins, such as field position and time remaining in game.

We used data collected from [publiclly available play by play data from the years 2013 through 2019](http://nflsavant.com/about.php) to build our model. As inputs, our model takes parameters such as time, down, yards to go, yardline, and offensive formation. Our data has several play resultant classifiers that we have tried to predict, including touchdowns, interceptions, fumbles, and interception.

In order to fit the data into our model, we performed several actions to pre-process it, including reformatting time into a linear format and removing non-descriptive data like season year. The results of our model are shown below.

In [None]:
# Use this cell for any notes
# Rubric: https://utexas.instructure.com/courses/1275914/assignments/4897667
import pandas as pd, numpy as np
import warnings
warnings.simplefilter("ignore")
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import time

from sklearn import decomposition  
from sklearn.decomposition import PCA
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

%matplotlib inline

## Data Preprocessing
Data cleaning, data exploration, and feature engineering

In [None]:
#Read in data from csv
#For building purposes use one season to save processing time.
#For final runs we will switch to compiled data sheet with all seasons.
#Display initial data head

play_data = pd.read_csv('pbp-2019.csv')
play_data.head()

In [None]:
#Convert time into a standard format
#Display both format heads for comparison
play_data['AbsoluteTime'] = (play_data['Quarter']-1)*900 + play_data['Minute']*60 + play_data['Second']

In [None]:
#Convert GameDate into just month to represent time of year
import re
pattern = "-(.*?)\-"
for index in range(play_data.shape[0]):
   play_data['GameDate'][index] = re.search(pattern, play_data['GameDate'][index]).group(1)

In [None]:
play_data.rename(columns={"GameDate": "GameMonth"})

In [None]:
#Purge other data not needed

#Drop Data that has no effect or could mislead models
# No longer need Quarter, Minute, Seconds
# GameID has no effect on the play
# SeriesFirstDown has no description
# NextScore is 0 for every row. Has no effect.
play_data = play_data.drop(['Quarter', 'Minute', 'Second', 'GameId', 'Unnamed: 10', 'Unnamed: 12', 'Unnamed: 16', 'Unnamed: 17', 'SeriesFirstDown', 'NextScore', 'TeamWin', 'Description', 'OffenseTeam', 'DefenseTeam', 'SeasonYear'], axis=1)

# Combine RushDirection and PassType to get one column with play type
# No need for PlayType column anymore because it says the same information but less descriptive
play_data['RushDirection'] = play_data['RushDirection'].fillna('')
play_data['PassType'] = play_data['PassType'].fillna('')
play_data['PlayType2'] = play_data['RushDirection'] + play_data['PassType']
play_data = play_data.drop('PlayType', axis=1)

play_data.rename(columns={"PlayType": "PlayType2"})
play_data = play_data.drop(['PassType', 'RushDirection', 'YardLineDirection'], axis=1)
play_data.head(50)

In [None]:
c = (play_data['PlayType2'] == '').sum()
print(c)
play_data.head(50)
play_data.describe()

In [None]:
#Drop rows where it is not a rush/pass play
# Get names of indexes for which plays are not rush or pass
indexNames = play_data[(play_data['IsRush'] == 0) & (play_data['IsPass'] == 0)].index
 
# Delete these row indexes from dataFrame
play_data.drop(indexNames , inplace=True)
play_data.describe()

In [None]:
#This took care of most of the nulls. Dropping the rest is a small fraction of our data
# Get names of indexes for which plays are not specified
indexNames = play_data[play_data['PlayType2'] == ''].index
 
# Delete these row indexes from dataFrame
play_data.drop(indexNames , inplace=True)

c = (play_data['PlayType2'] == '').sum()
print(c)
play_data.head(50)
play_data.describe()

In [None]:
#Label Encode the categorical data
from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
#bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
#bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
play_data['Formation_Code'] = labelencoder.fit_transform(play_data['Formation'])
play_data['PlayType_Code'] = labelencoder.fit_transform(play_data['PlayType2'])

play_data_encoded = play_data.drop(['Formation', 'PlayType2'], axis=1)
play_data_encoded

In [None]:
# To predict a touchdown, we must drop data that cannot be known prior to the play
play_data_isTD = play_data_encoded.drop(['Yards', 'IsIncomplete', 'IsSack', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
play_data_isTD.head()

In [None]:
# To predict an interception, we must drop data that cannot be known prior to the play
play_data_isINT = play_data_encoded.drop(['Yards', 'IsFumble', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsIncomplete', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
play_data_isINT.head()

In [None]:
# To predict an incomplete pass, we must drop data that cannot be known prior to the play
play_data_isIC = play_data_encoded.drop(['Yards', 'IsFumble', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)

# Incomplete only applies to passing plays. Must drop all rows where isRush = 1
rows = play_data_isIC['IsRush'] == 1
indexNames = play_data_isIC[rows].index
play_data_isIC = play_data_isIC.drop(indexNames)

play_data_isIC.head()

In [None]:
# To predict a fumble, we must drop data that cannot be known prior to the play
play_data_isFum = play_data_encoded.drop(['Yards', 'IsIncomplete', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
play_data_isFum.head()

In [None]:
#set features and labels

#for touchdowns
labels_TD = play_data_isTD['IsTouchdown']
features_TD = play_data_isTD.drop(['IsTouchdown'], axis=1)

#for interceptions
labels_Int = play_data_isINT['IsInterception']
features_Int = play_data_isINT.drop(['IsInterception'], axis=1)

#for fumbles
labels_Fum = play_data_isFum['IsFumble']
features_Fum = play_data_isFum.drop(['IsFumble'], axis=1)

#for incompletes
labels_IC = play_data_isIC['IsIncomplete']
features_IC = play_data_isIC.drop(['IsIncomplete'], axis=1)

# Data Analysis

### Decision Trees

In [16]:
def Decision_Tree_Football(features, labels):

    #Scale features
    ss = StandardScaler()
    features_scaled = ss.fit_transform(X=features)
    features_scaled = pd.DataFrame(features_scaled)

    #Split data into training and test data
    features_train, features_test, labels_train, labels_test = sk.model_selection.train_test_split(features_scaled, labels, test_size=.2)

    #Perform PCA
    pca = PCA(n_components = 0.95, svd_solver='full')
    features_train_pca = pca.fit_transform(features_train)
    features_train_pca = pd.DataFrame(features_train_pca)
    num_columns = len(features_train_pca.columns)

    features_test_pca = pca.transform(features_test)[:, :num_columns]
    features_test_pca = pd.DataFrame(features_test_pca)

    #Perform 10-fold cross validation on decision tree
    k_fold_tree = tree.DecisionTreeClassifier()
    cross_score = cross_val_score(k_fold_tree, features_scaled, labels, cv = 10)

    #Tune model with best parameters using GridSearch
    grid_tree = tree.DecisionTreeClassifier()
    grid_search = GridSearchCV(grid_tree, 
                          {'max_depth': [5,10,15,20,25],
                          'min_samples_leaf': [5,10,15,20],
                          'max_features': [2,4,6,8,10]},
                          cv = 10, scoring = 'accuracy')
    grid_search.fit(features_scaled, labels)
    best_params = grid_search.best_params_

    #Pass GridSearchCV into cross_val_score
    final_report = cross_val_score(grid_search, features_scaled, labels, cv = 10)
    avg_accuracy = final_report.mean()

    vals = [best_params, final_report, avg_accuracy]

    return vals

In [30]:
#Decision Trees for Touchdowns

#Call Decision Tree method
TD_vals_DT = Decision_Tree_Football(features_TD, labels_TD)
TD_best_params = TD_vals_DT[0]
TD_DT_accuracy = TD_vals_DT[2]

print(str(TD_best_para))
print('Accuracy for predicting touchdowns using Decision Trees: ' + str(TD_DT_accuracy))

0.9373757352318206
Accuracy for predicting touchdowns using Decision Trees: 0.9373757352318206


In [18]:
#Decision Trees for Interceptions

#Call Decision Tree method
Int_vals_DT = Decision_Tree_Football(features_Int, labels_Int)
Int_best_params = Int_vals_DT[0]
Int_DT_accuracy = Int_vals_DT[2]

print('Accuracy for predicting interceptions using Decision Trees: ' + str(Int_DT_accuracy))

Accuracy for predicting interceptions using Decision Trees: 0.9856508858383302


In [19]:
#Decision Trees for Fumbles

#Call Decision Tree method
Fum_vals_DT = Decision_Tree_Football(features_Fum, labels_Fum)
Fum_best_params = Fum_vals_DT[0]
Fum_DT_accuracy = Fum_vals_DT[2]

print('Accuracy for predicting fumbles using Decision Trees: ' + str(Fum_DT_accuracy))

Accuracy for predicting fumbles using Decision Trees: 0.9912311389431917


In [20]:
#Decision Trees for Incomplete Passes

#Call Decision Tree method
IC_vals_DT = Decision_Tree_Football(features_IC, labels_IC)
IC_best_params = IC_vals_DT[0]
IC_DT_accuracy = IC_vals_DT[2]

print('Accuracy for predicting incomplete passes using Decision Trees: ' + str(IC_DT_accuracy))

Accuracy for predicting incomplete passes using Decision Trees: 0.6626137418074025


#### KNN

In [74]:
def KNN_Football(features, labels):
    
    #Scale features
    standard_scaler = StandardScaler()
    features_scaled = standard_scaler.fit_transform(X=features)
    features_scaled = pd.DataFrame(features_scaled)
    
    #Create KNeighborsClassifier()
    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    knn_grid_search = GridSearchCV(KNeighborsClassifier(), param_grid = {'n_neighbors': numbers}, cv = 10, scoring = 'accuracy')
    knn_grid_search = knn_grid_search.fit(features_scaled, labels) 
    best_params = knn_grid_search.best_params_
    
    #Pass GridSearchCV into cross_val_score
    final_report = cross_val_score(knn_grid_search, features_scaled, labels, cv = 10)
    avg_accuracy = final_report.mean()
    return [best_params, final_report, avg_accuracy]

In [75]:
#KNN for Touchdowns

#Call KNN method
TD_vals_KNN = KNN_Football(features_TD, labels_TD)
TD_KNN_best_params = TD_vals_KNN[0]
TD_KNN_accuracy = TD_vals_KNN[2]

print('Accuracy for predicting touchdowns using K-Nearest Neighbors: ' + str(TD_KNN_accuracy))

Accuracy for predicting touchdowns using K-Nearest Neighbors: 0.9301682397830874


In [None]:
#KNN for Interceptions

#Call NKNN method
Int_vals_KNN = KNN_Football(features_Int, labels_Int)
Int_KNN_best_params = Int_vals_KNN[0]
Int_KNN_accuracy = Int_vals_KNN[2]

print('Accuracy for predicting interceptions using K-Nearest Neighbors: ' + str(Int_KNN_accuracy))

In [None]:
#KNN for Fumbles

#Call Decision Tree method
Fum_vals_KNN = KNN_Football(features_Fum, labels_Fum)
Fum_KNN_best_params = Fum_vals_KNN[0]
Fum_KNN_accuracy = Fum_vals_KNN[2]

print('Accuracy for predicting fumbles using K-Nearest Neighbors: ' + str(Fum_KNN_accuracy))

In [None]:
#KNN for Incomplete Passes

#Call Decision Tree method
IC_vals_KNN = KNN_Football(features_IC, labels_IC)
IC_KNN_best_params = IC_vals_KNN[0]
IC_KNN_accuracy = IC_vals_KNN[2]

print('Accuracy for predicting incomplete passes using K-Nearest Neighbors: ' + str(IC_KNN_accuracy))

#### Naive-Bayes

In [None]:
def Naive_Bayes_Football(features, labels):
    
    #Naive Bayes uses a probabilistic approach, which scales with changes in data.
    gaussian_classifier = GaussianNB()
    cv_accuracy = cross_val_score(gaussian_classifier, features, labels, cv = 10).mean() 
    return cv_accuracy

In [None]:
#Naive Bayes for Touchdowns

#Call Naive Bayes method
print('Accuracy for predicting touchdowns using K-Nearest Neighbors: ' + str(Naive_Bayes_Football(features_TD, labels_TD)))

In [None]:
#Naive Bayes for Interceptions

#Call Naive Bayes method
print('Accuracy for predicting interceptions using K-Nearest Neighbors: ' + str(Naive_Bayes_Football(features_Int, labels_Int)))

In [None]:
#Naive Bayes for Fumbles

#Call Naive Bayes method
print('Accuracy for predicting fumbles using K-Nearest Neighbors: ' + str(Naive_Bayes_Football(features_Fum, labels_Fum)))

In [None]:
#Naive Bayes for Incomplete Passes

#Call Naive Bayes method
print('Accuracy for predicting fumbles using K-Nearest Neighbors: ' + str(Naive_Bayes_Football(features_IC, labels_IC)))

#### Neural Net

In [33]:
def Neural_Net_Football(features, labels):

    #Create the scaler
    standard_scaler = StandardScaler()

    #Create the multi-layer perceptron
    mlp = MLPClassifier()
    neural_pipeline = Pipeline(steps = [('scaler', standard_scaler), ('mlpclassifier', mlp)])
    mlp_parameters = {'mlpclassifier__hidden_layer_sizes': [(10,)], 'mlpclassifier__activation':['logistic', 'tanh', 'relu']} # hidden layers of from 10-20 with gaps of 10

    #Tune model with best parameters using GridSearch
    mlp_grid_search = GridSearchCV(neural_pipeline, param_grid = mlp_parameters, cv = 5, scoring = 'accuracy')
    mlp_grid_search.fit(features, labels)
    best_params = mlp_grid_search.best_params_

    #Pass GridSearchCV into cross_val_score
    final_report = cross_val_score(mlp_grid_search, features, labels, cv = 10)
    avg_accuracy = final_report.mean()

    return [best_params, final_report, avg_accuracy]

In [24]:
#Neural Nets for Touchdowns

#Call Neural Network method
TD_vals_NN = Neural_Net_Football(features_TD, labels_TD)
TD_best_params = TD_vals_NN[0]
TD_NN_accuracy = TD_vals_NN[2]

print('Accuracy for predicting touchdowns using Neural Networks: ' + str(TD_NN_accuracy))

Accuracy for predicting touchdowns using Decision Trees: 0.9429194284968208


In [25]:
#Neural Nets for Interceptions

#Call Neural Network method
Int_vals_NN = Neural_Net_Football(features_Int, labels_Int)
Int_best_params = Int_vals_NN[0]
Int_NN_accuracy = Int_vals_NN[2]

print('Accuracy for predicting interceptions using Neural Networks: ' + str(Int_NN_accuracy))

Accuracy for predicting interceptions using Decision Trees: 0.985338999495248


In [27]:
#Neural Nets for Fumbles

#Call Decision Tree method
Fum_vals_NN = Neural_Net_Football(features_Fum, labels_Fum)
Fum_best_params = Fum_vals_NN[0]
Fum_NN_accuracy = Fum_vals_NN[2]

print('Accuracy for predicting fumbles using Neural Networks: ' + str(Fum_NN_accuracy))

Accuracy for predicting fumbles using Decision Trees: 0.9912311389431917


In [29]:
#Neural Nets for Incomplete Passes

#Call Decision Tree method
IC_vals_NN = Neural_Net_Football(features_IC, labels_IC)
IC_best_params = IC_vals_NN[0]
IC_NN_accuracy = IC_vals_NN[2]

print('Accuracy for predicting incomplete passes using Neural Networks: ' + str(IC_NN_accuracy))

Accuracy for predicting incomplete passes using Decision Trees: 0.6720319090423775


#### Ensembles

In [None]:
def Random_Forests_Football(features, labels):
    
    #Create the Random Forest Classifier
    random_forest_classifier = RandomForestClassifier()
    forest_parameters = {'max_depth': [5, 10, 15, 20, 25], 
                         'min_samples_leaf': [5, 10, 15, 20], 
                         'max_features': ['sqrt', 'log2']}
    
    #Tune model with best parameters using GridSearch
    forest_grid_search = GridSearchCV(random_forest_classifier, param_grid = forest_parameters, cv = 5, scoring = 'accuracy')
    forest_grid_search = forest_grid_search.fit(features, labels)
    best_params = forest_grid_search.best_params_

    #Pass GridSearchCV into cross_val_score
    final_report = cross_val_score(forest_grid_search, features, labels, cv = 10)
    avg_accuracy = final_report.mean()

    return [best_params, final_report, avg_accuracy]

def Ada_Boost_Football(features, labels):
    
    #Create the ADA Boost Classifier
    ada_boost_classifier = AdaBoostClassifier()
    ada_parameters = {'n_estimators': list(range(50, 151, 25))}
    
    #Tune model with best parameters using GridSearch
    ada_grid_search = GridSearchCV(ada_boost_classifier, param_grid = ada_parameters, cv = 5, scoring = 'accuracy')
    ada_grid_search = ada_grid_search.fit(features, labels)
    best_params = ada_grid_search.best_params_

    #Pass GridSearchCV into cross_val_score
    final_report = cross_val_score(ada_grid_search, features, labels, cv = 10)
    avg_accuracy = final_report.mean()
    
    return [best_params, final_report, avg_accuracy]

In [None]:
#Random Forests and ADA Boost for Touchdowns

#Call Random Forests method
TD_vals_RF = Random_Forests_Football(features_TD, labels_TD)
TD_RF_best_params = TD_vals_RF[0]
TD_RF_accuracy = TD_vals_RF[2]

#Call ADA Boost method
TD_vals_AB = Ada_Boost_Football(features_TD, labels_TD)
TD_AB_best_params = TD_vals_AB[0]
TD_AB_accuracy = TD_vals_AB[2]

print('Accuracy for predicting touchdowns using Random Forests: ' + str(TD_RF_accuracy))
print('Accuracy for predicting touchdowns using ADA Boost: ' + str(TD_AB_accuracy))

In [None]:
#Random Forests and ADA Boost for Interceptions

#Call Random Forests method
Int_vals_RF = Random_Forests_Football(features_Int, labels_Int)
Int_RF_best_params = Int_vals_RF[0]
Int_RF_accuracy = Int_vals_RF[2]

#Call ADA Boost method
Int_vals_AB = Ada_Boost_Football(features_Int, labels_Int)
Int_AB_best_params = Int_vals_AB[0]
Int_AB_accuracy = Int_vals_AB[2]

print('Accuracy for predicting interceptions using Random Forests: ' + str(Int_RF_accuracy))
print('Accuracy for predicting interceptions using ADA Boost: ' + str(Int_AB_accuracy))

In [None]:
#Random Forests and ADA Boost for Fumbles

#Call Random Forests method
Fum_vals_RF = Random_Forests_Football(features_Fum, labels_Fum)
Fum_RF_best_params = Fum_vals_RF[0]
Fum_RF_accuracy = Fum_vals_RF[2]

#Call ADA Boost method
Fum_vals_AB = Ada_Boost_Football(features_Fum, labels_Fum)
Fum_AB_best_params = Fum_vals_AB[0]
Fum_AB_accuracy = Fum_vals_AB[2]

print('Accuracy for predicting fumbles using Random Forests: ' + str(Fum_RF_accuracy))
print('Accuracy for predicting fumbles using ADA Boost: ' + str(Fum_AB_accuracy))

In [None]:
#Random Forests and ADA Boost for Incomplete Passes

#Call Random Forests method
IC_vals_RF = Random_Forests_Football(features_IC, labels_IC)
IC_RF_best_params = IC_vals_RF[0]
IC_RF_accuracy = IC_vals_RF[2]

#Call ADA Boost method
IC_vals_AB = Ada_Boost_Football(features_IC, labels_IC)
IC_AB_best_params = IC_vals_AB[0]
IC_AB_accuracy = IC_vals_AB[2]

print('Accuracy for predicting incomplete passes using Random Forests: ' + str(IC_RF_accuracy))
print('Accuracy for predicting incomplete passes using ADA Boost: ' + str(IC_AB_accuracy))

## Model Analysis

In [None]:
#Compare accuracy scores and other metrics for our different models.
#How confident are we in the success rates of these various models?

In [None]:
#Discuss which model was the best.

In [None]:
#Discuss data. What issues may have existed in the data?  What assumptions did we make? What could have made our data better?

In [None]:
#Discuss our project as a whole. How could we have improved project? How might this model be used in real world applications?