# CS329E Data Analytics Project

**Team Members:** *Bryce Holladay, Joshua Mathew, Austin Rinn, Eddie Castillo*

Using the techniques that we have learned in class, we attempted to predict the result of a National Football League (NFL) play based on elements existing before the play begins, such as field position and time remaining in game.

We used data collected from [publiclly available play by play data from the years 2013 through 2019](http://nflsavant.com/about.php) to build our model. As inputs, our model takes parameters of time, down, yards to go, yardline, and offensive formation. Our data has several play resultant classifiers that we have tried to predict, including touchdowns, interceptions, sacks, first downs, yards, and penalties.

In order to fit the data into our model, we performed several actions to pre-process it, including reformatting time into a linear format and removing non-descriptive data like season year. The results of our model are shown below.

In [2]:
# Use this cell for any notes
# Rubric: https://utexas.instructure.com/courses/1275914/assignments/4897667
import pandas as pd, numpy as np

## Data Preprocessing
Data cleaning, data exploration, and feature engineering

In [162]:
#Read in data from csv
#For building purposes use one season to save processing time.
#For final runs we will switch to compiled data sheet with all seasons.
#Display initial data head

df19 = pd.read_csv('pbp-2019.csv')
df18 = pd.read_csv('pbp-2018.csv')
df17 = pd.read_csv('pbp-2017.csv')
df16 = pd.read_csv('pbp-2016.csv')
df15 = pd.read_csv('pbp-2015.csv')
df14 = pd.read_csv('pbp-2014.csv')
df13 = pd.read_csv('pbp-2013.csv')
df13 = df13.drop(['Unnamed: 45', 'Unnamed: 46', 'Unnamed: 47'], axis=1)
frames = [df19, df18, df17, df16, df15, df14]

df = pd.concat(frames, ignore_index=True)
df.to_csv('pbp.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [163]:
df.head()

Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,IsTwoPointConversion,IsTwoPointConversionSuccessful,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards
0,2019100605,2019-10-06,1,2,25,OAK,CHI,1,10,50,...,0,0,,50,OPP,0,,0,,0
1,2019100605,2019-10-06,1,1,45,OAK,CHI,2,9,51,...,0,0,RIGHT GUARD,49,OPP,0,,0,,0
2,2019101400,2019-10-14,1,10,34,DET,GB,1,10,84,...,0,0,RIGHT TACKLE,16,OPP,0,,0,,0
3,2019101400,2019-10-14,1,9,55,DET,GB,2,9,85,...,0,0,,15,OPP,0,,0,,0
4,2019101400,2019-10-14,1,9,10,DET,GB,3,3,91,...,0,0,,9,OPP,0,,0,,0


In [169]:
#Convert time into a standard format
#Display both format heads for comparison
df['AbsoluteTime'] = (df['Quarter']-1)*900 + df['Minute']*60 + df['Second'] 


In [139]:
#Convert GameDate into just month to represent time of year
#import re
#pattern = "-(.*?)\-"
#for index in range(df.shape[0]):
#   df['GameDate'][index] = re.search(pattern, str(df['GameDate'][index])).group(1)

df['GameMonth'] = pd.DatetimeIndex(df['GameDate']).month

In [140]:
df.head(100)

Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards,AbsoluteTime,GameMonth
0,2019100605,2019-10-06,1,2,25,OAK,CHI,1,10,50,...,,50,OPP,0,,0,,0,145,10
1,2019100605,2019-10-06,1,1,45,OAK,CHI,2,9,51,...,RIGHT GUARD,49,OPP,0,,0,,0,105,10
2,2019101400,2019-10-14,1,10,34,DET,GB,1,10,84,...,RIGHT TACKLE,16,OPP,0,,0,,0,634,10
3,2019101400,2019-10-14,1,9,55,DET,GB,2,9,85,...,,15,OPP,0,,0,,0,595,10
4,2019101400,2019-10-14,1,9,10,DET,GB,3,3,91,...,,9,OPP,0,,0,,0,550,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2019122214,2019-12-22,2,1,43,KC,CHI,1,10,67,...,,33,OPP,0,,0,,0,1003,12
96,2019122214,2019-12-22,2,1,53,KC,CHI,1,10,62,...,,38,OPP,1,CHI,1,ILLEGAL USE OF HANDS,5,1013,12
97,2019122214,2019-12-22,2,2,0,KC,CHI,4,4,57,...,,43,OPP,1,CHI,1,RUNNING INTO THE KICKER,5,1020,12
98,2019122214,2019-12-22,2,4,5,KC,CHI,2,9,52,...,,48,OPP,0,KC,1,ILLEGAL BLOCK ABOVE THE WAIST,0,1145,12


##### Drop Data that has no effect or could mislead models

In [141]:
#Purge other data not needed
# No longer need Quarter, Minute, Seconds
# GameID has no effect on the play
# SeriesFirstDown has no description
# NextScore is 0 for every row. Has no effect.
df2 = df.drop(['Quarter', 'Minute', 'Second', 'GameDate', 'GameId', 'Unnamed: 10', 'Unnamed: 12', 'Unnamed: 16', 'Unnamed: 17', 'SeriesFirstDown', 'NextScore', 'TeamWin', 'Description', 'OffenseTeam', 'DefenseTeam', 'SeasonYear'], axis=1)
df2.head()
df2.describe()

Unnamed: 0,Down,ToGo,YardLine,Yards,IsRush,IsPass,IsIncomplete,IsTouchdown,IsSack,IsChallenge,...,IsFumble,IsPenalty,IsTwoPointConversion,IsTwoPointConversionSuccessful,YardLineFixed,IsPenaltyAccepted,IsNoPlay,PenaltyYards,AbsoluteTime,GameMonth
count,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,...,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0,270418.0
mean,1.675935,7.277992,44.738035,4.185535,0.290077,0.414229,0.149058,0.030416,0.028253,0.004985,...,0.013793,0.086174,0.002208,0.00108,26.693401,0.075228,0.055732,0.622851,1825.981103,10.361651
std,1.173453,4.949102,26.830167,8.281275,0.453798,0.492589,0.356146,0.171729,0.165694,0.070428,...,0.116633,0.280621,0.046934,0.032843,14.294987,0.263759,0.229404,2.632891,1043.497361,1.761597
min,0.0,0.0,0.0,-23.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,3.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,16.0,0.0,0.0,0.0,932.0,10.0
50%,2.0,9.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,28.0,0.0,0.0,0.0,1800.0,11.0
75%,2.0,10.0,65.0,6.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,38.0,0.0,0.0,0.0,2768.0,12.0
max,4.0,46.0,99.0,104.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,50.0,1.0,1.0,66.0,4500.0,12.0


##### Combine RushDirection and PassType in order to get 1 highly descriptive column for the play. Can delete some columns after this

In [142]:
# Combine RushDirection and PassType to get one column with play type
# No need for PlayType column anymore because it says the same information but less descriptive
df2['RushDirection'] = df2['RushDirection'].fillna('')
df2['PassType'] = df2['PassType'].fillna('')
df2['PlayType2'] = df2['RushDirection'] + df2['PassType']
df2 = df2.drop('PlayType', axis=1)
df2.shape

(270418, 31)

In [161]:
df2.PlayType2.unique()

array(['SHORT LEFT', 'RIGHT GUARD', 'RIGHT TACKLE', 'SHORT MIDDLE',
       'DEEP LEFT', '', 'LEFT END', 'SHORT RIGHT', 'LEFT TACKLE',
       'RIGHT END', 'LEFT GUARD', 'DEEP RIGHT', 'CENTER', 'DEEP MIDDLE',
       'MIDDLE. PENALTY', 'YARDS &', 'LEFT TO', 'MIDDLE TO', 'BACK TO',
       '(SHOTGUN) 10-THILL', 'INTERCEPTED BY', 'NOT LISTED', 'RIGHT TO',
       'RULING, AND', '(6:33) 11-ASMITH', '(4:03) (NO',
       '(6:41) (SHOTGUN)', 'INTENDED FOR', '(55-A.BROOKS) [53-NBOWMAN]',
       '(:15) (SHOTGUN)', 'KESSLER THROUGH', 'RIGHT (58-JHICKS)',
       '[33-E.GAINES]. LA-33-EGAINES', 'RIGHT. PENALTY',
       '[57-N.SPENCE]. PENALTY', '(94-C.LIUGET) [99-JBOSA]',
       '(6:44) (SHOTGUN)', 'PASS RULING,', '(13:19) 5-TTAYLOR', 'IN 119',
       '(10:14) 17-PRIVERS', '(10:01) (SHOTGUN)', '[20-C.GRAHAM]. THROWN',
       '[55-S.TULLOCH]. PENALTY', '(:21) 5-TBRIDGEWATER',
       '[31-M.ALEXANDER]. PENALTY', '(4:54) 2-JMANZIEL',
       '[58-V.MILLER]. THE', '(4:02) (SHOTGUN)', '(11:52) 11-ASMITH',
 

In [143]:
df.rename(columns={"PlayType": "PlayType2"})
df2 = df2.drop(['PassType', 'RushDirection', 'YardLineDirection'], axis=1)
df2.head(50)
df2.describe()
df2.to_csv('pbp.csv')
df2.shape

(270418, 28)

In [144]:
c = (df2['PlayType2'] == '').sum()
print(c)
df2.tail(50)
df2.shape
df3.to_csv('pbp.csv')

84500


##### Drop rows where it is not a rush/pass play

In [145]:
# Get names of indexes for which plays are not rush or pass
#indexNames = df2[(df2['IsRush'] == 0) & (df2['IsPass'] == 0)].index

# Delete these row indexes from dataFrame
#df3 = df2.drop(indexNames , inplace=True)
#df2.describe()

rows = (df2['IsRush'] == 0) & (df2['IsPass'] == 0)
indexNames = df2[rows].index
df3 = df2.drop(indexNames)

In [147]:
df3.shape

(190457, 28)

##### This took care of most of the nulls. Dropping the rest is a small fraction of our data

In [150]:
# Get names of indexes for which plays arre not specified
indexNames = df3[df2['PlayType2'] == ''].index
 
# Delete these row indexes from dataFrame
df3.drop(indexNames , inplace=True)

  


In [151]:
c = (df3['PlayType2'] == '').sum()
print(c)
df3.head(50)
df3.describe()

0


Unnamed: 0,Down,ToGo,YardLine,Yards,IsRush,IsPass,IsIncomplete,IsTouchdown,IsSack,IsChallenge,...,IsFumble,IsPenalty,IsTwoPointConversion,IsTwoPointConversionSuccessful,YardLineFixed,IsPenaltyAccepted,IsNoPlay,PenaltyYards,AbsoluteTime,GameMonth
count,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,...,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0,185918.0
mean,1.776262,8.648383,47.853683,6.176228,0.397573,0.602427,0.216805,0.041637,0.0,0.006245,...,0.009273,0.071456,0.0,0.0,28.941238,0.059209,0.045687,0.583833,1831.902882,10.358309
std,0.812076,3.971083,24.543002,9.089445,0.489398,0.489398,0.41207,0.199758,0.0,0.078776,...,0.095849,0.257586,0.0,0.0,12.78639,0.236016,0.208806,2.776748,1047.815479,1.766075
min,0.0,0.0,0.0,-23.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,6.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,20.0,0.0,0.0,0.0,938.0,10.0
50%,2.0,10.0,44.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,29.0,0.0,0.0,0.0,1814.0,11.0
75%,2.0,10.0,67.0,9.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,40.0,0.0,0.0,0.0,2776.0,12.0
max,4.0,46.0,99.0,104.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1.0,1.0,0.0,0.0,50.0,1.0,1.0,66.0,4500.0,12.0


##### Check how many unique values there are in categorical data

In [171]:
df3.Formation.unique()
df3.PlayType2.unique()

array(['SHORT LEFT', 'RIGHT GUARD', 'RIGHT TACKLE', 'SHORT MIDDLE',
       'DEEP LEFT', 'LEFT END', 'SHORT RIGHT', 'LEFT TACKLE', 'RIGHT END',
       'LEFT GUARD', 'DEEP RIGHT', 'CENTER', 'DEEP MIDDLE',
       'MIDDLE. PENALTY', 'YARDS &', 'LEFT TO', 'MIDDLE TO', 'BACK TO',
       '(SHOTGUN) 10-THILL', 'INTERCEPTED BY', 'NOT LISTED', 'RIGHT TO',
       'RULING, AND', '(6:33) 11-ASMITH', '(4:03) (NO',
       '(6:41) (SHOTGUN)', 'INTENDED FOR', '(55-A.BROOKS) [53-NBOWMAN]',
       '(:15) (SHOTGUN)', 'KESSLER THROUGH', 'RIGHT (58-JHICKS)',
       '[33-E.GAINES]. LA-33-EGAINES', 'RIGHT. PENALTY',
       '[57-N.SPENCE]. PENALTY', '(94-C.LIUGET) [99-JBOSA]',
       '(6:44) (SHOTGUN)', 'PASS RULING,', '(13:19) 5-TTAYLOR', 'IN 119',
       '(10:14) 17-PRIVERS', '(10:01) (SHOTGUN)', '[20-C.GRAHAM]. THROWN',
       '[55-S.TULLOCH]. PENALTY', '(:21) 5-TBRIDGEWATER',
       '[31-M.ALEXANDER]. PENALTY', '(4:54) 2-JMANZIEL',
       '[58-V.MILLER]. THE', '(4:02) (SHOTGUN)', '(11:52) 11-ASMITH',
     

In [198]:
print(df3.PlayType2.unique())
uniqueVal = df3.PlayType2.unique()
#df.RushDirection.unique()
init_rows = (df['PassType'] == 'MIDDLE. PENALTY')
indexNames = df2[rows]
indexNames

['SHORT LEFT' 'RIGHT GUARD' 'RIGHT TACKLE' 'SHORT MIDDLE' 'DEEP LEFT'
 'LEFT END' 'SHORT RIGHT' 'LEFT TACKLE' 'RIGHT END' 'LEFT GUARD'
 'DEEP RIGHT' 'CENTER' 'DEEP MIDDLE' 'MIDDLE. PENALTY' 'YARDS &' 'LEFT TO'
 'MIDDLE TO' 'BACK TO' '(SHOTGUN) 10-THILL' 'INTERCEPTED BY' 'NOT LISTED'
 'RIGHT TO' 'RULING, AND' '(6:33) 11-ASMITH' '(4:03) (NO'
 '(6:41) (SHOTGUN)' 'INTENDED FOR' '(55-A.BROOKS) [53-NBOWMAN]'
 '(:15) (SHOTGUN)' 'KESSLER THROUGH' 'RIGHT (58-JHICKS)'
 '[33-E.GAINES]. LA-33-EGAINES' 'RIGHT. PENALTY' '[57-N.SPENCE]. PENALTY'
 '(94-C.LIUGET) [99-JBOSA]' '(6:44) (SHOTGUN)' 'PASS RULING,'
 '(13:19) 5-TTAYLOR' 'IN 119' '(10:14) 17-PRIVERS' '(10:01) (SHOTGUN)'
 '[20-C.GRAHAM]. THROWN' '[55-S.TULLOCH]. PENALTY' '(:21) 5-TBRIDGEWATER'
 '[31-M.ALEXANDER]. PENALTY' '(4:54) 2-JMANZIEL' '[58-V.MILLER]. THE'
 '(4:02) (SHOTGUN)' '(11:52) 11-ASMITH' '(6:01) 11-ASMITH'
 'INTERFERENCE, 2' '(5:04) (NO' '(:31) (SHOTGUN)' 'MIDDLE [99-LHOUSTON]'
 '(:38) (NO' 'LEFT [94-JTRATTOU]' '(6:01) 9-ADAVIS']


Unnamed: 0,Down,ToGo,YardLine,Yards,Formation,IsRush,IsPass,IsIncomplete,IsTouchdown,IsSack,...,IsTwoPointConversionSuccessful,YardLineFixed,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards,AbsoluteTime,GameMonth,PlayType2
15416,2,9,91,0,SHOTGUN,0,1,1,0,0,...,0,9,1,PHI,1,DEFENSIVE OFFSIDE,5,2717,9,MIDDLE. PENALTY


##### Lots of these seem like misplaced data that will cause bad noise

##### Check how many of each occurance occurs to see if they are anamolies

In [188]:


for phrase in uniqueVal:
    rows = (df3['PlayType2'] == phrase)
    
    index = df3[rows].index
    print(phrase,': ',len(index))



SHORT LEFT :  32915
RIGHT GUARD :  9412
RIGHT TACKLE :  8895
SHORT MIDDLE :  21243
DEEP LEFT :  8352
LEFT END :  9150
SHORT RIGHT :  36153
LEFT TACKLE :  8840
RIGHT END :  8037
LEFT GUARD :  8985
DEEP RIGHT :  8659
CENTER :  20597
DEEP MIDDLE :  4536
MIDDLE. PENALTY :  1
YARDS & :  1
LEFT TO :  4
MIDDLE TO :  5
BACK TO :  3
(SHOTGUN) 10-THILL :  1
INTERCEPTED BY :  3
NOT LISTED :  10
RIGHT TO :  3
RULING, AND :  1
(6:33) 11-ASMITH :  1
(4:03) (NO :  1
(6:41) (SHOTGUN) :  1
INTENDED FOR :  16
(55-A.BROOKS) [53-NBOWMAN] :  1
(:15) (SHOTGUN) :  1
KESSLER THROUGH :  1
RIGHT (58-JHICKS) :  1
[33-E.GAINES]. LA-33-EGAINES :  1
RIGHT. PENALTY :  2
[57-N.SPENCE]. PENALTY :  1
(94-C.LIUGET) [99-JBOSA] :  1
(6:44) (SHOTGUN) :  1
PASS RULING, :  63
(13:19) 5-TTAYLOR :  1
IN 119 :  1
(10:14) 17-PRIVERS :  1
(10:01) (SHOTGUN) :  1
[20-C.GRAHAM]. THROWN :  1
[55-S.TULLOCH]. PENALTY :  1
(:21) 5-TBRIDGEWATER :  1
[31-M.ALEXANDER]. PENALTY :  1
(4:54) 2-JMANZIEL :  1
[58-V.MILLER]. THE :  1
(4:02) (SHO

##### Looking at these results, many of the phrases don't even describe the play. They seem to be mistakes taken from the description, 
##### which was already dropped.Dropping these rows would be a small fraction of data lost

In [205]:
#df.RushDirection.unique()
drop_rows = (df3['PlayType2'] == 'MIDDLE. PENALTY')
indexNames = df3[drop_rows]
indexNames

for phrase in uniqueVal[13:]:
    rows = (df3['PlayType2'] == phrase)
    index = df3[rows]
    indexNames = pd.concat([indexNames, index])

In [216]:
#df4 = df3.drop(dropIndexes, inplace=True)
df4 = df3.drop(indexNames.index)
df3.shape[0]- df4.shape[0]
df4.PlayType2.unique()

array(['SHORT LEFT', 'RIGHT GUARD', 'RIGHT TACKLE', 'SHORT MIDDLE',
       'DEEP LEFT', 'LEFT END', 'SHORT RIGHT', 'LEFT TACKLE', 'RIGHT END',
       'LEFT GUARD', 'DEEP RIGHT', 'CENTER', 'DEEP MIDDLE'], dtype=object)

##### Label Encode the categorical data

In [221]:
#Label Encode
from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
#bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
#bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
df4['Formation_Code'] = labelencoder.fit_transform(df4['Formation'])
df4['PlayType_Code'] = labelencoder.fit_transform(df4['PlayType2'])


In [222]:
df_encoded = df4.drop(['Formation', 'PlayType2'], axis=1)

In [223]:
df_encoded

Unnamed: 0,Down,ToGo,YardLine,Yards,IsRush,IsPass,IsIncomplete,IsTouchdown,IsSack,IsChallenge,...,YardLineFixed,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,1,0,1,0,0,0,0,...,50,0,,0,,0,145,10,1,10
1,2,9,51,3,1,0,0,0,0,0,...,49,0,,0,,0,105,10,4,8
2,1,10,84,1,1,0,0,0,0,0,...,16,0,,0,,0,634,10,4,9
3,2,9,85,6,0,1,0,0,0,0,...,15,0,,0,,0,595,10,3,11
4,3,3,91,6,0,1,0,0,0,0,...,9,0,,0,,0,550,10,3,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270410,1,10,42,32,0,1,0,0,0,0,...,42,0,,0,,0,2582,12,4,1
270411,1,10,74,4,0,1,0,0,0,0,...,26,0,,0,,0,2554,12,3,10
270412,2,6,78,-2,0,1,0,0,0,0,...,22,0,,0,,0,2522,12,3,12
270413,3,8,76,0,0,1,1,0,0,0,...,24,0,,0,,0,2481,12,3,11


##### Drop Data that can only be known after a play. Including this data would be "cheating"

### Create dataset for predicting touchdowns

In [224]:
# To predict a touchdown, we must drop data that cannot be known prior to the play
df_isTD = df_encoded.drop(['Yards', 'IsIncomplete', 'IsSack', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_isTD.head()

Unnamed: 0,Down,ToGo,YardLine,IsRush,IsPass,IsTouchdown,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,0,1,0,0,145,10,1,10
1,2,9,51,1,0,0,0,105,10,4,8
2,1,10,84,1,0,0,0,634,10,4,9
3,2,9,85,0,1,0,0,595,10,3,11
4,3,3,91,0,1,0,0,550,10,3,11


### Create dataset for predicting sacks

In [225]:
# To predict a sack, we must drop data that cannot be known prior to the play
df_isSack = df_encoded.drop(['Yards', 'IsIncomplete', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_isSack.head()

Unnamed: 0,Down,ToGo,YardLine,IsRush,IsPass,IsSack,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,0,1,0,0,145,10,1,10
1,2,9,51,1,0,0,0,105,10,4,8
2,1,10,84,1,0,0,0,634,10,4,9
3,2,9,85,0,1,0,0,595,10,3,11
4,3,3,91,0,1,0,0,550,10,3,11


### Create dataset for predicting a fumble

In [226]:
# To predict a fumble, we must drop data that cannot be known prior to the play
df_isFum = df_encoded.drop(['Yards', 'IsIncomplete', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_isFum.head()

Unnamed: 0,Down,ToGo,YardLine,IsRush,IsPass,IsFumble,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,0,1,0,0,145,10,1,10
1,2,9,51,1,0,0,0,105,10,4,8
2,1,10,84,1,0,0,0,634,10,4,9
3,2,9,85,0,1,0,0,595,10,3,11
4,3,3,91,0,1,0,0,550,10,3,11


### Create dataset for predicting an incomplete pass

In [227]:
# To predict a fumble, we must drop data that cannot be known prior to the play
df_isIC = df_encoded.drop(['Yards', 'IsFumble', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_isIC.head()

Unnamed: 0,Down,ToGo,YardLine,IsRush,IsPass,IsIncomplete,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,0,1,0,0,145,10,1,10
1,2,9,51,1,0,0,0,105,10,4,8
2,1,10,84,1,0,0,0,634,10,4,9
3,2,9,85,0,1,0,0,595,10,3,11
4,3,3,91,0,1,0,0,550,10,3,11


##### Incomplete only applies to passing plays. Must drop all rows where isRush = 1

In [217]:
rows = df_isIC['IsRush'] == 1
indexNames = df_isIC[rows].index
df_isIC = df_isIC.drop(indexNames)

### Create dataset for predicting an interception


In [228]:
# To predict a fumble, we must drop data that cannot be known prior to the play
df_isINT = df_encoded.drop(['Yards', 'IsFumble', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsIncomplete', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_isINT.head()

Unnamed: 0,Down,ToGo,YardLine,IsRush,IsPass,IsInterception,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,0,1,0,0,145,10,1,10
1,2,9,51,1,0,0,0,105,10,4,8
2,1,10,84,1,0,0,0,634,10,4,9
3,2,9,85,0,1,0,0,595,10,3,11
4,3,3,91,0,1,0,0,550,10,3,11


### Create dataset for predicting Yards gained

In [229]:
# To predict a fumble, we must drop data that cannot be known prior to the play
df_yardsGain = df_encoded.drop(['IsIncomplete', 'IsFumble', 'IsTouchdown', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsSack', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'IsPenaltyAccepted', 'PenaltyTeam', 'PenaltyType', 'PenaltyYards', 'YardLineFixed'], axis=1)
df_yardsGain.head()

Unnamed: 0,Down,ToGo,YardLine,Yards,IsRush,IsPass,IsNoPlay,AbsoluteTime,GameMonth,Formation_Code,PlayType_Code
0,1,10,50,1,0,1,0,145,10,1,10
1,2,9,51,3,1,0,0,105,10,4,8
2,1,10,84,1,1,0,0,634,10,4,9
3,2,9,85,6,0,1,0,595,10,3,11
4,3,3,91,6,0,1,0,550,10,3,11


In [22]:
#Separate labels from classifiers
#Labels will most likely need to be converted into one column with casting as nothing=0, touchdown=1, interception=2, etc 

#### df_isTD - Use to predict if they will score a touchdown

#### df_isSack - Use to predict if there will be a sack

#### df_isFum - Use to predict if there will be a sack

#### df_isIC - Use to predict if there will be a incomplete pass

#### df_isINT - Use to predict if there will be an interception

#### df_yardsGain - Use to predict if yards gained

## Data Analysis

#### Decision Trees

In [24]:
#Perform Decision Trees (Assign 1)
#Report results, including accuracy scores and appropriate visuals

#### KNN

In [25]:
#Perform KNN (Assign 2)
#Report results, including accuracy scores and appropriate visuals

#### Naive-Bayes

In [26]:
#Perform Naive-Bayes (Assign 2)
#Report results, including accuracy scores and appropriate visuals

#### SVM

In [27]:
#Perform SVM (Assign 3)
#Report results, including accuracy scores and appropriate visuals

#### Neural Net

In [28]:
#Perform Neural Net (Assign 3)
#Report results, including accuracy scores and appropriate visuals

#### Ensembles

In [29]:
#Perform Ensembles (Assign 3)
#Report results, including accuracy scores and appropriate visuals

## Model Analysis

In [30]:
#Compare accuracy scores and other metrics for our different models.
#How confident are we in the success rates of these various models?

In [31]:
#Discuss which model was the best.

In [32]:
#Discuss data. What issues may have existed in the data?  What assumptions did we make? What could have made our data better?

In [33]:
#Discuss our project as a whole. How could we have improved project? How might this model be used in real world applications?