In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O 
pd.set_option('display.max_columns', 5000)

# pretty pictures
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# [NFL BDB](https://www.kaggle.com/c/nfl-big-data-bowl-2022/overview)

### Predicting Penalties on Punts and Kickoffs

This notebook is meant as an exploration of the NFL Bid Data Bowl data, as well as some general interest feature engineering and building basic classification models. It's purpose isn't to generate a useful model but more for my own Data Science interest and practice. 

The main focus here is feature engineering using four of the NFL's special teams datasets, and using the engineered data to build a model that can predict whether or not a punt or a kickoff will result in a penalty. Of course, most plays *don't* result in a penalty, so as shown below, the models that get built will all be very good at predicting the plays without penalties, but not so good at predicting the plays that *do* result in penalties. Also, the model can't predict *which* team will get a penalty. 

Like I said, this isn't meant to be a useful model (yet?).

[Part 1:](#part1) Data Exploration and Feature Engineering

[Part 2:](#modelling) Modelling

In [None]:
# get the data
scouting_data = pd.read_csv("../input/nfl-big-data-bowl-2022/PFFScoutingData.csv")
games = pd.read_csv("../input/nfl-big-data-bowl-2022/games.csv")
players = pd.read_csv("../input/nfl-big-data-bowl-2022/players.csv")
plays = pd.read_csv("../input/nfl-big-data-bowl-2022/plays.csv")

# Part 1: Data Exploration and Feature Engineering 
* [Games](#games-data)
* [Plays](#plays-data-nav)
* [Scouting](#scouting-data)

<a id="games-data"></a>
## Games Data
The Games dataset is the smallest dataset and very straightforward. 
* gameID is a primary key for joins
* having gameDate and the week number may be redundant
* gameTime will be changed to afternoon, evening, night

In [None]:
games

In [None]:
games['gameTimeEastern'].value_counts()

In [None]:
def categorize_start_time(hour):
    if int(hour)<16:
        return 'A'  # afternoon                       
    elif int(hour)>=16 and int(hour)<19:
        return 'E'  # evening
    else:
        return 'N'  # night

In [None]:
def process_games(games_df):
    # make a working copy
    df = games_df.copy()
    
    # add a startPeriod column
    df['startHour'] = games['gameTimeEastern'].str[0:2]
    df['startPeriod'] = df['startHour'].map(lambda x: categorize_start_time(x))
        
    # remove the unnecessary ones
    df = df.drop(['startHour', 'gameTimeEastern'], axis=1)
    
    return df

In [None]:
process_games(games)

<a id="plays-data-nav"></a>
## Plays Data

Much more interesting. The field descriptions are from the [competition data page](https://www.kaggle.com/c/nfl-big-data-bowl-2022/data).

* **gameId**: Game identifier, unique (numeric)
* **playId**: Play identifier, not unique across games (numeric)
* **playDescription**: Description of play (text)
* **quarter**: Game quarter (numeric)
* **down**: Down (numeric)
* **yardsToGo**: Distance needed for a first down (numeric)
* **possessionTeam**: Team punting, placekicking or kicking off the ball (text)
* **specialTeamsPlayType**: Formation of play: Extra Point, Field Goal, Kickoff or Punt (text)
* **specialTeamsPlayResult**: Special Teams outcome of play dependent on play type: Blocked Kick Attempt, Blocked Punt, Downed, Fair Catch, Kick Attempt Good, Kick Attempt No Good, Kickoff Team Recovery, Muffed, Non-Special Teams Result, Out of Bounds, Return or Touchback (text)
* **kickerId**: nflId of placekicker, punter or kickoff specialist on play (numeric)
* **returnerId**: nflId(s) of returner(s) on play if there was a special teams return. Multiple returners on a play are separated by a ; (text)
* **kickBlockerId**: nflId of blocker of kick on play if there was a blocked field goal or blocked punt (numeric)
* **yardlineSide**: 3-letter team code corresponding to line-of-scrimmage (text)
* **yardlineNumber**: Yard line at line-of-scrimmage (numeric)
* **gameClock**: Time on clock of play (MM:SS)
* **penaltyCodes**: NFL categorization of the penalties that occurred on the play. Multiple penalties on a play are separated by a ; (text)
* **penaltyJerseyNumbe**r: Jersey number and team code of the player committing each penalty. Multiple penalties on a play are separated by a ; (text)
* **penaltyYards**: yards gained by possessionTeam by penalty (numeric)
* **preSnapHomeScore**: Home score prior to the play (numeric)
* **preSnapVisitorScore**: Visiting team score prior to the play (numeric)
* **passResult**: Scrimmage outcome of the play if specialTeamsPlayResult is "Non-Special Teams Result" (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, R: Scramble, ' ': Designed Rush, text)
* **kickLength**: Kick length in air of kickoff, field goal or punt (numeric)
* **kickReturnYardage**: Yards gained by return team if there was a return on a kickoff or punt (numeric)
* **playResult**: Net yards gained by the kicking team, including penalty yardage (numeric)
* **absoluteYardlineNumber**: Location of ball downfield in tracking data coordinates (numeric)

In [None]:
plays

<a id="plays-data-nav"></a>
### Plays Data Feature Engineering

My Goal here is to predict whether or not there will be a penalty, so first we're going to add a penalty flag column (no pun intended). I'm also looking only at penalties on kickoffs and punts, so I'll filter out other plays.
I'm going to get rid of some of these columns:
* The playID and gameID are going to stay in order to join the other data sets
* penaltyYards, penaltyCodes, and penaltyJerseyNumbers have to go because they will obviously match up to perfectly to the penaltyFlag column, which I'm trying to predict. This is to avoid leakage. 
* Kickoffs don't have a down and every single punt happened on fourth down, so I'll get rid of the down column
* I need to test whether absoluteYardlineNumber measures from one team's end or a stadium-specific end [(see below)](#yardline).
* I'm going to assume that who the kicker or returner is has no bearing on whether or not there was a penalty and get rid of the kickerId  and returnerId column; this is justifiable because I will be including data about each specific kick. I did a brief examination [below](#kickers-returners) but some kickers and returners have been involved with plays with a large amount of penalties. (Note: I'm going to do an exploration on which kickers and returners have been involved with the most plays with penalties. It's not done yet but I am going to link to it [here]()).
* Going to use quarter as a proxy for the time and get rid of the gameClock column.
* Such a small quantity of non-Nan's in the passResult column that I'm just going to remove it.

In addition. the returnYardage column has a bunch of NaN values, that I'll replace with 0. 

The [special teams result](#specialTeamResult) column seems to a somewhat valuable predictor, so it will be kept and one-hot encoded. Some of the encoding of categorical features will need to be done after the dataset is split between kickoffs and punts, and some before. Click [here](#encoding) to jump to the encoding section.

Next, I'm going to split the data into kickoffs and punts, as there are differences between them that should be handled. Click [here](#split) to jump to this section.

In [None]:
def kick_punt_filter(specialTeamsPlayType):
    """Used in a map/lambda to get rid of records other than those referring to a kickoff or a punt"""
    
    if specialTeamsPlayType == "Kickoff" or specialTeamsPlayType=="Punt":
        return specialTeamsPlayType
    else:
        return False

In [None]:
def process_plays(plays_df):
    """Takes in the original plays DataFrame and returns the processed one."""
    
    df = plays_df.copy()
    
    # add penalty flag column
    df['penaltyFlag'] = df['penaltyCodes'].notnull().astype(int)
    
    # filter for kickoffs and punts
    df['kickType'] = df['specialTeamsPlayType'].map(lambda x: kick_punt_filter(x))
    df = df.drop('specialTeamsPlayType', axis=1)
    
    df = df[df.kickType != False]
    
    # add in the yardsFromEndzone column
    # the reasoning is described in the Yard Line section below
    yardsFromEndzone = df['yardlineSide'] == df['possessionTeam']
    df['yardsFromEndzone'] = np.where(yardsFromEndzone, df['yardlineNumber'], 100-df['yardlineNumber']) 
    
    # fill NaN's
    df['kickReturnYardage'].fillna(0, inplace=True)
    df['penaltyYards'].fillna(0, inplace=True)
    df['penaltyJerseyNumbers'].fillna('None', inplace=True)
    
    # combine playId and gameId into a single column
    # this will be used later to join with the scouting data
    df['gameId'] = df['gameId'].astype(str)
    df['playId'] = df['playId'].astype(str)
    df['play_key'] = df[['gameId', 'playId']].agg('-'.join, axis=1)
    
    # drop unnecessary columns
    df = df.drop(['gameId',
                  'playId',
                  'penaltyYards',
                  'playDescription',
                  'absoluteYardlineNumber',
                  'down',
                  'gameClock',
                  'passResult',
                  'yardlineSide',
                  'yardlineNumber',
                  'kickerId',
                  'returnerId',
                  'kickBlockerId',
                  'penaltyCodes',
                  'penaltyJerseyNumbers'],
                 axis=1)
    
    
    return df

In [None]:
process_plays(plays)

<a id="yardline"></a>
#### Yard Line 

The next code block will test whether **absoluteYardlineNumber** is fixed from one team's end (e.g. yards from the home team's end zone) or fixed from one end of the stadium (e.g. yards from the south end zone, regardless of which team's that is). We'll have to make subset DataFrames from the relevant columns in the games and the plays df's and then join them on **gameId**. 

From inspecting the first few rows, it's obvious that **absoluteYardlineNumber** measures from a fixed end in the stadium.  In the first quarter of the Philadelphia-Atlanta game (seen below, gameId=201809600), **absoluteYardlineNumber** measures from Philadelphia's end. When the two teams switch sides in the second quarter, it measures from Atlanta's end. This means we can get rid of the the **absoluteYardline** column and use only **yardlineSide** and **yardlineNumber**.

[Back to Plays Data Navigator](#plays-data-nav)


In [None]:
def test_yardline(plays, games):
    # relevant columns from the two dfs:
    yardlineTestDf1 = plays[['gameId','absoluteYardlineNumber', 'yardlineSide', 'yardlineNumber', 'quarter', 'possessionTeam']]
    yardlineTestDf2 = games[['gameId', 'homeTeamAbbr']]
    
    # join them
    yardlineTestDF = yardlineTestDf1.set_index('gameId').join(yardlineTestDf2.set_index('gameId'))
    
    return yardlineTestDF

In [None]:
test_yardline(plays, games).head(10)

Now that we've got that, what we want to do is turn the yardlineSide and yardlineNumber features into a single one called yardsFromEndzone, which denotes where the line of scrimmage is from one's own endzone. This will be yardline number if yardlineSide = possessionTeam, if not, it'll be the yardlineNumber subtracted from 100, since the other team's 5 yard line and 25 yard line are 95 and 75 yards from one's own endzone.

This functionality is implemented within the process_plays() function above.

<a id="specialTeamResult"></a>
#### Special Teams Result
Next, I'm going to do some general exploring of other features and decide whether or not to keep them, based on whether there's any indication of those correlating with there being a penalty. The **specialTeamsResult** column definitely looks like it should be kept.

[Back to Plays Data Navigator](#plays-data-nav)

In [None]:
process_plays(plays)['specialTeamsResult'].value_counts()

In [None]:
plt.figure(figsize=(20,14))
graph = sns.histplot(process_plays(plays), x=process_plays(plays)['specialTeamsResult'], hue=process_plays(plays)['penaltyFlag'], stat='count')
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 0.3, height, ha="center")
plt.show()

In [None]:
process_plays(plays)

In [None]:
# only a few non-nan passResult
plays['passResult'].value_counts()

<a id=kickers-returners></a>
What about keeping the kicker or returner? Let's sum up the number of penalties for each.
#### Kickers and Returners
For some reason, a few kickers have a huge number of penalties resulting from their kicks. This will be explored elsewhere and I'll link to it here when it's done. A surface level examination is below; click here to skip to below it. For example, we can see below that kicker 40113's kicks had 45 penalties; the average was 9.3 (red line) and the median was 8 (yellow line).

The situation for returners is similar.

For the time being, I'm going to assume the identity of the kicker or returner is irrelevant and remove it.

[Back to Plays Data Navigator](#plays-data-nav)

In [None]:
def whatsup_with_the_kickers(plays_df):
    """Takes in the original plays DataFrame and returns the processed one."""
    
    df = plays_df.copy()
    
    # add penalty flag column
    df['penaltyFlag'] = df['penaltyCodes'].notnull().astype(int)
    
    kickers = df[['kickerId', 'penaltyFlag']]
    kickers = kickers.groupby(['kickerId']).sum().sort_values('penaltyFlag', ascending=False)
    kickers = kickers.reset_index()
    
    # plot the penalties for each kicker
    plt.figure(figsize=(14,10))
    ax = sns.barplot(x=kickers.index, y=kickers["penaltyFlag"])
    # horizontal lines for median, mean
    ax.axhline(kickers.describe().loc['mean','penaltyFlag'], color='red')
    ax.axhline(kickers.describe().loc['50%','penaltyFlag'], color='yellow')
    ax.set_xticklabels(kickers.kickerId)
    for item in ax.get_xticklabels(): item.set_rotation(90)
    plt.show()
    
    
    return kickers, kickers.describe()['penaltyFlag'], kickers.describe().loc['50%','penaltyFlag']

In [None]:
whatsup_with_the_kickers(plays)

<a id=encoding></a>
### Encoding
I'm going to define a general one-hot encoding function that will take in a dataframe, the column to encode, and a prefix and return a new Dataframe with the column encoded.

[Back to Plays Data Navigator](#plays-data-nav)

In [None]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df,dummies],axis=1)
    df = df.drop(column, axis=1)
    
    return df

In [None]:
process_plays(plays)

In [None]:
plays_data = onehot_encode(process_plays(plays), 'possessionTeam', 'team')
plays_data

<a id="scouting-data"></a>
## Scouting Data

The third and final dataset we'll bring in is the PFF scouting data. Here are the field definitions from the NFL:

* **gameId**: Game identifier, unique (numeric)
* **playId**: Play identifier, not unique across games (numeric)
* **snapDetail**: On Punts, whether the snap was on target and if not, provides detail (H: High, L: Low, <: Left, >: Right, OK: Accurate Snap, text)
* **operationTime**: Timing from snap to kick on punt plays in seconds: (numeric)
* **hangTime**: Hangtime of player's punt or kickoff attempt in seconds. Timing is taken from impact with foot to impact with the ground or a player. (numeric)
* **kickType**: Kickoff or Punt Type (text).
* * Possible values for kickoff plays:
* * * D: Deep - your normal deep kick with decent hang time
* * * F: Flat - different than a Squib in that it will have some hang time and no roll but has a lower trajectory and hang time than a Deep kick off
* * * K: Free Kick - Kick after a safety
* * * O: Obvious Onside - score and situation dictates the need to regain possession. Also the hands team is on for the returning team
* * * P: Pooch kick - high for hangtime but not a lot of distance - usually targeting an upman
* * * Q: Squib - low-line drive kick that bounces or rolls considerably, with virtually no hang time
* * * S: Surprise Onside - accounting for score and situation an onsides kick that the returning team doesn’t expect. Hands teams probably aren't on the field
* * * B: Deep Direct OOB - Kickoff that is aimed deep (regular kickoff) that goes OOB directly (doesn't bounce)
* * Possible values for punt plays:
* * * N: Normal - standard punt style
* * * R: Rugby style punt
* * * A: Nose down or Aussie-style punts
* **kickDirectionIntended**: Intended kick direction from the kicking team's perspective - based on how coverage unit sets up and other factors (L: Left, R: Right, C: Center, text).
* **kickDirectionActual**: Actual kick direction from the kicking team's perspective (L: Left, R: Right, C: Center, text).
* **returnDirectionIntended**: The return direction the punt return or kick off return unit is set up for from the return team's perspective (L: Left, R: Right, C: Center, text).
* **returnDirectionActual**: Actual return direction from the return team's perspective (L: Left, R: Right, C: Center, text).
* **missedTacklers**: Jersey number and team code of player(s) charged with a missed tackle on the play. It will be reasonable to assume that he should have brought down the ball carrier and failed to do so. This situation does not have to entail contact, but it most frequently does. Missed tackles on a QB by a pass rusher are also included here. Multiple missed tacklers on a play are separated by a ; (text).
* **assistTacklers**: Jersey number and team code of player(s) assisting on the tackle. Multiple assist tacklers on a play are separated by a ; (text).
* **tacklers**: Jersey number and team code of player making the tackle (text).
* **kickoffReturnFormation**: 3 digit code indicating the number of players in the Front Wall, Mid Wall and Back Wall (text).
* **gunners**: Jersey number and team code of player(s) lined up as gunner on punt unit. Multiple gunners on a play are separated by a ; (text).
* **puntRushers**: Jersey number and team code of player(s) on the punt return unit with "Punt Rush" role for actively trying to block the punt. Does not include players crossing the line of scrimmage to engage in punt coverage players in a "Hold Up" role. Multiple punt rushers on a play are separated by a ; (text).
* **specialTeamsSafeties**: Jersey number and team code for player(s) with "Safety" roles on kickoff coverage and field goal/extra point block units - and those not actively advancing towards the line of scrimmage on the punt return unit. Multiple special teams safeties on a play are separated by a ; (text).
* **vises**: Jersey number and team code for player(s) with a "Vise" role on the punt return unit. Multiple vises on a play are separated by a ; (text).
* **kickContactType**: Detail on how a punt was fielded, or what happened when it wasn't fielded (text).
* * Possible values:
* * * BB: Bounced Backwards
* * * BC: Bobbled Catch from Air
* * * BF: Bounced Forwards
* * * BOG: Bobbled on Ground
* * * CC: Clean Catch from Air
* * * CFFG: Clean Field From Ground
* * * DEZ: Direct to Endzone
* * * ICC: Incidental Coverage Team Contact
* * * KTB: Kick Team Knocked Back
* * * KTC: Kick Team Catch
* * * KTF: Kick Team Knocked Forward
* * * MBC: Muffed by Contact with Non-Designated Returner
* * * MBDR: Muffed by Designated Returner
* * * OOB: Directly Out Of Bounds

In [None]:
scouting_data

### Scouting Data Feature Engineering
* As with the plays data, the gameId and the playId will be merged to make a primary key 
* I'm going to get rid of the data about the tacklers
* Gunners, rushers, safeties, and vises will be turned into numeric values (simply the number of each)
* The times will stay as they are
* All of the other field are categorical and will get one hot encoded

The plays that aren't kickoffs or punts will be dropped when this data is inner-joined with the other data.


In [None]:
def process_scouting(df):
    df = df.copy()
    
    # rename kickType 
    df['typeOfKick'] = df['kickType']
    
    # change the player roles columns to the number of players occupying that role
    df['gunners'] = df['gunners'].map(lambda x: number_players(x))
    df['puntRushers'] = df['puntRushers'].map(lambda x: number_players(x))
    df['specialTeamsSafeties'] = df['specialTeamsSafeties'].map(lambda x: number_players(x))
    df['vises'] = df['vises'].map(lambda x: number_players(x))
    
    # combine playId and gameId into a single column
    df['gameId'] = df['gameId'].astype(str)
    df['playId'] = df['playId'].astype(str)
    df['play_key'] = df[['gameId', 'playId']].agg('-'.join, axis=1)
    
    # drop unnecessary columns
    df = df.drop(['gameId',
                  'playId',
                  'missedTackler',
                  'assistTackler',
                  'tackler',
                  'kickType'],
                 axis=1)
    
    return df

In [None]:
def number_players(players):
    """function to change the player roles columns to the number of players occupying that role"""
    if type(players) != str: # if it's not NaN
        return 0

      # if there are players who got penalties, we're just returning the number of semicolons + 1
    else:
        return players.count(';') + 1


In [None]:
process_scouting(scouting_data)


## Merging and finalizing the two data sets 
Before we can go on any farther, the two datasets need to be merged - we get a 28 field dataset with nearly 14000 records.

In [None]:
full_data = process_plays(plays).set_index('play_key').join(process_scouting(scouting_data).set_index('play_key'))
full_data

<a id=split></a>
### Splitting kickoffs and punts
Some of the fields are only for kickoffs, and some only for punts. Here I'll split them up before we move on.

* The fields only for punts are yardsToGo snapDetail, operationTime, hangTime, gunners, rushers, vises and kickContactType
* The field only for kickoffs is kickoffReturnFormation

In [None]:
kickoffs = full_data[full_data['kickType'] == 'Kickoff']
punts = full_data[full_data['kickType'] == 'Punt']

In [None]:
# check columns only for punts are all nans
# snapDetail, operationTime, snapTime, gunners, rushers, vises and kickContactType should all be 0
for column in kickoffs.columns:
    print(column + ' - '+ str(kickoffs[column].count()))

In [None]:
for column in punts.columns:
    print(column + ' - '+ str(punts[column].count()))

<a id=kickoffs></a>
    
### Finishing up the kickoffs data

Last thing is to onehot encode the categorical columns and drop the unnecessary columns.

In [None]:
kickoffs

In [None]:
def process_kickoffs(df):
    df = df.copy()
    
    # drop unnecessary columns
    df = df.drop(['snapDetail',
                  'snapTime',
                  'yardsToGo',
                  'operationTime',
                  'gunners',
                  'puntRushers',
                  'vises',
                  'kickContactType',
                  'kickType']
                , axis=1)
    
    # onehot encoding
    df = onehot_encode(df, 'quarter', 'quarter')
    df = onehot_encode(df, 'possessionTeam', 'team')
    df = onehot_encode(df, 'specialTeamsResult', 'result')
    df = onehot_encode(df, 'kickDirectionIntended', 'kickDirectionIntended')
    df = onehot_encode(df, 'kickDirectionActual', 'kickDirectionActual')
    df = onehot_encode(df, 'returnDirectionIntended', 'returnDirectionIntended')
    df = onehot_encode(df, 'returnDirectionActual', 'returnDirectionActual')
    df = onehot_encode(df, 'kickoffReturnFormation', 'kickoffReturnFormation')
    df = onehot_encode(df, 'typeOfKick', 'typeOfKick')
    
    # deal with the NaN's in the hangTime Column
    df['hangTime'] = df['hangTime'].fillna(value=df['hangTime'].mean())
    
    return df



In [None]:
process_kickoffs(kickoffs)

There are 620 NaN's in the hangTime field for kickoffs, which are simply missing randomly. They'll be filled with the average kickoff hangtime.

In [None]:
# the count() function counts the non-NaN's 
7843 - kickoffs['hangTime'].count()

In [None]:
# this should now be 7843
process_kickoffs(kickoffs)['hangTime'].count()

In [None]:
# 620 were filled with the average value
process_kickoffs(kickoffs)['hangTime'].value_counts()

My final kickoffs data is below:

In [None]:
process_kickoffs(kickoffs)

In [None]:
# quick null check
for col in process_kickoffs(kickoffs).columns:
    print(col + ' - ' + str(process_kickoffs(kickoffs)[col].isna().sum()))

<a id=punts></a>
    
### Finishing up the punts data

In [None]:
punts

In [None]:
def process_punts(df):
    df = df.copy()
    
    # drop unnecessary columns
    df = df.drop(['kickoffReturnFormation', 'kickType'], axis=1)
    
    # onehot encoding
    df = onehot_encode(df, 'quarter', 'quarter')
    df = onehot_encode(df, 'possessionTeam', 'team')
    df = onehot_encode(df, 'specialTeamsResult', 'result')
    df = onehot_encode(df, 'snapDetail', 'snap')
    df = onehot_encode(df, 'kickDirectionIntended', 'kickDirectionIntended')
    df = onehot_encode(df, 'kickDirectionActual', 'kickDirectionActual')
    df = onehot_encode(df, 'returnDirectionIntended', 'returnDirectionIntended')
    df = onehot_encode(df, 'returnDirectionActual', 'returnDirectionActual')
    df = onehot_encode(df, 'kickContactType', 'kickContactType')
    df = onehot_encode(df, 'typeOfKick', 'typeOfKick')
    
    # drop records with NaN's in the hangTime Column; explained below
    df = df.dropna()
    
    return df

In [None]:
process_punts(punts)

The NaN values for the punts are a little trickier because some of them are simply missing data, but some are for punt plays for which there is no hangTime, such as blocked punts.

In [None]:
temp = punts[['specialTeamsResult','hangTime','penaltyFlag']]
temp

In [None]:
temp[temp['hangTime'].isna()]['penaltyFlag'].value_counts()

In [None]:
temp[temp['hangTime'].isna()]['specialTeamsResult'].value_counts()

However, there's only 118 such records, and only 6 have penalties, so I'm just going to delete those rows for the first go through. A future version of this may deal with those more rigorously.

My final punts dataset is below:

In [None]:
process_punts(punts)

In [None]:
# quick null check
for col in process_punts(punts).columns:
    print(col + ' - ' + str(process_punts(punts)[col].isna().sum()))

<a id=modelling></a>
# Part 2: Modelling

I'm going to put together some classification models that can use these two datasets to predict whether or not there will be a penalty on each play.

For both the kickoffs and punts, I'll first need to split the data into a training and testing set, fit a standard scaler on the training sets and apply it to both.

### Kickoffs first

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
def process_for_modelling(df, target):
    df = df.copy()
    
    X = df.drop(target, axis=1)  # feature matrix
    y = df[target]   # target vector
    
    # split it 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)
    
    # scale it 
    scaler = StandardScaler()
    scaler.fit(X_train) # fit only to the training data

    # apply to both
    X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [None]:
k_X_train, k_X_test, k_y_train, k_y_test = process_for_modelling(process_kickoffs(kickoffs), 'penaltyFlag')

In [None]:
k_X_train

And we can see the scaler was applied as the standard deviations are all the same:

In [None]:
k_X_train.describe()

In [None]:
k_y_test.value_counts()

And with the scaled data, we'll do some modelling and see if this mess can predict anything usefully:

In [None]:
# import models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, Ridge, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# import eval stuff
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix

In [None]:
# make a dictionary of the models to iterate through to train and test each one 
models = {
    "K-Nearest Neighbors": KNeighborsClassifier(),
    # this guy is currently throwing an error I can't figure out
    # "Logistic Regression": LogisticRegression(),  
    "Stochastic Gradient Descent Classifier": SGDClassifier(),
    "Support Vector Classifier": SVC(),
    "Linear Support Vector Classifier": LinearSVC(),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifer": RandomForestClassifier()         
         }

In [None]:
# train
for name, model in models.items():
    model.fit(k_X_train, k_y_train)
    print(name + " trained.")

In [None]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(k_X_test, k_y_test)))
    print(confusion_matrix(model.predict(k_X_test), k_y_test))

### Evaluation
All of the models achieved nearly 100% accuracy.


#### Closer look at the Decision Tree Classifier

In [None]:
from sklearn.tree import plot_tree

In [None]:
dTree = DecisionTreeClassifier()
dTree.fit(k_X_train, k_y_train)
pred_dt = dTree.predict(k_X_test)
print("Classification Report:\n")
print(classification_report(k_y_test, pred_dt))
print("\nConfusion Matrix:\n")
print(confusion_matrix(k_y_test, pred_dt))

##### Classification Report
The exact numbers here will change every time this notebook is run, but here is a summary for right now:

In the negative class - the not-penalty class - 99% of the predictions were correct (precision = 0.99) and 97% of the plays that did not result in a penalty were predicted by the model (recall = 0.97).

In the positive class, 79% of plays with penalties detected actually did result in penalties, and 80% of the plays that did result in a penalty were predicted.

##### Confusion Matrix

The first row is the negative class. 1478 plays were correctly predicted *not to have penalties*, and 16 were incorrectly predicted as having penalties. The second row is the positive class. 15 plays were incorrectly predicted to have penalties (i.e, most), and 60 were correctly predicted to have penalties.

This model is a lot better at making its predictions on the negative class.

In [None]:
#Create the tree plot

plt.figure(figsize=(20,14))
plot_tree(dTree,
           feature_names = k_X_test.columns, #Feature names
           class_names = ["0","1"], #Class names
           rounded = True,
           filled = True)

# leaving this in for now even though it's inelligible
plt.savefig('dTree.png')
plt.show()

### Punts

In [None]:
p_X_train, p_X_test, p_y_train, p_y_test = process_for_modelling(process_punts(punts), 'penaltyFlag')

In [None]:
p_X_train

In [None]:
p_X_train.describe()

In [None]:
models = {
    "K-Nearest Neighbors": KNeighborsClassifier(),
    #"Logistic Regression": LogisticRegression(),
    "Stochastic Gradient Descent Classifier": SGDClassifier(),
    "Support Vector Classifier": SVC(),
    "Linear Support Vector Classifier": LinearSVC(),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifer": RandomForestClassifier()         
         }

In [None]:
# train
for name, model in models.items():
    model.fit(p_X_train, p_y_train)
    print(name + " trained.")

In [None]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(p_X_test, p_y_test)))