<a href="https://colab.research.google.com/github/HimalKarkal/NFL/blob/Plays-Analysis/Tackle_Probability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

The goal of this project is to build a machine learning model that calculates the probability of each defender making a positive tackle (tackle, assist, or forced fumble) on the ball carrier at every point of time in a play. In doing this, I aim to learn more about the factors that increase the odds of a tackle being made. This knowledge can then be translated into coaching and training strategies.

#Importing Data

In this section, I am importing all the csv files from the 2024 big data bowl from Kaggle. The files are saved as pandas dataframes.

In [1]:
#Importing pandas
import pandas as pd

In [2]:
#Importing datafiles

games = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/games.csv')
players = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/players.csv')
plays = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/plays.csv')
tackles = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tackles.csv')
track_1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_1.csv')
track_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_2.csv')
track_3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_3.csv')
track_4 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_4.csv')
track_5 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_5.csv')
track_6 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_6.csv')
track_7 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_7.csv')
track_8 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_8.csv')
track_9 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tracking_week_9.csv')

In [3]:
# Concatenating tracking data for all 9 weeks into a single dataframe called track

tracking = pd.concat([track_1, track_2, track_3, track_4, track_5, track_6, track_7, track_8, track_9], axis=0)


# Data Cleaning

In this section, I am cleaning the imported data. This includes removing columns containing irrelevant data. I am also addressing null-values by either removing or modifying them.

## Players

In [4]:
players.head()

Unnamed: 0,nflId,height,weight,birthDate,collegeName,position,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,217,1985-05-17,Boston College,QB,Matt Ryan


In [5]:
#Dropping collegeName as it is unnecessary information

players = players.drop('collegeName', axis = 1)

In [6]:
#Converting heights to centimetres

def convert_height_to_cm(players):
    players['height'] = players['height'].str.split('-').apply(
        lambda x: int(x[0]) * 30.48 + int(x[1]) * 2.54)
    return players

players = convert_height_to_cm(players)

players.head()

Unnamed: 0,nflId,height,weight,birthDate,position,displayName
0,25511,193.04,225,1977-08-03,QB,Tom Brady
1,29550,193.04,328,1982-01-22,T,Jason Peters
2,29851,187.96,225,1983-12-02,QB,Aaron Rodgers
3,30842,198.12,267,1984-05-19,TE,Marcedes Lewis
4,33084,193.04,217,1985-05-17,QB,Matt Ryan


In [7]:
#Converting weights to kilograms

players['weight'] = (players['weight'] / 2.2).round(2)
players.head()

Unnamed: 0,nflId,height,weight,birthDate,position,displayName
0,25511,193.04,102.27,1977-08-03,QB,Tom Brady
1,29550,193.04,149.09,1982-01-22,T,Jason Peters
2,29851,187.96,102.27,1983-12-02,QB,Aaron Rodgers
3,30842,198.12,121.36,1984-05-19,TE,Marcedes Lewis
4,33084,193.04,98.64,1985-05-17,QB,Matt Ryan


In [8]:
# Adding column called age to players

import datetime as dt
players['birthDate'] = pd.to_datetime(players['birthDate'])

players['age'] = 2022 - players['birthDate'].dt.year.round(0)

players.head()

Unnamed: 0,nflId,height,weight,birthDate,position,displayName,age
0,25511,193.04,102.27,1977-08-03,QB,Tom Brady,45.0
1,29550,193.04,149.09,1982-01-22,T,Jason Peters,40.0
2,29851,187.96,102.27,1983-12-02,QB,Aaron Rodgers,39.0
3,30842,198.12,121.36,1984-05-19,TE,Marcedes Lewis,38.0
4,33084,193.04,98.64,1985-05-17,QB,Matt Ryan,37.0


In [9]:
#Dropping birthDate as we now have player ages

players = players.drop('birthDate', axis = 1)
players.head()

Unnamed: 0,nflId,height,weight,position,displayName,age
0,25511,193.04,102.27,QB,Tom Brady,45.0
1,29550,193.04,149.09,T,Jason Peters,40.0
2,29851,187.96,102.27,QB,Aaron Rodgers,39.0
3,30842,198.12,121.36,TE,Marcedes Lewis,38.0
4,33084,193.04,98.64,QB,Matt Ryan,37.0


## Plays

In [10]:
# Get an idea of the structure of the plays dataframe

plays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12486 entries, 0 to 12485
Data columns (total 35 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   gameId                            12486 non-null  int64  
 1   playId                            12486 non-null  int64  
 2   ballCarrierId                     12486 non-null  int64  
 3   ballCarrierDisplayName            12486 non-null  object 
 4   playDescription                   12486 non-null  object 
 5   quarter                           12486 non-null  int64  
 6   down                              12486 non-null  int64  
 7   yardsToGo                         12486 non-null  int64  
 8   possessionTeam                    12486 non-null  object 
 9   defensiveTeam                     12486 non-null  object 
 10  yardlineSide                      12319 non-null  object 
 11  yardlineNumber                    12486 non-null  int64  
 12  game

In [11]:
# Removing all unnecessary columns from plays.

plays = plays[['gameId', 'playId', 'ballCarrierId', 'ballCarrierDisplayName', 'possessionTeam', 'defensiveTeam']]
plays.head()

Unnamed: 0,gameId,playId,ballCarrierId,ballCarrierDisplayName,possessionTeam,defensiveTeam
0,2022100908,3537,48723,Parker Hesse,ATL,TB
1,2022091103,3126,52457,Chase Claypool,PIT,CIN
2,2022091111,1148,42547,Darren Waller,LV,LAC
3,2022100212,2007,46461,Mike Boone,DEN,LV
4,2022091900,1372,47857,Devin Singletary,BUF,TEN


In [12]:
# Check whether the changes made have reflected

plays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12486 entries, 0 to 12485
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   gameId                  12486 non-null  int64 
 1   playId                  12486 non-null  int64 
 2   ballCarrierId           12486 non-null  int64 
 3   ballCarrierDisplayName  12486 non-null  object
 4   possessionTeam          12486 non-null  object
 5   defensiveTeam           12486 non-null  object
dtypes: int64(3), object(3)
memory usage: 585.4+ KB


## Tackles

In [13]:
tackles.head()

Unnamed: 0,gameId,playId,nflId,tackle,assist,forcedFumble,pff_missedTackle
0,2022090800,101,42816,1,0,0,0
1,2022090800,393,46232,1,0,0,0
2,2022090800,486,40166,1,0,0,0
3,2022090800,646,47939,1,0,0,0
4,2022090800,818,40107,1,0,0,0


In [14]:
# Get an idea of the structure of the tackles dataframe

tackles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17426 entries, 0 to 17425
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   gameId            17426 non-null  int64
 1   playId            17426 non-null  int64
 2   nflId             17426 non-null  int64
 3   tackle            17426 non-null  int64
 4   assist            17426 non-null  int64
 5   forcedFumble      17426 non-null  int64
 6   pff_missedTackle  17426 non-null  int64
dtypes: int64(7)
memory usage: 953.1 KB


## Tracking

In [15]:
# Get an idea of the structure of the tracking dataframe

tracking.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12187398 entries, 0 to 1150022
Data columns (total 17 columns):
 #   Column         Dtype  
---  ------         -----  
 0   gameId         int64  
 1   playId         int64  
 2   nflId          float64
 3   displayName    object 
 4   frameId        int64  
 5   time           object 
 6   jerseyNumber   float64
 7   club           object 
 8   playDirection  object 
 9   x              float64
 10  y              float64
 11  s              float64
 12  a              float64
 13  dis            float64
 14  o              float64
 15  dir            float64
 16  event          object 
dtypes: float64(9), int64(3), object(5)
memory usage: 1.6+ GB


In [16]:
tracking = tracking.drop('jerseyNumber', axis = 1)

In [17]:
# Count the number of non-null values in each column of tracking.

print('The number of non-null values in each column of tracking are:')
print()
tracking.count()


The number of non-null values in each column of tracking are:



gameId           12187398
playId           12187398
nflId            11657338
displayName      12187398
frameId          12187398
time             12187398
club             12187398
playDirection    12187398
x                12187398
y                12187398
s                12187398
a                12187398
dis              12187398
o                11658511
dir              11658511
event             1102042
dtype: int64

# Preprocessing

In this section, I am combining columns from multiple dataframes to create a single dataframe which I will use for feature engineering.

## Tackle Opportunity

Here I am creating a new column in the tackles dataframe that counts the number of tackle opportunities for each play. I will merge this dataframe to tracking on gameId and playId. All rows in the resulting dataframe with NaNs will be converted to 0 indicating that those plays had no tackle opportunity.

In [18]:
# Creating a new column in tackles that sums the value in the tackle, assist, and forcedFumble columns

tackles['positiveTackle'] = tackles[['tackle','assist', 'forcedFumble']].sum(axis=1).astype(int)
tackles.head()

Unnamed: 0,gameId,playId,nflId,tackle,assist,forcedFumble,pff_missedTackle,positiveTackle
0,2022090800,101,42816,1,0,0,0,1
1,2022090800,393,46232,1,0,0,0,1
2,2022090800,486,40166,1,0,0,0,1
3,2022090800,646,47939,1,0,0,0,1
4,2022090800,818,40107,1,0,0,0,1


In [19]:
# Since I only need the positiveTackle column from this dataset, I am dropping all other columns except gameId, playId, and nflId.

tackles = tackles.drop(['tackle', 'assist', 'forcedFumble', 'pff_missedTackle'], axis = 1)
tackles.head()

Unnamed: 0,gameId,playId,nflId,positiveTackle
0,2022090800,101,42816,1
1,2022090800,393,46232,1
2,2022090800,486,40166,1
3,2022090800,646,47939,1
4,2022090800,818,40107,1


In [20]:
# Merging tackles to tracking

tracking = tracking.merge(tackles, how = 'left', left_on = ['gameId', 'playId', 'nflId'], right_on = ['gameId', 'playId', 'nflId'])
tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,s,a,dis,o,dir,event,positiveTackle
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,,
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,,
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,,
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,,


In [21]:
# Converting all NaNs in positiveTackle in tracking to 0s

tracking['positiveTackle'] = tracking['positiveTackle'].fillna(0).astype(int)
tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,s,a,dis,o,dir,event,positiveTackle
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,,0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,,0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,,0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,,0


## Ball Carrier

Here I am creating a new column in plays called, ballCarrier. Since all rows of plays indicate a ball carrier, this column will be filled with 1 for all rows.

I will then merge plays to tracking. This will help me identify who the ball carrier is for each play in the tracking dataframe.

In [22]:
# Creating a new column called ballCarrier in plays and filling it with 1s

plays['ballCarrier'] = 1

# Display plays dataframe

plays.head()

Unnamed: 0,gameId,playId,ballCarrierId,ballCarrierDisplayName,possessionTeam,defensiveTeam,ballCarrier
0,2022100908,3537,48723,Parker Hesse,ATL,TB,1
1,2022091103,3126,52457,Chase Claypool,PIT,CIN,1
2,2022091111,1148,42547,Darren Waller,LV,LAC,1
3,2022100212,2007,46461,Mike Boone,DEN,LV,1
4,2022091900,1372,47857,Devin Singletary,BUF,TEN,1


In [23]:
# Merging plays to tracking

tracking = tracking.merge(plays, how = 'left', left_on = ['gameId','playId', 'nflId'], right_on = ['gameId','playId', 'ballCarrierId'])

# Filling any NaNs in ballCarrier in the merged tracking dataframe to 0s

tracking['ballCarrier'] = tracking['ballCarrier'].fillna(0).astype(int)

# Display merged tracking dataframe

tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,...,dis,o,dir,event,positiveTackle,ballCarrierId,ballCarrierDisplayName,possessionTeam,defensiveTeam,ballCarrier
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,...,0.16,231.74,147.9,,0,,,,,0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,...,0.17,230.98,148.53,pass_arrived,0,,,,,0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,...,0.15,230.98,147.05,,0,,,,,0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,...,0.14,232.38,145.42,,0,,,,,0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,...,0.13,233.36,141.95,,0,,,,,0


In [24]:
# Dropping unnecessary columns from tracking.

tracking = tracking.drop(['ballCarrierId', 'ballCarrierDisplayName', 'possessionTeam', 'defensiveTeam'], axis = 1)
tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,s,a,dis,o,dir,event,positiveTackle,ballCarrier
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,,0,0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,0,0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,,0,0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,,0,0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,,0,0


## DefensiveTeam

Here I am creating a new column in a copy of the plays dataframe called **defense**. All rows of this column will hold 1s. I will then merge this to tracking and fill all Nans with 0.

In [25]:
# Copying plays

plays_copy = plays

# Creating a column called defense in plays_copy and filling it with 1

plays_copy['defense'] = 1

# Dropping ballCarrier from plays_copy to prevent merging it twice

plays_copy = plays_copy.drop(['ballCarrierId', 'ballCarrierDisplayName', 'possessionTeam'], axis = 1)

# Left merging plays_copy to tracking on gameId, playId, and club

tracking = tracking.merge(plays_copy.rename(columns = {'defensiveTeam':'club'}), how = 'left')

# View merged tracking dataframe

tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,s,a,dis,o,dir,event,positiveTackle,ballCarrier,defense
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,,0,0,
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,0,0,
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,,0,0,
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,,0,0,
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,,0,0,


In [26]:
tracking['defense'] = tracking['defense'].fillna(0).astype(int)
tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,s,a,dis,o,dir,event,positiveTackle,ballCarrier,defense
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,,0,0,0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,0,0,0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,,0,0,0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,,0,0,0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,,0,0,0


## Player Characteristics

In this section, I am merging the player dataframe to the tracking dataframe on nflId.

In [27]:
# Left merging player to tracking on nflId

tracking = tracking.merge(players.drop('displayName', axis = 1), how = 'left', left_on = 'nflId', right_on = 'nflId')

# View changes

tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,...,o,dir,event,positiveTackle,ballCarrier,defense,height,weight,position,age
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,...,231.74,147.9,,0,0,0,195.58,147.73,G,34.0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,...,230.98,148.53,pass_arrived,0,0,0,195.58,147.73,G,34.0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,...,230.98,147.05,,0,0,0,195.58,147.73,G,34.0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,...,232.38,145.42,,0,0,0,195.58,147.73,G,34.0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,...,233.36,141.95,,0,0,0,195.58,147.73,G,34.0


In [28]:
# Dropping event from tracking because I will not be using it

tracking = tracking.drop('event', axis =1)
tracking.head()

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,club,playDirection,x,y,...,dis,o,dir,positiveTackle,ballCarrier,defense,height,weight,position,age
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,BUF,left,88.37,27.27,...,0.16,231.74,147.9,0,0,0,195.58,147.73,G,34.0
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,BUF,left,88.47,27.13,...,0.17,230.98,148.53,0,0,0,195.58,147.73,G,34.0
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,BUF,left,88.56,27.01,...,0.15,230.98,147.05,0,0,0,195.58,147.73,G,34.0
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,BUF,left,88.64,26.9,...,0.14,232.38,145.42,0,0,0,195.58,147.73,G,34.0
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,BUF,left,88.72,26.8,...,0.13,233.36,141.95,0,0,0,195.58,147.73,G,34.0


#Feature Engineering

In this section, I will use the data contained in tracking to engineer features to train my machine learning model on. The main features