In [1]:
import os
SNAPPY_notebook_path = os.path.join(os.path.abspath(""), "bench.ipynb")

In [2]:
import time
SNAPPY_start_time = time.perf_counter_ns()

In [3]:
%load_ext autotime

time: 39 µs (started: 2024-05-22 22:01:21 -04:00)


# Creating Player Stats Using Tracking Data: Snap Speed and Penetration 
An important use of the player tracking data is "feature engineering", or creating new features using the data available. In this notebook, I will describe a simple set of metrics using the player tracking information in order to create some "season averages" of players. These kinds of player statistics can help explain the underlying talent of a player. 

# Load packages

In [4]:
import os
# STEFANOS: Conditionally import Modin Pandas
import pandas as pd


time: 224 ms (started: 2024-05-22 22:01:21 -04:00)


# Load one week of player tracking data and PFF scouting data

In [5]:
data = pd.read_csv(os.path.abspath('') + '/input/nfl-big-data-bowl-2023/week1.csv')
scout = pd.read_csv(os.path.abspath('') + '/input/nfl-big-data-bowl-2023/pffScoutingData.csv')
plays = pd.read_csv(os.path.abspath('') + '/input/nfl-big-data-bowl-2023/plays.csv')
players = pd.read_csv(os.path.abspath('') + '/input/nfl-big-data-bowl-2023/players.csv')

# Let's merge these data into one set 
data = data.merge(scout, how='left')
data.shape

(1118122, 28)

time: 1.29 s (started: 2024-05-22 22:01:21 -04:00)


# -- STEFANOS -- Replicate Data

In fact, in this one, `data` seems to be big enough that I think it's ok to not replicate it.

In [6]:
# data.info()

time: 125 µs (started: 2024-05-22 22:01:23 -04:00)


# Create snap metrics 
Let's use the tracking data to create metrics based around the snap. After 500ms of the snap, how fast is someone going? How far from the line of scrimmage is a player? This can help us gain more insights on how quickly a player gets off the line of scrimmage and also their depth relative to the line of scrimmage. 

In [7]:
# get ball snap indicies 
_idxs = (data
         .loc[data['event']=='ball_snap', 
              'frameId']
         .index
         .values)

# to get 500ms of movement after snap, get 5 rows (each row is 100ms of info)
x = [(_idxs+x).tolist() for x in range(0,6)]
idxs = [item for sublist in x for item in sublist] #the output x is a list of lists, so this is just to flatten the list

# filter for snap + 500ms of data only using our selected indicies
_df = data.loc[idxs]

time: 86.4 ms (started: 2024-05-22 22:01:23 -04:00)


In [8]:
gid = 2021090900
pid = 97 
nid = 25511 
_df.loc[(_df['gameId']==gid) & (_df['playId']==pid) & (_df['nflId']==nid)]

Unnamed: 0,gameId,playId,nflId,frameId,time,jerseyNumber,team,playDirection,x,y,...,pff_hit,pff_hurry,pff_sack,pff_beatenByDefender,pff_hitAllowed,pff_hurryAllowed,pff_sackAllowed,pff_nflIdBlockedPlayer,pff_blockType,pff_backFieldBlock
5,2021090900,97,25511.0,6,2021-09-10T00:26:31.600,12.0,TB,right,37.64,24.26,...,,,,,,,,,,
6,2021090900,97,25511.0,7,2021-09-10T00:26:31.700,12.0,TB,right,37.56,24.26,...,,,,,,,,,,
7,2021090900,97,25511.0,8,2021-09-10T00:26:31.800,12.0,TB,right,37.47,24.25,...,,,,,,,,,,
8,2021090900,97,25511.0,9,2021-09-10T00:26:31.900,12.0,TB,right,37.38,24.24,...,,,,,,,,,,
9,2021090900,97,25511.0,10,2021-09-10T00:26:32.000,12.0,TB,right,37.27,24.23,...,,,,,,,,,,
10,2021090900,97,25511.0,11,2021-09-10T00:26:32.100,12.0,TB,right,37.14,24.22,...,,,,,,,,,,


time: 28.2 ms (started: 2024-05-22 22:01:23 -04:00)


In the above example, we can see there are only 6 rows for a player on a given play. This would be the ball snap row and 500ms after ball snap. 

In [9]:
# get line of scrimmage info to compute block/rush depth relative to LOS
_los = (data
        .loc[(data['team']=='football') & 
             (data['frameId']==1), 
             ['gameId', 'playId', 'x']]
        .rename(columns={'x':'los'}))

# merge LOS info back to subsetted data
_df = _df.merge(_los)

time: 261 ms (started: 2024-05-22 22:01:23 -04:00)


In [10]:
_df.loc[(_df['gameId']==gid) & (_df['playId']==pid) & (_df['nflId']==nid)]

Unnamed: 0,gameId,playId,nflId,frameId,time,jerseyNumber,team,playDirection,x,y,...,pff_hurry,pff_sack,pff_beatenByDefender,pff_hitAllowed,pff_hurryAllowed,pff_sackAllowed,pff_nflIdBlockedPlayer,pff_blockType,pff_backFieldBlock,los
0,2021090900,97,25511.0,6,2021-09-10T00:26:31.600,12.0,TB,right,37.64,24.26,...,,,,,,,,,,42.92
23,2021090900,97,25511.0,7,2021-09-10T00:26:31.700,12.0,TB,right,37.56,24.26,...,,,,,,,,,,42.92
46,2021090900,97,25511.0,8,2021-09-10T00:26:31.800,12.0,TB,right,37.47,24.25,...,,,,,,,,,,42.92
69,2021090900,97,25511.0,9,2021-09-10T00:26:31.900,12.0,TB,right,37.38,24.24,...,,,,,,,,,,42.92
92,2021090900,97,25511.0,10,2021-09-10T00:26:32.000,12.0,TB,right,37.27,24.23,...,,,,,,,,,,42.92
115,2021090900,97,25511.0,11,2021-09-10T00:26:32.100,12.0,TB,right,37.14,24.22,...,,,,,,,,,,42.92


time: 20.1 ms (started: 2024-05-22 22:01:23 -04:00)


The above cells demonstrate taking the line of scrimmage information from the `x` location of the football in the first frame of the play. Alternatively, you could use the `plays.csv` dataset, under the column `absoluteYardlineNumber` which should be the line of scrimmage information as well. 

Using the same game-play-player example from before: if you scroll to the right, the last column in the dataframe is `los`, which stands for line of scrimmage. 

In [11]:
# get difference from LOS for all frames and players 
_df['los_diff'] = _df['x'].sub(_df['los'])

# multiply by -1 for plays going the "left" direction 
# this is so pass block is monotonic in the same direction (as well as pass rush)
_df.loc[_df['playDirection']=='left', 'los_diff'] = _df.loc[_df['playDirection']=='left', 'los_diff'].mul(-1)

# merge onto play info to get possession team (could do this anywhere, i do it here for no real optimal reason)
_df = plays.loc[:, ['gameId', 'playId', 'possessionTeam']].merge(_df)

time: 47.9 ms (started: 2024-05-22 22:01:23 -04:00)


In [12]:
_df.loc[(_df['gameId']==gid) & (_df['playId']==pid) & (_df['nflId']==nid)]

Unnamed: 0,gameId,playId,possessionTeam,nflId,frameId,time,jerseyNumber,team,playDirection,x,...,pff_sack,pff_beatenByDefender,pff_hitAllowed,pff_hurryAllowed,pff_sackAllowed,pff_nflIdBlockedPlayer,pff_blockType,pff_backFieldBlock,los,los_diff
0,2021090900,97,TB,25511.0,6,2021-09-10T00:26:31.600,12.0,TB,right,37.64,...,,,,,,,,,42.92,-5.28
23,2021090900,97,TB,25511.0,7,2021-09-10T00:26:31.700,12.0,TB,right,37.56,...,,,,,,,,,42.92,-5.36
46,2021090900,97,TB,25511.0,8,2021-09-10T00:26:31.800,12.0,TB,right,37.47,...,,,,,,,,,42.92,-5.45
69,2021090900,97,TB,25511.0,9,2021-09-10T00:26:31.900,12.0,TB,right,37.38,...,,,,,,,,,42.92,-5.54
92,2021090900,97,TB,25511.0,10,2021-09-10T00:26:32.000,12.0,TB,right,37.27,...,,,,,,,,,42.92,-5.65
115,2021090900,97,TB,25511.0,11,2021-09-10T00:26:32.100,12.0,TB,right,37.14,...,,,,,,,,,42.92,-5.78


time: 20.4 ms (started: 2024-05-22 22:01:23 -04:00)


We create a difference from line of scrimmage metric `los_diff` and also make sure offense players are going to the "right" and defense players are going to the "left" using the `playDirection` feature and multiplying through by `-1` (arbitrary whether offense or defense is all going left or right; so long as they are going the same way for all rows).  

In the example we use yet again, we can see we've merged some play data information (possesion team) and the last column in the dataset is the `los_diff` feature. 

In [13]:
# indicate if a player is on the possession team (1), the defensive team (0), or neither aka the football (-1)
_df['posTeam'] = 0
_df.loc[_df['possessionTeam']==_df['team'], 'posTeam'] = 1 
_df.loc[_df['team']=='football', 'posTeam'] = -1

# create initial snap speed dataframe (mean of speed and acceleration per player)
snap_speed = (_df
              .loc[:, ['nflId','s','a']]
              .groupby('nflId', 
                       as_index=False)
              .mean())

time: 17.2 ms (started: 2024-05-22 22:01:23 -04:00)


In [14]:
snap_speed.head()

Unnamed: 0,nflId,s,a
0,25511.0,0.652727,1.755189
1,28963.0,0.346731,1.195
2,29550.0,0.547222,1.138111
3,29851.0,0.605533,1.5994
4,30078.0,0.47125,1.575208


time: 3.12 ms (started: 2024-05-22 22:01:23 -04:00)


We take our temporary dataframe we have been working with and mean aggregate the speed and acceleration data using a groupby method. We get `nflId` and average `s` and `a` as the initial `snap_speed` dataframe. 

In [15]:
# given whether a offense player or defense player, aggregate by maxmimum or minimum LOS difference, respectively. 
# e.g. if o-lineman has more a negative LOS diff, they allow more pass rush penetration 
_off = _df.loc[_df['posTeam']==1, ['gameId', 'playId', 'nflId', 'los_diff']].groupby(['gameId', 'playId', 'nflId'], as_index=False).max()
_def = _df.loc[_df['posTeam']==0, ['gameId', 'playId', 'nflId', 'los_diff']].groupby(['gameId', 'playId', 'nflId'], as_index=False).min()
los_diff = _off._append(_def)
los_diff = (los_diff
            .loc[:, ['nflId', 'los_diff']]
            .groupby('nflId', 
                     as_index=False)
            .mean())

# merge LOS diff data back onto snap speed
snap_speed = snap_speed.merge(los_diff)
snap_speed = snap_speed.rename(columns={'s':'snap_s', 'a':'snap_a', 'los_diff':'snap_los_diff'})


time: 28 ms (started: 2024-05-22 22:01:23 -04:00)


In [16]:
snap_speed.head()

Unnamed: 0,nflId,snap_s,snap_a,snap_los_diff
0,25511.0,0.652727,1.755189,-4.382955
1,28963.0,0.346731,1.195,-4.866154
2,29550.0,0.547222,1.138111,-1.565333
3,29851.0,0.605533,1.5994,-4.3044
4,30078.0,0.47125,1.575208,-4.475


time: 3.81 ms (started: 2024-05-22 22:01:23 -04:00)


Finally, we aggregate the `los_diff` data and merge that back onto the `snap_speed` dataframe. This gives us new features we can use to analyze player abilities.

# Exploratory Data Analysis with `snap_speed` data

In [17]:
df_plt = players.loc[:, ['nflId', 'officialPosition', 'displayName']].merge(snap_speed)
# STEFANOS: Disable plotting
# sns.scatterplot(data=df_plt.loc[df_plt['officialPosition'].isin(['T','G','C','DT','NT','DE'])], x='snap_s', y='snap_los_diff', hue='officialPosition')
# plt.axhline(0, ls=':', lw=2, c='k')
# plt.legend(bbox_to_anchor=(1.02,1), loc=2)
# sns.despine()
# plt.show()

time: 2.65 ms (started: 2024-05-22 22:01:23 -04:00)


Using our `snap_speed` dataframe, we can visualize what players are getting over the line of scrimmage faster on defense and what players are blocking near the line of scrimmage. It should be no surprise tackles (who line up off the line of scrimmage) are generally further away from centers (who line up nearly on the line of scrimmage) after the ball is snapped. Some defensive ends seem to be able to get over the line of scrimmage more often than others -- perhaps they are able to time the snap more often (or simply get over the line without getting called offsides). 

We can also see there is no real correlation with speed of a player on offense. Perhaps there is a correlation on the defensive side (faster defensive players can penetrate deeper beyond the line of scrimmage). 

Let's take a look at the list of DEs, ordered by line of scrimmage difference

In [18]:
df_plt.loc[(df_plt['officialPosition']=='DE') & (df_plt['snap_los_diff']<0)].sort_values('snap_los_diff')

Unnamed: 0,nflId,officialPosition,displayName,snap_s,snap_a,snap_los_diff
977,52556,DE,Alton Robinson,1.741667,2.258333,-0.454
722,47785,DE,Nick Bosa,1.275375,2.342,-0.35175
674,46255,DE,Jacob Martin,1.144938,2.267346,-0.318519
654,46199,DE,Josh Sweat,1.308551,2.224855,-0.225217
927,52462,DE,A.J. Epenesa,1.357083,2.34625,-0.19875
158,41249,DE,Dee Ford,1.416839,2.097356,-0.169655
18,35441,DE,Ndamukong Suh,1.255045,1.718378,-0.160811
506,44944,DE,Deatrich Wise,1.183889,2.136111,-0.155556
334,43308,DE,Shaq Lawson,1.407857,2.878571,-0.155
743,47809,DE,Montez Sweat,1.103434,1.807828,-0.144242


time: 6.42 ms (started: 2024-05-22 22:01:23 -04:00)


Several of the top players (Bosa, Martin, Ford, etc) seem to be good defensive ends (recent pro bowl/all-pro teams). Perhaps players on this list who are not as well known are underrated or undervalued. 

# Next steps 
This is a very simple way to aggregate over the player tracking data in order to create features that can help represent a player's underlying abilities. This was only in relation to half a second after snap -- you can create metrics based around any important moments you define in your dataset. Some examples: 
* How often do DEs break through being double teamed? 
* Does a guard get beat more often to his left or to his right? 
* Does weight/height correlate with overall distance traveled after first contact?  

Also, remember this is only for one week's worth of data -- it would make sense to loop over all 8 weeks and aggregate all 8 weeks if you want something more comprehensive! 

In [19]:
SNAPPY_end_time = time.perf_counter_ns()
print("Total elapsed time:", (SNAPPY_end_time - SNAPPY_start_time) / (10 ** 9))

Total elapsed time: 2.108097779
time: 243 µs (started: 2024-05-22 22:01:23 -04:00)


# If you liked this notebook, please upvote! 
# Follow more Big Data Bowl live development on my stream: https://twitch.tv/nickwan_datasci 