# 🏈 NFL Game Data Analysis

## Overview
This Jupyter Notebook explores NFL game, play, player, tackle, and tracking data.  
Additionally, it describes the datasets and their columns to provide clarity on the available information.

## Objectives
- Load and explore the dataset. ✅
- Perform exploratory data analysis (EDA).
- Visualize key trends and patterns.
- Identify interesting insights based on game events.

In [1]:
import os

# List the contents of the data folder
data_folder = "../data"
contents = os.listdir(data_folder)
print(contents)

['games.csv', 'players.csv', 'plays.csv', 'tackles.csv', 'tracking_week_1.csv', 'tracking_week_2.csv', 'tracking_week_3.csv', 'tracking_week_4.csv', 'tracking_week_5.csv', 'tracking_week_6.csv', 'tracking_week_7.csv', 'tracking_week_8.csv', 'tracking_week_9.csv']


In [None]:
import pandas as pd
import numpy as np

# Update file paths to use data_folder
games_path = os.path.join(data_folder, 'games.csv')
plays_path = os.path.join(data_folder, 'plays.csv')
players_path = os.path.join(data_folder, 'players.csv')
week1_path = os.path.join(data_folder, 'tracking_week_1.csv')
tackles_path = os.path.join(data_folder, 'tackles.csv')

# Load the data
games = pd.read_csv(games_path)
plays = pd.read_csv(plays_path)
players = pd.read_csv(players_path)
# players = players.drop("displayName", axis=1)
week1 = pd.read_csv(week1_path)
tackles = pd.read_csv(tackles_path)



Game data: The games.csv contains the teams playing in each game. The key variable is gameId.
- **gameId**: Unique game identifier (numeric).
- **season**: Season in which the game took place (numeric).
- **week**: Week of the season when the game occurred (numeric).
- **gameDate**: Date of the game (format: MM/DD/YYYY).
- **gameTimeEastern**: Start time of the game in Eastern Standard Time (format: HH:MM:SS).
- **homeTeamAbbr**: Three-letter abbreviation of the home team (text).
- **visitorTeamAbbr**: Three-letter abbreviation of the visiting team (text).
- **homeFinalScore**: Total points scored by the home team (numeric).
- **visitorFinalScore**: Total points scored by the visiting team (numeric).

In [34]:
# print(games.describe())
print(games.shape)
games.head()

(136, 9)


Unnamed: 0,gameId,season,week,gameDate,gameTimeEastern,homeTeamAbbr,visitorTeamAbbr,homeFinalScore,visitorFinalScore
0,2022090800,2022,1,09/08/2022,20:20:00,LA,BUF,10,31
1,2022091100,2022,1,09/11/2022,13:00:00,ATL,NO,26,27
2,2022091101,2022,1,09/11/2022,13:00:00,CAR,CLE,24,26
3,2022091102,2022,1,09/11/2022,13:00:00,CHI,SF,19,10
4,2022091103,2022,1,09/11/2022,13:00:00,CIN,PIT,20,23


### Play data
The plays.csv file contains play-level information from each game. The key variables are gameId and playId.

- **gameId**: Unique game identifier (numeric).
- **playId**: Play identifier, not unique across games (numeric).
- **ballCarrierId**: The `nflId` of the ball carrier (numeric).
- **ballCarrierName**: Name of the ball carrier (text).
- **playDescription**: Description of play (text).
- **quarter**: Game quarter (numeric).
- **down**: Down number (numeric).
- **yardsToGo**: Distance needed for a first down (numeric).
- **possessionTeam**: Offensive team abbreviation (text).
- **defensiveTeam**: Defensive team abbreviation (text).
- **yardlineSide**: Three-letter team code corresponding to the line of scrimmage (text).
- **yardlineNumber**: Yard line at the line of scrimmage (numeric).
- **gameClock**: Time on the play clock (format: `MM:SS`).
- **preSnapHomeScore**: Home team's score before the play (numeric).
- **preSnapVisitorScore**: Visitor team's score before the play (numeric).
- **passResult**: Outcome of the play (C: Complete, I: Incomplete, S: Sack, IN: Interception, R: Scramble) (text).
- **passLength**: Distance the ball traveled beyond the line of scrimmage (numeric).
- **penaltyYards**: Yards gained by offense due to a penalty (numeric).
- **prePenaltyPlayResult**: Net yards gained before penalty yardage (numeric).
- **playResult**: Net yards gained, including penalty yardage (numeric).
- **playNullifiedByPenalty**: Whether a penalty nullified the play (Y/N) (text).
- **absoluteYardlineNumber**: Distance from the end zone for possession team (numeric).
- **offenseFormation**: Formation used by the offense (text).
- **defendersInTheBox**: Number of defenders near the line of scrimmage (numeric).
- **passProbability**: Probability of a pass play (numeric).
- **preSnapHomeTeamWinProbability**: Home team win probability before play (numeric).
- **preSnapVisitorTeamWinProbability**: Visitor team win probability before play (numeric).
- **homeTeamWinProbabilityAdded**: Change in home team win probability (numeric).
- **visitorTeamWinProbabilityAdded**: Change in visitor team win probability (numeric).
- **expectedPoints**: Expected points on this play (numeric).
- **expectedPointsAdded**: Change in expected points on this play (numeric).
- **foulName[i]**: Name of the i-th penalty committed during the play (text).
- **foulNFLId[i]**: `nflId` of the player who committed the i-th penalty (numeric).


In [30]:
# print(plays.describe())
print(plays.shape)
plays.head()

(12486, 35)


Unnamed: 0,gameId,playId,ballCarrierId,ballCarrierDisplayName,playDescription,quarter,down,yardsToGo,possessionTeam,defensiveTeam,...,preSnapHomeTeamWinProbability,preSnapVisitorTeamWinProbability,homeTeamWinProbabilityAdded,visitorTeamWinProbilityAdded,expectedPoints,expectedPointsAdded,foulName1,foulName2,foulNFLId1,foulNFLId2
0,2022100908,3537,48723,Parker Hesse,(7:52) (Shotgun) M.Mariota pass short middle t...,4,1,10,ATL,TB,...,0.976785,0.023215,-0.00611,0.00611,2.360609,0.981955,,,,
1,2022091103,3126,52457,Chase Claypool,(7:38) (Shotgun) C.Claypool right end to PIT 3...,4,1,10,PIT,CIN,...,0.160485,0.839515,-0.010865,0.010865,1.733344,-0.263424,,,,
2,2022091111,1148,42547,Darren Waller,(8:57) D.Carr pass short middle to D.Waller to...,2,2,5,LV,LAC,...,0.756661,0.243339,-0.037409,0.037409,1.312855,1.133666,,,,
3,2022100212,2007,46461,Mike Boone,(13:12) M.Boone left tackle to DEN 44 for 7 ya...,3,2,10,DEN,LV,...,0.620552,0.379448,-0.002451,0.002451,1.641006,-0.04358,,,,
4,2022091900,1372,47857,Devin Singletary,(8:33) D.Singletary right guard to TEN 32 for ...,2,1,10,BUF,TEN,...,0.83629,0.16371,0.001053,-0.001053,3.686428,-0.167903,,,,


### Player data 
The players.csv file contains player-level information from players that participated in any of the tracking data files. The key variable is nflId.
- **nflId**: Unique player identification number (numeric).
- **height**: Player height (text).
- **weight**: Player weight (numeric).
- **birthDate**: Date of birth (format: `YYYY-MM-DD`).
- **collegeName**: Player's college (text).
- **position**: Official player position (text).
- **displayName**: Player name (text).

In [31]:
# print(players.describe())
print(players.shape)
players.head()

(1683, 7)


Unnamed: 0,nflId,height,weight,birthDate,collegeName,position,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,217,1985-05-17,Boston College,QB,Matt Ryan


### Tackles data
The tackles.csv file contains player-level tackle information for each game and play. The key variables are gameId, playId, and nflId.

- **gameId**: Unique game identifier (numeric).
- **playId**: Play identifier, not unique across games (numeric).
- **nflId**: Unique player identification number (numeric).
- **tackle**: Indicator if player made a tackle (binary).
- **assist**: Indicator if player assisted a tackle (binary).
- **forcedFumble**: Indicator if player forced a fumble (binary).
- **pff_missedTackle**: Indicator if player missed a tackle (binary).

In [32]:
# print(tackles.describe())
print(tackles.shape)
tackles.head()

(17426, 7)


Unnamed: 0,gameId,playId,nflId,tackle,assist,forcedFumble,pff_missedTackle
0,2022090800,101,42816,1,0,0,0
1,2022090800,393,46232,1,0,0,0
2,2022090800,486,40166,1,0,0,0
3,2022090800,646,47939,1,0,0,0
4,2022090800,818,40107,1,0,0,0


### Tracking data
Files tracking_week_[week].csv contain player tracking data from week number [week]. The key variables are gameId, playId, and nflId.

- **gameId**: Unique game identifier (numeric).
- **playId**: Play identifier, not unique across games (numeric).
- **nflId**: Unique player identification number. If `NA`, the row corresponds to the ball (numeric).
- **displayName**: Player name (text).
- **frameId**: Frame identifier within the play (numeric).
- **time**: Timestamp of the play (format: `YYYY-MM-DD HH:MM:SS`).
- **jerseyNumber**: Player jersey number (numeric).
- **club**: Team abbreviation (text).
- **playDirection**: Direction the offense is moving (left or right) (text).
- **x**: Player position along the field's long axis (0-120 yards) (numeric).
- **y**: Player position along the field's short axis (0-53.3 yards) (numeric).
- **s**: Speed in yards per second (numeric).
- **a**: Acceleration in yards per second² (numeric).
- **dis**: Distance traveled from prior frame in yards (numeric).
- **o**: Player orientation (degrees, 0-360) (numeric).
- **dir**: Angle of player motion (degrees, 0-360) (numeric).
- **event**: Tagged play details (snap, pass, catch, tackle, etc.) (text).

In [33]:
# Only week 1 data out of the tracking data containing 9 weeks.
# print(week1.describe())
print(week1.shape)
week1.head()

(1407439, 17)


Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,


In [23]:
week1[(week1['gameId'] == 2022090800) & (week1['playId'] == 56) & (week1['nflId'] == 35472)]

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.37,27.27,1.62,1.15,0.16,231.74,147.9,
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.47,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.56,27.01,1.57,0.49,0.15,230.98,147.05,
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.64,26.9,1.44,0.89,0.14,232.38,145.42,
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.72,26.8,1.29,1.24,0.13,233.36,141.95,
5,2022090800,56,35472.0,Rodger Saffold,6,2022-09-08 20:24:05.700000,76.0,BUF,left,88.8,26.7,1.15,1.42,0.12,234.48,139.41,pass_outcome_caught
6,2022090800,56,35472.0,Rodger Saffold,7,2022-09-08 20:24:05.799999,76.0,BUF,left,88.87,26.64,0.93,1.69,0.09,235.77,134.32,
7,2022090800,56,35472.0,Rodger Saffold,8,2022-09-08 20:24:05.900000,76.0,BUF,left,88.91,26.59,0.68,1.74,0.07,240.0,131.01,
8,2022090800,56,35472.0,Rodger Saffold,9,2022-09-08 20:24:06.000000,76.0,BUF,left,88.94,26.57,0.42,1.74,0.04,243.56,122.29,
9,2022090800,56,35472.0,Rodger Saffold,10,2022-09-08 20:24:06.099999,76.0,BUF,left,88.95,26.58,0.14,1.83,0.01,246.07,85.87,


In [None]:

joined_all = pd.merge(games,plays,how="inner",on = "gameId")
joined_all = pd.merge(joined_all,week1,how="inner",on=["gameId","playId"])
joined_all = pd.merge(joined_all,tackles,how="left",on=["gameId","playId","nflId"])
# left join on players to keep football records
joined_all = pd.merge(joined_all,players,how="left",on = "nflId")
play_focus = 2184
focused_df = joined_all[(joined_all.playId==play_focus)]