![title](images/title.png)

# Project Proposal

#### State your hypothesis or research objective, explain the relevance or importance of your topic, and/or describe potential uses / applications of your findings, or what future investigations your work might inform.

The objective of this project is to predict NFL player movement while a pass is in the air. The target variables are the x and y coordinates of each player during a given frame.

This model could assist players and coaching staff in identifying positional tendencies, reaction times, and coverage weaknesses for both their own team and opponents. By accurately modeling player trajectories, teams could better anticipate on-field dynamics and improve strategic decision-making.
  
#### Identify potential data sources, with some preliminary summary - stats, and a statement on data quality and completeness.

- **Primary source**: Kaggle-provided dataset from the 2026 NFL Big Data Bowl

- **Possible expansions**: Team-level metadata such as Head Coach and Offensive/Defensive Coordinator

- **Dataset summary**: ~500,000 rows and 23 columns

- **Data quality**: Currently no null values detected. However, the current merge between input and output datasets requires validation to ensure all records align properly and no duplicates inflate the frame count.
  
#### Describe your intended approach, listing any potentially useful techniques or types of modeling you intend to explore.

The project will begin with exploratory data analysis and data cleaning, followed by predictive modeling.
Planned techniques include:

- Multiple Linear Regression as a baseline model

- Random Forests, Neural Networks and Gradient Boosting Models to improve accuracy and capture nonlinear relationships

- Clustering methods (e.g., K-Means, DBSCAN, or HDBSCAN) to group plays by field position or play context before modeling

These models will be evaluated using metrics such as Root Mean Squared Error (RMSE) for coordinate prediction accuracy. The ultimate goal is to provide interpretable insights into how player movement dynamics vary across game situations.

# Introduction
#### (Copied from competition)
https://www.kaggle.com/competitions/nfl-big-data-bowl-2026-prediction/data

The downfield pass is the crown jewel of American sports. When the ball is in the air, anything can happen, like a touchdown, an interception, or a contested catch. The uncertainty and the importance of the outcome of these plays is what helps keep audiences on the edge of its seat.

The 2026 Big Data Bowl is designed to help the National Football League better understand player movement during the pass play, starting with when the ball is thrown and ending when the ball is either caught or ruled incomplete. For the offensive team, this means focusing on the targeted receiver, whose job is to move towards the ball landing location in order to complete a catch. For the defensive team, who could have several players moving towards the ball, their jobs are to both prevent the offensive player from making a catch, while also going for the ball themselves. This year's Big Data Bowl asks our fans to help track the movement of these players.

In the Prediction Competition of the Big Data Bowl, participants are tasked with predicting player movement with the ball in the air. Specifically, the NFL is sharing data before the ball is thrown (including the Next Gen Stats tracking data), and stopping the play the moment the quarterback releases the ball. In addition to the pre-pass tracking data, we are providing participants with which offensive player was targeted (e.g, the targeted receiver) and the landing location of the pass.

Using the information above, participants should generate prediction models for player movement during the frames when the ball is in the air. The most accurate algorithms will be those whose output most closely matches the eventual player movement of each player.

Competition specifics

In the NFL's tracking data, there are 10 frames per second. As a result, if a ball is in the air for 2.5 seconds, there will be 25 frames of location data to predict.
Quick passes (less than half a second), deflected passes, and throwaway passes are dropped from the competition.
Evaluation for the training data is based on historical data. Evaluation for the leaderboard is based on data that hasn't happened yet. Specifically, we will be doing a live leaderboard covering the last five weeks of the 2025 NFL season.

# Evaluation

Submissions are evaluated using the Root Mean Squared Error between the predicted and the observed target. The evaluation metric for this contest can be found here.

### Root Mean Squared Error (RMSE)

The Root Mean Squared Error (RMSE) for 2D coordinates is given by:

$$
RMSE = \sqrt{\frac{1}{2N} \sum_{i=1}^{N} \left[ (x_{true,i} - x_{pred,i})^2 + (y_{true,i} - y_{pred,i})^2 \right]}
$$



In [12]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [6]:
df = pd.DataFrame()

for file in os.listdir('train'):
    if file.startswith('input'):
        path = os.path.join('train', file)
        temp = pd.read_csv(path)
        df = pd.concat([df, temp], ignore_index=True)

df.shape    

(4880579, 23)

In [13]:
df.head()

Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,player_birth_date,player_position,player_side,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y
0,2023090700,101,False,54527,1,right,42,Bryan Cook,6-1,210,1999-09-07,FS,Defense,Defensive Coverage,52.33,36.94,0.09,0.39,322.4,238.24,21,63.259998,-0.22
1,2023090700,101,False,54527,2,right,42,Bryan Cook,6-1,210,1999-09-07,FS,Defense,Defensive Coverage,52.33,36.94,0.04,0.61,200.89,236.05,21,63.259998,-0.22
2,2023090700,101,False,54527,3,right,42,Bryan Cook,6-1,210,1999-09-07,FS,Defense,Defensive Coverage,52.33,36.93,0.12,0.73,147.55,240.6,21,63.259998,-0.22
3,2023090700,101,False,54527,4,right,42,Bryan Cook,6-1,210,1999-09-07,FS,Defense,Defensive Coverage,52.35,36.92,0.23,0.81,131.4,244.25,21,63.259998,-0.22
4,2023090700,101,False,54527,5,right,42,Bryan Cook,6-1,210,1999-09-07,FS,Defense,Defensive Coverage,52.37,36.9,0.35,0.82,123.26,244.25,21,63.259998,-0.22


In [7]:
output_df  = pd.DataFrame()

for file in os.listdir('train'):
    if file.startswith('output'):
        path = os.path.join('train', file)
        temp = pd.read_csv(path)
        output_df = pd.concat([output_df, temp], ignore_index=True)

output_df.shape

In [14]:
output_df.head()

Unnamed: 0,game_id,play_id,nfl_id,frame_id,x,y
0,2023090700,101,46137,1,56.22,17.28
1,2023090700,101,46137,2,56.63,16.88
2,2023090700,101,46137,3,57.06,16.46
3,2023090700,101,46137,4,57.48,16.02
4,2023090700,101,46137,5,57.91,15.56


In [15]:
output_df = output_df.rename(columns={'x': 'x_target', 'y': 'y_target'})

df = df.merge(output_df, on=["game_id", "play_id", "nfl_id", "frame_id"], how="inner")

df.shape

(560426, 25)

In [16]:
df.head()

Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,player_birth_date,player_position,player_side,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y,x_target,y_target
0,2023090700,101,True,46137,1,right,42,Justin Reid,6-1,204,1997-02-15,SS,Defense,Defensive Coverage,51.32,20.69,0.31,0.49,79.43,267.68,21,63.259998,-0.22,56.22,17.28
1,2023090700,101,True,46137,2,right,42,Justin Reid,6-1,204,1997-02-15,SS,Defense,Defensive Coverage,51.35,20.66,0.36,0.74,118.07,268.66,21,63.259998,-0.22,56.63,16.88
2,2023090700,101,True,46137,3,right,42,Justin Reid,6-1,204,1997-02-15,SS,Defense,Defensive Coverage,51.39,20.63,0.44,0.76,130.89,269.78,21,63.259998,-0.22,57.06,16.46
3,2023090700,101,True,46137,4,right,42,Justin Reid,6-1,204,1997-02-15,SS,Defense,Defensive Coverage,51.43,20.61,0.48,0.62,134.5,269.78,21,63.259998,-0.22,57.48,16.02
4,2023090700,101,True,46137,5,right,42,Justin Reid,6-1,204,1997-02-15,SS,Defense,Defensive Coverage,51.48,20.58,0.54,0.44,129.79,269.06,21,63.259998,-0.22,57.91,15.56


![dataset](images/dataset.png)

### Summary of Data
#### (Copied from competition)

This section provides an overview of each dataset in the **2026 NFL Big Data Bowl**, including key variables for joining and a description of all fields.  
The tracking data is provided by the **NFL Next Gen Stats** team.

---

### Files
#### `train/`
Contains both the **input** and **output** CSV files used for training.

---

#### **Input Files:** `input_2023_w[01-18].csv`
The input data contains tracking data **before the pass is thrown**.

| Variable | Description |
|-----------|--------------|
| `game_id` | Game identifier (unique, numeric) |
| `play_id` | Play identifier (not unique across games, numeric) |
| `player_to_predict` | Whether or not the x/y prediction for this player will be scored (boolean) |
| `nfl_id` | Player identifier (unique across players, numeric) |
| `frame_id` | Frame identifier for each play/type, starting at 1 for each game_id/play_id/file type (numeric) |
| `play_direction` | Direction that the offense is moving (left or right) |
| `absolute_yardline_number` | Distance from end zone for possession team (numeric) |
| `player_name` | Player name (text) |
| `player_height` | Player height (ft-in) |
| `player_weight` | Player weight (lbs) |
| `player_birth_date` | Birth date (yyyy-mm-dd) |
| `player_position` | Player’s position (role on the field) |
| `player_side` | Team player is on (Offense or Defense) |
| `player_role` | Role player has on play (Defensive Coverage, Targeted Receiver, Passer, or Other Route Runner) |
| `x` | Player position along the long axis of the field (0–120 yards) |
| `y` | Player position along the short axis of the field (0–53.3 yards) |
| `s` | Speed in yards/second |
| `a` | Acceleration in yards/second² |
| `o` | Orientation of player (degrees) |
| `dir` | Angle of player motion (degrees) |
| `num_frames_output` | Number of frames to predict in output data for the given game_id/play_id/nfl_id (numeric) |
| `ball_land_x` | Ball landing position along the long axis (0–120 yards) |
| `ball_land_y` | Ball landing position along the short axis (0–53.3 yards) |

---

#### **Output Files:** `output_2023_w[01-18].csv`
The output data contains tracking data **after the pass is thrown**.

| Variable | Description |
|-----------|--------------|
| `game_id` | Game identifier (unique, numeric) |
| `play_id` | Play identifier (not unique across games, numeric) |
| `nfl_id` | Player identifier (unique across players, numeric) |
| `frame_id` | Frame identifier for each play/type, starting at 1 for each game_id/play_id/file type (numeric) |
| `x` | Player position along the long axis of the field (target to predict) |
| `y` | Player position along the short axis of the field (target to predict) |

---

### Key Join Columns
The datasets can be merged using the following keys:

| Key | Description |
|------|-------------|
| `game_id` | Unique identifier for each game |
| `play_id` | Unique identifier for each play within a game |
| `nfl_id` | Unique player identifier |
| `frame_id` | Frame number representing the moment in the play |

---


In [17]:
df.isna().sum()

game_id                     0
play_id                     0
player_to_predict           0
nfl_id                      0
frame_id                    0
play_direction              0
absolute_yardline_number    0
player_name                 0
player_height               0
player_weight               0
player_birth_date           0
player_position             0
player_side                 0
player_role                 0
x                           0
y                           0
s                           0
a                           0
dir                         0
o                           0
num_frames_output           0
ball_land_x                 0
ball_land_y                 0
x_target                    0
y_target                    0
dtype: int64

In [19]:
df.describe()

Unnamed: 0,game_id,play_id,nfl_id,frame_id,absolute_yardline_number,player_weight,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y,x_target,y_target
count,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0,560426.0
mean,2023156000.0,2218.36949,49649.474284,7.749182,60.395003,208.459748,60.322071,26.726465,1.724788,1.760978,180.802075,181.090839,14.602608,60.293877,26.567883,60.311604,26.604846
std,202245.3,1246.748063,5087.692138,5.54651,23.094026,21.723023,23.305106,10.139143,1.790043,1.478373,100.430689,94.965759,6.659607,27.361703,16.482688,25.247203,13.428138
min,2023091000.0,54.0,30842.0,1.0,11.0,153.0,3.79,1.54,0.0,0.0,0.0,0.0,5.0,-5.26,-3.91,0.02,0.33
25%,2023101000.0,1183.0,45395.0,4.0,41.0,193.0,42.51,18.44,0.31,0.54,91.24,90.77,10.0,41.900002,11.24,43.08,14.92
50%,2023111000.0,2204.0,52423.0,7.0,60.0,203.0,60.11,26.62,1.13,1.43,179.5,181.91,13.0,60.189999,26.389999,60.13,26.42
75%,2023121000.0,3279.0,54496.0,10.0,79.0,220.0,77.99,35.07,2.6,2.69,270.61,271.06,18.0,78.900002,41.599998,77.34,38.33
max,2024011000.0,5258.0,56673.0,40.0,109.0,358.0,116.33,50.45,9.88,16.75,360.0,360.0,94.0,125.849998,57.330002,120.83,53.72


In [20]:
input_keys = df[["game_id","play_id","nfl_id","frame_id"]]
output_keys = output_df[["game_id","play_id","nfl_id","frame_id"]]

missing_from_input = len(output_keys.merge(input_keys, on=["game_id","play_id","nfl_id","frame_id"], how="left", indicator=True).query('_merge == "left_only"'))
print("Rows in output not in input:", missing_from_input)

Rows in output not in input: 2510


In [21]:
keys = ["game_id", "play_id", "nfl_id", "frame_id"]
dupes = df[df.duplicated(subset=keys, keep=False)]
len(dupes)

Total duplicate rows: 0


Series([], dtype: int64)