# NFL Big Data Bowl 2026 - Comprehensive EDA

Welcome to our step-by-step guide to tackling the NFL Big Data Bowl 2026 Prediction Competition. Our first objective is to build a solid foundation by performing a comprehensive Exploratory Data Analysis (EDA). A thorough EDA is the most critical step for understanding the data's nuances, identifying potential features, and uncovering insights that will guide our modeling strategy.

This notebook will focus exclusively on EDA. We will not build any models or create a submission file yet. Instead, we will systematically explore the dataset to understand:
- The structure and format of the input and output data.
- The distributions of key variables.
- Relationships between different features.
- The nature of player movement.
- Any potential data quality issues or quirks.

Let's begin!

# **Step 1: Setup and Initial Data Loading**

**Objective:**
Our first step is to set up our environment by importing the necessary libraries and loading all the weekly training data into two primary pandas DataFrames: one for the `input` data (features) and one for the `output` data (our targets). This will give us a complete view of the training set. We will then perform a quick inspection to understand the size, data types, and presence of any missing values.

**Why we are doing this:**
Before we can analyze anything, we need to have all the data in one place. The data is provided in weekly chunks, so we must combine them. The initial inspection (`.info()`, `.head()`, `.isnull().sum()`) is a fundamental sanity check to get a first impression of the dataset's structure and quality.

In [None]:
import pandas as pd
import numpy as np
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set pandas display options for better viewing
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

# --- Define the base path to your data ---
BASE_DIR = '/kaggle/input/nfl-big-data-bowl-2026-prediction'

# --- Load all weekly input files ---
print("Loading input files...")
input_files = glob.glob(os.path.join(BASE_DIR, 'train/input_*.csv'))
input_df = pd.concat([pd.read_csv(f) for f in input_files], ignore_index=True)
print(f"Loaded {len(input_files)} input files.")

In [None]:
# --- Load all weekly output files ---
print("\nLoading output files...")
output_files = glob.glob(os.path.join(BASE_DIR, 'train/output_*.csv'))
output_df = pd.concat([pd.read_csv(f) for f in output_files], ignore_index=True)
print(f"Loaded {len(output_files)} output files.")

In [None]:
# --- Display basic information about the input DataFrame ---
print("\n" + "="*50)
print("--- Input DataFrame Info ---")
input_df.info(show_counts=True)
print("\n" + "="*50)

In [None]:
# --- Display basic information about the output DataFrame ---
print("\n--- Output DataFrame Info ---")
output_df.info(show_counts=True)
print("\n" + "="*50)

In [None]:
# --- Display the first few rows of the input DataFrame ---
print("\n--- Input DataFrame Head ---")
display(input_df.head())
print("\n" + "="*50)

In [None]:
# --- Check for missing values in the input DataFrame ---
print("\n--- Input DataFrame Missing Values ---")
print(input_df.isnull().sum())
print("\n" + "="*50)

In [None]:
# --- Check for missing values in the output DataFrame ---
print("\n--- Output DataFrame Missing Values ---")
print(output_df.isnull().sum())

## **Observations from Step 1**

1.  **Data Size:** We're working with a large dataset. The `input_df` has nearly 4.9 million rows, representing player snapshots before the pass. The `output_df` has about 563,000 rows, which are the target positions we need to predict.
2.  **No Missing Values:** This is great news! The data appears to be very clean, with no missing values in any of the columns in either DataFrame. This saves us a lot of time on data imputation.
3.  **Data Types Need Cleaning:** We have several columns with the `object` data type that should be converted for easier analysis and modeling:
    *   `player_height` is in a 'feet-inches' format (e.g., '6-2'). We need to convert this to a single numerical value like inches.
    *   `player_birth_date` is a string. We can use this to calculate the player's age at the time of the play, which could be a useful feature.
4.  **Key Columns:**
    *   `player_to_predict`: This boolean is critical. It tells us exactly which players' predictions will be scored. We'll need to focus our model on these players.
    *   `num_frames_output`: This tells us how many future frames (rows in `output_df`) we need to predict for each player in a given play. This value will vary from play to play.
    *   `play_direction`: This is a crucial piece of information. Player coordinates (`x`, `y`) and angles (`o`, `dir`) are given relative to the field. To compare different plays, we need to standardize them so that the offense is always moving in the same direction.

# **Step 2: Data Cleaning and Feature Engineering**

**Objective:**
In this step, we will perform essential data cleaning and create some basic new features. Specifically, we will:
1.  Convert `player_height` from 'feet-inches' format to a numerical value (total inches).
2.  Calculate `player_age` using their `player_birth_date` and the `game_id`.
3.  Standardize the coordinate system. We will ensure that for every play, the offense is moving from left to right (towards a higher `x

In [None]:
def clean_and_feature_engineer(df):
    """
    Applies cleaning and feature engineering steps to the input dataframe.
    """
    # 1. Convert player_height to inches
    # Example: '6-2' becomes 6*12 + 2 = 74 inches
    df['player_height_inches'] = df['player_height'].apply(lambda x: int(x.split('-')[0]) * 12 + int(x.split('-')[1]))

    # 2. Calculate player_age
    # Extract year from game_id and birth year from player_birth_date
    df['game_year'] = df['game_id'].astype(str).str[:4].astype(int)
    df['birth_year'] = pd.to_datetime(df['player_birth_date']).dt.year
    df['player_age'] = df['game_year'] - df['birth_year']

    # 3. Standardize play direction
    # If play_direction is 'left', we flip the x, y, dir, and o coordinates
    df_left = df[df['play_direction'] == 'left'].copy()

    # Flip x coordinate (120 yard field)
    df_left['x'] = 120.0 - df_left['x']
    df_left['ball_land_x'] = 120.0 - df_left['ball_land_x']

    # Flip y coordinate (53.3 yard field width)
    df_left['y'] = 53.3 - df_left['y']
    df_left['ball_land_y'] = 53.3 - df_left['ball_land_y']

    # Flip orientation and direction angles
    df_left['o'] = (df_left['o'] + 180) % 360
    df_left['dir'] = (df_left['dir'] + 180) % 360
    
    # Get the right-direction plays
    df_right = df[df['play_direction'] == 'right'].copy()
    
    # Concatenate back together
    processed_df = pd.concat([df_left, df_right], ignore_index=True)

    # Drop intermediate and original columns that are no longer needed
    processed_df = processed_df.drop(columns=[
        'player_height', 'player_birth_date', 'game_year', 'birth_year', 'play_direction'
    ])
    
    return processed_df

In [None]:
print("Applying cleaning and feature engineering to input_df...")
input_df_processed = clean_and_feature_engineer(input_df)

# We also need to standardize the output_df for later analysis/merging
# We only need play_direction from input_df to do this.
# Let's merge it onto output_df first.
play_info = input_df[['game_id', 'play_id', 'play_direction']].drop_duplicates()
output_df_merged = pd.merge(output_df, play_info, on=['game_id', 'play_id'])

# Now apply the coordinate standardization to output_df
output_left = output_df_merged[output_df_merged['play_direction'] == 'left'].copy()
output_left['x'] = 120.0 - output_left['x']
output_left['y'] = 53.3 - output_left['y']

output_right = output_df_merged[output_df_merged['play_direction'] == 'right'].copy()

output_df_processed = pd.concat([output_left, output_right], ignore_index=True)
output_df_processed = output_df_processed.drop(columns=['play_direction'])


print("Processing complete.")
print("\n" + "="*50)

# --- Display info of the new processed input DataFrame ---
print("\n--- Processed Input DataFrame Info ---")
input_df_processed.info(show_counts=True)
print("\n" + "="*50)

# --- Display the first few rows of the processed input DataFrame ---
print("\n--- Processed Input DataFrame Head ---")
# Sorting to see a play that was originally 'left' if possible
display(input_df_processed.sort_values(by=['game_id', 'play_id', 'nfl_id', 'frame_id']).head())
print("\n" + "="*50)

# --- Display the first few rows of the processed output DataFrame ---
print("\n--- Processed Output DataFrame Head ---")
display(output_df_processed.sort_values(by=['game_id', 'play_id', 'nfl_id', 'frame_id']).head())

## **Observations from Step 2**

1.  **Successful Transformation:** The code ran successfully. The original `player_height` and `player_birth_date` columns are gone, replaced by the numerical `player_height_inches` and `player_age`.
2.  **Standardized Coordinates:** The most important change is the standardization of coordinates. We no longer have the `play_direction` column because all plays are now represented as if the offense is moving from left (`x=0`) to right (`x=120`). This is a critical step that makes it much easier for a model to learn consistent movement patterns.
3.  **Data Type Check:** The `info()` output confirms our new columns are numerical (`int64`), and the memory usage is slightly reduced. The remaining `object` dtypes (`player_name`, `player_position`, etc.) are categorical and are fine for now.

We now have a clean, standardized dataset ready for analysis. The next logical step is to understand the characteristics of our data by looking at individual variables.

# **Step 3: Univariate Analysis - Understanding the Distributions**

**Objective:**
Now that our data is clean, we will analyze the distributions of key individual variables (univariate analysis). This involves creating visualizations and summary statistics to understand the range, central tendency, and spread of our most important features. We will look at three groups of features:
1.  **Play-Level Features:** How long are the plays we need to predict (`num_frames_output`)?
2.  **Static Player Features:** What are the distributions of player age, height, weight, and their roles on the field?
3.  **Dynamic Player Features:** What are the typical speeds, accelerations, and angles of movement for players *before* the pass is thrown?

**Why we are doing this:**
This step is fundamental to building intuition about the dataset. It helps us answer basic questions like: "What is a typical pass air time?", "Are there outlier values in player speed or acceleration?", "Which player roles are most common?". This knowledge is crucial for feature engineering and for sanity-checking our model's predictions later. For example, if we know the maximum acceleration in the training data is 10 yards/s², a model predicting 25 would be highly suspect.

In [None]:
# --- Set up plotting style ---
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (20, 15)

# --- 1. Play-Level Analysis ---
# We only need one row per play to analyze play-level features
play_level_df = input_df_processed[['game_id', 'play_id', 'num_frames_output']].drop_duplicates()

plt.figure(figsize=(20, 5))
plt.subplot(1, 2, 1)
sns.histplot(play_level_df['num_frames_output'], bins=30, kde=True)
plt.title('Distribution of Prediction Length (num_frames_output)')
plt.xlabel('Number of Frames to Predict')
plt.ylabel('Frequency (Number of Plays)')

plt.subplot(1, 2, 2)
sns.boxplot(x=play_level_df['num_frames_output'])
plt.title('Boxplot of Prediction Length (num_frames_output)')
plt.xlabel('Number of Frames to Predict')
plt.show()

print("--- Summary Statistics for num_frames_output ---")
print(play_level_df['num_frames_output'].describe())
print("\n" + "="*50 + "\n")

In [None]:
# --- 2. Static Player-Level Analysis ---
# We only need one row per player for static features
player_level_df = input_df_processed[['nfl_id', 'player_age', 'player_height_inches', 'player_weight', 'player_position', 'player_role']].drop_duplicates(subset=['nfl_id'])

fig, axes = plt.subplots(2, 3, figsize=(20, 12))
fig.suptitle('Static Player Feature Distributions', fontsize=16)

# Player Age
sns.histplot(ax=axes[0, 0], data=player_level_df, x='player_age', bins=20, kde=True)
axes[0, 0].set_title('Distribution of Player Age')

# Player Height
sns.histplot(ax=axes[0, 1], data=player_level_df, x='player_height_inches', bins=20, kde=True)
axes[0, 1].set_title('Distribution of Player Height (inches)')

# Player Weight
sns.histplot(ax=axes[0, 2], data=player_level_df, x='player_weight', bins=20, kde=True)
axes[0, 2].set_title('Distribution of Player Weight (lbs)')

# Player Position (Top 15)
top_positions = player_level_df['player_position'].value_counts().nlargest(15).index
sns.countplot(ax=axes[1, 0], data=player_level_df, y='player_position', order=top_positions)
axes[1, 0].set_title('Top 15 Player Positions')
axes[1, 0].tick_params(axis='y', labelsize=8)


# Player Role
sns.countplot(ax=axes[1, 1], data=input_df_processed.drop_duplicates(subset=['game_id', 'play_id', 'nfl_id']), y='player_role')
axes[1, 1].set_title('Player Roles per Play')


# Player Side
sns.countplot(ax=axes[1, 2], data=input_df_processed.drop_duplicates(subset=['game_id', 'play_id', 'nfl_id']), y='player_side')
axes[1, 2].set_title('Player Side per Play')


plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

In [None]:
# --- 3. Dynamic Player-Level Analysis (using all frames) ---
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
fig.suptitle('Dynamic Player Feature Distributions (Pre-Pass)', fontsize=16)

# Speed (s)
sns.histplot(ax=axes[0], data=input_df_processed, x='s', bins=50)
axes[0].set_title('Distribution of Speed (s)')

# Acceleration (a)
sns.histplot(ax=axes[1], data=input_df_processed, x='a', bins=50)
axes[1].set_title('Distribution of Acceleration (a)')

# Direction (dir)
sns.histplot(ax=axes[2], data=input_df_processed, x='dir', bins=50)
axes[2].set_title('Distribution of Motion Angle (dir)')

# Orientation (o)
sns.histplot(ax=axes[3], data=input_df_processed, x='o', bins=50)
axes[3].set_title('Distribution of Orientation Angle (o)')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

print("--- Summary Statistics for Dynamic Features ---")
print(input_df_processed[['s', 'a', 'dir', 'o']].describe())

## **Observations from Step 3**

1.  **Prediction Length (`num_frames_output`):**
    *   The number of frames we need to predict is typically short. The vast majority of plays require predicting between 5 and 20 frames (0.5 to 2 seconds of game time, since it's 10 frames/sec).
    *   The average is around 11 frames. The distribution is right-skewed, with a few significant outliers where the ball is in the air for a very long time (up to 94 frames, or 9.4 seconds). These long plays are rare but could be challenging for a model.

2.  **Static Player Features:**
    *   **Age, Height, Weight:** These features show nice, normal (bell-curve) distributions as expected from a population of athletes. Age peaks around 25-26. Height is bimodal (two peaks), likely reflecting the different body types of different positions (e.g., shorter cornerbacks vs. taller wide receivers/linemen). Weight is slightly right-skewed, with some very heavy players (likely offensive/defensive linemen).
    *   **Roles & Sides:** Defensive players (`Defensive Coverage`) are the most common role in the dataset, followed by non-targeted receivers (`Other Route Runner`). There are roughly equal numbers of offensive and defensive players per play, which makes sense. Crucially, the number of `Targeted Receiver` and `Passer` roles is much smaller, which is logical as there's only one of each per play. Our model will need to learn how these different roles behave.

3.  **Dynamic Player Features (Pre-Pass):**
    *   **Speed (`s`) & Acceleration (`a`):** Both distributions are heavily skewed towards zero. This is logical because the data includes frames where players are stationary at the line of scrimmage before the play starts. The maximum speed before the pass is ~12.5 yards/sec, and max acceleration is ~17.1 yards/sec². These are important physical limits to keep in mind.
    *   **Angles (`dir` & `o`):** These distributions are fascinating. They show clear peaks, indicating preferred directions of movement and orientation.
        *   `dir` (direction of motion) has a strong peak around 90-100 degrees (moving "upfield" after our coordinate standardization) and a smaller peak around 270 degrees (moving "downfield" or backwards).
        *   `o` (orientation) shows two massive peaks around 90 degrees and 270 degrees. This makes perfect sense: players are typically oriented "sideways" facing either sideline at the start of a play. The standardization we performed in Step 2 is what makes these patterns so clear.

We've analyzed the variables in isolation. Now, we need to understand how they relate to each other and, most importantly, how they relate to the target variables (`x`, `y` in the output data).

# **Step 4: Bivariate and Initial Target Analysis**

**Objective:**
In this step, we will start exploring the relationships between variables and get our first look at the target data. We will:
1.  Merge the input and output data to create a complete picture of a player's trajectory for a given play.
2.  Visualize some of these trajectories to get a feel for player movement.
3.  Calculate the target variables: `dx` (change in x) and `dy` (change in y) from the player's last known position. This is often easier to predict than the absolute final coordinates.
4.  Analyze the relationship between the initial state (speed, direction) and the final displacement.

**Why we are doing this:**
The goal of the competition is to predict future `x` and `y`. Simply looking at the input data isn't enough. We need to connect the "before" (input) with the "after" (output). By visualizing trajectories, we can see the patterns we're trying to model. By calculating `dx` and `dy`, we reframe the problem from "predict where they will be" to "predict how they will move from their current spot," which is often a more stable and intuitive target for a machine learning model.

In [None]:
# --- 1. Merge input and output data ---
# We need the last known position from the input data for each player on each play.
last_input_frame = input_df_processed.loc[input_df_processed.groupby(['game_id', 'play_id', 'nfl_id'])['frame_id'].idxmax()]

# Rename columns to avoid clashes after merging
last_input_frame = last_input_frame.rename(columns={
    'x': 'x_start', 'y': 'y_start',
    's': 's_start', 'a': 'a_start',
    'dir': 'dir_start', 'o': 'o_start',
    'frame_id': 'frame_id_start'
})

# Select necessary columns
last_input_frame_subset = last_input_frame[[
    'game_id', 'play_id', 'nfl_id', 'player_to_predict', 'player_role', 
    'x_start', 'y_start', 's_start', 'a_start', 'dir_start', 'o_start',
    'ball_land_x', 'ball_land_y'
]]

# Merge with the processed output data
# The output data contains the target x, y for each future frame
full_trajectory_df = pd.merge(
    output_df_processed, 
    last_input_frame_subset, 
    on=['game_id', 'play_id', 'nfl_id']
)

print("--- Merged Trajectory DataFrame Head ---")
display(full_trajectory_df.head())
print("\n" + "="*50 + "\n")

In [None]:
# --- 2. Visualize some sample trajectories ---
# Let's pick a few sample plays to visualize
sample_plays = full_trajectory_df[full_trajectory_df['player_to_predict'] == True][['game_id', 'play_id']].drop_duplicates().head(4)

print("--- Visualizing 4 Sample Player Trajectories ---")

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
fig.suptitle('Sample Player Trajectories (Post-Pass)', fontsize=16)

for i, (idx, play) in enumerate(sample_plays.iterrows()):
    game_id, play_id = play['game_id'], play['play_id']
    
    play_data = full_trajectory_df[(full_trajectory_df['game_id'] == game_id) & (full_trajectory_df['play_id'] == play_id) & (full_trajectory_df['player_to_predict'] == True)]
    
    sns.scatterplot(ax=axes[i], data=play_data, x='x', y='y', hue='player_role', style='nfl_id', s=100, palette='viridis')
    
    # Plot starting positions
    sns.scatterplot(ax=axes[i], data=play_data, x='x_start', y='y_start', hue='player_role', style='nfl_id', s=200, marker='X', legend=False, palette='viridis')

    # Plot ball landing spot
    ball_land_pos = play_data[['ball_land_x', 'ball_land_y']].iloc[0]
    axes[i].scatter(ball_land_pos['ball_land_x'], ball_land_pos['ball_land_y'], color='red', marker='*', s=300, label='Ball Landing Spot')

    axes[i].set_title(f'GameID: {game_id}, PlayID: {play_id}')
    axes[i].set_xlabel('X Coordinate (Yards)')
    axes[i].set_ylabel('Y Coordinate (Yards)')
    axes[i].legend(loc='best', fontsize='small')
    axes[i].set_xlim(0, 120)
    axes[i].set_ylim(0, 53.3)
    axes[i].grid(True)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

In [None]:
# --- 3. Calculate displacement targets (dx, dy) ---
full_trajectory_df['dx'] = full_trajectory_df['x'] - full_trajectory_df['x_start']
full_trajectory_df['dy'] = full_trajectory_df['y'] - full_trajectory_df['y_start']


# --- 4. Analyze relationship between start state and final displacement ---
# We'll look at the very last frame of the prediction for each player
final_frame_df = full_trajectory_df.loc[full_trajectory_df.groupby(['game_id', 'play_id', 'nfl_id'])['frame_id'].idxmax()]

print("\n--- Relationship between Initial State and Final Displacement ---")

fig, axes = plt.subplots(1, 3, figsize=(20, 6))
fig.suptitle('Initial State vs. Final Displacement', fontsize=16)

sns.scatterplot(ax=axes[0], data=final_frame_df, x='s_start', y='dx', alpha=0.3, hue='player_role')
axes[0].set_title('Initial Speed vs. Final X-Displacement')

sns.scatterplot(ax=axes[1], data=final_frame_df, x='dir_start', y='dx', alpha=0.3, hue='player_role')
axes[1].set_title('Initial Direction vs. Final X-Displacement')

# This is a bit more complex, let's plot dx vs dy and color by role
sns.scatterplot(ax=axes[2], data=final_frame_df, x='dx', y='dy', hue='player_role', alpha=0.5)
axes[2].set_title('Final Displacement (dx vs. dy) by Player Role')
axes[2].axhline(0, color='grey', linestyle='--')
axes[2].axvline(0, color='grey', linestyle='--')


plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## **Observations from Step 4**

1.  **Trajectory Visualizations are Key:**
    *   The trajectory plots confirm our core hypothesis: players move towards the ball's landing spot. Both offensive receivers and defensive players adjust their paths to converge on the red star.
    *   This immediately tells us that the most important features we can engineer will describe the relationship between a player's initial state and the ball's landing location.
    *   The paths are curved, not straight lines. This means a simple model that just extrapolates the initial velocity (`s_start`, `dir_start`) will be inaccurate. Models will need to account for changes in direction and acceleration.

2.  **Initial State Matters:**
    *   **Speed & Direction:** The scatter plots show a strong, intuitive relationship between a player's initial speed/direction and where they end up. Higher initial speed leads to greater displacement. An initial direction pointing downfield (around 90 degrees) leads to a large positive `dx`. These are definitely strong baseline features.
    *   **Role-Based Differences:** We can see subtle but clear differences in the displacement patterns between the `Targeted Receiver` and `Defensive Coverage` roles. Receivers (blue) appear to have a more pronounced and direct downfield movement (`dx`), while defenders (orange) have a wider, more reactive pattern. Any good model must account for the player's role.

3.  **Reframing the Target:**
    *   Predicting the displacement (`dx`, `dy`) seems more promising than predicting the absolute final coordinates (`x`, `y`). The displacement is directly influenced by the player's actions and the physics of their movement, making it a more natural target for a model.

Now, let's build on that crucial insight about the ball's landing spot. We need to create explicit features that quantify this relationship.

# **Step 5: Engineering Ball-Centric Features**

**Objective:**
Based on our observations, the single most important factor influencing a player's movement is the ball's landing spot. In this step, we will engineer features that explicitly describe the relationship between each player at the moment the pass is thrown and the target location of the ball. We will then analyze how these new features relate to player movement.

Specifically, we will create:
1.  `dist_to_ball`: The initial Euclidean distance from the player to the ball landing spot.
2.  `angle_to_ball`: The angle of the vector from the player's starting position to the ball's landing spot.
3.  `diff_dir_ball_angle`: The difference between the player's current direction of motion (`dir_start`) and the direct-line angle to the ball. This feature quantifies *how much a player needs to turn* to head directly for the ball.

**Why we are doing this:**
These features translate the core physics of the problem into numbers that a machine learning model can easily use. Instead of the model having to implicitly learn "oh, the player is at (x1, y1) and the ball is at (x2, y2), I should subtract them," we are giving it the answer directly. `dist_to_ball` tells the model *how far* the player needs to go, and `diff_dir_ball_angle` tells it *how much they need to change direction*. These are likely to be among the most powerful predictive features we can create.

In [None]:
# --- 1. Engineer Ball-Centric Features on our final_frame_df ---
# We'll use the final_frame_df from the previous step as it has one row per player-play
# and contains the final displacement (dx, dy).

# Calculate vector from player to ball
final_frame_df['vec_x_to_ball'] = final_frame_df['ball_land_x'] - final_frame_df['x_start']
final_frame_df['vec_y_to_ball'] = final_frame_df['ball_land_y'] - final_frame_df['y_start']

# Calculate distance to ball
final_frame_df['dist_to_ball'] = np.sqrt(final_frame_df['vec_x_to_ball']**2 + final_frame_df['vec_y_to_ball']**2)

# Calculate angle to ball. We use arctan2 for quadrant-correct angles, then convert to degrees [0, 360]
final_frame_df['angle_to_ball'] = np.degrees(np.arctan2(final_frame_df['vec_y_to_ball'], final_frame_df['vec_x_to_ball']))
# Note: The NFL's 'dir' is clockwise from the positive Y-axis. Our arctan2 is counter-clockwise from positive X-axis.
# To align them, we need a transformation: dir_aligned = (450 - arctan2_degrees) % 360
final_frame_df['angle_to_ball'] = (450 - final_frame_df['angle_to_ball']) % 360


# Calculate the difference between player's motion direction and angle to ball
# This tells us how much they need to turn.
angle_diff = np.abs(final_frame_df['dir_start'] - final_frame_df['angle_to_ball'])
final_frame_df['diff_dir_ball_angle'] = np.min(np.stack([angle_diff, 360 - angle_diff]), axis=0)


print("--- Ball-Centric Features DataFrame Head ---")
display(final_frame_df[['player_role', 'dist_to_ball', 'angle_to_ball', 'dir_start', 'diff_dir_ball_angle']].head())
print("\n" + "="*50 + "\n")

In [None]:
# --- 2. Analyze the new features ---

print("--- Analyzing New Ball-Centric Features ---")

fig, axes = plt.subplots(2, 2, figsize=(18, 12))
axes = axes.flatten()
fig.suptitle('Analysis of Ball-Centric Features', fontsize=16)

# Distribution of dist_to_ball by player role
sns.histplot(ax=axes[0], data=final_frame_df, x='dist_to_ball', hue='player_role', multiple='stack', bins=40)
axes[0].set_title('Initial Distance to Ball Landing Spot')

# Distribution of diff_dir_ball_angle by player role
sns.histplot(ax=axes[1], data=final_frame_df, x='diff_dir_ball_angle', hue='player_role', multiple='stack', bins=40)
axes[1].set_title('Angle Difference (How much to turn)')

# How does the required turn angle affect final displacement?
sns.scatterplot(ax=axes[2], data=final_frame_df, x='diff_dir_ball_angle', y='dx', hue='player_role', alpha=0.3)
axes[2].set_title('Required Turn vs. Final X-Displacement')

# How does initial distance affect total displacement?
# Let's calculate total displacement magnitude
final_frame_df['total_displacement'] = np.sqrt(final_frame_df['dx']**2 + final_frame_df['dy']**2)
sns.scatterplot(ax=axes[3], data=final_frame_df, x='dist_to_ball', y='total_displacement', hue='player_role', alpha=0.3)
axes[3].set_title('Initial Distance to Ball vs. Total Displacement')
# Add a y=x line for reference. If a point is on this line, they ran straight to the spot.
axes[3].plot([0, 50], [0, 50], color='red', linestyle='--', label='y=x line')
axes[3].legend()


plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## **Observations from Step 5**

1.  **Distance to Ball is a Primary Driver:**
    *   The `dist_to_ball` histogram shows that most players are within 20 yards of the ball's landing spot when the pass is thrown. Targeted receivers (blue) are, on average, closer to the ball than defenders (orange), which makes perfect sense.
    *   The "Initial Distance to Ball vs. Total Displacement" plot is the most telling chart yet. There is an almost linear relationship between the initial distance and the total distance the player travels. The points cluster around the red `y=x` line, which means players often travel *almost* the entire distance to the ball's landing spot. This confirms `dist_to_ball` is a top-tier feature.

2.  **The "Required Turn" is Crucial:**
    *   The `diff_dir_ball_angle` histogram shows that players (especially receivers) are often already moving in a direction that is closely aligned with the ball's path (a small angle difference). A large portion of players need to turn less than 45 degrees.
    *   The "Required Turn vs. Final X-Displacement" plot is very insightful. When the required turn is small (close to 0), the player's downfield displacement (`dx`) is high. As the required turn increases towards 180 degrees (meaning they are running in the opposite direction), their downfield displacement becomes much lower or even negative. This feature captures the player's momentum and the effort required to change it.

3.  **Role-Based Behavior is Confirmed Again:**
    *   In the displacement plot, Targeted Receivers (blue) consistently sit on or just below the `y=x` line. This means they are highly efficient and run directly towards the landing spot.
    *   Defensive players (orange) are more scattered and often lie below the line. This implies their path is less direct; they might be reacting to the receiver, trying to cut off an angle, or simply not able to cover the distance as efficiently. The model must learn this different behavior.

This concludes our initial EDA. We have a very strong understanding of the data, the core problem, and the key features that drive player movement. We have cleaned the data, standardized the coordinate system, and engineered a set of powerful, physically-intuitive features.

This notebook is an excellent foundation. The next logical step, which we will do in the *next* notebook, is to build a baseline model using these insights.

# **Step 6: EDA Summary and Next Steps**

**Objective:**
To formally conclude our Exploratory Data Analysis by summarizing our key findings and outlining a clear plan for our first modeling notebook.

**Why we are doing this:**
This section serves as the conclusion to our investigation and the bridge to our next phase: prediction. By summarizing what we've learned, we solidify our understanding and create a strategic roadmap for building an effective model.

### **Key Findings from this EDA:**

1.  **Data Quality:** The dataset is high-quality with no missing values, requiring only minor type conversions.
2.  **Coordinate Standardization is Mandatory:** Flipping the coordinates for `play_direction == 'left'` was a critical preprocessing step that made all plays comparable.
3.  **The Target:** Predicting displacement (`dx`, `dy`) from the player's last known position is a more intuitive and likely more stable target than predicting absolute coordinates (`x`, `y`).
4.  **Player Role is a Key Differentiator:** Offensive and defensive players, particularly the `Targeted Receiver`, exhibit distinct movement patterns. This must be used as a feature in any model.
5.  **Ball Landing Spot is Paramount:** A player's movement is overwhelmingly dictated by their spatial relationship to the ball's landing spot.
6.  **Most Powerful Features:** Our engineered features (`dist_to_ball`, `angle_to_ball`, `diff_dir_ball_angle`), combined with the player's initial state (`s_start`, `dir_start`, `player_role`), are likely to be the strongest predictors.

### **Plan for the First Modeling Notebook:**

1.  **Problem Framing:** Predict the `x` and `y` coordinates for each player for each future `frame_id`.
2.  **Feature Engineering:** Re-create the key features identified in this EDA notebook:
    *   Player age, height (in inches).
    *   Standardized coordinates.
    *   Features relative to the ball: `dist_to_ball`, `angle_to_ball`, `diff_dir_ball_angle`.
    *   Initial state features from the last input frame: `x_start`, `y_start`, `s_start`, `a_start`, `dir_start`.
3.  **Model Choice:** A simple, powerful, and fast gradient boosting model like LightGBM is an excellent starting point. We will train two separate models: one to predict `x` and one to predict `y`.
4.  **Validation Strategy:** A simple train-test split based on `game_id` or `play_id` to ensure our model generalizes to unseen plays.
5.  **Baseline Model:** We will create a very simple physics-based baseline (e.g., "player continues at their last known speed and direction") to benchmark our machine learning model's performance against.

This concludes our comprehensive EDA. We are now well-prepared to move on to the modeling stage.