# NFL Big Data Bowl 2025 Submission

## Track: Metric Track

### Team Information

- **Collaborators:** Mr Oscar Yanez-Feijoo.
- **Track:** Metric Track

## Executive Summary


Type: Supervised Learning.

Problem: Classification.

Objective: Predict categorical outcomes (e.g., passResult) based on pre-snap player tracking features.

This notebook analyzes NFL player tracking data to predict post-snap outcomes based on pre-snap player tendencies. By aggregating and analyzing features like average speed, acceleration, and orientation before the snap, we aim to predict whether the play results in a pass completion, incompletion, or other outcomes.

My model provides insights that can help teams better understand tendencies and improve decision-making processes.

---

## Methodology

### Data Loading and Preprocessing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

sns.set_style("whitegrid")
pd.set_option('display.max_columns', 100)

### Load Core Datasets

In [2]:
# Load core datasets
games = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2025/games.csv')
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2025/plays.csv')
players = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2025/players.csv')

### Load and Process Tracking Data

In [3]:
# Load and process tracking data in chunks to avoid memory overload
tracking_files = [f"/kaggle/input/nfl-big-data-bowl-2025/tracking_week_{i}.csv" for i in range(1, 10)]

processed_chunks = []

for file in tracking_files:
    chunk = pd.read_csv(file)
    
    # Print column names for debugging
    print(f"Columns in file {file}:", list(chunk.columns))
    
    # Filter pre-snap data for the current chunk
    chunk_pre_snap = chunk[chunk['frameType'] == 'BEFORE_SNAP']
    
    # Add required features
    chunk_pre_snap = chunk_pre_snap.assign(
        max_speed=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['s'].transform('max'),
        speed_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['s'].transform('std'),
        acceleration_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['a'].transform('std'),
        direction_changes=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['dir'].transform(
            lambda x: x.diff().abs().gt(30).sum()
        ),
        orientation_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['o'].transform('std')
    )
    
    processed_chunks.append(chunk_pre_snap)

# Combine all processed chunks
tracking_pre_snap = pd.concat(processed_chunks, ignore_index=True)

Columns in file /kaggle/input/nfl-big-data-bowl-2025/tracking_week_1.csv: ['gameId', 'playId', 'nflId', 'displayName', 'frameId', 'frameType', 'time', 'jerseyNumber', 'club', 'playDirection', 'x', 'y', 's', 'a', 'dis', 'o', 'dir', 'event']
Columns in file /kaggle/input/nfl-big-data-bowl-2025/tracking_week_2.csv: ['gameId', 'playId', 'nflId', 'displayName', 'frameId', 'frameType', 'time', 'jerseyNumber', 'club', 'playDirection', 'x', 'y', 's', 'a', 'dis', 'o', 'dir', 'event']
Columns in file /kaggle/input/nfl-big-data-bowl-2025/tracking_week_3.csv: ['gameId', 'playId', 'nflId', 'displayName', 'frameId', 'frameType', 'time', 'jerseyNumber', 'club', 'playDirection', 'x', 'y', 's', 'a', 'dis', 'o', 'dir', 'event']
Columns in file /kaggle/input/nfl-big-data-bowl-2025/tracking_week_4.csv: ['gameId', 'playId', 'nflId', 'displayName', 'frameId', 'frameType', 'time', 'jerseyNumber', 'club', 'playDirection', 'x', 'y', 's', 'a', 'dis', 'o', 'dir', 'event']
Columns in file /kaggle/input/nfl-big-da

In [None]:
# Inspect a sample of the dataset
print("Sample rows of tracking_pre_snap:")
print(tracking_pre_snap.sample(5))  # Random 5 rows instead of head or tail

# Print dataset information with limited rows
print("\nDataset information:")
tracking_pre_snap.info()

print("\nStatistical summary of numeric columns (limited columns):")
print(tracking_pre_snap.describe(include=[np.number]).loc[:, :'s'])  # Modify to limit specific columns if needed

# Check for duplicate rows
duplicates = tracking_pre_snap.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")



### Merge Ball Position Data

### Rename Ball Position Columns

In [None]:
if 'x' in tracking_pre_snap.columns and 'y' in tracking_pre_snap.columns:
    tracking_pre_snap = tracking_pre_snap.rename(columns={'x': 'x_ball', 'y': 'y_ball'})
    print("Ball position columns renamed in tracking_pre_snap dataset.")
else:
    print("Ball position columns ('x', 'y') are missing in the tracking_pre_snap dataset.")

#### Calculate Distance from Ball

In [None]:
if 'x_ball' in tracking_pre_snap.columns and 'y_ball' in tracking_pre_snap.columns:
    tracking_pre_snap = tracking_pre_snap.assign(
        distance_from_ball=np.sqrt(
            (tracking_pre_snap['x'] - tracking_pre_snap['x_ball'])**2 +
            (tracking_pre_snap['y'] - tracking_pre_snap['y_ball'])**2
        )
    )
else:
    print("Ball position columns are still missing. Skipping distance_from_ball calculation.")

#### Spread Statistics

In [None]:
spread_summary = tracking_pre_snap.groupby(['gameId', 'playId']).agg(
    horizontal_spread=('x', lambda x: x.max() - x.min()),
    vertical_spread=('y', lambda y: y.max() - y.min())
).reset_index()

### Feature Engineering: Spread Statistics

In [None]:
# Calculate horizontal and vertical spread
spread_summary = tracking_pre_snap.groupby(['gameId', 'playId']).agg(
    horizontal_spread=('x', lambda x: x.max() - x.min()),
    vertical_spread=('y', lambda y: y.max() - y.min())
).reset_index()

### Optimize Data Types

In [None]:
# Optimize data types
def optimize_data_types(df):
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype('float32')
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = df[col].astype('int16')
    return df

tracking_pre_snap = optimize_data_types(tracking_pre_snap)

### Aggregate Features


#### Pre-Snap Statistics


In [None]:
pre_snap_summary = tracking_pre_snap.groupby(['gameId', 'playId', 'nflId']).agg(
    avg_speed=('s', 'mean'),
    avg_acceleration=('a', 'mean'),
    total_distance=('dis', 'sum'),
    avg_orientation=('o', 'mean'),
    avg_direction=('dir', 'mean'),
    max_speed=('max_speed', 'max'),
    speed_std=('speed_std', 'mean'),
    acceleration_std=('acceleration_std', 'mean'),
    distance_from_ball=('distance_from_ball', 'mean'),
    direction_changes=('direction_changes', 'mean'),
    orientation_std=('orientation_std', 'mean')
).reset_index()

#### Team-Level Features

In [None]:
if 'team' in tracking_pre_snap.columns:
    team_summary = tracking_pre_snap.groupby(['gameId', 'playId', 'team']).agg(
        team_avg_speed=('s', 'mean'),
        team_max_speed=('s', 'max')
    ).reset_index()
else:
    print("Column 'team' is missing in the tracking data. Skipping team-level features.")

#### Merge Features

In [None]:
merged_data = pd.merge(pre_snap_summary, plays, on=['gameId', 'playId'], how='inner')
merged_data = pd.merge(merged_data, players, on='nflId', how='left')
if 'team_summary' in locals():
    merged_data = pd.merge(merged_data, team_summary, on=['gameId', 'playId'], how='inner')
merged_data = pd.merge(merged_data, spread_summary, on=['gameId', 'playId'], how='inner')

merged_data = optimize_data_types(merged_data)

### Feature Engineering

Key features were engineered to capture pre-snap player tendencies:
- **Average Speed (`avg_speed`)**
- **Average Acceleration (`avg_acceleration`)**
- **Total Distance (`total_distance`)**
- **Average Orientation (`avg_orientation`)**
- **Average Direction (`avg_direction`)**
- **Maximum Speed (`max_speed`)**
- **Speed Variability (`speed_std`)**
- **Acceleration Variability (`acceleration_std`)**
- **Team Average Speed (`team_avg_speed`)**
- **Team Maximum Speed (`team_max_speed`)**
- **Player Distance from Ball (`distance_from_ball`)**
- **Horizontal Spread (`horizontal_spread`)**
- **Vertical Spread (`vertical_spread`)**
- **Direction Change Frequency (`direction_changes`)**
- **Orientation Variability (`orientation_std`)**

### Aggregate Features

### Use ball position directly from tracking_pre_snap
if 'x' in tracking_pre_snap.columns and 'y' in tracking_pre_snap.columns:
    tracking_pre_snap = tracking_pre_snap.rename(columns={'x': 'x_ball', 'y': 'y_ball'})
    print("Ball position columns renamed in tracking_pre_snap dataset.")
else:
    print("Ball position columns ('x', 'y') are missing in the tracking_pre_snap dataset.")

# Initialize a list to store file paths for processed chunks
processed_file_paths = []

for i, file in enumerate(tracking_files):
    print(f"Processing file: {file}")
    chunk = pd.read_csv(file)

    # Filter pre-snap data
    chunk_pre_snap = chunk[chunk['frameType'] == 'BEFORE_SNAP']

    # Add features
    chunk_pre_snap = chunk_pre_snap.assign(
        max_speed=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['s'].transform('max'),
        speed_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['s'].transform('std'),
        acceleration_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['a'].transform('std'),
        direction_changes=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['dir'].transform(
            lambda x: x.diff().abs().gt(30).sum()
        ),
        orientation_std=chunk_pre_snap.groupby(['gameId', 'playId', 'nflId'])['o'].transform('std')
    )

    # Save processed chunk to a temporary file
    output_file = f"processed_chunk_{i}.csv"
    chunk_pre_snap.to_csv(output_file, index=False)
    processed_file_paths.append(output_file)

    print(f"Processed chunk saved to {output_file}")

# Load processed files into a single DataFrame as needed
tracking_pre_snap = pd.concat((pd.read_csv(fp) for fp in processed_file_paths), ignore_index=True)

# Calculate distance from the ball if columns are available
if 'x_ball' in tracking_pre_snap.columns and 'y_ball' in tracking_pre_snap.columns:
    tracking_pre_snap = tracking_pre_snap.assign(
        distance_from_ball=np.sqrt(
            (tracking_pre_snap['x'] - tracking_pre_snap['x_ball'])**2 +
            (tracking_pre_snap['y'] - tracking_pre_snap['y_ball'])**2
        )
    )
else:
    print("Ball position columns are still missing. Skipping distance_from_ball calculation.")

# Ensure 'team' column exists in the tracking data
if 'team' not in tracking_pre_snap.columns:
    print("Adding 'team' column from the original data.")
    for file in tracking_files:
        chunk = pd.read_csv(file)
        if 'team' in chunk.columns:
            tracking_pre_snap = pd.merge(
                tracking_pre_snap,
                chunk[['gameId', 'playId', 'nflId', 'team']],
                on=['gameId', 'playId', 'nflId'],
                how='left'
            )
            break
    else:
        print("Column 'team' is not found in the original tracking files. Unable to compute team-level features.")

# Proceed with aggregation if 'team' is successfully added
if 'team' in tracking_pre_snap.columns:
    pre_snap_summary = tracking_pre_snap.groupby(['gameId', 'playId', 'nflId']).agg(
        avg_speed=('s', 'mean'),
        avg_acceleration=('a', 'mean'),
        total_distance=('dis', 'sum'),
        avg_orientation=('o', 'mean'),
        avg_direction=('dir', 'mean'),
        max_speed=('max_speed', 'max'),
        speed_std=('speed_std', 'mean'),
        acceleration_std=('acceleration_std', 'mean'),
        distance_from_ball=('distance_from_ball', 'mean'),
        direction_changes=('direction_changes', 'mean'),
        orientation_std=('orientation_std', 'mean')
    ).reset_index()

    team_summary = tracking_pre_snap.groupby(['gameId', 'playId', 'team']).agg(
        team_avg_speed=('s', 'mean'),
        team_max_speed=('s', 'max')
    ).reset_index()

    merged_data = pd.merge(pre_snap_summary, plays, on=['gameId', 'playId'], how='inner')
    merged_data = pd.merge(merged_data, players, on='nflId', how='left')
    merged_data = pd.merge(merged_data, team_summary, on=['gameId', 'playId'], how='inner')
    merged_data = pd.merge(merged_data, spread_summary, on=['gameId', 'playId'], how='inner')

    merged_data = optimize_data_types(merged_data)
else:
    print("Unable to add 'team' column. Skipping team-level aggregation.")



### Handling Imbalanced Data

In [None]:
# Encode target
label_encoder = LabelEncoder()
merged_data['passResult'] = label_encoder.fit_transform(merged_data['passResult'])

# Prepare data
features = ['avg_speed', 'avg_acceleration', 'total_distance', 'max_speed', 'speed_std', 
            'acceleration_std', 'distance_from_ball', 'team_avg_speed', 'team_max_speed']
target = 'passResult'
X = merged_data[features]
y = merged_data[target]

# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

### Model Training and Hyperparameter Tuning


#### Random Forest

In [None]:
# Random Forest with hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'class_weight': ['balanced', None]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, scoring='f1_weighted', cv=3)
grid_search.fit(X_resampled, y_resampled)

rf_model = grid_search.best_estimator_
rf_y_pred = rf_model.predict(X)
print("Random Forest Accuracy:", accuracy_score(y, rf_y_pred))

# Feature Importance Analysis
feature_importances = pd.Series(rf_model.feature_importances_, index=features).sort_values(ascending=False)
print("Feature Importances:")
print(feature_importances)

#### Gradient Boosting (XGBoost)

In [None]:
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_resampled, y_resampled)
xgb_y_pred = xgb_model.predict(X)
print("XGBoost Accuracy:", accuracy_score(y, xgb_y_pred))

# Feature Importance Analysis
feature_importances_xgb = pd.Series(xgb_model.feature_importances_, index=features).sort_values(ascending=False)
print("XGBoost Feature Importances:")
print(feature_importances_xgb)

#### Support Vector Machine (SVM)

In [None]:
svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_resampled, y_resampled)
svm_y_pred = svm_model.predict(X)
print("SVM Accuracy:", accuracy_score(y, svm_y_pred))

#### Neural Network

In [None]:
nn_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
nn_model.fit(X_resampled, y_resampled)
nn_y_pred = nn_model.predict(X)
print("Neural Network Accuracy:", accuracy_score(y, nn_y_pred))

#### Logistic Regression

In [None]:
lr_model = LogisticRegression(multi_class='multinomial', max_iter=500, random_state=42)
lr_model.fit(X_resampled, y_resampled)
lr_y_pred = lr_model.predict(X)
print("Logistic Regression Accuracy:", accuracy_score(y, lr_y_pred))

#### k-Nearest Neighbors (kNN)

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_resampled, y_resampled)
knn_y_pred = knn_model.predict(X)
print("kNN Accuracy:", accuracy_score(y, knn_y_pred))

### Cross-Validation for All Models

In [None]:
# Cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_resampled, y_resampled, cv=5, scoring='f1_weighted')
print(f"Random Forest Cross-Validation F1 Scores: {rf_cv_scores}")
print(f"Random Forest Average F1 Score: {rf_cv_scores.mean():.4f}")

# Cross-validation for XGBoost
xgb_cv_scores = cross_val_score(xgb_model, X_resampled, y_resampled, cv=5, scoring='f1_weighted')
print(f"XGBoost Cross-Validation F1 Scores: {xgb_cv_scores}")
print(f"XGBoost Average F1 Score: {xgb_cv_scores.mean():.4f}")

# Cross-validation for SVM
svm_cv_scores = cross_val_score(svm_model, X_resampled, y_resampled, cv=5, scoring='f1_weighted')
print(f"SVM Cross-Validation F1 Scores: {svm_cv_scores}")
print(f"SVM Average F1 Score: {svm_cv_scores.mean():.4f}")

# Cross-validation for Neural Network
nn_cv_scores = cross_val_score(nn_model, X_resampled, y_resampled, cv=5, scoring='f1_weighted')
print(f"Neural Network Cross-Validation F1 Scores: {nn_cv_scores}")
print(f"Neural Network Average F1 Score: {nn_cv_scores.mean():.4f}")

# Cross-validation for Logistic Regression
lr_cv_scores = cross_val_score(lr_model, X_resampled, y_resampled, cv=5, scoring='f1_weighted')
print(f"Logistic Regression Cross-Validation F1 Scores: {lr_cv_scores}")
print(f"Logistic Regression Average F1 Score: {lr_cv_scores.mean():.4f}")

### Compare Models

In [None]:
model_accuracies = {
    'Random Forest': accuracy_score(y, rf_y_pred),
    'XGBoost': accuracy_score(y, xgb_y_pred),
    'SVM': accuracy_score(y, svm_y_pred),
    'Neural Network': accuracy_score(y, nn_y_pred),
    'Logistic Regression': accuracy_score(y, lr_y_pred),
    'kNN': accuracy_score(y, knn_y_pred)
}

for model, acc in model_accuracies.items():
    print(f"{model}: {acc:.4f}")

# Best model evaluation based on cross-validation

In [None]:
cv_scores = {
    'Random Forest': rf_cv_scores.mean(),
    'XGBoost': xgb_cv_scores.mean(),
    'SVM': svm_cv_scores.mean(),
    'Neural Network': nn_cv_scores.mean(),
    'Logistic Regression': lr_cv_scores.mean()
}

best_model_name = max(cv_scores, key=cv_scores.get)
print(f"Best model based on cross-validation: {best_model_name}")

## Results

### Visualization: Score Distribution

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(games['homeFinalScore'], kde=True, color='blue', label='Home Score', bins=20)
sns.histplot(games['visitorFinalScore'], kde=True, color='red', label='Visitor Score', bins=20)
plt.title("Score Distribution")
plt.legend()
plt.show()

### Model Evaluation


In [None]:
# Best model evaluation based on accuracy
best_model_name = max(model_accuracies, key=model_accuracies.get)
print(f"Best model: {best_model_name} with accuracy {model_accuracies[best_model_name]:.4f}")

## Discussion

- **Key Insights:**
  - Average speed and acceleration before the snap are strong predictors of pass outcomes.
  - Distance traveled correlates with certain offensive strategies.
  - Cross-validation ensures robustness of the models, and feature importance provides actionable insights for refining features.

- **Limitations:**
  - Current analysis does not include positional context, which could improve model accuracy.
  - Handling imbalanced data remains a challenge despite using SMOTE.


## Conclusion

My analysis demonstrates the potential of pre-snap metrics to predict post-snap outcomes effectively. Future work could include exploring positional context, expanding the feature set, and experimenting with advanced machine learning models.


## Appendix

### Detailed Code Implementation

In [None]:
# Save model predictions for submission
submission = X.copy()
submission['predicted_passResult'] = label_encoder.inverse_transform(model.predict(X))
submission.to_csv('submission.csv', index=False)
print("Submission saved as submission.csv")