# NFL Big Data Bowl 2026 - Submission Notebook

## Overview

This notebook provides the inference pipeline for the NFL Big Data Bowl 2026 Prediction Track. It loads a pre-trained Keras model to predict player trajectory coordinates (x, y) given contextual play information.

### Workflow
1. **Environment Setup**: Configure Protobuf and import all required libraries.
2. **Model Loading**: Lazy-load the trained Keras model on first inference call.
3. **Preprocessing**: Transform raw input data to match the feature engineering used during training.
4. **Inference**: Generate (x, y) coordinate predictions for each player in a play.
5. **Kaggle Integration**: Expose the [predict](cci:1://file:///home/samer/Desktop/competitions/NFL_Big_Data_Bowl_2026_dev/src/manual_data_processing/unsupervised_pretraining.py:98:0-163:29) function via the Kaggle NFL Inference Server.

### Key Compatibility Notes
- **Protobuf**: The `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION` environment variable must be set to `'python'` *before* any TensorFlow imports to avoid version conflicts on Kaggle.
- **Feature Consistency**: The preprocessing logic **must** exactly match the training pipeline (`csv_to_keras_sequence.py`) to avoid train-serve skew.

In [None]:
# CRITICAL: Set environment variables BEFORE any imports to fix Protobuf conflicts
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

# Import TensorFlow FIRST to avoid Protobuf version conflicts
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Now import other libraries
import sys
import pandas as pd
import polars as pl
import numpy as np
import zlib
# import kaggle_evaluation.nfl_inference_server

# =================================================================================================
# MODEL CONFIGURATION
# =================================================================================================
## 1. Configuration
'''
This section defines all configurable parameters for the inference pipeline. Modify these values to point to your trained model and normalization statistics.

| Parameter        | Description                                                               |
|------------------|---------------------------------------------------------------------------|
| `MODEL_PATH`     | Absolute path to the trained `.keras` model file.                         |
| `ID_COLUMNS`     | Columns to exclude from features (identifiers, not predictive signals).   |
| `MAX_SEQ_LENGTH` | Maximum input sequence length (must match training).                      |
| `MEAN` / `STD`   | Per-feature normalization statistics from training (optional but recommended). |
'''

MODEL_PATH = '/home/samer/Desktop/competitions/NFL_Big_Data_Bowl_2026_dev/trained_models/fine_tuned_encoder_60_epochs_submission_6.keras'

# ID columns to EXCLUDE (matching csv_to_keras_sequence.py)
ID_COLUMNS = ['game_id', 'play_id', 'nfl_id', 'frame_id', 'player_to_predict', 'time']

# Maximum sequence length (matching training data)
MAX_SEQ_LENGTH = 10

# Normalization Statistics
# IMPORTANT: Replace these with the actual mean and std from your training data!
# These must be numpy arrays of shape (18,) matching the feature columns.
MEAN = None 
STD = None

# Global Model Variable
model = None

'''
---
## 2. Model Management

The model is loaded lazilyâ€”only when the first prediction is requested. This
avoids unnecessary memory usage if the notebook cell is run but no
predictions are made (e.g., during development). The loaded model is cached
in a global variable to prevent redundant disk reads.
'''

def load_model_if_needed():
    """Lazily loads the Keras model into global memory.
    This function implements a singleton pattern for the model. It checks if
    the global `model` variable is None, and if so, loads the model from
    `MODEL_PATH`. Subsequent calls return instantly.
    This approach minimizes startup time and memory usage in development,
    while ensuring the model is ready before the first prediction.
    Raises:
        Exception: If the model file cannot be loaded (e.g., file not found,
            corrupted file, or incompatible TensorFlow version).
    Returns:
        tf.keras.Model: The loaded Keras model instance.
    """
    global model
    if model is None:
        print(f"Loading model from {MODEL_PATH}...")
        try:
            model = tf.keras.models.load_model(MODEL_PATH)
            print("Model loaded successfully.")
            print(f"Model expects input shape: {model.input_shape}")
        except Exception as e:
            print(f"Error loading model: {e}")
            raise e
    return model


'''
## 3. Feature Preprocessing

These functions transform raw input data into the numerical format expected by the model. 

### Critical Requirement: Train-Serve Consistency
The preprocessing logic here **must be identical** to the logic used in `csv_to_keras_sequence.py` during training. Any discrepancy will cause a distribution shift between training and inference data, degrading model performance. 

### Processing Pipeline
1. **Type Conversion**: Convert booleans, categorical strings, and dates to floats.
2. **String Hashing**: Unknown string values are hashed using `zlib.adler32` for a deterministic numeric representation.
3. **Feature Selection**: Only non-ID columns are used as features.
4. **Sequence Padding**: Variable-length sequences are padded to `MAX_SEQ_LENGTH`.
5. **Normalization**: Features are standardized using pre-computed `MEAN` and `STD` (if provided).
'''

def process_value(val):
    """Converts a single value to a float, matching training preprocessing.
    This function handles the type conversion for individual cell values
    from the input DataFrame. It is designed to exactly replicate the
    feature engineering performed in `csv_to_keras_sequence.py`.
    Conversion Rules:
        - None: 0.0
        - bool: True -> 1.0, False -> 0.0
        - int/float: Cast to float
        - str 'true'/'false': 1.0 / 0.0
        - str 'left'/'right': 0.0 / 1.0 (direction encoding)
        - str 'defense'/'offense': 0.0 / 1.0 (team side encoding)
        - str (numeric): Parsed as float
        - str (other): Hashed to integer using zlib.adler32 modulo 10000
    Args:
        val: The value to convert. Can be any type.
    Returns:
        float: The numeric representation of the input value.
    """
    # Handle None/null values
    if val is None:
        return 0.0
    
    # Handle Booleans
    if isinstance(val, bool):
        return 1.0 if val else 0.0
    
    # Handle numeric types
    if isinstance(val, (int, float)):
        return float(val)
    
    # Handle string values
    if isinstance(val, str):
        val_lower = val.lower()
        
        # Booleans
        if val_lower == 'true':
            return 1.0
        if val_lower == 'false':
            return 0.0
        
        # Direction
        if val_lower == 'left':
            return 0.0
        if val_lower == 'right':
            return 1.0
        
        # Player Side
        if val_lower == 'defense':
            return 0.0
        if val_lower == 'offense':
            return 1.0
        
        # Try to parse as number
        try:
            return float(val_lower)
        except ValueError:
            # Hash the string using zlib.adler32 to match training
            return float(zlib.adler32(val.encode('utf-8')) % 10000)
    
    # Fallback: hash any other type
    return float(zlib.adler32(str(val).encode('utf-8')) % 10000)


def preprocess(test_df, test_input_df):
    """Transforms raw input DataFrames into model-ready feature tensors.
    This function replicates the full preprocessing pipeline from training,
    ensuring feature consistency between training and inference. It performs:
    1. DataFrame conversion (Pandas -> Polars if needed)
    2. Vectorized type casting and categorical encoding
    3. Per-player sequence extraction
    4. Zero-padding to `MAX_SEQ_LENGTH`
    5. Optional normalization with `MEAN` and `STD`
    Args:
        test_df (pl.DataFrame | pd.DataFrame): Metadata for the prediction
            request. Contains `game_id`, `play_id`, and `nfl_id` to identify
            which player's trajectory to predict.
        test_input_df (pl.DataFrame | pd.DataFrame): Context data for the play.
            Contains time-series features for all players in the play.
    Returns:
        np.ndarray: A 3D NumPy array of shape [(batch_size, MAX_SEQ_LENGTH, 18)](cci:1://file:///home/samer/Desktop/competitions/NFL_Big_Data_Bowl_2026_dev/src/manual_data_processing/unsupervised_pretraining.py:166:0-316:17)
            containing the preprocessed and padded feature sequences. `batch_size`
            equals the number of rows in `test_df`.
    """
    # Convert to Polars if needed
    if not isinstance(test_df, pl.DataFrame):
        test_df = pl.from_pandas(test_df.to_pandas() if hasattr(test_df, 'to_pandas') else test_df)
            
    if not isinstance(test_input_df, pl.DataFrame):
        test_input_df = pl.from_pandas(test_input_df.to_pandas() if hasattr(test_input_df, 'to_pandas') else test_input_df)
    
    # Get feature columns (all columns EXCEPT ID columns)
    all_columns = test_input_df.columns
    feature_cols = [col for col in all_columns if col not in ID_COLUMNS]
    
    # Process features using vectorized Polars operations
    expressions = []
    for col in feature_cols:
        if test_input_df[col].dtype == pl.Utf8:
            # Handle string columns
            expr = (
                pl.when(pl.col(col).str.to_lowercase() == "true").then(1.0)
                .when(pl.col(col).str.to_lowercase() == "false").then(0.0)
                .when(pl.col(col).str.to_lowercase() == "left").then(0.0)
                .when(pl.col(col).str.to_lowercase() == "right").then(1.0)
                .when(pl.col(col).str.to_lowercase() == "defense").then(0.0)
                .when(pl.col(col).str.to_lowercase() == "offense").then(1.0)
                .otherwise(
                    pl.col(col).cast(pl.Float64, strict=False).fill_null(
                        pl.col(col).map_elements(lambda x: float(zlib.adler32(x.encode('utf-8')) % 10000) if x else 0.0, return_dtype=pl.Float64)
                    )
                ).cast(pl.Float64).alias(col)
            )
            expressions.append(expr)
        else:
            # Numeric columns
            expressions.append(pl.col(col).cast(pl.Float64).fill_null(0.0).alias(col))
    
    # Apply all transformations
    test_input_df = test_input_df.with_columns(expressions)
    
    # Build sequences
    sequences = []
    
    for row in test_df.iter_rows(named=True):
        # Filter for this specific player
        player_data = test_input_df.filter(
            (pl.col('game_id') == row['game_id']) &
            (pl.col('play_id') == row['play_id']) &
            (pl.col('nfl_id') == row['nfl_id'])
        )
        
        # Filter for player_to_predict == True (matching training data)
        if 'player_to_predict' in test_input_df.columns:
            player_data = player_data.filter(
                (pl.col('player_to_predict') == 1.0) | 
                (pl.col('player_to_predict').cast(pl.Utf8).str.to_lowercase() == 'true')
            )
        
        if len(player_data) == 0:
            # Fallback: create zero sequence
            seq = np.zeros((1, len(feature_cols)), dtype=np.float32)
        else:
            # Sort by frame_id
            if 'frame_id' in player_data.columns:
                player_data = player_data.sort('frame_id')
            
            # Select ONLY feature columns (excludes ID columns)
            seq = player_data.select(feature_cols).to_numpy().astype(np.float32)
        
        sequences.append(seq)
    
    # Pad sequences to MAX_SEQ_LENGTH
    X_padded = pad_sequences(
        sequences,
        maxlen=MAX_SEQ_LENGTH,
        dtype='float32',
        padding='post',
        truncating='post',
        value=0.0
    )
    
    # Normalize if stats are available
    if MEAN is not None and STD is not None:
        X_padded = (X_padded - MEAN) / STD
    
    return X_padded


'''
## 4. Inference

The [predict](cci:1://file:///home/samer/Desktop/competitions/NFL_Big_Data_Bowl_2026_dev/src/manual_data_processing/unsupervised_pretraining.py:98:0-163:29) function is the main entry point called by the Kaggle evaluation server. It orchestrates the full inference flow:
1. Ensures the model is loaded.
2. Preprocesses the input DataFrames.
3. Runs the model forward pass.
4. Formats the output for submission.

### Output Handling
The model outputs a 3D tensor of shape [(batch_size, sequence_length, 2)]
(cci:1://file:///home/samer/Desktop/competitions/NFL_Big_Data_Bowl_2026_dev/
src/manual_data_processing/unsupervised_pretraining.py:166:0-316:17). Since
 the competition expects a single (x, y) prediction per player, we extract
  the prediction from the **last timestep** of each sequence.
'''


def predict(test_df, test_input_df):
    """Generates (x, y) trajectory predictions for a batch of players.
    This is the main entry point called by the Kaggle NFL Inference Server.
    It orchestrates model loading, preprocessing, and inference.
    The model outputs a sequence of predictions. To produce a single (x, y)
    prediction per player, the **last timestep** of each output sequence
    is selected.
    Args:
        test_df (pl.DataFrame | pd.DataFrame): Prediction request metadata.
            Each row specifies a unique (game_id, play_id, nfl_id) combination
            for which a prediction is required.
        test_input_df (pl.DataFrame | pd.DataFrame): Contextual tracking data
            for the play, including positions, velocities, and player attributes.
    Returns:
        pd.DataFrame: A DataFrame with two columns:
            - 'x': Predicted x-coordinate on the field.
            - 'y': Predicted y-coordinate on the field.
            The number of rows matches the number of rows in `test_df`.
    """
    load_model_if_needed()
    
    # Preprocess
    features = preprocess(test_df, test_input_df)
    
    # Run inference
    if len(features) > 32:
        predictions_xy = model.predict(features, batch_size=32, verbose=0)
    else:
        predictions_xy = model(features, training=False).numpy()
    
    # Handle 3D output (batch_size, time_steps, features)
    # The model returns a sequence of predictions, take the last timestep
    if len(predictions_xy.shape) == 3:
        predictions_xy = predictions_xy[:, -1, :]
    
    # Ensure we have exactly 2 features (x, y)
    if predictions_xy.shape[1] != 2:
        predictions_xy = predictions_xy[:, :2]  # Take first 2 columns
    
    # Format the predictions into the required DataFrame
    return pd.DataFrame(predictions_xy, columns=['x', 'y'])


# =================================================================================================
# INFERENCE SERVER (ENTRY POINT)
# =================================================================================================

'''
## 5. Kaggle Inference Server Integration

This section initializes the `NFLInferenceServer` provided by the
`kaggle_evaluation` package. 

- **Competition Rerun**: When the notebook runs in the Kaggle competition 
environment (`KAGGLE_IS_COMPETITION_RERUN` is set), the server listens for 
incoming prediction requests.
- **Local Testing**: For local development, the `run_local_gateway` method 
simulates the server using local data files.

> **Note**: On local machines without the `kaggle_evaluation` package 
installed, this section will raise an error. This is expected and does not 
affect model development.
'''
if __name__=="__main__":
    pass

inference_server = kaggle_evaluation.nfl_inference_server.NFLInferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    # For local testing
    inference_server.run_local_gateway((
        '/kaggle/input/nfl-big-data-bowl-2026-prediction/',
    ))



NameError: name 'kaggle_evaluation' is not defined