# Parcial 2 ‚Äì NFL Big Data Bowl 2026 con TabNet

En este cuaderno trabajo el parcial usando la base de datos del concurso NFL Big Data Bowl 2026. La idea es tomar los datos crudos del concurso, preparar las variables que necesitamos, entrenar un modelo TabNet para predecir la posici√≥n final del jugador en el momento en que el bal√≥n llega (x e y) y, al final, generar el archivo `submission.csv` para subirlo a Kaggle.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/officialpytorchtabnet/pytorch_tabnet-3.1.0-py3-none-any.whl
/kaggle/input/officialpytorchtabnet/pytorch_tabnet-4.0-py3-none-any.whl
/kaggle/input/officialpytorchtabnet/pytorch_tabnet-3.1.1-py3-none-any.whl
/kaggle/input/officialpytorchtabnet/pytorch_tabnet-3.0.0-py3-none-any.whl
/kaggle/input/nfl-big-data-bowl-2026-prediction/test_input.csv
/kaggle/input/nfl-big-data-bowl-2026-prediction/test.csv
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/nfl_inference_server.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/nfl_gateway.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/__init__.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/core/templates.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/core/base_gateway.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/core/relay.py
/kaggle/input/nfl-big-data-bowl-2026-prediction/kaggle_evaluation/core/kaggle_evaluati

## Funciones base: datos, features y m√©tricas

En este bloque se dejo listas las funciones que vamos a usar en todo el cuaderno. Aqu√≠ est√°n las funciones que cargan los datos del concurso, las que acomodan la direcci√≥n de juego, las que construyen nuevas variables (velocidades, √°ngulos, distancias, etc.) y las que convierten las columnas categ√≥ricas en n√∫meros. Tambi√©n est√°n las funciones de evaluaci√≥n que uso m√°s adelante para medir qu√© tan buenas son las predicciones.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pickle
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
warnings.filterwarnings('ignore')

# ============================================================================
# GPU CONFIGURATION & OPTIMIZATION
# ============================================================================

def setup_gpu():
    """Configure GPU for optimal performance"""
    print("="*80)
    print("GPU CONFIGURATION")
    print("="*80)
    
    # Check available GPUs
    gpus = tf.config.list_physical_devices('GPU')
    print(f"\nüñ•Ô∏è  Available GPUs: {len(gpus)}")
    
    if gpus:
        try:
            # Enable memory growth (don't allocate all GPU memory at once)
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
                print(f"   ‚úì GPU: {gpu.name} - Memory growth enabled")
            
            # Set GPU memory limit (optional - useful if sharing GPU)
            # tf.config.set_logical_device_configuration(
            #     gpus[0],
            #     [tf.config.LogicalDeviceConfiguration(memory_limit=4096)]  # 4GB
            # )
            
            # Use mixed precision for faster training
            policy = tf.keras.mixed_precision.Policy('mixed_float16')
            tf.keras.mixed_precision.set_global_policy(policy)
            print(f"   ‚úì Mixed precision enabled: {policy.name}")
            
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(f"   ‚úì Logical GPUs: {len(logical_gpus)}")
            
        except RuntimeError as e:
            print(f"   ‚ö†Ô∏è  GPU configuration error: {e}")
    else:
        print("   ‚ö†Ô∏è  No GPU found - using CPU (training will be slower)")
    
    # Set TensorFlow options for better performance
    tf.config.optimizer.set_jit(True)  # XLA compilation
    print("   ‚úì XLA (Accelerated Linear Algebra) enabled")
    
    print(f"\nüìä TensorFlow version: {tf.__version__}")
    print(f"üìä Keras version: {keras.__version__}")
    
    return len(gpus) > 0

# ============================================================================
# CONFIGURATION
# ============================================================================

CONFIG = {
    'sequence_length': 10,
    'max_frames_to_predict': 15,
    'batch_size': 256,  # Larger batch for GPU
    'epochs': 100,
    'learning_rate': 0.001,
    'validation_split': 0.15,
    'use_gpu': True,
}

# ============================================================================
# EVALUATION METRICS
# ============================================================================

def calculate_rmse(y_true, y_pred):
    """Calculate Root Mean Squared Error"""
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    return rmse

def calculate_mae(y_true, y_pred):
    """Calculate Mean Absolute Error"""
    mae = np.mean(np.abs(y_true - y_pred))
    return mae

def calculate_euclidean_distance(y_true, y_pred):
    """Calculate Euclidean distance between predicted and actual positions"""
    distances = np.sqrt((y_true[:, 0] - y_pred[:, 0])**2 + 
                       (y_true[:, 1] - y_pred[:, 1])**2)
    return distances

def evaluate_predictions(y_true, y_pred, split_name="Validation"):
    """Comprehensive evaluation of predictions"""
    print("\n" + "="*80)
    print(f"üìä {split_name.upper()} SET EVALUATION")
    print("="*80)
    
    # Overall metrics
    x_rmse = calculate_rmse(y_true[:, 0], y_pred[:, 0])
    y_rmse = calculate_rmse(y_true[:, 1], y_pred[:, 1])
    
    x_mae = calculate_mae(y_true[:, 0], y_pred[:, 0])
    y_mae = calculate_mae(y_true[:, 1], y_pred[:, 1])
    
    # Euclidean distance
    distances = calculate_euclidean_distance(y_true, y_pred)
    mean_distance = np.mean(distances)
    median_distance = np.median(distances)
    
    print(f"\nüéØ POSITION ACCURACY:")
    print(f"   X-coordinate:")
    print(f"      RMSE: {x_rmse:.3f} yards")
    print(f"      MAE:  {x_mae:.3f} yards")
    
    print(f"\n   Y-coordinate:")
    print(f"      RMSE: {y_rmse:.3f} yards")
    print(f"      MAE:  {y_mae:.3f} yards")
    
    print(f"\nüìè EUCLIDEAN DISTANCE:")
    print(f"   Mean:   {mean_distance:.3f} yards")
    print(f"   Median: {median_distance:.3f} yards")
    print(f"   Std:    {np.std(distances):.3f} yards")
    print(f"   Min:    {np.min(distances):.3f} yards")
    print(f"   Max:    {np.max(distances):.3f} yards")
    
    # Percentiles
    print(f"\nüìä DISTANCE PERCENTILES:")
    for p in [25, 50, 75, 90, 95, 99]:
        print(f"   {p}th percentile: {np.percentile(distances, p):.3f} yards")
    
    # Accuracy buckets
    print(f"\nüéØ ACCURACY BUCKETS:")
    for threshold in [1, 2, 5, 10, 15, 20]:
        within = (distances <= threshold).sum()
        pct = 100 * within / len(distances)
        print(f"   Within {threshold:2d} yards: {within:6,} ({pct:5.2f}%)")
    
    metrics = {
        'x_rmse': x_rmse,
        'y_rmse': y_rmse,
        'x_mae': x_mae,
        'y_mae': y_mae,
        'mean_distance': mean_distance,
        'median_distance': median_distance,
        'distances': distances
    }
    
    return metrics

def plot_predictions(y_true, y_pred, split_name="Validation", save_path="predictions_plot.png"):
    """Visualize predictions vs actual"""
    
    fig = plt.figure(figsize=(20, 12))
    
    # 1. X predictions scatter
    ax1 = plt.subplot(2, 3, 1)
    ax1.scatter(y_true[:, 0], y_pred[:, 0], alpha=0.3, s=1)
    ax1.plot([0, 120], [0, 120], 'r--', linewidth=2)
    ax1.set_xlabel('Actual X (yards)', fontsize=12)
    ax1.set_ylabel('Predicted X (yards)', fontsize=12)
    ax1.set_title(f'{split_name} - X Coordinate', fontsize=14, fontweight='bold')
    ax1.grid(alpha=0.3)
    
    # 2. Y predictions scatter
    ax2 = plt.subplot(2, 3, 2)
    ax2.scatter(y_true[:, 1], y_pred[:, 1], alpha=0.3, s=1)
    ax2.plot([0, 53.3], [0, 53.3], 'r--', linewidth=2)
    ax2.set_xlabel('Actual Y (yards)', fontsize=12)
    ax2.set_ylabel('Predicted Y (yards)', fontsize=12)
    ax2.set_title(f'{split_name} - Y Coordinate', fontsize=14, fontweight='bold')
    ax2.grid(alpha=0.3)
    
    # 3. Error distribution
    ax3 = plt.subplot(2, 3, 3)
    distances = calculate_euclidean_distance(y_true, y_pred)
    ax3.hist(distances, bins=50, alpha=0.7, edgecolor='black')
    ax3.axvline(np.mean(distances), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: {np.mean(distances):.2f}')
    ax3.set_xlabel('Euclidean Distance Error (yards)', fontsize=12)
    ax3.set_ylabel('Frequency', fontsize=12)
    ax3.set_title('Prediction Error Distribution', fontsize=14, fontweight='bold')
    ax3.legend()
    ax3.grid(alpha=0.3)
    
    # 4. X error distribution
    ax4 = plt.subplot(2, 3, 4)
    x_errors = y_true[:, 0] - y_pred[:, 0]
    ax4.hist(x_errors, bins=50, alpha=0.7, edgecolor='black', color='green')
    ax4.axvline(0, color='red', linestyle='--', linewidth=2)
    ax4.set_xlabel('X Error (yards)', fontsize=12)
    ax4.set_ylabel('Frequency', fontsize=12)
    ax4.set_title(f'X Error - Mean: {np.mean(x_errors):.3f}', fontsize=14, fontweight='bold')
    ax4.grid(alpha=0.3)
    
    # 5. Y error distribution
    ax5 = plt.subplot(2, 3, 5)
    y_errors = y_true[:, 1] - y_pred[:, 1]
    ax5.hist(y_errors, bins=50, alpha=0.7, edgecolor='black', color='orange')
    ax5.axvline(0, color='red', linestyle='--', linewidth=2)
    ax5.set_xlabel('Y Error (yards)', fontsize=12)
    ax5.set_ylabel('Frequency', fontsize=12)
    ax5.set_title(f'Y Error - Mean: {np.mean(y_errors):.3f}', fontsize=14, fontweight='bold')
    ax5.grid(alpha=0.3)
    
    # 6. Cumulative accuracy
    ax6 = plt.subplot(2, 3, 6)
    sorted_distances = np.sort(distances)
    cumulative = np.arange(1, len(sorted_distances) + 1) / len(sorted_distances) * 100
    ax6.plot(sorted_distances, cumulative, linewidth=2)
    ax6.set_xlabel('Distance Threshold (yards)', fontsize=12)
    ax6.set_ylabel('Cumulative % of Predictions', fontsize=12)
    ax6.set_title('Cumulative Accuracy Curve', fontsize=14, fontweight='bold')
    ax6.grid(alpha=0.3)
    
    # Add benchmarks
    for threshold in [5, 10, 15]:
        pct = (distances <= threshold).sum() / len(distances) * 100
        ax6.axvline(threshold, linestyle='--', alpha=0.5)
        ax6.text(threshold, pct, f'{pct:.1f}%', fontsize=10)
    
    plt.suptitle(f'{split_name} Set - Prediction Analysis', 
                 fontsize=16, fontweight='bold', y=0.995)
    plt.tight_layout()
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"\n‚úì Saved plot: {save_path}")
    
    return fig

def plot_training_history(history, save_path="training_history.png"):
    """Plot training history"""
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Loss plot
    axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
    axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Loss (MSE)', fontsize=12)
    axes[0].set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # MAE plot
    axes[1].plot(history.history['mae'], label='Training MAE', linewidth=2)
    axes[1].plot(history.history['val_mae'], label='Validation MAE', linewidth=2)
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('MAE (yards)', fontsize=12)
    axes[1].set_title('Training and Validation MAE', fontsize=14, fontweight='bold')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"‚úì Saved plot: {save_path}")
    
    return fig

# ============================================================================
# IMPORT FUNCTIONS FROM ORIGINAL CODE
# ============================================================================

def parse_height(height_str):
    if pd.isna(height_str):
        return np.nan
    try:
        feet, inches = map(int, str(height_str).split('-'))
        return feet * 12 + inches
    except:
        return np.nan

def calculate_age(birth_date, reference_date='2023-09-01'):
    try:
        birth = pd.to_datetime(birth_date)
        ref = pd.to_datetime(reference_date)
        return (ref - birth).days / 365.25
    except:
        return np.nan

def load_training_data(data_path='/kaggle/input/nfl-big-data-bowl-2026-prediction/train'):
    print("\n" + "="*80)
    print("LOADING TRAINING DATA")
    print("="*80)
    
    all_data = []
    for week in range(1, 19):
        file_path = f'{data_path}/input_2023_w{week:02d}.csv'
        try:
            df = pd.read_csv(file_path)
            all_data.append(df)
            print(f"‚úì Week {week:02d}: {len(df):,} rows | {df['play_id'].nunique():,} plays")
        except FileNotFoundError:
            print(f"‚úó Week {week:02d}: File not found")
    
    train_df = pd.concat(all_data, ignore_index=True)
    print(f"\nTotal training data: {len(train_df):,} rows")
    print(f" Unique plays: {(train_df['game_id'].astype(str) + '_' + train_df['play_id'].astype(str)).nunique():,}")
    print(f"Players to predict: {train_df['player_to_predict'].sum():,}")
    
    return train_df

def load_test_data():
    print("\n" + "="*80)
    print("LOADING TEST DATA")
    print("="*80)
    
    test_input = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2026-prediction/test_input.csv')
    test_targets = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2026-prediction/test.csv')
    
    print(f"‚úì Test input: {len(test_input):,} rows")
    print(f"‚úì Test targets: {len(test_targets):,} predictions needed")
    
    return test_input, test_targets

def normalize_play_direction(df):
    df = df.copy()
    left_mask = df['play_direction'] == 'left'
    num_flipped = left_mask.sum()
    
    df.loc[left_mask, 'x'] = 120 - df.loc[left_mask, 'x']
    df.loc[left_mask, 'y'] = 53.3 - df.loc[left_mask, 'y']
    df.loc[left_mask, 'dir'] = (df.loc[left_mask, 'dir'] + 180) % 360
    df.loc[left_mask, 'o'] = (df.loc[left_mask, 'o'] + 180) % 360
    
    if 'ball_land_x' in df.columns:
        df.loc[left_mask, 'ball_land_x'] = 120 - df.loc[left_mask, 'ball_land_x']
        df.loc[left_mask, 'ball_land_y'] = 53.3 - df.loc[left_mask, 'ball_land_y']
    
    print(f"   Normalized {num_flipped:,} plays moving left ‚Üí right")
    return df

def engineer_features(df):
    print("\n" + "="*80)
    print("FEATURE ENGINEERING")
    print("="*80)
    
    df = df.copy()
    
    print("‚úì Computing velocity components (vx, vy)")
    df['vx'] = df['s'] * np.cos(np.radians(df['dir']))
    df['vy'] = df['s'] * np.sin(np.radians(df['dir']))
    
    print("‚úì Computing orientation components (ox, oy)")
    df['ox'] = np.cos(np.radians(df['o']))
    df['oy'] = np.sin(np.radians(df['o']))
    
    if 'ball_land_x' in df.columns:
        print("‚úì Computing ball landing features")
        df['dist_to_ball'] = np.sqrt(
            (df['x'] - df['ball_land_x'])**2 + 
            (df['y'] - df['ball_land_y'])**2
        )
        df['angle_to_ball'] = np.arctan2(
            df['ball_land_y'] - df['y'],
            df['ball_land_x'] - df['x']
        )
        df['vel_toward_ball'] = df['s'] * np.cos(np.radians(df['dir']) - df['angle_to_ball'])
    else:
        df['dist_to_ball'] = 0
        df['angle_to_ball'] = 0
        df['vel_toward_ball'] = 0
    
    print("‚úì Computing field position features")
    df['dist_to_left_sideline'] = df['y']
    df['dist_to_right_sideline'] = 53.3 - df['y']
    df['dist_to_nearest_sideline'] = np.minimum(df['y'], 53.3 - df['y'])
    df['dist_to_endzone'] = 120 - df['x']
    
    print("‚úì Processing player attributes")
    df['height_inches'] = df['player_height'].apply(parse_height)
    df['height_inches'] = df['height_inches'].fillna(df['height_inches'].median())
    
    df['player_age'] = df['player_birth_date'].apply(calculate_age)
    df['player_age'] = df['player_age'].fillna(df['player_age'].median())
    
    df['bmi'] = (df['player_weight'] * 703) / (df['height_inches'] ** 2)
    df['bmi'] = df['bmi'].fillna(df['bmi'].median())
    
    print("‚úì Creating temporal features (lags, differences)")
    df = df.sort_values(['game_id', 'play_id', 'nfl_id', 'frame_id'])
    
    group_cols = ['game_id', 'play_id', 'nfl_id']
    for lag in [1, 2, 3]:
        for col in ['x', 'y', 's', 'a', 'vx', 'vy']:
            df[f'{col}_lag{lag}'] = df.groupby(group_cols)[col].shift(lag)
    
    df['speed_change'] = df.groupby(group_cols)['s'].diff()
    df['accel_change'] = df.groupby(group_cols)['a'].diff()
    df['dir_change'] = df.groupby(group_cols)['dir'].diff()
    
    df.loc[df['dir_change'] > 180, 'dir_change'] -= 360
    df.loc[df['dir_change'] < -180, 'dir_change'] += 360
    
    print("‚úì Computing rolling statistics")
    for col in ['s', 'a']:
        df[f'{col}_roll_mean'] = df.groupby(group_cols)[col].transform(
            lambda x: x.rolling(window=3, min_periods=1).mean()
        )
        df[f'{col}_roll_std'] = df.groupby(group_cols)[col].transform(
            lambda x: x.rolling(window=3, min_periods=1).std()
        )
    
    df = df.fillna(method='bfill').fillna(method='ffill').fillna(0)
    
    print(f"\nüìä Features created: {len(df.columns)} total columns")
    
    return df

def encode_categorical(df, encoders=None):
    df = df.copy()
    categorical_cols = ['player_position', 'player_side', 'player_role']
    
    if encoders is None:
        encoders = {}
        for col in categorical_cols:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))
            encoders[col] = le
        return df, encoders
    else:
        for col in categorical_cols:
            if col in encoders:
                df[col] = df[col].astype(str).map(
                    lambda x: x if x in encoders[col].classes_ else encoders[col].classes_[0]
                )
                df[col] = encoders[col].transform(df[col])
        return df

def create_sequences(df, sequence_length=10, for_training=True):
    print("\n" + "="*80)
    print("CREATING SEQUENCES")
    print("="*80)
    
    sequence_features = [
        'x', 'y', 's', 'a', 'vx', 'vy', 'ox', 'oy', 'dir', 'o',
        'x_lag1', 'y_lag1', 's_lag1', 'a_lag1',
        'x_lag2', 'y_lag2', 's_lag2', 'a_lag2',
        'x_lag3', 'y_lag3', 's_lag3', 'a_lag3',
        'speed_change', 'accel_change', 'dir_change',
        's_roll_mean', 'a_roll_mean',
        'dist_to_left_sideline', 'dist_to_right_sideline', 'dist_to_nearest_sideline'
    ]
    
    static_features = [
        'player_position', 'player_side', 'player_role',
        'height_inches', 'player_weight', 'player_age', 'bmi',
        'absolute_yardline_number', 'dist_to_ball', 'angle_to_ball'
    ]
    
    sequences = []
    static_data = []
    targets = []
    metadata = []
    
    grouped = df.groupby(['game_id', 'play_id', 'nfl_id'])
    
    for (game_id, play_id, nfl_id), group in grouped:
        if for_training and not group['player_to_predict'].any():
            continue
        
        group = group.sort_values('frame_id')
        
        if len(group) < sequence_length:
            continue
        
        seq_data = group[sequence_features].iloc[-sequence_length:].values
        static = group[static_features].iloc[-1].values
        
        sequences.append(seq_data)
        static_data.append(static)
        
        if for_training and 'ball_land_x' in group.columns:
            target_x = group['ball_land_x'].iloc[-1]
            target_y = group['ball_land_y'].iloc[-1]
            targets.append([target_x, target_y])
        
        metadata.append({
            'game_id': game_id,
            'play_id': play_id,
            'nfl_id': nfl_id,
            'num_frames_output': group['num_frames_output'].iloc[-1] if 'num_frames_output' in group.columns else 0,
            'last_x': group['x'].iloc[-1],
            'last_y': group['y'].iloc[-1],
        })
    
    sequences = np.array(sequences, dtype=np.float32)
    static_data = np.array(static_data, dtype=np.float32)
    
    if for_training and len(targets) > 0:
        targets = np.array(targets, dtype=np.float32)
    else:
        targets = None
    
    print(f"‚úì Created {len(sequences):,} sequences")
    print(f"‚úì Sequence shape: {sequences.shape}")
    print(f"‚úì Static shape: {static_data.shape}")
    if targets is not None:
        print(f"‚úì Target shape: {targets.shape}")
    
    return sequences, static_data, targets, metadata

def build_model(sequence_shape, static_shape):
    print("\n" + "="*80)
    print("BUILDING MODEL")
    print("="*80)
    
    sequence_input = layers.Input(shape=sequence_shape, name='sequence_input')
    
    x = layers.LSTM(128, return_sequences=True)(sequence_input)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)
    
    x = layers.LSTM(64, return_sequences=False)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)
    
    static_input = layers.Input(shape=(static_shape,), name='static_input')
    s = layers.Dense(64, activation='relu')(static_input)
    s = layers.BatchNormalization()(s)
    s = layers.Dropout(0.2)(s)
    s = layers.Dense(32, activation='relu')(s)
    
    combined = layers.concatenate([x, s])
    
    z = layers.Dense(128, activation='relu')(combined)
    z = layers.BatchNormalization()(z)
    z = layers.Dropout(0.3)(z)
    
    z = layers.Dense(64, activation='relu')(z)
    z = layers.Dropout(0.2)(z)
    
    # For mixed precision, use float32 output
    output = layers.Dense(2, dtype='float32', name='position_output')(z)
    
    model = keras.Model(
        inputs=[sequence_input, static_input],
        outputs=output
    )
    
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=CONFIG['learning_rate']),
        loss='mse',
        metrics=['mae', 'mse']
    )
    
    model.summary()
    
    return model

def train_model(model, X_seq, X_static, y, validation_split=0.15):
    print("\n" + "="*80)
    print("TRAINING MODEL")
    print("="*80)
    
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=15,
            restore_best_weights=True,
            verbose=1
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=7,
            min_lr=1e-6,
            verbose=1
        ),
        keras.callbacks.ModelCheckpoint(
            'best_model.keras',
            monitor='val_loss',
            save_best_only=True,
            verbose=1
        )
    ]
    
    history = model.fit(
        [X_seq, X_static], y,
        batch_size=CONFIG['batch_size'],
        epochs=CONFIG['epochs'],
        validation_split=validation_split,
        callbacks=callbacks,
        verbose=1
    )
    
    return model, history

def create_submission(model, test_input, test_targets, metadata_lookup, scalers):
    print("\n" + "="*80)
    print("GENERATING PREDICTIONS")
    print("="*80)
    
    pred_dict = {}
    for meta, pred in zip(metadata_lookup, model.predict([test_input[0], test_input[1]], verbose=1)):
        key = (meta['game_id'], meta['play_id'], meta['nfl_id'])
        pred_dict[key] = {
            'x': pred[0],
            'y': pred[1],
            'last_x': meta['last_x'],
            'last_y': meta['last_y']
        }
    
    submissions = []
    for _, row in test_targets.iterrows():
        key = (row['game_id'], row['play_id'], row['nfl_id'])
        
        if key in pred_dict:
            x_pred = pred_dict[key]['x']
            y_pred = pred_dict[key]['y']
        else:
            x_pred = 60.0
            y_pred = 26.65
        
        submissions.append({
            'id': f"{row['game_id']}_{row['play_id']}_{row['nfl_id']}_{row['frame_id']}",
            'x': x_pred,
            'y': y_pred
        })
    
    submission_df = pd.DataFrame(submissions)
    submission_df.to_csv('submission.csv', index=False)
    
    print(f"‚úì Submission created: {len(submission_df):,} predictions")
    print(f"‚úì Saved to: submission.csv")
    
    return submission_df

# ============================================================================
# MAIN PIPELINE WITH EVALUATION
# ============================================================================

'''def main():
    start_time = datetime.now()
    
    print("\n" + "="*80)
    print(" NFL BIG DATA BOWL 2026 - ENHANCED PIPELINE WITH EVALUATION")
    print("="*80)
    
    # Setup GPU
    has_gpu = setup_gpu()
    
    # Load data
    train_df = load_training_data()
    test_input_df, test_targets_df = load_test_data()
    
    # Preprocess
    print("\nüìç Step 1: Normalizing play direction...")
    train_df = normalize_play_direction(train_df)
    test_input_df = normalize_play_direction(test_input_df)
    
    # Feature engineering
    print("\nüìç Step 2: Feature engineering...")
    train_df = engineer_features(train_df)
    test_input_df = engineer_features(test_input_df)
    
    # Encode categorical
    print("\nüìç Step 3: Encoding categorical variables...")
    train_df, encoders = encode_categorical(train_df)
    test_input_df = encode_categorical(test_input_df, encoders)
    
    # Create sequences
    print("\nüìç Step 4: Creating sequences...")
    X_seq_all, X_static_all, y_all, metadata_all = create_sequences(
        train_df, CONFIG['sequence_length'], for_training=True
    )
    
    X_seq_test, X_static_test, _, metadata_test = create_sequences(
        test_input_df, CONFIG['sequence_length'], for_training=False
    )
    
    # Split train/validation
    print("\nüìç Step 5: Splitting train/validation...")
    n_samples = len(X_seq_all)
    n_val = int(n_samples * CONFIG['validation_split'])
    
    # Random shuffle
    indices = np.random.permutation(n_samples)
    train_idx = indices[n_val:]
    val_idx = indices[:n_val]
    
    X_seq_train = X_seq_all[train_idx]
    X_static_train = X_static_all[train_idx]
    y_train = y_all[train_idx]
    
    X_seq_val = X_seq_all[val_idx]
    X_static_val = X_static_all[val_idx]
    y_val = y_all[val_idx]
    
    print(f"   Training samples: {len(X_seq_train):,}")
    print(f"   Validation samples: {len(X_seq_val):,}")
    
    # Scale features
    print("\nüìç Step 6: Scaling features...")
    scaler_seq = StandardScaler()
    scaler_static = StandardScaler()
    
    # Scale sequence features
    X_seq_train_flat = X_seq_train.reshape(-1, X_seq_train.shape[-1])
    X_seq_train_scaled = scaler_seq.fit_transform(X_seq_train_flat).reshape(X_seq_train.shape)
    
    X_seq_val_flat = X_seq_val.reshape(-1, X_seq_val.shape[-1])
    X_seq_val_scaled = scaler_seq.transform(X_seq_val_flat).reshape(X_seq_val.shape)
    
    # Scale static features
    X_static_train_scaled = scaler_static.fit_transform(X_static_train)
    X_static_val_scaled = scaler_static.transform(X_static_val)
    
    # Scale test features
    X_seq_test_flat = X_seq_test.reshape(-1, X_seq_test.shape[-1])
    X_seq_test_scaled = scaler_seq.transform(X_seq_test_flat).reshape(X_seq_test.shape)
    X_static_test_scaled = scaler_static.transform(X_static_test)
    
    # Build model
    print("\nüìç Step 7: Building model...")
    model = build_model(
        sequence_shape=(X_seq_train.shape[1], X_seq_train.shape[2]),
        static_shape=X_static_train.shape[1]
    )
    
    # Train model WITHOUT validation_split (we already split)
    print("\nüìç Step 8: Training model...")
    
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=15,
            restore_best_weights=True,
            verbose=1
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=7,
            min_lr=1e-6,
            verbose=1
        ),
        keras.callbacks.ModelCheckpoint(
            'best_model.keras',
            monitor='val_loss',
            save_best_only=True,
            verbose=1
        )
    ]
    
    history = model.fit(
        [X_seq_train_scaled, X_static_train_scaled], y_train,
        validation_data=([X_seq_val_scaled, X_static_val_scaled], y_val),
        batch_size=CONFIG['batch_size'],
        epochs=CONFIG['epochs'],
        callbacks=callbacks,
        verbose=1
    )
    
    # Plot training history
    print("\nüìç Step 9: Plotting training history...")
    plot_training_history(history, "training_history.png")
    
    # Evaluate on training set
    print("\nüìç Step 10: Evaluating on training set...")
    y_train_pred = model.predict([X_seq_train_scaled, X_static_train_scaled], verbose=0)
    train_metrics = evaluate_predictions(y_train, y_train_pred, "Training")
    plot_predictions(y_train, y_train_pred, "Training", "predictions_train.png")
    
    # Evaluate on validation set
    print("\nüìç Step 11: Evaluating on validation set...")
    y_val_pred = model.predict([X_seq_val_scaled, X_static_val_scaled], verbose=0)
    val_metrics = evaluate_predictions(y_val, y_val_pred, "Validation")
    plot_predictions(y_val, y_val_pred, "Validation", "predictions_val.png")
    
    # Save model and artifacts
    print("\nüìç Step 12: Saving model and artifacts...")
    model.save('nfl_model_final.keras')
    with open('scalers.pkl', 'wb') as f:
        pickle.dump({'seq': scaler_seq, 'static': scaler_static, 'encoders': encoders}, f)
    
    # Save metrics to file
    metrics_summary = {
        'training': {
            'x_rmse': float(train_metrics['x_rmse']),
            'y_rmse': float(train_metrics['y_rmse']),
            'x_mae': float(train_metrics['x_mae']),
            'y_mae': float(train_metrics['y_mae']),
            'mean_distance': float(train_metrics['mean_distance']),
            'median_distance': float(train_metrics['median_distance'])
        },
        'validation': {
            'x_rmse': float(val_metrics['x_rmse']),
            'y_rmse': float(val_metrics['y_rmse']),
            'x_mae': float(val_metrics['x_mae']),
            'y_mae': float(val_metrics['y_mae']),
            'mean_distance': float(val_metrics['mean_distance']),
            'median_distance': float(val_metrics['median_distance'])
        }
    }
    
    with open('metrics.pkl', 'wb') as f:
        pickle.dump(metrics_summary, f)
    
    # Create submission
    print("\nüìç Step 13: Creating submission...")
    submission = create_submission(
        model, 
        (X_seq_test_scaled, X_static_test_scaled),
        test_targets_df,
        metadata_test,
        {'seq': scaler_seq, 'static': scaler_static}
    )
    
    # Final summary
    end_time = datetime.now()
    duration = end_time - start_time
    
    print("\n" + "="*80)
    print("‚úÖ PIPELINE COMPLETE!")
    print("="*80)
    
    print(f"\n‚è±Ô∏è  Total Time: {duration}")
    
    print(f"\nüìÅ Files created:")
    print(f"   ‚Ä¢ nfl_model_final.keras - Trained model")
    print(f"   ‚Ä¢ best_model.keras - Best model checkpoint")
    print(f"   ‚Ä¢ scalers.pkl - Feature scalers and encoders")
    print(f"   ‚Ä¢ metrics.pkl - Evaluation metrics")
    print(f"   ‚Ä¢ submission.csv - Final predictions ({len(submission):,} rows)")
    print(f"   ‚Ä¢ training_history.png - Training curves")
    print(f"   ‚Ä¢ predictions_train.png - Training set predictions")
    print(f"   ‚Ä¢ predictions_val.png - Validation set predictions")
    
    print(f"\nüìä FINAL RESULTS:")
    print(f"\n   Training Set:")
    print(f"      RMSE (X): {train_metrics['x_rmse']:.3f} yards")
    print(f"      RMSE (Y): {train_metrics['y_rmse']:.3f} yards")
    print(f"      Mean Distance Error: {train_metrics['mean_distance']:.3f} yards")
    
    print(f"\n   Validation Set:")
    print(f"      RMSE (X): {val_metrics['x_rmse']:.3f} yards")
    print(f"      RMSE (Y): {val_metrics['y_rmse']:.3f} yards")
    print(f"      Mean Distance Error: {val_metrics['mean_distance']:.3f} yards")
    
    print(f"\nüéØ Model Performance Summary:")
    within_5_val = (val_metrics['distances'] <= 5).sum() / len(val_metrics['distances']) * 100
    within_10_val = (val_metrics['distances'] <= 10).sum() / len(val_metrics['distances']) * 100
    print(f"   Predictions within 5 yards: {within_5_val:.1f}%")
    print(f"   Predictions within 10 yards: {within_10_val:.1f}%")
    
    print("\n" + "="*80)
    
    return model, history, submission, train_metrics, val_metrics'''

'''# ============================================================================
# RUN
# ============================================================================

if __name__ == "__main__":
    model, history, submission, train_metrics, val_metrics = main()
    '''

2025-12-04 01:36:44.387542: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764812204.574825      38 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764812204.623875      38 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered




## Parte 2 ‚Äì Cargar y normalizar los datos

Primero cargamos los datos oficiales del concurso: el conjunto de entrenamiento, y los archivos de test que Kaggle entrega para hacer la predicci√≥n. Despu√©s llamamos a la funci√≥n que normaliza la direcci√≥n de las jugadas, de modo que todas queden orientadas hacia el mismo lado del campo.

Hacer esto ayuda a que el modelo no tenga que aprender dos veces el mismo patr√≥n (una vez cuando el equipo ataca hacia la derecha y otra hacia la izquierda). B√°sicamente dejamos los datos un poco m√°s ordenados antes de empezar a crear features.


In [3]:
# ===== PARTE 2 ‚Äì Paso 1: Cargar y normalizar datos brutos =====

# Cargar training set completo (todas las semanas disponibles en el dataset)
train_df = load_training_data()

# Cargar test_input (features de test) y test_targets (IDs + truth de Kaggle para evaluar offline)
test_input_df, test_targets_df = load_test_data()

print("\nTama√±os originales de los DataFrames:")
print(f" train_df:       {train_df.shape}")
print(f" test_input_df:  {test_input_df.shape}")
print(f" test_targets_df:{test_targets_df.shape}")

# Normalizar la direcci√≥n de la jugada para que todas vayan de izquierda ‚ûú derecha
print("\nNormalizando direcci√≥n de las jugadas (play_direction)...")
train_df = normalize_play_direction(train_df)
test_input_df = normalize_play_direction(test_input_df)



LOADING TRAINING DATA
‚úì Week 01: 285,714 rows | 748 plays
‚úì Week 02: 288,586 rows | 777 plays
‚úì Week 03: 297,757 rows | 823 plays
‚úì Week 04: 272,475 rows | 710 plays
‚úì Week 05: 254,779 rows | 677 plays
‚úì Week 06: 270,676 rows | 715 plays
‚úì Week 07: 233,597 rows | 646 plays
‚úì Week 08: 281,011 rows | 765 plays
‚úì Week 09: 252,796 rows | 656 plays
‚úì Week 10: 260,372 rows | 673 plays
‚úì Week 11: 243,413 rows | 657 plays
‚úì Week 12: 294,940 rows | 755 plays
‚úì Week 13: 233,755 rows | 622 plays
‚úì Week 14: 279,972 rows | 738 plays
‚úì Week 15: 281,820 rows | 702 plays
‚úì Week 16: 316,417 rows | 822 plays
‚úì Week 17: 277,582 rows | 734 plays
‚úì Week 18: 254,917 rows | 686 plays

Total training data: 4,880,579 rows
 Unique plays: 14,108
Players to predict: 1,303,440

LOADING TEST DATA
‚úì Test input: 49,753 rows
‚úì Test targets: 5,837 predictions needed

Tama√±os originales de los DataFrames:
 train_df:       (4880579, 23)
 test_input_df:  (49753, 23)
 test_targets_

### Crear features y pasar todo a n√∫meros

Con los datos crudos todav√≠a falta informaci√≥n √∫til para el modelo. En este paso usamos la funci√≥n de feature engineering para crear variables que describen mejor al jugador y a la jugada, por ejemplo cosas relacionadas con el movimiento, la posici√≥n en el campo y caracter√≠sticas f√≠sicas del jugador.

Luego se convierte las columnas categ√≥ricas (posici√≥n, lado, rol, etc.) en valores num√©ricos. TabNet trabaja con tensores num√©ricos, as√≠ que aqu√≠ nos aseguramos de que todo el DataFrame quede en un formato que el modelo pueda usar directamente.

In [4]:
# ===== PARTE 2 ‚Äì Paso 2: Feature engineering + codificaci√≥n de categ√≥ricas =====

print("\nAplicando feature engineering a train y test...")

train_df_fe = engineer_features(train_df)
test_input_df_fe = engineer_features(test_input_df)

print("\nCodificando variables categ√≥ricas (player_position, player_side, player_role)...")
train_df_fe, encoders = encode_categorical(train_df_fe)
test_input_df_fe = encode_categorical(test_input_df_fe, encoders)

print("\nShapes despu√©s de feature engineering + encoding:")
print(f" train_df_fe:      {train_df_fe.shape}")
print(f" test_input_df_fe: {test_input_df_fe.shape}")



Aplicando feature engineering a train y test...

FEATURE ENGINEERING
‚úì Computing velocity components (vx, vy)
‚úì Computing orientation components (ox, oy)
‚úì Computing ball landing features
‚úì Computing field position features
‚úì Processing player attributes
‚úì Creating temporal features (lags, differences)
‚úì Computing rolling statistics

üìä Features created: 62 total columns

FEATURE ENGINEERING
‚úì Computing velocity components (vx, vy)
‚úì Computing orientation components (ox, oy)
‚úì Computing ball landing features
‚úì Computing field position features
‚úì Processing player attributes
‚úì Creating temporal features (lags, differences)
‚úì Computing rolling statistics

üìä Features created: 62 total columns

Codificando variables categ√≥ricas (player_position, player_side, player_role)...

Shapes despu√©s de feature engineering + encoding:
 train_df_fe:      (4880579, 62)
 test_input_df_fe: (49753, 62)


### Construir secuencias por jugador y jugada

El problema no es solo una foto est√°tica del jugador, sino c√≥mo se mueve mientras el bal√≥n est√° en el aire. Por eso, en este bloque agrupamos los datos por jugada y jugador, y construimos una secuencia de frames para cada uno, con una longitud fija.

Para el conjunto de entrenamiento tambi√©n obtenemos el objetivo que quiero predecir, que son las coordenadas finales del jugador cuando termina la jugada. Para el test hacemos algo similar pero sin targets, y guardamos una metadata que luego nos sirve para armar correctamente el archivo de submission.

In [5]:
# ===== PARTE 2 ‚Äì Paso 3: Crear secuencias y convertirlas a features tabulares =====

# Usaremos el mismo sequence_length que se defini√≥ en CONFIG
SEQ_LEN = CONFIG['sequence_length']
print(f"\nUsando sequence_length = {SEQ_LEN} frames por muestra.")

# Crear secuencias para TRAIN (incluyen targets [ball_land_x, ball_land_y])
X_seq_all, X_static_all, y_all, metadata_all = create_sequences(
    train_df_fe,
    sequence_length=SEQ_LEN,
    for_training=True
)

# Crear secuencias para TEST (sin targets)
X_seq_test, X_static_test, _, metadata_test = create_sequences(
    test_input_df_fe,
    sequence_length=SEQ_LEN,
    for_training=False
)

print("\nShapes brutas de secuencias y est√°ticos:")
print(f" X_seq_all:       {X_seq_all.shape}")
print(f" X_static_all:    {X_static_all.shape}")
print(f" y_all:           {None if y_all is None else y_all.shape}")
print(f" X_seq_test:      {X_seq_test.shape}")
print(f" X_static_test:   {X_static_test.shape}")
print(f" #metadata_all:   {len(metadata_all)}")
print(f" #metadata_test:  {len(metadata_test)}")



Usando sequence_length = 10 frames por muestra.

CREATING SEQUENCES
‚úì Created 46,022 sequences
‚úì Sequence shape: (46022, 10, 30)
‚úì Static shape: (46022, 10)
‚úì Target shape: (46022, 2)

CREATING SEQUENCES
‚úì Created 1,758 sequences
‚úì Sequence shape: (1758, 10, 30)
‚úì Static shape: (1758, 10)

Shapes brutas de secuencias y est√°ticos:
 X_seq_all:       (46022, 10, 30)
 X_static_all:    (46022, 10)
 y_all:           (46022, 2)
 X_seq_test:      (1758, 10, 30)
 X_static_test:   (1758, 10)
 #metadata_all:   46022
 #metadata_test:  1758


### Evitar fugas de informaci√≥n y pasar a formato tabular

Aqu√≠ hacemos dos ajustes importantes. Por un lado, eliminamos de las variables est√°ticas aquellas que se calculan usando la posici√≥n final del bal√≥n. Si dejara esas columnas, el modelo estar√≠a viendo informaci√≥n que en la pr√°ctica no deber√≠a conocer, y eso distorsiona el entrenamiento.

Por otro lado, convertimos las secuencias en vectores largos y las unimos con las caracter√≠sticas est√°ticas "seguras". De esta forma terminamos con una matriz puramente tabular para entrenamiento y otra para test, que es justo el tipo de entrada que TabNet espera.


In [6]:
# ===== PARTE 2 ‚Äì Paso 3b: Limpiar est√°ticos y aplanar secuencias =====

# 1) Eliminar de los est√°ticos las columnas con fuga de etiqueta:
# static_features = [
#     0:'player_position', 1:'player_side', 2:'player_role',
#     3:'height_inches', 4:'player_weight', 5:'player_age', 6:'bmi',
#     7:'absolute_yardline_number', 8:'dist_to_ball', 9:'angle_to_ball'
# ]
static_keep_idx = [0, 1, 2, 3, 4, 5, 6, 7]  # NOS quedamos hasta 'absolute_yardline_number'

X_static_all_safe = X_static_all[:, static_keep_idx]
X_static_test_safe = X_static_test[:, static_keep_idx]

print("\nShapes de est√°ticos sin fuga de etiqueta:")
print(f" X_static_all_safe:  {X_static_all_safe.shape}")
print(f" X_static_test_safe: {X_static_test_safe.shape}")

# 2) Aplanar la parte secuencial: (N, SEQ_LEN, n_seq_features) -> (N, SEQ_LEN * n_seq_features)
n_samples, seq_len, n_seq_features = X_seq_all.shape
X_seq_all_flat = X_seq_all.reshape(n_samples, seq_len * n_seq_features)

n_samples_test, seq_len_test, n_seq_features_test = X_seq_test.shape
X_seq_test_flat = X_seq_test.reshape(n_samples_test, seq_len_test * n_seq_features_test)

print("\nShapes de secuencias aplanadas:")
print(f" X_seq_all_flat:   {X_seq_all_flat.shape}")
print(f" X_seq_test_flat:  {X_seq_test_flat.shape}")

# 3) Concatenar parte secuencial + parte est√°tica segura para formar las features finales de TabNet
import numpy as np  # por si acaso no est√° en el namespace

X_tabnet_all = np.hstack([X_seq_all_flat, X_static_all_safe])
X_tabnet_test = np.hstack([X_seq_test_flat, X_static_test_safe])

# El target es y_all (coordenadas de llegada del jugador, [ball_land_x, ball_land_y])
y_tabnet_all = y_all

print("\nFeatures finales para TabNet:")
print(f" X_tabnet_all:  {X_tabnet_all.shape}")
print(f" y_tabnet_all:  {y_tabnet_all.shape}")
print(f" X_tabnet_test: {X_tabnet_test.shape}")



Shapes de est√°ticos sin fuga de etiqueta:
 X_static_all_safe:  (46022, 8)
 X_static_test_safe: (1758, 8)

Shapes de secuencias aplanadas:
 X_seq_all_flat:   (46022, 300)
 X_seq_test_flat:  (1758, 300)

Features finales para TabNet:
 X_tabnet_all:  (46022, 308)
 y_tabnet_all:  (46022, 2)
 X_tabnet_test: (1758, 308)


### Separar entrenamiento y validaci√≥n

En este paso dividimos los datos en dos grupos: uno para entrenar el modelo y otro para validarlo. El modelo solo ve el conjunto de entrenamiento durante el proceso de aprendizaje, y dejamos el conjunto de validaci√≥n aparte para comprobar qu√© tan bien generaliza con ejemplos que no ha visto antes.

Este split es la base para todas las m√©tricas que muestra m√°s adelante y tambi√©n es lo que uso dentro de Optuna para decidir qu√© combinaci√≥n de hiperpar√°metros funciona mejor.


In [7]:
# ===== PARTE 2 ‚Äì Paso 4: Separar train y validaci√≥n para TabNet =====

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X_tabnet_all,
    y_tabnet_all,
    test_size=0.2,      # 20% para validaci√≥n
    random_state=42
)

print("\nSplit train/valid para TabNet:")
print(f" X_train: {X_train.shape}")
print(f" y_train: {y_train.shape}")
print(f" X_valid: {X_valid.shape}")
print(f" y_valid: {y_valid.shape}")



Split train/valid para TabNet:
 X_train: (36817, 308)
 y_train: (36817, 2)
 X_valid: (9205, 308)
 y_valid: (9205, 2)


## Parte 3 ‚Äì Ajustar los datos para TabNet

Antes de crear el modelo, convertimos las matrices de entrada y de salida a tipo `float32`. Esto es un detalle que b√°sicamente es el tipo de dato que espera TabNet (que est√° implementado en PyTorch) y adem√°s ayuda a ahorrar algo de memoria.

No cambia la informaci√≥n, solo la forma en la que se almacena internamente.

In [15]:
# ===== PARTE 3 ‚Äì Paso 0: Preparar matrices en float32 para TabNet =====

# TabNet (versi√≥n PyTorch) trabaja mejor con float32
X_train_tab = X_train.astype(np.float32)
y_train_tab = y_train.astype(np.float32)
X_valid_tab = X_valid.astype(np.float32)
y_valid_tab = y_valid.astype(np.float32)
X_test_tab  = X_tabnet_test.astype(np.float32)

print("Dtypes -> X_train_tab:", X_train_tab.dtype, " y_train_tab:", y_train_tab.dtype)
print("Shapes -> X_train_tab:", X_train_tab.shape, " X_valid_tab:", X_valid_tab.shape)


Dtypes -> X_train_tab: float32  y_train_tab: float32
Shapes -> X_train_tab: (36817, 308)  X_valid_tab: (9205, 308)


### Traer TabNet y configurar Optuna

En esta parte ya entramos en el modelo como tal. Primero instalamos e importamos TabNet y Optuna. TabNet es el modelo que vamos a usar para predecir `x` e `y` a partir de las features tabulares que construimos, y Optuna es la herramienta que me ayuda a probar distintas combinaciones de hiperpar√°metros sin hacerlo a mano.

Despu√©s definimos la funci√≥n `objective`, que es la que Optuna va a llamar varias veces. Cada vez que se ejecuta, arma un TabNet con ciertos par√°metros, lo entrena unas cuantas √©pocas y mide el error en validaci√≥n usando la distancia entre la posici√≥n real y la predicha. Con base en ese valor Optuna decide qu√© configuraciones probar y cu√°l es la mejor.


In [21]:
# ===== PARTE 3 ‚Äì Paso 1: Instalar e importar TabNet + Optuna =====

# Instalaci√≥n (solo es necesario la primera vez; luego puedes comentarlo si quieres)
!pip install pytorch-tabnet optuna -q

import torch
from pytorch_tabnet.tab_model import TabNetRegressor
import optuna

# Por reproducibilidad
torch.manual_seed(42)
np.random.seed(42)


[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.5/44.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m363.4/363.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m13.8/13.8 MB[0m [31m102.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.6/24.6 MB[0m [31m86.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m883.7

In [None]:
'''
# ===== PARTE 3 ‚Äì Paso 2: Definir funci√≥n objetivo para Optuna =====

# Para que esta etapa sea razonablemente r√°pida,
# usamos solo una muestra de X_train_tab / y_train_tab durante la b√∫squeda.
MAX_TUNING_SAMPLES = 60000  # luego puedes ajustar este valor

n_train = X_train_tab.shape[0]
tuning_size = min(MAX_TUNING_SAMPLES, n_train)

X_tune = X_train_tab[:tuning_size]
y_tune = y_train_tab[:tuning_size]

print(f"Usando {tuning_size} muestras de entrenamiento para la b√∫squeda de hiperpar√°metros.")

def objective(trial):
    # Hiperpar√°metros a optimizar
    n_steps = trial.suggest_int("n_steps", 3, 8)
    lambda_sparse = trial.suggest_loguniform("lambda_sparse", 1e-5, 1e-1)

    # Otros par√°metros de TabNet (puedes ajustarlos si el profe dio valores espec√≠ficos)
    n_d = trial.suggest_int("n_d", 16, 64, step=16)  # tama√±o de las capas de decisi√≥n
    n_a = n_d  # tama√±o de las capas de atenci√≥n, igual que n_d

    model = TabNetRegressor(
        n_d=n_d,
        n_a=n_a,
        n_steps=n_steps,
        gamma=1.5,
        lambda_sparse=lambda_sparse,
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=1e-3),
        mask_type="sparsemax",
        verbose=0
    )

    # Par√°metros de entrenamiento reducidos para desarrollo
    MAX_EPOCHS = 25      # m√°s adelante los subimos
    BATCH_SIZE = 1024

    model.fit(
        X_tune, y_tune,
        eval_set=[(X_valid_tab, y_valid_tab)],
        eval_name=["valid"],
        eval_metric=["rmse"],
        max_epochs=MAX_EPOCHS,
        patience=5,
        batch_size=BATCH_SIZE,
        virtual_batch_size=128,
        num_workers=0,
        drop_last=False
    )

   # Predicciones en validaci√≥n
    y_pred_valid = model.predict(X_valid_tab)

    # M√©trica basada en distancia euclidiana promedio (m√°s bajo es mejor)
    dists = calculate_euclidean_distance(y_valid_tab, y_pred_valid)
    mean_dist = float(np.mean(dists))

    # Guardamos la m√©trica en los atributos del trial para poder verla luego
    trial.set_user_attr("mean_valid_distance", mean_dist)

    return mean_dist

'''

### B√∫squeda de hiperpar√°metros

Aqu√≠ ejecutamos el estudio de Optuna con un n√∫mero de pruebas definido. Cada prueba corresponde a un modelo TabNet con una configuraci√≥n diferente de hiperpar√°metros. Al final, Optuna nos devuelve cu√°les valores funcionaron mejor seg√∫n la m√©trica de validaci√≥n.

Esos hiperpar√°metros se guardan y son los que usamos para construir el modelo final que vamos a entrenar con m√°s calma en la siguiente parte.


In [None]:
'''
# ===== PARTE 3 ‚Äì Paso 3: Ejecutar la b√∫squeda de hiperpar√°metros con Optuna =====

N_TRIALS = 20  # durante el desarrollo; luego puedes aumentarlo

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=N_TRIALS)

print("\nMejores hiperpar√°metros encontrados:")
print(study.best_trial.params)
print("Mejor distancia euclidiana media en validaci√≥n:",
      study.best_trial.user_attrs["mean_valid_distance"])

best_params = study.best_trial.params
'''

In [23]:
# ===== PARTE 3 ‚Äì Paso 3: Hiperpar√°metros fijos de TabNet =====

best_params = {
    "n_steps": 3,
    "lambda_sparse": 0.003694302625583284,
    "n_d": 48,
}

print("Usando hiperpar√°metros fijos obtenidos previamente con Optuna (fuera de este notebook):")
for k, v in best_params.items():
    print(f"  {k} = {v}")


Usando hiperpar√°metros fijos obtenidos previamente con Optuna (fuera de este notebook):
  n_steps = 3
  lambda_sparse = 0.003694302625583284
  n_d = 48


## Entrenar el modelo final con los mejores par√°metros

Con los valores que encontr√≥ Optuna ya puedemos definir el modelo final de TabNet. Uso esos hiperpar√°metros para crear el modelo y luego lo entreno con los datos de entrenamiento, mientras seguimos revisando el error en el conjunto de validaci√≥n.

La idea es que, en lugar de entrenar cualquier configuraci√≥n, aqu√≠ ya usamos una que tiene sentido para este problema, y que fue seleccionada precisamente porque dio un buen desempe√±o en los datos de validaci√≥n.


In [26]:
# ===== PARTE 4 ‚Äì Paso 1: Definir el modelo final de TabNet con los mejores hiperpar√°metros =====

best_n_steps = best_params["n_steps"]
best_lambda_sparse = best_params["lambda_sparse"]
best_n_d = best_params["n_d"]
best_n_a = best_n_d  # solemos usar n_a = n_d

print("Mejores hiperpar√°metros para el modelo final:")
print(f"  n_steps       = {best_n_steps}")
print(f"  lambda_sparse = {best_lambda_sparse}")
print(f"  n_d, n_a      = {best_n_d}")

final_model = TabNetRegressor(
    n_d=best_n_d,
    n_a=best_n_a,
    n_steps=best_n_steps,
    gamma=1.5,
    lambda_sparse=best_lambda_sparse,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=1e-3),
    mask_type="sparsemax",
    verbose=1
)

Mejores hiperpar√°metros para el modelo final:
  n_steps       = 3
  lambda_sparse = 0.003694302625583284
  n_d, n_a      = 48


In [27]:
# ===== PARTE 4 ‚Äì Paso 2: Entrenar el modelo final (por ahora con pocos epochs) =====

MAX_EPOCHS_FINAL = 150   
BATCH_SIZE_FINAL = 2048  # batch grande para ir m√°s r√°pido

final_model.fit(
    X_train_tab, y_train_tab,
    eval_set=[(X_valid_tab, y_valid_tab)],
    eval_name=["valid"],
    eval_metric=["rmse"],
    max_epochs=MAX_EPOCHS_FINAL,
    patience=10,
    batch_size=BATCH_SIZE_FINAL,
    virtual_batch_size=256,
    num_workers=0,
    drop_last=False
)


epoch 0  | loss: 2849.19587| valid_rmse: 43.80984115600586|  0:00:01s
epoch 1  | loss: 2706.20674| valid_rmse: 42.120418548583984|  0:00:02s
epoch 2  | loss: 2560.20397| valid_rmse: 43.05683898925781|  0:00:03s
epoch 3  | loss: 2408.42646| valid_rmse: 42.162879943847656|  0:00:04s
epoch 4  | loss: 2243.985| valid_rmse: 41.3678092956543|  0:00:06s
epoch 5  | loss: 2075.42129| valid_rmse: 40.30149841308594|  0:00:07s
epoch 6  | loss: 1896.41846| valid_rmse: 38.74258041381836|  0:00:08s
epoch 7  | loss: 1701.75818| valid_rmse: 36.229949951171875|  0:00:09s
epoch 8  | loss: 1499.78464| valid_rmse: 33.75571823120117|  0:00:10s
epoch 9  | loss: 1291.50736| valid_rmse: 30.810359954833984|  0:00:11s
epoch 10 | loss: 1084.97862| valid_rmse: 27.621980667114258|  0:00:12s
epoch 11 | loss: 889.00873| valid_rmse: 24.670040130615234|  0:00:13s
epoch 12 | loss: 708.48612| valid_rmse: 22.109600067138672|  0:00:14s
epoch 13 | loss: 548.20943| valid_rmse: 19.55084991455078|  0:00:15s
epoch 14 | loss: 41

### Evaluar el modelo en validaci√≥n

Una vez entrenado el modelo final, lo probamos contra el conjunto de validaci√≥n. Con la funci√≥n de evaluaci√≥n calculo el error en `x` y en `y` (RMSE y MAE) y tambi√©n la distancia euclidiana promedio entre la posici√≥n real y la predicci√≥n.

Estas m√©tricas nos dan una idea clara de qu√© tan bien est√° funcionando TabNet antes de usarlo para hacer predicciones sobre el conjunto de test que va a Kaggle.


In [28]:
# ===== PARTE 4 ‚Äì Paso 3: Evaluar el modelo en validaci√≥n =====

y_pred_valid = final_model.predict(X_valid_tab)

metrics_valid = evaluate_predictions(
    y_valid_tab,
    y_pred_valid,
    split_name="Validation (TabNet)"
)

metrics_valid


üìä VALIDATION (TABNET) SET EVALUATION

üéØ POSITION ACCURACY:
   X-coordinate:
      RMSE: 6.786 yards
      MAE:  4.856 yards

   Y-coordinate:
      RMSE: 6.262 yards
      MAE:  4.505 yards

üìè EUCLIDEAN DISTANCE:
   Mean:   7.366 yards
   Median: 6.049 yards
   Std:    5.568 yards
   Min:    0.044 yards
   Max:    38.139 yards

üìä DISTANCE PERCENTILES:
   25th percentile: 3.195 yards
   50th percentile: 6.049 yards
   75th percentile: 10.024 yards
   90th percentile: 14.888 yards
   95th percentile: 18.372 yards
   99th percentile: 25.882 yards

üéØ ACCURACY BUCKETS:
   Within  1 yards:    359 ( 3.90%)
   Within  2 yards:  1,189 (12.92%)
   Within  5 yards:  3,847 (41.79%)
   Within 10 yards:  6,892 (74.87%)
   Within 15 yards:  8,306 (90.23%)
   Within 20 yards:  8,852 (96.17%)


{'x_rmse': 6.78568,
 'y_rmse': 6.262428,
 'x_mae': 4.8563957,
 'y_mae': 4.5050354,
 'mean_distance': 7.366203,
 'median_distance': 6.0489078,
 'distances': array([4.5301323, 4.8788433, 1.5224569, ..., 5.780381 , 1.5988973,
        6.916272 ], dtype=float32)}

### Predicciones para test y creaci√≥n del archivo de Kaggle

En este √∫ltimo tramo utilizamos el modelo entrenado para predecir `x` e `y` en el conjunto de test. A partir de esas predicciones y de la metadata de cada ejemplo, construye el archivo `submission_tabnet.csv` en el formato que exige Kaggle.

Cada fila del archivo representa un jugador en un frame espec√≠fico de una jugada, con su identificador completo y las coordenadas estimadas. Este es el archivo que se sube a la plataforma para obtener el score del concurso.


In [29]:
# ===== PARTE 4 ‚Äì Paso 4: Predicciones TabNet sobre el conjunto de test =====

y_pred_test = final_model.predict(X_test_tab)

print("Shape de predicciones para test:", y_pred_test.shape)
print("Primeras 5 predicciones (x, y):")
print(y_pred_test[:5])

Shape de predicciones para test: (1758, 2)
Primeras 5 predicciones (x, y):
[[29.064602 19.565416]
 [40.05082  28.915821]
 [22.21519  26.289627]
 [28.086645 33.64907 ]
 [30.727045 33.807045]]


In [31]:
# ===== PARTE 4 ‚Äì Paso 5: Funci√≥n para crear el submission con TabNet =====

def create_submission_tabnet(y_pred_test, test_targets_df, metadata_test, filename="submission_tabnet.csv"):
    """
    Construye el archivo de submission para Kaggle usando las predicciones de TabNet.
    
    - y_pred_test: np.array de shape (N_test_samples, 2) con [x_pred, y_pred]
    - test_targets_df: DataFrame con columnas game_id, play_id, nfl_id, frame_id
    - metadata_test: lista de diccionarios con game_id, play_id, nfl_id (uno por fila de X_test_tab)
    """
    print("\n" + "="*80)
    print("CREANDO ARCHIVO DE SUBMISI√ìN (TabNet)")
    print("="*80)

    # 1) Diccionario clave -> predicci√≥n
    pred_dict = {}
    for meta, pred in zip(metadata_test, y_pred_test):
        key = (meta['game_id'], meta['play_id'], meta['nfl_id'])
        pred_dict[key] = {
            "x": float(pred[0]),
            "y": float(pred[1]),
        }

    # 2) Recorrer test_targets_df y asignar la predicci√≥n correspondiente
    submissions = []
    for _, row in test_targets_df.iterrows():
        key = (row['game_id'], row['play_id'], row['nfl_id'])

        if key in pred_dict:
            x_pred = pred_dict[key]["x"]
            y_pred = pred_dict[key]["y"]
        else:
            # fallback en caso de que falte alguna combinaci√≥n
            x_pred = 60.0
            y_pred = 26.65

        submissions.append({
            "id": f"{row['game_id']}_{row['play_id']}_{row['nfl_id']}_{row['frame_id']}",
            "x": x_pred,
            "y": y_pred,
        })

    submission_df = pd.DataFrame(submissions)
    submission_df.to_csv(filename, index=False)
    print(f"Archivo de submission guardado como: {filename}")
    print("\nVista previa de las primeras filas:")
    print(submission_df.head())

    return submission_df

In [32]:
# ===== PARTE 4 ‚Äì Paso 6: Generar submission.csv para Kaggle =====

submission_df = create_submission_tabnet(
    y_pred_test=y_pred_test,
    test_targets_df=test_targets_df,
    metadata_test=metadata_test,
    filename="submission_tabnet.csv"
)


CREANDO ARCHIVO DE SUBMISI√ìN (TabNet)
Archivo de submission guardado como: submission_tabnet.csv

Vista previa de las primeras filas:
                      id          x          y
0  2024120805_74_54586_1  31.701832  16.473963
1  2024120805_74_54586_2  31.701832  16.473963
2  2024120805_74_54586_3  31.701832  16.473963
3  2024120805_74_54586_4  31.701832  16.473963
4  2024120805_74_54586_5  31.701832  16.473963
