# NBA Spread Prediction Pipeline

**Production-ready blueprint for predicting NBA point spreads and ATS (Against The Spread) outcomes**

## Goal
- Predict the **point spread outcome** (expected margin = home_score - away_score)
- Predict **cover/no-cover** vs the betting spread
- Primary label: `margin_real = home_score - away_score`
- Derived label: `residual = margin_real - closing_spread` (positive => home covers)

## Pipeline Overview
1. Data Loading & Integration
2. Feature Engineering (ELO, recent form, rest days, injuries, market features)
3. Model Training (Baseline → Linear → LightGBM)
4. Time-Series Cross-Validation & Backtesting
5. Model Evaluation & Interpretation
6. Prediction Generation
7. Deployment-Ready Code


In [None]:
# Core dependencies
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Modeling
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import lightgbm as lgb
from joblib import dump, load

# Visualization & interpretability
import matplotlib.pyplot as plt
import seaborn as sns
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available. Install with: pip install shap")

# Set random seeds for reproducibility
np.random.seed(42)

print("✓ Libraries loaded successfully")


## Part 1: Data Loading & Integration

Load data from the collection notebook or from saved CSV files.


In [None]:
def load_game_data(file_path: str = None) -> pd.DataFrame:
    """
    Load game data from CSV or use data from collection notebook
    """
    if file_path:
        df = pd.read_csv(file_path, parse_dates=['date'])
    else:
        # Try to load from collection notebook output (check multiple possible files)
        for filename in ['nba_games.csv', 'nba_games_test.csv']:
            try:
                df = pd.read_csv(filename, parse_dates=['date'])
                print(f"✓ Loaded data from {filename}")
                break
            except FileNotFoundError:
                continue
        else:
            print("⚠️  No game data file found. Run the data collection notebook first.")
            print("   Or provide a CSV with columns: game_id, date, home_team_id, away_team_id,")
            print("   home_score, away_score, home_team_name, away_team_name")
            return pd.DataFrame()
    
    # Ensure required columns exist
    required_cols = ['date', 'home_team_id', 'away_team_id', 'home_score', 'away_score']
    missing = [c for c in required_cols if c not in df.columns]
    if missing:
        print(f"⚠️  Missing required columns: {missing}")
        return pd.DataFrame()
    
    df = df.sort_values('date').reset_index(drop=True)
    df['margin'] = df['home_score'] - df['away_score']
    
    return df

# Load data (modify path as needed)
games_df = load_game_data()

if len(games_df) > 0:
    print(f"✓ Loaded {len(games_df)} games")
    print(f"  Date range: {games_df['date'].min()} to {games_df['date'].max()}")
    print(f"\nSample data:")
    print(games_df[['date', 'home_team_name', 'away_team_name', 'home_score', 'away_score', 'margin']].head())
else:
    print("⚠️  No data loaded. Please run data collection first or provide data file.")


## Part 2: Feature Engineering

This is the heart of the model. We'll create features at multiple granularities.


In [None]:
class ELORating:
    """
    Simple ELO rating system for NBA teams
    Updates after each game using the result
    """
    def __init__(self, initial_rating=1500, k_factor=20):
        self.initial_rating = initial_rating
        self.k_factor = k_factor
        self.ratings = {}
    
    def get_rating(self, team_id, date=None):
        """Get team rating at a specific date (or current if date=None)"""
        if team_id not in self.ratings:
            return self.initial_rating
        if date is None:
            return self.ratings[team_id][-1][1]  # Latest rating
        # Find rating at or before date
        for d, rating in reversed(self.ratings[team_id]):
            if d <= date:
                return rating
        return self.initial_rating
    
    def update(self, team_id, opponent_id, margin, date, home_advantage=0):
        """
        Update ELO after a game
        margin = home_score - away_score (from home team's perspective)
        """
        if team_id not in self.ratings:
            self.ratings[team_id] = []
        if opponent_id not in self.ratings:
            self.ratings[opponent_id] = []
        
        # Get current ratings
        team_rating = self.get_rating(team_id, date)
        opp_rating = self.get_rating(opponent_id, date)
        
        # Expected score (sigmoid)
        expected = 1 / (1 + 10 ** ((opp_rating - team_rating - home_advantage) / 400))
        
        # Actual score (normalize margin to 0-1 scale, capped)
        # Margin of +10 = win by 10, treat as ~0.75 score
        # Margin of -10 = loss by 10, treat as ~0.25 score
        actual = 0.5 + np.clip(margin / 20, -0.5, 0.5)
        
        # Update ratings
        new_team_rating = team_rating + self.k_factor * (actual - expected)
        new_opp_rating = opp_rating + self.k_factor * (expected - actual)
        
        # Store with date
        if not self.ratings[team_id] or self.ratings[team_id][-1][0] < date:
            self.ratings[team_id].append((date, new_team_rating))
        else:
            self.ratings[team_id][-1] = (date, new_team_rating)
            
        if not self.ratings[opponent_id] or self.ratings[opponent_id][-1][0] < date:
            self.ratings[opponent_id].append((date, new_opp_rating))
        else:
            self.ratings[opponent_id][-1] = (date, new_opp_rating)

print("✓ ELO Rating class defined")


In [None]:
def calculate_rest_days(games_df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate rest days for each team before each game
    Returns DataFrame with rest_days_home and rest_days_away
    """
    df = games_df.copy()
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    
    # Track last game date for each team
    last_game_home = {}
    last_game_away = {}
    
    rest_days_home = []
    rest_days_away = []
    is_back_to_back_home = []
    is_back_to_back_away = []
    
    for idx, row in df.iterrows():
        home_id = str(row['home_team_id'])
        away_id = str(row['away_team_id'])
        game_date = row['date']
        
        # Home team rest
        if home_id in last_game_home:
            rest = (game_date - last_game_home[home_id]).days
            rest_days_home.append(rest)
            is_back_to_back_home.append(rest == 0)
        else:
            rest_days_home.append(3)  # Default for first game
            is_back_to_back_home.append(False)
        
        # Away team rest
        if away_id in last_game_away:
            rest = (game_date - last_game_away[away_id]).days
            rest_days_away.append(rest)
            is_back_to_back_away.append(rest == 0)
        else:
            rest_days_away.append(3)  # Default for first game
            is_back_to_back_away.append(False)
        
        # Update last game dates (only after game is played)
        # For forward-filling, we update immediately but use previous values for features
        last_game_home[home_id] = game_date
        last_game_away[away_id] = game_date
    
    df['rest_days_home'] = rest_days_home
    df['rest_days_away'] = rest_days_away
    df['rest_diff'] = df['rest_days_home'] - df['rest_days_away']
    df['is_home_back_to_back'] = is_back_to_back_home
    df['is_away_back_to_back'] = is_back_to_back_away
    
    return df

print("✓ Rest days calculation function defined")


In [None]:
def calculate_rolling_features(games_df: pd.DataFrame, windows=[3, 5, 10]) -> pd.DataFrame:
    """
    Calculate rolling window features for each team
    Features: avg margin, avg points for, avg points against, std of margin
    """
    df = games_df.copy()
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    
    # Initialize feature columns
    for window in windows:
        df[f'home_margin_avg_{window}'] = np.nan
        df[f'away_margin_avg_{window}'] = np.nan
        df[f'home_points_for_avg_{window}'] = np.nan
        df[f'away_points_for_avg_{window}'] = np.nan
        df[f'home_points_against_avg_{window}'] = np.nan
        df[f'away_points_against_avg_{window}'] = np.nan
        df[f'home_margin_std_{window}'] = np.nan
        df[f'away_margin_std_{window}'] = np.nan
    
    # Track game history per team
    team_history = {}
    
    for idx, row in df.iterrows():
        home_id = str(row['home_team_id'])
        away_id = str(row['away_team_id'])
        game_date = row['date']
        
        # Initialize team history if needed
        if home_id not in team_history:
            team_history[home_id] = []
        if away_id not in team_history:
            team_history[away_id] = []
        
        # Calculate rolling features for home team (using past games only)
        home_past_games = [g for g in team_history[home_id] if g['date'] < game_date]
        if len(home_past_games) > 0:
            home_margins = [g['margin'] for g in home_past_games]
            home_scores = [g['score'] for g in home_past_games]
            home_opp_scores = [g['opp_score'] for g in home_past_games]
            
            for window in windows:
                if len(home_past_games) >= window:
                    recent = home_past_games[-window:]
                    recent_margins = [g['margin'] for g in recent]
                    recent_scores = [g['score'] for g in recent]
                    recent_opp_scores = [g['opp_score'] for g in recent]
                    
                    df.loc[idx, f'home_margin_avg_{window}'] = np.mean(recent_margins)
                    df.loc[idx, f'home_points_for_avg_{window}'] = np.mean(recent_scores)
                    df.loc[idx, f'home_points_against_avg_{window}'] = np.mean(recent_opp_scores)
                    df.loc[idx, f'home_margin_std_{window}'] = np.std(recent_margins) if len(recent_margins) > 1 else 0
        
        # Calculate rolling features for away team
        away_past_games = [g for g in team_history[away_id] if g['date'] < game_date]
        if len(away_past_games) > 0:
            for window in windows:
                if len(away_past_games) >= window:
                    recent = away_past_games[-window:]
                    # For away team, margin is negative of their perspective
                    recent_margins = [-g['margin'] for g in recent]  # Flip perspective
                    recent_scores = [g['opp_score'] for g in recent]  # They were away, so opp_score is their score
                    recent_opp_scores = [g['score'] for g in recent]
                    
                    df.loc[idx, f'away_margin_avg_{window}'] = np.mean(recent_margins)
                    df.loc[idx, f'away_points_for_avg_{window}'] = np.mean(recent_scores)
                    df.loc[idx, f'away_points_against_avg_{window}'] = np.mean(recent_opp_scores)
                    df.loc[idx, f'away_margin_std_{window}'] = np.std(recent_margins) if len(recent_margins) > 1 else 0
        
        # Update team history AFTER calculating features (to avoid data leakage)
        # Store from home team's perspective
        team_history[home_id].append({
            'date': game_date,
            'margin': row['margin'],
            'score': row['home_score'],
            'opp_score': row['away_score']
        })
        # Store from away team's perspective (they were away)
        team_history[away_id].append({
            'date': game_date,
            'margin': -row['margin'],  # Negative from their perspective
            'score': row['away_score'],
            'opp_score': row['home_score']
        })
    
    return df

print("✓ Rolling features calculation function defined")


In [None]:
def build_features(games_df: pd.DataFrame, elo: ELORating = None, include_spread: bool = True, include_injuries: bool = True) -> pd.DataFrame:
    """
    Main feature engineering function
    Combines all feature types into a single DataFrame
    """
    if len(games_df) == 0:
        return pd.DataFrame()
    
    df = games_df.copy()
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    
    # Filter to completed games with scores
    df = df[(df['home_score'].notna()) & (df['away_score'].notna())].copy()
    
    if len(df) == 0:
        return pd.DataFrame()
    
    # 1. Calculate rest days
    df = calculate_rest_days(df)
    
    # 2. Calculate rolling features
    df = calculate_rolling_features(df, windows=[3, 5, 10])
    
    # 3. ELO ratings (if provided)
    if elo is None:
        # Initialize and build ELO from scratch
        elo = ELORating()
        for idx, row in df.iterrows():
            home_id = str(row['home_team_id'])
            away_id = str(row['away_team_id'])
            margin = row['margin']
            game_date = row['date']
            # Get ELO before this game (forward-fill)
            home_elo = elo.get_rating(home_id, game_date)
            away_elo = elo.get_rating(away_id, game_date)
            # Update after game
            elo.update(home_id, away_id, margin, game_date, home_advantage=70)  # ~3.5 point home advantage
    
    # Add ELO features (using ratings BEFORE each game)
    elo_home_list = []
    elo_away_list = []
    for idx, row in df.iterrows():
        home_id = str(row['home_team_id'])
        away_id = str(row['away_team_id'])
        game_date = row['date']
        # Get rating just before this game
        elo_home_list.append(elo.get_rating(home_id, game_date - timedelta(days=1)))
        elo_away_list.append(elo.get_rating(away_id, game_date - timedelta(days=1)))
    
    df['elo_home'] = elo_home_list
    df['elo_away'] = elo_away_list
    df['elo_diff'] = df['elo_home'] - df['elo_away']
    
    # 4. Home court advantage (binary)
    df['home_court'] = 1  # All games in our data are home/away (not neutral)
    
    # 5. Temporal features
    df['day_of_week'] = df['date'].dt.dayofweek
    df['month'] = df['date'].dt.month
    df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
    
    # 6. Injury features (if available and enabled)
    if include_injuries:
        injury_cols = [c for c in df.columns if any(x in c for x in [
            'injury', 'players_out', 'star_out', 'players_dtd'
        ])]
        if injury_cols:
            print(f"✓ Using {len(injury_cols)} injury features from dataset")
            # Create injury interaction features
            if 'injury_severity_diff' in df.columns:
                df['injury_x_elo_diff'] = df['injury_severity_diff'] * df['elo_diff']
                df['injury_beyond_spread'] = df['injury_severity_diff'] - (df.get('spread_move', 0) * 0.5)
        else:
            print("⚠️  No injury features found in dataset. Run add_injuries_to_pipeline.py first.")
            # Initialize with zeros
            injury_cols_default = [
                'players_out_home', 'players_out_away', 'players_dtd_home', 'players_dtd_away',
                'injury_severity_home', 'injury_severity_away', 'star_out_home', 'star_out_away',
                'total_injury_impact_home', 'total_injury_impact_away',
                'injury_severity_diff', 'players_out_diff', 'total_injury_impact_diff'
            ]
            for col in injury_cols_default:
                if col not in df.columns:
                    df[col] = 0.0
    
    # 7. Interaction features
    df['elo_diff_x_rest_diff'] = df['elo_diff'] * df['rest_diff']
    df['home_margin_avg_5_x_elo_home'] = df['home_margin_avg_5'] * df['elo_home']
    
    # 8. Market features (if available)
    if 'closing_spread' in df.columns:
        df['closing_spread'] = df['closing_spread']
    elif include_spread:
        # Simulate closing spread as ELO-based estimate (for demo)
        df['closing_spread'] = (df['elo_home'] - df['elo_away']) / 25  # Rough conversion
        print("⚠️  No closing_spread column found. Using ELO-based estimate.")
    else:
        df['closing_spread'] = 0
    
    if 'opening_spread' in df.columns:
        df['spread_move'] = df['closing_spread'] - df['opening_spread']
    else:
        df['spread_move'] = 0
    
    # 9. Target variables
    df['margin'] = df['home_score'] - df['away_score']
    df['residual'] = df['margin'] - df['closing_spread']
    df['cover'] = (df['residual'] > 0).astype(int)
    
    # 10. Fill NaN values in rolling features (early season games)
    rolling_cols = [c for c in df.columns if any(x in c for x in ['avg_', 'std_'])]
    for col in rolling_cols:
        df[col] = df[col].fillna(0)
    
    return df

print("✓ Feature engineering function defined")


## Part 3: Build Feature Dataset

Apply feature engineering to the game data.


In [None]:
# Build features
if len(games_df) > 0:
    # Check if injury features exist in dataset
    has_injuries = any('injury' in c.lower() or 'players_out' in c.lower() for c in games_df.columns)
    
    features_df = build_features(games_df, include_spread=True, include_injuries=has_injuries)
    
    if len(features_df) > 0:
        print(f"✓ Built features for {len(features_df)} games")
        
        # Check for multicollinearity if injuries are present
        injury_cols = [c for c in features_df.columns if any(x in c for x in ['injury', 'players_out', 'star_out'])]
        spread_cols = [c for c in features_df.columns if 'spread' in c.lower() or 'elo_diff' in c]
        
        if injury_cols and spread_cols:
            print(f"\n[Multicollinearity Check]")
            print(f"  Injury features: {len(injury_cols)}")
            print(f"  Spread-related features: {len(spread_cols)}")
            
            # Calculate correlations
            try:
                from injury_features import analyze_injury_correlations, check_multicollinearity
                corr_matrix = analyze_injury_correlations(features_df)
                if not corr_matrix.empty:
                    print(f"\n  Correlation Matrix (sample):")
                    print(corr_matrix.round(3).head())
                    
                    multicoll_results = check_multicollinearity(features_df, threshold=0.8)
                    if multicoll_results['warnings']:
                        print(f"\n  ⚠️  High correlations detected:")
                        for warning in multicoll_results['warnings'][:5]:  # Show first 5
                            print(f"    {warning}")
                    else:
                        print(f"\n  ✓ No high correlations (threshold: 0.8)")
            except ImportError:
                print("  (Multicollinearity analysis available via injury_features module)")
        
        feature_cols = [c for c in features_df.columns if c not in [
            'game_id', 'date', 'home_team_id', 'away_team_id', 'home_team_name', 
            'away_team_name', 'home_score', 'away_score', 'margin', 'residual', 
            'cover', 'closing_spread', 'opening_spread', 'spread_move', 'status',
            'completed', 'venue', 'attendance', 'name', 'short_name'
        ]]
        print(f"\nTotal feature columns: {len(feature_cols)}")
        
        print(f"\nSample features:")
        display_cols = ['date', 'home_team_name', 'away_team_name', 'elo_home', 'elo_away', 
                       'elo_diff', 'rest_diff', 'margin', 'closing_spread', 'residual']
        if injury_cols:
            display_cols.extend(['injury_severity_diff', 'players_out_diff'])
        print(features_df[display_cols].head(10))
    else:
        print("⚠️  No features generated. Check data quality.")
else:
    print("⚠️  No game data available. Please load data first.")
    features_df = pd.DataFrame()


## Part 4: Model Training Pipeline

Implement baseline → linear → LightGBM models with time-series cross-validation.


In [None]:
def prepare_model_data(features_df: pd.DataFrame, target: str = 'residual'):
    """
    Prepare data for modeling
    Returns: X (features), y (target), feature_names, metadata
    """
    if len(features_df) == 0:
        return None, None, [], pd.DataFrame()
    
    # Exclude non-feature columns
    exclude_cols = [
        'game_id', 'date', 'home_team_id', 'away_team_id', 'home_team_name', 
        'away_team_name', 'home_score', 'away_score', 'margin', 'residual', 
        'cover', 'closing_spread', 'opening_spread', 'spread_move', 'status',
        'completed', 'venue', 'attendance', 'name', 'short_name', 'point_differential'
    ]
    
    feature_cols = [c for c in features_df.columns if c not in exclude_cols]
    
    # Remove any remaining non-numeric columns
    numeric_cols = []
    for col in feature_cols:
        if features_df[col].dtype in [np.int64, np.float64, np.int32, np.float32]:
            numeric_cols.append(col)
        elif features_df[col].dtype == 'object':
            # Try to convert
            try:
                features_df[col] = pd.to_numeric(features_df[col], errors='coerce')
                numeric_cols.append(col)
            except:
                pass
    
    feature_cols = numeric_cols
    
    # Create feature matrix
    X = features_df[feature_cols].fillna(0).values
    y = features_df[target].values
    
    # Metadata for tracking
    metadata = features_df[['date', 'home_team_name', 'away_team_name', 'margin', 'closing_spread', 'residual']].copy()
    
    return X, y, feature_cols, metadata

print("✓ Data preparation function defined")


In [None]:
def evaluate_model(y_true, y_pred, metadata=None, model_name=""):
    """
    Comprehensive model evaluation
    """
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    
    # Betting-relevant metrics
    if metadata is not None and 'closing_spread' in metadata.columns:
        # Predicted margin = predicted residual + closing spread
        pred_margin = y_pred + metadata['closing_spread'].values
        true_margin = metadata['margin'].values
        
        # Cover prediction accuracy
        pred_cover = (y_pred > 0).astype(int)
        true_cover = (metadata['residual'].values > 0).astype(int)
        cover_accuracy = (pred_cover == true_cover).mean()
        
        # Average edge
        avg_edge = np.mean(y_pred - metadata['residual'].values)
        
        print(f"\n{'='*60}")
        print(f"{model_name} - Evaluation Metrics")
        print(f"{'='*60}")
        print(f"MAE (residual): {mae:.3f}")
        print(f"RMSE (residual): {rmse:.3f}")
        print(f"R² (residual): {r2:.3f}")
        print(f"\nBetting Metrics:")
        print(f"Cover Accuracy: {cover_accuracy:.1%}")
        print(f"Average Edge: {avg_edge:.3f}")
        print(f"MAE (margin): {mean_absolute_error(true_margin, pred_margin):.3f}")
    else:
        print(f"\n{model_name} - MAE: {mae:.3f}, RMSE: {rmse:.3f}, R²: {r2:.3f}")
    
    return {
        'mae': mae,
        'rmse': rmse,
        'r2': r2,
        'cover_accuracy': cover_accuracy if metadata is not None else None,
        'avg_edge': avg_edge if metadata is not None else None
    }

print("✓ Evaluation function defined")


### Model 1: Baseline (Closing Spread)


In [None]:
# Baseline: predicted residual = 0 (i.e., margin = closing_spread)
if len(features_df) > 0:
    # Use closing spread as prediction (residual = 0)
    baseline_pred = np.zeros(len(features_df))
    baseline_metrics = evaluate_model(
        features_df['residual'].values, 
        baseline_pred, 
        features_df,
        "Baseline (Closing Spread)"
    )
else:
    print("⚠️  No data available for baseline model")


### Model 2: Linear Model (Ridge Regression)


In [None]:
def train_linear_model(X_train, y_train, X_val, y_val, alpha=1.0):
    """
    Train Ridge regression model
    """
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    
    model = Ridge(alpha=alpha)
    model.fit(X_train_scaled, y_train)
    
    pred_train = model.predict(X_train_scaled)
    pred_val = model.predict(X_val_scaled)
    
    return model, scaler, pred_train, pred_val

if len(features_df) > 0:
    # Prepare data
    X, y, feature_names, metadata = prepare_model_data(features_df, target='residual')
    
    if X is not None and len(X) > 100:
        # Time-series split
        tscv = TimeSeriesSplit(n_splits=3)
        linear_maes = []
        linear_models = []
        
        for train_idx, val_idx in tscv.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            model, scaler, _, pred_val = train_linear_model(X_train, y_train, X_val, y_val, alpha=1.0)
            mae = mean_absolute_error(y_val, pred_val)
            linear_maes.append(mae)
            linear_models.append((model, scaler))
        
        print(f"\nLinear Model CV MAE: {np.mean(linear_maes):.3f} ± {np.std(linear_maes):.3f}")
        
        # Retrain on full training set
        final_linear_model, final_scaler = linear_models[-1]  # Use last fold's model
        print("✓ Linear model trained")
    else:
        print("⚠️  Insufficient data for linear model")
        final_linear_model, final_scaler = None, None
else:
    print("⚠️  No data available")
    final_linear_model, final_scaler = None, None


### Model 3: LightGBM (Main Model)


In [None]:
def train_lightgbm(X_train, y_train, X_val, y_val, feature_names, params=None):
    """
    Train LightGBM model with early stopping
    """
    if params is None:
        params = {
            "objective": "regression",
            "metric": "mae",
            "boosting_type": "gbdt",
            "learning_rate": 0.05,
            "num_leaves": 31,
            "feature_fraction": 0.8,
            "bagging_fraction": 0.8,
            "bagging_freq": 5,
            "min_child_samples": 20,
            "verbose": -1,
            "seed": 42
        }
    
    dtrain = lgb.Dataset(X_train, label=y_train, feature_name=feature_names)
    dval = lgb.Dataset(X_val, label=y_val, reference=dtrain, feature_name=feature_names)
    
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=1000,
        valid_sets=[dval],
        valid_names=['val'],
        callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
    )
    
    pred_train = model.predict(X_train, num_iteration=model.best_iteration)
    pred_val = model.predict(X_val, num_iteration=model.best_iteration)
    
    return model, pred_train, pred_val

if len(features_df) > 0 and X is not None and len(X) > 100:
    # Time-series cross-validation for LightGBM
    tscv = TimeSeriesSplit(n_splits=3)
    lgb_maes = []
    lgb_models = []
    lgb_feature_importances = []
    
    for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        meta_val = metadata.iloc[val_idx]
        
        model, _, pred_val = train_lightgbm(X_train, y_train, X_val, y_val, feature_names)
        
        mae = mean_absolute_error(y_val, pred_val)
        lgb_maes.append(mae)
        lgb_models.append(model)
        lgb_feature_importances.append(model.feature_importance(importance_type='gain'))
        
        print(f"Fold {fold+1} MAE: {mae:.3f}")
    
    print(f"\nLightGBM CV MAE: {np.mean(lgb_maes):.3f} ± {np.std(lgb_maes):.3f}")
    
    # Retrain on full dataset
    # Use average best iteration from CV
    avg_best_iter = int(np.mean([m.best_iteration for m in lgb_models]))
    
    final_lgb_model = lgb.train(
        {
            "objective": "regression",
            "metric": "mae",
            "boosting_type": "gbdt",
            "learning_rate": 0.05,
            "num_leaves": 31,
            "feature_fraction": 0.8,
            "bagging_fraction": 0.8,
            "bagging_freq": 5,
            "min_child_samples": 20,
            "verbose": -1,
            "seed": 42
        },
        lgb.Dataset(X, label=y, feature_name=feature_names),
        num_boost_round=avg_best_iter
    )
    
    print("✓ LightGBM model trained")
    
    # Feature importance
    feature_importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': final_lgb_model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)
    
    print(f"\nTop 10 Most Important Features:")
    print(feature_importance_df.head(10))
    
else:
    print("⚠️  Insufficient data for LightGBM")
    final_lgb_model = None
    feature_importance_df = pd.DataFrame()


## Part 5: Model Evaluation & Backtesting


In [None]:
def walk_forward_backtest(features_df, model_func, min_train_games=200, test_size=50):
    """
    Walk-forward backtesting: train on past, test on future
    Simulates real-world deployment
    """
    df = features_df.sort_values('date').reset_index(drop=True)
    
    results = []
    test_predictions = []
    
    # Start testing after minimum training games
    for test_start in range(min_train_games, len(df), test_size):
        test_end = min(test_start + test_size, len(df))
        
        train = df.iloc[:test_start]
        test = df.iloc[test_start:test_end]
        
        if len(test) == 0:
            break
        
        # Prepare data
        X_train, y_train, _, _ = prepare_model_data(train, target='residual')
        X_test, y_test, _, meta_test = prepare_model_data(test, target='residual')
        
        if X_train is None or X_test is None or len(X_train) < 50:
            continue
        
        # Train model
        try:
            model, _, _ = train_lightgbm(X_train, y_train, X_test, y_test, feature_names)
            pred_test = model.predict(X_test, num_iteration=model.best_iteration)
            
            # Evaluate
            mae = mean_absolute_error(y_test, pred_test)
            rmse = np.sqrt(mean_squared_error(y_test, pred_test))
            
            # Betting metrics
            pred_margin = pred_test + meta_test['closing_spread'].values
            true_margin = meta_test['margin'].values
            pred_cover = (pred_test > 0).astype(int)
            true_cover = (meta_test['residual'].values > 0).astype(int)
            cover_acc = (pred_cover == true_cover).mean()
            
            results.append({
                'test_start': test_start,
                'test_end': test_end,
                'n_games': len(test),
                'mae': mae,
                'rmse': rmse,
                'cover_accuracy': cover_acc
            })
            
            # Store predictions
            for idx, (_, row) in enumerate(meta_test.iterrows()):
                test_predictions.append({
                    'date': row['date'],
                    'home_team': row['home_team_name'],
                    'away_team': row['away_team_name'],
                    'true_margin': row['margin'],
                    'closing_spread': row['closing_spread'],
                    'pred_residual': pred_test[idx],
                    'pred_margin': pred_margin[idx],
                    'true_cover': true_cover[idx],
                    'pred_cover': pred_cover[idx]
                })
            
            print(f"Backtest window {test_start}-{test_end}: MAE={mae:.3f}, Cover Acc={cover_acc:.1%}")
            
        except Exception as e:
            print(f"Error in backtest window {test_start}-{test_end}: {e}")
            continue
    
    return pd.DataFrame(results), pd.DataFrame(test_predictions)

if len(features_df) > 200:
    print("Running walk-forward backtest...")
    backtest_results, backtest_preds = walk_forward_backtest(features_df, train_lightgbm)
    
    if len(backtest_results) > 0:
        print(f"\n{'='*60}")
        print("Backtest Summary")
        print(f"{'='*60}")
        print(f"Total test windows: {len(backtest_results)}")
        print(f"Average MAE: {backtest_results['mae'].mean():.3f}")
        print(f"Average Cover Accuracy: {backtest_results['cover_accuracy'].mean():.1%}")
        print(f"\nBacktest Results DataFrame:")
        print(backtest_results)
else:
    print("⚠️  Insufficient data for backtesting (need >200 games)")
    backtest_results, backtest_preds = pd.DataFrame(), pd.DataFrame()


## Part 6: Model Interpretation (SHAP)


In [None]:
if SHAP_AVAILABLE and final_lgb_model is not None and X is not None:
    print("Computing SHAP values for model interpretation...")
    
    # Use a sample for SHAP (it can be slow)
    sample_size = min(100, len(X))
    sample_idx = np.random.choice(len(X), sample_size, replace=False)
    X_sample = X[sample_idx]
    
    explainer = shap.TreeExplainer(final_lgb_model)
    shap_values = explainer.shap_values(X_sample)
    
    print("✓ SHAP values computed")
    print("\nTo visualize SHAP plots, run:")
    print("  shap.summary_plot(shap_values, X_sample, feature_names=feature_names)")
    print("  shap.waterfall_plot(explainer.expected_value, shap_values[0], X_sample[0], feature_names=feature_names)")
else:
    if not SHAP_AVAILABLE:
        print("⚠️  SHAP not available. Install with: pip install shap")
    else:
        print("⚠️  Model not available for SHAP analysis")


## Part 7: Generate Predictions for Future Games

Function to make predictions on new games.


In [None]:
def predict_future_games(future_games_df, historical_games_df, model, feature_names, elo=None):
    """
    Predict residuals for future games
    Requires: future_games_df with at least: date, home_team_id, away_team_id, closing_spread
    """
    # Combine historical + future for feature calculation
    combined = pd.concat([historical_games_df, future_games_df], ignore_index=True)
    combined = combined.sort_values('date').reset_index(drop=True)
    
    # Build features (this will use historical data for rolling features)
    combined_features = build_features(combined, elo=elo, include_spread=True)
    
    # Extract only future games
    future_dates = set(future_games_df['date'])
    future_features = combined_features[combined_features['date'].isin(future_dates)].copy()
    
    if len(future_features) == 0:
        return pd.DataFrame()
    
    # Prepare features
    X_future, _, _, meta_future = prepare_model_data(future_features, target='residual')
    
    if X_future is None:
        return pd.DataFrame()
    
    # Predict
    pred_residual = model.predict(X_future, num_iteration=model.best_iteration)
    pred_margin = pred_residual + future_features['closing_spread'].values
    
    # Create predictions DataFrame
    predictions = pd.DataFrame({
        'date': future_features['date'].values,
        'home_team': future_features['home_team_name'].values,
        'away_team': future_features['away_team_name'].values,
        'closing_spread': future_features['closing_spread'].values,
        'predicted_residual': pred_residual,
        'predicted_margin': pred_margin,
        'predicted_home_cover': (pred_residual > 0).astype(int),
        'confidence': np.abs(pred_residual)  # Higher absolute residual = more confident
    })
    
    return predictions.sort_values('date')

print("✓ Prediction function defined")
print("\nTo use:")
print("  predictions = predict_future_games(future_games, features_df, final_lgb_model, feature_names)")


## Part 8: Save Models & Export Results


In [None]:
# Save models and feature names
import os

os.makedirs('models', exist_ok=True)
os.makedirs('data/processed', exist_ok=True)

if final_lgb_model is not None:
    # Save LightGBM model
    final_lgb_model.save_model('models/lgb_spread_model.txt')
    print("✓ Saved LightGBM model to models/lgb_spread_model.txt")
    
    # Save feature names
    with open('models/feature_names.txt', 'w') as f:
        f.write('\n'.join(feature_names))
    print("✓ Saved feature names to models/feature_names.txt")
    
    # Save feature importance
    if len(feature_importance_df) > 0:
        feature_importance_df.to_csv('models/feature_importance.csv', index=False)
        print("✓ Saved feature importance to models/feature_importance.csv")

# Save processed features
if len(features_df) > 0:
    features_df.to_csv('data/processed/features_with_targets.csv', index=False)
    print("✓ Saved processed features to data/processed/features_with_targets.csv")

# Save backtest results
if len(backtest_preds) > 0:
    backtest_preds.to_csv('data/processed/backtest_predictions.csv', index=False)
    print("✓ Saved backtest predictions to data/processed/backtest_predictions.csv")

print("\n✓ All models and data saved successfully")


## Part 9: Production Deployment Code

Example code for deploying the model in production (FastAPI endpoint, scheduled jobs, etc.)


In [None]:
# Example: Load model and make predictions (for production use)
# See production_predict.py example below:

# production_predict.py
# import pandas as pd
# import numpy as np
# import lightgbm as lgb
# from datetime import datetime
# 
# def load_production_model():
#     model = lgb.Booster(model_file='models/lgb_spread_model.txt')
#     with open('models/feature_names.txt', 'r') as f:
#         feature_names = [line.strip() for line in f.readlines()]
#     return model, feature_names
# 
# def predict_game(home_team_id, away_team_id, closing_spread, historical_data):
#     # Load model
#     model, feature_names = load_production_model()
#     
#     # Build features for this game (using historical_data)
#     # ... (use build_features function)
#     
#     # Predict
#     # pred_residual = model.predict(X_new)
#     # pred_margin = pred_residual + closing_spread
#     
#     return pred_margin, pred_residual

# Example FastAPI endpoint:
# from fastapi import FastAPI
# from pydantic import BaseModel
# 
# app = FastAPI()
# 
# class GamePrediction(BaseModel):
#     home_team_id: str
#     away_team_id: str
#     closing_spread: float
#     date: str
# 
# @app.post("/predict")
# def predict(game: GamePrediction):
#     # Load model, build features, predict
#     pred_margin, pred_residual = predict_game(
#         game.home_team_id, 
#         game.away_team_id, 
#         game.closing_spread,
#         load_historical_data()
#     )
#     return {
#         "predicted_margin": pred_margin,
#         "predicted_residual": pred_residual,
#         "home_covers": pred_residual > 0
#     }

print("✓ Production deployment code template provided")
print("\nSee comments above for FastAPI example")


## Part 10: Monitoring & Model Drift Detection


In [None]:
def monitor_model_performance(predictions_df, actual_results_df, baseline_mae=None):
    """
    Monitor model performance over time
    Detects model drift and performance degradation
    """
    # Merge predictions with actual results
    merged = predictions_df.merge(
        actual_results_df[['date', 'home_team', 'away_team', 'margin', 'residual']],
        on=['date', 'home_team', 'away_team'],
        how='inner'
    )
    
    if len(merged) == 0:
        return None
    
    # Calculate metrics by time window
    merged['date'] = pd.to_datetime(merged['date'])
    merged = merged.sort_values('date')
    
    # Rolling performance (last 20 games)
    window_size = 20
    metrics = []
    
    for i in range(window_size, len(merged)):
        window = merged.iloc[i-window_size:i]
        
        mae = mean_absolute_error(window['residual'], window['predicted_residual'])
        cover_acc = (window['predicted_home_cover'] == (window['residual'] > 0)).mean()
        
        metrics.append({
            'date': window.iloc[-1]['date'],
            'mae': mae,
            'cover_accuracy': cover_acc,
            'n_games': len(window)
        })
    
    metrics_df = pd.DataFrame(metrics)
    
    # Detect drift (MAE increases significantly)
    if baseline_mae and len(metrics_df) > 0:
        current_mae = metrics_df['mae'].iloc[-10:].mean()
        drift_threshold = baseline_mae * 1.15  # 15% increase
        
        if current_mae > drift_threshold:
            print(f"⚠️  MODEL DRIFT DETECTED!")
            print(f"   Baseline MAE: {baseline_mae:.3f}")
            print(f"   Current MAE: {current_mae:.3f}")
            print(f"   Increase: {(current_mae/baseline_mae - 1)*100:.1f}%")
            print(f"   Recommendation: Retrain model")
        else:
            print(f"✓ Model performance stable (MAE: {current_mae:.3f})")
    
    return metrics_df

print("✓ Model monitoring function defined")
print("\nUse this to track model performance over time and detect drift")


## Summary & Next Steps

### What We've Built:
1. ✅ **Feature Engineering Pipeline**: ELO ratings, rolling stats, rest days, temporal features
2. ✅ **Multiple Models**: Baseline, Linear (Ridge), LightGBM
3. ✅ **Time-Series Cross-Validation**: Proper walk-forward validation
4. ✅ **Backtesting Framework**: Simulates real-world deployment
5. ✅ **Model Interpretation**: Feature importance, SHAP support
6. ✅ **Production-Ready Code**: Model saving, prediction functions, monitoring

### Key Features:
- **Target**: Predicts `residual = margin - closing_spread` (betting edge)
- **No Data Leakage**: Features computed only from past games
- **Time-Aware**: Proper time-series validation prevents overfitting
- **Betting Metrics**: Cover accuracy, average edge, MAE on margin

### To Improve:
1. **Add Injury Data**: Integrate player availability/minutes lost
2. **Add Travel Data**: Calculate distance/time zones between cities
3. **Add Market Data**: Opening spreads, line movement, public betting %
4. **Advanced Features**: Play-by-play, lineup-based metrics, clutch performance
5. **Probabilistic Predictions**: Quantile regression for uncertainty
6. **Ensemble Models**: Stack ELO + LightGBM + Neural Network

### Deployment Checklist:
- [ ] Set up daily data collection job
- [ ] Automate feature calculation pipeline
- [ ] Schedule weekly model retraining
- [ ] Deploy prediction API (FastAPI/Flask)
- [ ] Set up monitoring dashboard
- [ ] Implement alerting for model drift
- [ ] Backtest on historical data before live deployment

### Important Notes:
- ⚠️ **Gambling Laws**: Be aware of local regulations
- ⚠️ **Vigorish**: Account for bookmaker house edge (~4.5% on spreads)
- ⚠️ **Sample Size**: Early season predictions have higher variance
- ⚠️ **Data Quality**: Accurate injury/line data is critical

### Files Generated:
- `models/lgb_spread_model.txt` - Trained LightGBM model
- `models/feature_names.txt` - Feature list
- `models/feature_importance.csv` - Feature rankings
- `data/processed/features_with_targets.csv` - Full feature dataset
- `data/processed/backtest_predictions.csv` - Backtest results


## Appendix: Requirements File

Create a `requirements.txt` file with:
```
pandas>=1.5.0
numpy>=1.23.0
scikit-learn>=1.2.0
lightgbm>=3.3.0
matplotlib>=3.6.0
seaborn>=0.12.0
shap>=0.41.0
joblib>=1.2.0
requests>=2.28.0
```
