# NFL Home Win Ensemble Model

This notebook reproduces the ensemble training pipeline (RandomForest + GradientBoosting + LogisticRegression with soft voting) using the processed game-level dataset. It keeps the model in memory (no joblib artifacts).

Workflow:
- Load or build processed dataset (`data/processed/games_dataset.csv`).
- Prepare features (numeric + categorical) and build the preprocessing + ensemble pipeline.
- Train on 2019–2021, validate on 2022 (report accuracy, ROC AUC, log loss, Brier).
- Optionally refit on 2019–2022 for final in-memory model.

Notes:
- Ensure data is fetched under `data/raw/` (e.g., run the fetch script or `make fetch`).
- If strictly modeling pregame predictions, consider excluding in-game weather (`temp`, `wind`) to avoid leakage.


In [3]:
from __future__ import annotations

from pathlib import Path
from typing import List, Tuple

import numpy as np
import pandas as pd

# Sklearn
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss, brier_score_loss

DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

ID_COLS = ['game_id', 'season', 'week', 'gameday', 'home_team', 'away_team']
TARGET = 'home_win'

print('Using DATA_DIR=', DATA_DIR.resolve())


Using DATA_DIR= /Users/aksharravichandran/Documents/GitHub/NFLAnalytics/data


## Build Processed Dataset (optional)

Builds a compact game-level table with engineered rolling features from schedules under `data/raw/schedules/`. If the processed file already exists, you can skip running this cell.

In [4]:
def _read_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(path)

def load_schedules() -> pd.DataFrame:
    sched_path = RAW_DIR / 'schedules' / 'schedules_2019_2024.csv'
    if not sched_path.exists():
        raise FileNotFoundError(f'Missing schedules file at {sched_path}. Run the fetch script first.')
    df = _read_csv(sched_path)
    if 'gameday' in df.columns:
        df['gameday'] = pd.to_datetime(df['gameday'], errors='coerce')
    return df

def _long_games(sched: pd.DataFrame) -> pd.DataFrame:
    base_cols = [
        'game_id','season','game_type','week','gameday','weekday',
        'spread_line','total_line','div_game','roof','surface','temp','wind',
    ]
    home = sched.assign(
        team=sched['home_team'],
        opp=sched['away_team'],
        points_for=sched['home_score'],
        points_against=sched['away_score'],
        is_home=1,
        rest=sched.get('home_rest'),
        moneyline=sched.get('home_moneyline'),
    )[base_cols + ['team','opp','points_for','points_against','is_home','rest','moneyline']]
    away = sched.assign(
        team=sched['away_team'],
        opp=sched['home_team'],
        points_for=sched['away_score'],
        points_against=sched['home_score'],
        is_home=0,
        rest=sched.get('away_rest'),
        moneyline=sched.get('away_moneyline'),
    )[base_cols + ['team','opp','points_for','points_against','is_home','rest','moneyline']]
    long_df = pd.concat([home, away], ignore_index=True)
    long_df['margin'] = pd.to_numeric(long_df['points_for'], errors='coerce') - pd.to_numeric(long_df['points_against'], errors='coerce')
    long_df['win'] = (long_df['margin'] > 0).astype(int)
    long_df = long_df.sort_values(['team','season','week']).reset_index(drop=True)
    return long_df

def _rolling_features(long_df: pd.DataFrame) -> pd.DataFrame:
    feats = long_df.copy()
    grp = feats.groupby(['team','season'], sort=False)
    def add_roll(col: str, name: str, window: int, func: str = 'mean'):
        s = grp[col].apply(lambda x: getattr(x.shift(1).rolling(window, min_periods=1), func)())
        feats[f'{name}_{window}'] = s.to_numpy()
    for w in (3, 5):
        add_roll('margin','roll_margin_mean', w, 'mean')
        add_roll('win','roll_win_rate', w, 'mean')
        add_roll('points_for','roll_pts_for_mean', w, 'mean')
        add_roll('points_against','roll_pts_against_mean', w, 'mean')
    feats['st_d_win_rate'] = grp['win'].apply(lambda x: x.shift(1).expanding(min_periods=1).mean()).to_numpy()
    feats['st_d_margin_mean'] = grp['margin'].apply(lambda x: x.shift(1).expanding(min_periods=1).mean()).to_numpy()
    for c in ['rest','moneyline','temp','wind']:
        if c in feats.columns:
            feats[c] = pd.to_numeric(feats[c], errors='coerce')
    return feats

def _wide_game_level(sched: pd.DataFrame, team_feats: pd.DataFrame) -> pd.DataFrame:
    key_cols = ['game_id','team']
    feat_cols = [
        'roll_margin_mean_3','roll_margin_mean_5',
        'roll_win_rate_3','roll_win_rate_5',
        'roll_pts_for_mean_3','roll_pts_for_mean_5',
        'roll_pts_against_mean_3','roll_pts_against_mean_5',
        'st_d_win_rate','st_d_margin_mean',
        'rest','moneyline',
    ]
    home_merge = team_feats[key_cols + feat_cols + ['temp','wind']].rename(columns={c: f'home_{c}' for c in feat_cols + ['temp','wind']}).rename(columns={'team':'home_team'})
    away_merge = team_feats[key_cols + feat_cols + ['temp','wind']].rename(columns={c: f'away_{c}' for c in feat_cols + ['temp','wind']}).rename(columns={'team':'away_team'})
    games = sched.merge(home_merge, on=['game_id','home_team'], how='left')
    games = games.merge(away_merge, on=['game_id','away_team'], how='left')
    games['home_win'] = (pd.to_numeric(games['home_score'], errors='coerce') > pd.to_numeric(games['away_score'], errors='coerce')).astype('Int64')
    return games

def build_dataset(seasons: Tuple[int, int] = (2019, 2024)) -> pd.DataFrame:
    sched = load_schedules()
    for c in ['week','season','temp','wind','spread_line','total_line','home_rest','away_rest','home_moneyline','away_moneyline']:
        if c in sched.columns:
            sched[c] = pd.to_numeric(sched[c], errors='coerce')
    s0, s1 = seasons
    sched = sched[(sched['season'] >= s0) & (sched['season'] <= s1) & (sched['game_type'].isin(['REG']))].copy()
    long_df = _long_games(sched)
    team_feats = _rolling_features(long_df)
    game_lvl = _wide_game_level(sched, team_feats)
    lt_path = DATA_DIR / 'eda' / 'league_trend.csv'
    if lt_path.exists():
        lt = pd.read_csv(lt_path).rename(columns={'pass_rate':'league_pass_rate','epa_mean':'league_epa_mean'})
        lt = lt[['season','week','league_pass_rate','league_epa_mean']]
        game_lvl = game_lvl.merge(lt, on=['season','week'], how='left')
    keep_cols = [
        'game_id','season','week','gameday','weekday',
        'home_team','away_team','home_score','away_score','home_win',
        'spread_line','total_line','div_game','roof','surface',
        'home_rest','away_rest','home_moneyline','away_moneyline',
        'home_roll_margin_mean_3','home_roll_margin_mean_5',
        'home_roll_win_rate_3','home_roll_win_rate_5',
        'home_roll_pts_for_mean_3','home_roll_pts_for_mean_5',
        'home_roll_pts_against_mean_3','home_roll_pts_against_mean_5',
        'home_st_d_win_rate','home_st_d_margin_mean',
        'home_temp','home_wind',
        'away_roll_margin_mean_3','away_roll_margin_mean_5',
        'away_roll_win_rate_3','away_roll_win_rate_5',
        'away_roll_pts_for_mean_3','away_roll_pts_for_mean_5',
        'away_roll_pts_against_mean_3','away_roll_pts_against_mean_5',
        'away_st_d_win_rate','away_st_d_margin_mean',
        'away_temp','away_wind',
        'league_pass_rate','league_epa_mean',
    ]
    keep_cols = [c for c in keep_cols if c in game_lvl.columns]
    return game_lvl[keep_cols].copy()

processed_path = PROCESSED_DIR / 'games_dataset.csv'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
if not processed_path.exists():
    df_proc = build_dataset((2019, 2024))
    df_proc.to_csv(processed_path, index=False)
    print(f'Saved {len(df_proc):,} rows to {processed_path}')
else:
    df_proc = pd.read_csv(processed_path)
    print(f'Loaded {len(df_proc):,} rows from {processed_path}')

df_proc.head(3)


Loaded 1,599 rows from ../data/processed/games_dataset.csv


Unnamed: 0,game_id,season,week,gameday,weekday,home_team,away_team,home_score,away_score,home_win,...,away_roll_pts_for_mean_3,away_roll_pts_for_mean_5,away_roll_pts_against_mean_3,away_roll_pts_against_mean_5,away_st_d_win_rate,away_st_d_margin_mean,away_temp,away_wind,league_pass_rate,league_epa_mean
0,2019_01_GB_CHI,2019,1,2019-09-05,Thursday,CHI,GB,3.0,10.0,0,...,,,,,,,65.0,10.0,0.473666,0.027652
1,2019_01_LA_CAR,2019,1,2019-09-08,Sunday,CAR,LA,27.0,30.0,0,...,,,,,,,87.0,3.0,0.473666,0.027652
2,2019_01_TEN_CLE,2019,1,2019-09-08,Sunday,CLE,TEN,13.0,43.0,0,...,,,,,,,71.0,10.0,0.473666,0.027652


## Build Model Pipeline

Construct numeric/categorical transformers and a soft-voting ensemble. Optionally exclude weather variables to avoid potential leakage.

In [5]:
# Toggle to drop weather features if modeling pregame only
EXCLUDE_WEATHER = True

exclude = set(ID_COLS + [TARGET, 'home_score', 'away_score'])
all_cols = [c for c in df_proc.columns if c not in exclude]

if EXCLUDE_WEATHER:
    for c in ['home_temp','away_temp','home_wind','away_wind']:
        if c in all_cols:
            all_cols.remove(c)

categorical_features: List[str] = []
numeric_features: List[str] = []
for c in all_cols:
    if df_proc[c].dtype.kind in 'ifu':
        numeric_features.append(c)
    else:
        categorical_features.append(c)
# Ensure certain known categoricals are treated as such if present
for c in ['weekday','roof','surface','div_game']:
    if c in all_cols and c not in categorical_features:
        if c in numeric_features:
            numeric_features.remove(c)
        categorical_features.append(c)

def build_pipeline(numeric_features: List[str], categorical_features: List[str]) -> Pipeline:
    num_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median'))])
    cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    preprocessor = ColumnTransformer(
        transformers=[('num', num_transformer, numeric_features), ('cat', cat_transformer, categorical_features)]
    )
    # Base learners with class weights to guard imbalance
    rf = RandomForestClassifier(n_estimators=400, max_depth=None, random_state=42, n_jobs=-1, class_weight='balanced')
    gb = GradientBoostingClassifier(random_state=42)
    lr = LogisticRegression(max_iter=200, solver='lbfgs', class_weight='balanced')
    ensemble = VotingClassifier(estimators=[('rf', rf), ('gb', gb), ('lr', lr)], voting='soft', n_jobs=-1)
    pipe = Pipeline(steps=[('prep', preprocessor), ('model', ensemble)])
    return pipe

pipe = build_pipeline(numeric_features, categorical_features)
len(numeric_features), len(categorical_features)


(24, 4)

## Train / Validate

Train on 2019–2021 and validate on 2022, then refit on 2019–2022 for the final in-memory model.

In [6]:
def evaluate(y_true: np.ndarray, proba: np.ndarray):
    preds = (proba >= 0.5).astype(int)
    return {
        'accuracy': float(accuracy_score(y_true, preds)),
        'roc_auc': float(roc_auc_score(y_true, proba)),
        'log_loss': float(log_loss(y_true, np.vstack([1 - proba, proba]).T, labels=[0, 1])),
        'brier': float(brier_score_loss(y_true, proba)),
    }

# Train/val split
df_fit = df_proc[df_proc[TARGET].notna()].copy()
train_df = df_fit[(df_fit['season'] >= 2019) & (df_fit['season'] <= 2021)].copy()
val_df = df_fit[df_fit['season'] == 2022].copy()

X_train = train_df[all_cols]
y_train = train_df[TARGET].astype(int).values
X_val = val_df[all_cols]
y_val = val_df[TARGET].astype(int).values

# Fit on 2019-2021
pipe.fit(X_train, y_train)

# Report 2022 validation metrics
if len(X_val) > 0:
    val_proba = pipe.predict_proba(X_val)[:, 1]
    metrics = evaluate(y_val, val_proba)
    print('Validation (2022) metrics:')
    for k, v in metrics.items():
        print(f'  {k}: {v:.4f}')
else:
    print('No 2022 validation rows found.')

# Refit on 2019-2022 for final in-memory model
train_full = df_fit[(df_fit['season'] >= 2019) & (df_fit['season'] <= 2022)].copy()
X_full = train_full[all_cols]
y_full = train_full[TARGET].astype(int).values
pipe.fit(X_full, y_full)
print('Final model trained on seasons 2019–2022 and stored in variable `pipe`.')


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation (2022) metrics:
  accuracy: 0.6236
  roc_auc: 0.6859
  log_loss: 0.6313
  brier: 0.2212


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Final model trained on seasons 2019–2022 and stored in variable `pipe`.


## Inspect Pipeline and Features

Peek at the pipeline structure and expanded feature names (after one-hot encoding).

In [7]:
pipe.named_steps

prep = pipe.named_steps['prep']
try:
    feat_names = prep.get_feature_names_out()
    print('Total transformed features:', len(feat_names))
    print('First 25:', feat_names[:25])
except Exception as e:
    print('Feature name introspection not available:', e)


Total transformed features: 44
First 25: ['num__spread_line' 'num__total_line' 'num__home_roll_margin_mean_3'
 'num__home_roll_margin_mean_5' 'num__home_roll_win_rate_3'
 'num__home_roll_win_rate_5' 'num__home_roll_pts_for_mean_3'
 'num__home_roll_pts_for_mean_5' 'num__home_roll_pts_against_mean_3'
 'num__home_roll_pts_against_mean_5' 'num__home_st_d_win_rate'
 'num__home_st_d_margin_mean' 'num__away_roll_margin_mean_3'
 'num__away_roll_margin_mean_5' 'num__away_roll_win_rate_3'
 'num__away_roll_win_rate_5' 'num__away_roll_pts_for_mean_3'
 'num__away_roll_pts_for_mean_5' 'num__away_roll_pts_against_mean_3'
 'num__away_roll_pts_against_mean_5' 'num__away_st_d_win_rate'
 'num__away_st_d_margin_mean' 'num__league_pass_rate'
 'num__league_epa_mean' 'cat__weekday_Friday']


## Example Inference

Predict probabilities on a few rows using the in-memory model (`pipe`).

In [8]:
X_any = df_proc[df_proc[TARGET].notna()].iloc[:5][all_cols]
pipe.predict_proba(X_any)[:, 1]


array([0.41874589, 0.26284928, 0.51061971, 0.25090728, 0.17958385])