# 02 â€“ Model Benchmarks on NBA Processed Data

In this notebook, we:

- Load the processed, leak-free NBA games table.
- Rebuild the logistic regression baseline from the processed file.
- Add stronger models (Random Forest / Gradient Boosting).
- Compare performance across models using Accuracy and ROC AUC.

## 1: Load Processed Data

In [3]:
import sys 
from pathlib import Path

ROOT = Path("..").resolve()
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

import pandas as pd 
from src.paths import PROCESSED_DIR

games_model = pd.read_parquet(PROCESSED_DIR / "processed_games.parquet")
games_model.head()

Unnamed: 0,date,start_et,away_team,home_team,attend,arena,season,source_file,home_win,home_win_pct_10,home_avg_pd_10,home_season_win_pct,home_recent_win_pct_20g,home_days_rest,home_last_pd,away_win_pct_10,away_avg_pd_10,away_season_win_pct,away_recent_win_pct_20g,away_days_rest,away_last_pd
0,2015-10-29,7:00p,Memphis Grizzlies,Indiana Pacers,18165.0,Bankers Life Fieldhouse,2015-16_NBA,oct.xls,0,0.0,-1.0,0.0,0.0,1.0,-1.0,0.0,-1.0,0.0,0.0,1.0,-1.0
1,2015-10-29,8:00p,Atlanta Hawks,New York Knicks,19812.0,Madison Square Garden (IV),2015-16_NBA,oct.xls,0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,-1.0,0.0,0.0,2.0,-1.0
2,2015-10-29,10:30p,Dallas Mavericks,Los Angeles Clippers,19218.0,STAPLES Center,2015-16_NBA,oct.xls,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,2015-10-30,10:30p,Portland Trail Blazers,Phoenix Suns,18055.0,Talking Stick Resort Arena,2015-16_NBA,oct.xls,1,0.0,-1.0,0.0,0.0,2.0,-1.0,1.0,1.0,1.0,1.0,2.0,1.0
4,2015-10-30,10:00p,Los Angeles Lakers,Sacramento Kings,17391.0,Sleep Train Arena,2015-16_NBA,oct.xls,1,0.0,-1.0,0.0,0.0,2.0,-1.0,0.0,-1.0,0.0,0.0,2.0,-1.0


## 2. Rebuild Logistic Regression Baseline (from processed table)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

feature_cols = [
    "home_win_pct_10", "away_win_pct_10",
    "home_avg_pd_10", "away_avg_pd_10",
    "home_season_win_pct", "away_season_win_pct",
    "home_recent_win_pct_20g", "away_recent_win_pct_20g",
    "home_days_rest", "away_days_rest",
    "home_last_pd", "away_last_pd",
]

X = games_model[feature_cols]
y = games_model["home_win"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    shuffle=True,
    stratify=y,
)

log_reg = LogisticRegression(max_iter=1000, solver="lbfgs")
log_reg.fit(X_train, y_train)

preds_lr = log_reg.predict(X_test)
proba_lr = log_reg.predict_proba(X_test)[:, 1]

acc_lr = accuracy_score(y_test, preds_lr)
auc_lr = roc_auc_score(y_test, proba_lr)

acc_lr, auc_lr

(0.6259774109470027, 0.6634115545700912)

## 3. Random Forest Benchmark 

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=1
)

rf.fit(X_train, y_train)

preds_rf = rf.predict(X_test)
proba_rf = rf.predict_proba(X_test)[:,1]

acc_rf = accuracy_score(y_test, preds_rf)
auc_rf = roc_auc_score(y_test, proba_rf)

acc_rf, auc_rf

(0.6172893136403128, 0.6538217541266322)

### **Random Forest Results**

- Random Forest performs similarly but does **not outperform** logistic regression.  

- Since AUC is our most important metric for ranking win probability, **logistic regression remains the current best model**.  

- This confirms that our engineered rolling + season + rest features contain most of the predictive signal, and that a simple linear model can already exploit them effectively.

## 4: Gradient Boosting Benchmark 

In [6]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

xgb = XGBClassifier(
    n_estimators=600,
    max_depth=5,
    learning_rate=0.03,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42,
)

xgb.fit(X_train, y_train)

preds_xgb = xgb.predict(X_test)
proba_xgb = xgb.predict_proba(X_test)[:, 1]

acc_xgb = accuracy_score(y_test, preds_xgb)
auc_xgb = roc_auc_score(y_test, proba_xgb)

acc_xgb, auc_xgb

(0.6233709817549956, 0.6521137441488052)

## 5: Quick Baseline Model

In [9]:
import numpy as np 

# Baseline always picks the home team (1) to win 
baseline_preds = np.ones_like(y_test)

# Dumb, uninformative probabilities: 0.5 for every game
baseline_proba = np.full_like(y_test, 0.5, dtype=float)

baseline_acc = accuracy_score(y_test, baseline_preds)
baseline_auc = roc_auc_score(y_test, baseline_proba)

baseline_acc, baseline_auc


(0.5699391833188532, 0.5)

In [10]:
# Create results table 
results = pd.DataFrame([
    {"model": "baseline_home_team", "acc": baseline_acc, "auc": baseline_auc},
    {"model": "log_reg_extended", "acc": acc_lr, "auc": auc_lr},
    {"model": "random_forest", "acc": acc_rf, "auc": auc_rf},
    {"model": "xgboost", "acc": acc_xgb, "auc": auc_xgb},
])

results.sort_values("auc", ascending=False)

Unnamed: 0,model,acc,auc
1,log_reg_extended,0.625977,0.663412
2,random_forest,0.617289,0.653822
3,xgboost,0.623371,0.652114
0,baseline_home_team,0.569939,0.5
