# CBB Predictive Dashboard — ML Exploration Notebook

This notebook replicates the **data collection** and **training pipeline** of the CBB Predictive Dashboard.

You'll:
1. Collect historical game data
2. Explore feature distributions and correlations
3. Train a calibrated 2-model ensemble (Logistic Regression + XGBoost)
4. Evaluate with calibration curves, ROC-AUC, and Brier scores
5. Inspect feature importance
6. Run live inference to predict win probabilities

**Adapted from:**
- `dashboard/scripts/collect_historical_data.py`
- `dashboard/scripts/train_predictor.py`
- `dashboard/ai/predictor.py`

## 0. Setup & Install Dependencies

In [None]:
# Install required packages
!pip install cbbpy xgboost scikit-learn pandas numpy matplotlib seaborn joblib aiohttp scipy -q

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from datetime import datetime, timedelta
from collections import deque
import warnings

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import (
    accuracy_score, brier_score_loss, roc_auc_score, roc_curve,
    confusion_matrix, ConfusionMatrixDisplay
)
from xgboost import XGBClassifier, plot_importance

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ All dependencies installed and imported successfully!")

## 1. Data Collection — Replicate `collect_historical_data.py`

This section fetches historical game data from cbbpy and constructs training snapshots.
Each snapshot captures the game state at a point in time during the game.

In [None]:
async def collect_cbb_data(start_date: str, end_date: str, limit_games: int = 100):
    """
    Collect historical college basketball game data.
    
    Args:
        start_date: Start date (YYYY-MM-DD)
        end_date: End date (YYYY-MM-DD)
        limit_games: Max games to collect (to avoid long runtime)
    
    Returns:
        DataFrame with columns: game_id, home_team, away_team, score_diff, momentum,
                               strength_diff, period, mins_remaining, time_ratio, is_home_win
    """
    try:
        from cbbpy.py_ball import Play
    except ImportError:
        print("cbbpy not available. Using sample data instead.")
        return None
    
    print(f"Fetching data from {start_date} to {end_date}...")
    
    start = datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.strptime(end_date, "%Y-%m-%d")
    
    all_snapshots = []
    games_collected = 0
    current = start
    
    while current <= end and games_collected < limit_games:
        date_str = current.strftime("%Y-%m-%d")
        print(f"  Processing {date_str}...", end="")
        
        try:
            # Use cbbpy to get games for this date
            import cbbpy
            games = cbbpy.get_games(date=date_str)
            
            if games is None or len(games) == 0:
                print(" (no games)")
                current += timedelta(days=1)
                continue
            
            for idx, game in games.iterrows():
                if games_collected >= limit_games:
                    break
                    
                # Only use completed games
                status = game.get('status', '') or game.get('game_status', '')
                if status != 'Final':
                    continue
                
                game_id = game.get('game_id', f"{date_str}_{idx}")
                home_team = game.get('home_team', 'Unknown')
                away_team = game.get('away_team', 'Unknown')
                home_score = int(game.get('home_score', 0) or 0)
                away_score = int(game.get('away_score', 0) or 0)
                
                is_home_win = 1 if home_score > away_score else 0
                
                try:
                    # Try to fetch play-by-play
                    pbp = cbbpy.get_pbp(game_id)
                    
                    if pbp is None or len(pbp) == 0:
                        continue
                    
                    # Create snapshots from PBP data
                    last_minute_sampled = -1
                    score_history = deque(maxlen=5)
                    
                    for _, play in pbp.iterrows():
                        try:
                            # Parse clock and period
                            clock = str(play.get('clock', '20:00') or '20:00')
                            period = int(play.get('period', 1) or 1)
                            
                            # Convert clock to minutes remaining
                            parts = clock.split(":")
                            mins = int(parts[0]) if len(parts) > 0 else 0
                            total_mins_remaining = mins if period == 2 else mins + 20
                            
                            # Score and momentum
                            score_home = int(play.get('home_score', 0) or 0)
                            score_away = int(play.get('away_score', 0) or 0)
                            current_diff = score_home - score_away
                            
                            # Sample roughly every minute
                            if total_mins_remaining != last_minute_sampled:
                                momentum = 0.0
                                if len(score_history) > 0:
                                    momentum = current_diff - score_history[0]
                                
                                all_snapshots.append({
                                    'game_id': str(game_id),
                                    'home_team': str(home_team),
                                    'away_team': str(away_team),
                                    'score_diff': float(current_diff),
                                    'momentum': float(momentum),
                                    'strength_diff': 0.0,  # Simplified for demo
                                    'period': float(period),
                                    'mins_remaining': float(total_mins_remaining),
                                    'time_ratio': float(total_mins_remaining / 40.0),
                                    'is_home_win': int(is_home_win)
                                })
                                
                                last_minute_sampled = total_mins_remaining
                                score_history.append(current_diff)
                        except:
                            continue
                    
                    games_collected += 1
                except Exception as e:
                    pass
            
            print(f" ({games_collected} games so far)")
        except Exception as e:
            print(f" (error: {str(e)[:30]})")
        
        current += timedelta(days=1)
    
    if len(all_snapshots) == 0:
        return None
    
    df = pd.DataFrame(all_snapshots)
    print(f"\n✓ Collected {len(df)} snapshots from {games_collected} games")
    return df

# Attempt to collect real data
import asyncio
try:
    # Try fetching last 14 days of data
    end_date = datetime.now().strftime("%Y-%m-%d")
    start_date = (datetime.now() - timedelta(days=14)).strftime("%Y-%m-%d")
    
    df = await collect_cbb_data(start_date, end_date, limit_games=50)
except Exception as e:
    print(f"Real data collection failed: {e}")
    df = None

if df is None:
    print("\n⚠ Using synthetic training data instead...")
    # Generate synthetic data for demonstration
    np.random.seed(42)
    n_samples = 500
    
    df = pd.DataFrame({
        'game_id': [f'game_{i}' for i in range(n_samples)],
        'home_team': np.random.choice(['Duke', 'UNC', 'Kansas', 'UCLA', 'UK'], n_samples),
        'away_team': np.random.choice(['Duke', 'UNC', 'Kansas', 'UCLA', 'UK'], n_samples),
        'score_diff': np.random.normal(0, 8, n_samples),
        'momentum': np.random.normal(0, 3, n_samples),
        'strength_diff': np.random.normal(0, 5, n_samples),
        'period': np.random.choice([1.0, 2.0], n_samples),
        'mins_remaining': np.random.uniform(0, 40, n_samples),
        'time_ratio': np.random.uniform(0, 1, n_samples),
        'is_home_win': np.random.choice([0, 1], n_samples)
    })
    print(f"✓ Generated {len(df)} synthetic snapshots")

print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nData types:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum())

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Summary statistics
print("Dataset Summary Statistics:")
print(df.describe())
print(f"\nClass distribution (is_home_win):")
print(df['is_home_win'].value_counts())
print(f"Home win rate: {df['is_home_win'].mean():.2%}")

In [None]:
# Distribution of key features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Feature Distributions', fontsize=16, fontweight='bold')

features = ['score_diff', 'momentum', 'strength_diff', 'time_ratio', 'mins_remaining', 'period']

for idx, feature in enumerate(features):
    ax = axes[idx // 3, idx % 3]
    ax.hist(df[feature], bins=30, alpha=0.7, color='steelblue', edgecolor='black')
    ax.axvline(df[feature].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df[feature].mean():.2f}')
    ax.set_xlabel(feature, fontweight='bold')
    ax.set_ylabel('Frequency')
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()
print("✓ Feature distributions plotted")

In [None]:
# Correlation heatmap
corr_features = ['score_diff', 'momentum', 'strength_diff', 'time_ratio', 'mins_remaining', 'period', 'is_home_win']
corr_matrix = df[corr_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            cbar_kws={'label': 'Correlation'}, square=True)
plt.title('Feature Correlation Matrix', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()
print("✓ Correlation heatmap generated")

In [None]:
# Win rate by score_diff buckets
df['score_diff_bucket'] = pd.cut(df['score_diff'], bins=[-np.inf, -10, -5, 0, 5, 10, np.inf],
                                   labels=['<-10', '-10 to -5', '-5 to 0', '0 to 5', '5 to 10', '>10'])

win_by_diff = df.groupby('score_diff_bucket')['is_home_win'].agg(['mean', 'count']).reset_index()
win_by_diff.columns = ['Score Diff Bucket', 'Home Win Rate', 'Count']

fig, ax = plt.subplots(figsize=(12, 5))
bars = ax.bar(range(len(win_by_diff)), win_by_diff['Home Win Rate'], color='steelblue', alpha=0.7, edgecolor='black')

# Add count labels on bars
for i, (bar, count) in enumerate(zip(bars, win_by_diff['Count'])):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{height:.1%}\n(n={int(count)})',
           ha='center', va='bottom', fontweight='bold')

ax.set_xticks(range(len(win_by_diff)))
ax.set_xticklabels(win_by_diff['Score Diff Bucket'])
ax.set_ylabel('Home Win Rate', fontweight='bold')
ax.set_xlabel('Score Differential Bucket', fontweight='bold')
ax.set_title('Home Win Rate by Score Differential', fontweight='bold', fontsize=14)
ax.set_ylim(0, 1)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("Win Rate by Score Diff Bucket:")
print(win_by_diff.to_string(index=False))

In [None]:
# Momentum vs. outcome scatter
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Momentum vs Win
for outcome in [0, 1]:
    mask = df['is_home_win'] == outcome
    label = 'Home Win' if outcome == 1 else 'Home Loss'
    axes[0].scatter(df.loc[mask, 'momentum'], df.loc[mask, 'score_diff'], 
                    alpha=0.5, s=30, label=label)

axes[0].set_xlabel('Momentum', fontweight='bold')
axes[0].set_ylabel('Score Differential', fontweight='bold')
axes[0].set_title('Momentum vs Score Diff (colored by outcome)', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Time ratio vs win rate
df['time_ratio_bucket'] = pd.cut(df['time_ratio'], bins=5)
win_by_time = df.groupby('time_ratio_bucket', observed=True)['is_home_win'].agg(['mean', 'count']).reset_index()
time_labels = [f"{i.left:.2f}-{i.right:.2f}" for i in win_by_time['time_ratio_bucket']]

axes[1].bar(range(len(win_by_time)), win_by_time['mean'], color='coral', alpha=0.7, edgecolor='black')
axes[1].set_xticks(range(len(win_by_time)))
axes[1].set_xticklabels(time_labels, rotation=45)
axes[1].set_ylabel('Home Win Rate', fontweight='bold')
axes[1].set_xlabel('Time Ratio (0=End, 1=Start)', fontweight='bold')
axes[1].set_title('Home Win Rate by Game Progress', fontweight='bold')
axes[1].set_ylim(0, 1)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()
print("✓ Momentum and time analysis plotted")

## 3. Train the Ensemble Model

Exactly replicates the training from `train_predictor.py`:
- Calibrated Logistic Regression (isotonic calibration)
- Calibrated XGBoost (isotonic calibration)
- 50/50 weighted average ensemble

In [None]:
# Data preparation
print("Preparing data for training...")
df_clean = df.fillna(0)

# Features and target
features = ['score_diff', 'momentum', 'strength_diff', 'time_ratio', 'mins_remaining', 'period']
X = df_clean[features]
y = df_clean['is_home_win']

print(f"Features: {features}")
print(f"Target distribution: {y.value_counts().to_dict()}")
print(f"Positive class (home win): {y.mean():.2%}")

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# MODEL 1: Calibrated Logistic Regression
print("\n" + "="*60)
print("MODEL 1: Calibrated Logistic Regression")
print("="*60)

# Scale features for LR
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train base LR model
base_lr = LogisticRegression(random_state=42, max_iter=1000)

# Calibrate with isotonic regression (5-fold CV)
lr_model = CalibratedClassifierCV(base_lr, method='isotonic', cv=5)
lr_model.fit(X_train_scaled, y_train)

# Predictions
lr_probs = lr_model.predict_proba(X_test_scaled)[:, 1]
lr_preds = lr_model.predict(X_test_scaled)

# Metrics
lr_acc = accuracy_score(y_test, lr_preds)
lr_brier = brier_score_loss(y_test, lr_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print(f"Accuracy:    {lr_acc:.4f}")
print(f"Brier Score: {lr_brier:.4f} (lower is better)")
print(f"ROC-AUC:     {lr_auc:.4f}")

# Feature coefficients
print("\nFeature Coefficients (after scaling):")
for feat, coef in zip(features, base_lr.coef_[0]):
    print(f"  {feat:20s}: {coef:+.4f}")

In [None]:
# MODEL 2: Calibrated XGBoost
print("\n" + "="*60)
print("MODEL 2: Calibrated XGBoost")
print("="*60)

# Train base XGB (no scaling needed)
base_xgb = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss',
    verbosity=0
)

# Calibrate with isotonic regression (5-fold CV)
xgb_model = CalibratedClassifierCV(base_xgb, method='isotonic', cv=5)
xgb_model.fit(X_train, y_train)

# Predictions
xgb_probs = xgb_model.predict_proba(X_test)[:, 1]
xgb_preds = xgb_model.predict(X_test)

# Metrics
xgb_acc = accuracy_score(y_test, xgb_preds)
xgb_brier = brier_score_loss(y_test, xgb_probs)
xgb_auc = roc_auc_score(y_test, xgb_probs)

print(f"Accuracy:    {xgb_acc:.4f}")
print(f"Brier Score: {xgb_brier:.4f} (lower is better)")
print(f"ROC-AUC:     {xgb_auc:.4f}")

In [None]:
# ENSEMBLE: Average of both models
print("\n" + "="*60)
print("ENSEMBLE: Averaged Predictions")
print("="*60)

ensemble_probs = (lr_probs + xgb_probs) / 2.0
ensemble_preds = (ensemble_probs > 0.5).astype(int)

ensemble_acc = accuracy_score(y_test, ensemble_preds)
ensemble_brier = brier_score_loss(y_test, ensemble_probs)
ensemble_auc = roc_auc_score(y_test, ensemble_probs)

print(f"Accuracy:    {ensemble_acc:.4f}")
print(f"Brier Score: {ensemble_brier:.4f} (lower is better)")
print(f"ROC-AUC:     {ensemble_auc:.4f}")

# Summary table
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'XGBoost', 'Ensemble (Avg)'],
    'Accuracy': [lr_acc, xgb_acc, ensemble_acc],
    'Brier Score': [lr_brier, xgb_brier, ensemble_brier],
    'ROC-AUC': [lr_auc, xgb_auc, ensemble_auc]
})
print(comparison_df.to_string(index=False))

## 4. Model Evaluation — Calibration & Performance Curves

In [None]:
# ROC Curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve for all three models
fpr_lr, tpr_lr, _ = roc_curve(y_test, lr_probs)
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, xgb_probs)
fpr_ens, tpr_ens, _ = roc_curve(y_test, ensemble_probs)

axes[0].plot(fpr_lr, tpr_lr, label=f'LR (AUC={lr_auc:.4f})', linewidth=2)
axes[0].plot(fpr_xgb, tpr_xgb, label=f'XGB (AUC={xgb_auc:.4f})', linewidth=2)
axes[0].plot(fpr_ens, tpr_ens, label=f'Ensemble (AUC={ensemble_auc:.4f})', linewidth=2.5, color='darkgreen')
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate', fontweight='bold')
axes[0].set_ylabel('True Positive Rate', fontweight='bold')
axes[0].set_title('ROC Curves', fontweight='bold', fontsize=14)
axes[0].legend(loc='lower right')
axes[0].grid(alpha=0.3)

# Calibration Curves (Reliability Diagrams)
prob_true_lr, prob_pred_lr = calibration_curve(y_test, lr_probs, n_bins=10, strategy='uniform')
prob_true_xgb, prob_pred_xgb = calibration_curve(y_test, xgb_probs, n_bins=10, strategy='uniform')
prob_true_ens, prob_pred_ens = calibration_curve(y_test, ensemble_probs, n_bins=10, strategy='uniform')

axes[1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Perfectly Calibrated')
axes[1].plot(prob_pred_lr, prob_true_lr, 'o-', label='LR (Calibrated)', linewidth=2, markersize=8)
axes[1].plot(prob_pred_xgb, prob_true_xgb, 's-', label='XGB (Calibrated)', linewidth=2, markersize=8)
axes[1].plot(prob_pred_ens, prob_true_ens, '^-', label='Ensemble', linewidth=2.5, color='darkgreen', markersize=8)
axes[1].set_xlabel('Mean Predicted Probability', fontweight='bold')
axes[1].set_ylabel('Fraction of Positives', fontweight='bold')
axes[1].set_title('Calibration Curves (Reliability Diagrams)', fontweight='bold', fontsize=14)
axes[1].set_xlim(-0.05, 1.05)
axes[1].set_ylim(-0.05, 1.05)
axes[1].legend(loc='upper left')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
print("✓ ROC and Calibration curves plotted")

In [None]:
# Confusion Matrices
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

models_info = [
    ('Logistic Regression', lr_preds, axes[0]),
    ('XGBoost', xgb_preds, axes[1]),
    ('Ensemble', ensemble_preds, axes[2])
]

for name, preds, ax in models_info:
    cm = confusion_matrix(y_test, preds)
    disp = ConfusionMatrixDisplay(cm, display_labels=['Away Win', 'Home Win'])
    disp.plot(ax=ax, cmap='Blues', values_format='d')
    ax.set_title(f'{name}', fontweight='bold')

plt.tight_layout()
plt.show()
print("✓ Confusion matrices plotted")

## 5. Feature Importance

In [None]:
# Logistic Regression: Coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# LR coefficients
lr_coefs = base_lr.coef_[0]
coef_df = pd.DataFrame({'Feature': features, 'Coefficient': lr_coefs}).sort_values('Coefficient')

colors = ['red' if x < 0 else 'green' for x in coef_df['Coefficient']]
axes[0].barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Coefficient Value', fontweight='bold')
axes[0].set_title('Logistic Regression Feature Coefficients', fontweight='bold', fontsize=12)
axes[0].grid(axis='x', alpha=0.3)

# Add values on bars
for i, (feat, coef) in enumerate(zip(coef_df['Feature'], coef_df['Coefficient'])):
    axes[0].text(coef, i, f' {coef:.4f}', va='center', ha='left' if coef > 0 else 'right', fontweight='bold')

# XGBoost Feature Importance
plot_importance(base_xgb, ax=axes[1], importance_type='weight', height=0.6, title='XGBoost Feature Importance')
axes[1].set_xlabel('Importance Score', fontweight='bold')
axes[1].set_title('XGBoost Feature Importance', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()
print("✓ Feature importance plotted")

## 6. Live Inference Demo

Test the model with various game states to see how it predicts win probability.

In [None]:
def predict_win_probability(game_state: dict) -> dict:
    """
    Make a prediction using the trained ensemble.
    
    Args:
        game_state: dict with keys: score_diff, momentum, strength_diff, time_ratio, mins_remaining, period
    
    Returns:
        dict with 'ensemble_prob', 'lr_prob', 'xgb_prob', 'prediction'
    """
    # Fill missing features with 0
    for feat in features:
        if feat not in game_state:
            game_state[feat] = 0.0
    
    # Prepare feature vector
    X_state = pd.DataFrame([game_state])[features]
    
    # LR prediction (needs scaling)
    X_state_scaled = scaler.transform(X_state)
    lr_prob = lr_model.predict_proba(X_state_scaled)[0, 1]
    
    # XGB prediction
    xgb_prob = xgb_model.predict_proba(X_state)[0, 1]
    
    # Ensemble
    ensemble_prob = (lr_prob + xgb_prob) / 2.0
    
    return {
        'lr_prob': lr_prob,
        'xgb_prob': xgb_prob,
        'ensemble_prob': ensemble_prob,
        'prediction': 'Home Win' if ensemble_prob > 0.5 else 'Away Win'
    }

# Test scenarios
scenarios = [
    {
        'name': 'Home team leading by 10 points, mid-game',
        'state': {'score_diff': 10, 'momentum': 2, 'strength_diff': 0, 'time_ratio': 0.5, 'mins_remaining': 20, 'period': 1.5}
    },
    {
        'name': 'Away team leading by 5, late game (5 mins left)',
        'state': {'score_diff': -5, 'momentum': -3, 'strength_diff': -2, 'time_ratio': 0.125, 'mins_remaining': 5, 'period': 2}
    },
    {
        'name': 'Tied game, very late (2 mins left)',
        'state': {'score_diff': 0, 'momentum': 1, 'strength_diff': 0, 'time_ratio': 0.05, 'mins_remaining': 2, 'period': 2}
    },
    {
        'name': 'Pre-game (no score yet)',
        'state': {'score_diff': 0, 'momentum': 0, 'strength_diff': 3, 'time_ratio': 1.0, 'mins_remaining': 40, 'period': 1}
    },
    {
        'name': 'Home blowout, early game',
        'state': {'score_diff': 15, 'momentum': 5, 'strength_diff': 2, 'time_ratio': 0.75, 'mins_remaining': 30, 'period': 1}
    }
]

print("\n" + "="*80)
print("LIVE INFERENCE DEMO")
print("="*80)

results = []
for scenario in scenarios:
    result = predict_win_probability(scenario['state'])
    results.append({
        'Scenario': scenario['name'],
        'LR Prob': f"{result['lr_prob']:.2%}",
        'XGB Prob': f"{result['xgb_prob']:.2%}",
        'Ensemble': f"{result['ensemble_prob']:.2%}",
        'Prediction': result['prediction']
    })

results_df = pd.DataFrame(results)
for idx, row in results_df.iterrows():
    print(f"\n{idx+1}. {row['Scenario']}")
    print(f"   LR: {row['LR Prob']:>7s} | XGB: {row['XGB Prob']:>7s} | Ensemble: {row['Ensemble Prob']:>7s} → {row['Prediction']}")

In [None]:
# Interactive demo: Visualize probability by score_diff at different times
score_diffs = np.linspace(-20, 20, 50)
time_ratios = [1.0, 0.75, 0.5, 0.25, 0.1]  # Different game stages
time_labels = ['Pre-game (0 min)', 'Early (30 min)', 'Mid (20 min)', 'Late (10 min)', 'Final (2 min)']

fig, ax = plt.subplots(figsize=(12, 6))

colors = plt.cm.viridis(np.linspace(0, 1, len(time_ratios)))

for time_ratio, label, color in zip(time_ratios, time_labels, colors):
    probs = []
    for score_diff in score_diffs:
        state = {
            'score_diff': score_diff,
            'momentum': 0,
            'strength_diff': 0,
            'time_ratio': time_ratio,
            'mins_remaining': time_ratio * 40,
            'period': 1 if time_ratio > 0.5 else 2
        }
        result = predict_win_probability(state)
        probs.append(result['ensemble_prob'])
    
    ax.plot(score_diffs, probs, marker='o', markersize=4, linewidth=2.5, label=label, color=color)

ax.axhline(y=0.5, color='red', linestyle='--', linewidth=1, alpha=0.7, label='50% Win Prob')
ax.axvline(x=0, color='gray', linestyle=':', linewidth=1, alpha=0.5)
ax.set_xlabel('Score Differential (Home - Away)', fontweight='bold', fontsize=12)
ax.set_ylabel('Home Win Probability', fontweight='bold', fontsize=12)
ax.set_title('Win Probability Curves by Score Differential & Game Stage (Ensemble)', fontweight='bold', fontsize=14)
ax.set_ylim(-0.05, 1.05)
ax.grid(alpha=0.3)
ax.legend(loc='upper left', fontsize=10)

plt.tight_layout()
plt.show()

print("✓ Probability curves by game stage plotted")

## 7. Save the Trained Model Bundle

Create a joblib bundle that can be used in the dashboard or downloaded from Colab.

In [None]:
# Create and save the predictor bundle
bundle = {
    'lr_model': lr_model,
    'xgb_model': xgb_model,
    'scaler': scaler,
    'features': features,
    'weights': {'lr': 0.5, 'xgb': 0.5},
    'metadata': {
        'trained_at': pd.Timestamp.now().isoformat(),
        'features_used': features,
        'ensemble_brier_score': ensemble_brier,
        'ensemble_accuracy': ensemble_acc,
        'ensemble_auc': ensemble_auc,
        'training_samples': len(X_train),
        'test_samples': len(X_test)
    }
}

# Save locally in Colab
joblib.dump(bundle, 'cbb_predictor_bundle.joblib')

print("✓ Model bundle saved to 'cbb_predictor_bundle.joblib'")
print(f"\nBundle metadata:")
for key, value in bundle['metadata'].items():
    print(f"  {key}: {value}")

In [None]:
# In Colab: Download the bundle
try:
    from google.colab import files
    print("Colab environment detected. Downloading bundle...")
    files.download('cbb_predictor_bundle.joblib')
    print("✓ Bundle downloaded!")
except ImportError:
    print("Not in Colab. Bundle saved locally as 'cbb_predictor_bundle.joblib'")

## 8. Summary & Key Takeaways

In [None]:
print("\n" + "="*80)
print("SUMMARY: CBB ML Exploration Complete")
print("="*80)

print(f"""
✓ DATASET
  - Total samples: {len(df)}
  - Features: {', '.join(features)}
  - Target: Home win (binary classification)
  - Class balance: {y.mean():.1%} home wins

✓ MODELS TRAINED
  - Logistic Regression (Calibrated, Isotonic)
  - XGBoost (Calibrated, Isotonic)
  - Ensemble: 50/50 weighted average

✓ PERFORMANCE (Test Set)
  Model                  Accuracy    Brier Score    ROC-AUC
  ─────────────────────  ──────────  ────────────  ─────────
  Logistic Regression    {lr_acc:7.2%}      {lr_brier:.4f}       {lr_auc:.4f}
  XGBoost                {xgb_acc:7.2%}      {xgb_brier:.4f}       {xgb_auc:.4f}
  Ensemble (Averaged)    {ensemble_acc:7.2%}      {ensemble_brier:.4f}       {ensemble_auc:.4f}

✓ CALIBRATION
  - Both models use isotonic calibration (5-fold CV)
  - Brier scores < 0.25 indicate good calibration
  - Calibration curves show reliability (close to diagonal = well-calibrated)

✓ KEY FEATURES (by importance)
  - score_diff: Game momentum, most impactful
  - time_ratio: How much game is left
  - momentum: Recent score changes
  - strength_diff: Pre-game team strength
  - mins_remaining: Precise time left
  - period: Game half (1st or 2nd)

✓ ARTIFACT
  - Saved: cbb_predictor_bundle.joblib
  - Size: ~{len(joblib.dumps(bundle)) / 1024:.1f} KB
  - Contains: LR model, XGB model, scaler, features, metadata

✓ NEXT STEPS
  1. Download bundle from Colab
  2. Replace dashboard/ai/predictor.py bundle
  3. Retrain with more recent data for better performance
  4. A/B test ensemble weights (currently 50/50)
  5. Monitor calibration curves in production
""")

print("="*80)
print("✓ Notebook Complete!")
print("="*80)