# Coach Change Prediction

## 1. Project Context
As outlined in the project overview, the goal of this notebook is to address **Core Objective: Coach Changes**. We aim to predict the set of teams that will change coaches at the end of the test season (Year 10).

## 2. Methodology
To achieve this, we will model the problem as a **Binary Classification** task:
* **Target ($y$):** `1` if a team changes its coach between the current season and the next; `0` otherwise.
* **Features ($X$):** We will utilize historical basketball data including team performance metrics, coaching history, and player aggregates.

We will follow this workflow:
1.  **Data Loading:** Ingest relational tables (`teams`, `coaches`, `players`, etc.).
2.  **Target Engineering:** Construct the labeled target variable by looking ahead one season.
3.  **Feature Engineering:** Aggregate player awards, coach tenure, and derive advanced metrics (Win %, Point Differential).
4.  **Modeling:** Train and evaluate multiple classifiers (Random Forest, Gradient Boosting, SVM, etc.).
5.  **Prediction:** Generate predictions for the final test season (Year 10).

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path

PYDEVD_DISABLE_FILE_VALIDATION=1

### 2.1 Load Data Sources
We utilize the project's relational database consisting of teams, coaches, and player statistics.
* **`teams`:** Team performance per season.
* **`coaches`:** Records of coaches managing teams.
* **`teams_post`:** Post-season results.
* **`players_teams`:** Player performance stats.
* **`awards_players`:** Awards received by players.

In [None]:
teams_df = pd.read_csv("project_data/initial_data/teams.csv").sort_values(by=["year", "tmID"])
coaches_df = pd.read_csv("project_data/initial_data/coaches.csv").sort_values(by=["year", "tmID"])
teams_post_df = pd.read_csv("project_data/initial_data/teams_post.csv").sort_values(by=["year", "tmID"])
players_teams_df = pd.read_csv("project_data/initial_data/players_teams.csv").sort_values(by=["year", "tmID"])
awards_players_df = pd.read_csv("project_data/initial_data/awards_players.csv").sort_values(by=["year"])

print(f"Loaded {len(teams_df)} team-season records.")
display(teams_df.head())

print(f"Loaded {len(coaches_df)} coach records.")
display(coaches_df.head())

print(f"Loaded {len(teams_post_df)} post-season team records.")
display(teams_post_df.head())

print(f"Loaded {len(players_teams_df)} player records.")
display(players_teams_df.head())

print(f"Loaded {len(awards_players_df)} player award records.")
display(awards_players_df.head())

### 2.2 Target Variable Definition: `CoachChange`
We need to define a target variable that indicates if a coach will be replaced in the *middle of the* season.

**Logic:**
1. TODO: Fix this

In [None]:
target_df = coaches_df.groupby(['year', 'tmID'])['stint'].max().reset_index()
target_df['CoachChange'] = target_df['stint'] > 0

target_df = target_df[['year', 'tmID', 'CoachChange']]

print("\n--- Target Variable 'CoachChange' Created ---")
print(target_df[target_df['CoachChange'] == True])
print(f"\nTotal 'CoachChange = True' events: {int(target_df['CoachChange'].sum())}")

### 3. Feature Engineering & Data Assembly
To improve prediction accuracy, we need to move beyond raw stats. We will engineer features that reflect the "status" of the team and the coach:

1.  **Player Talent:** We quantify the talent level by counting the number of individual awards (`awards_players`) a team's players won in a given season.
2.  **Coach History:** We calculate `stint_max` (how long the coach has been there) and their historical post-season success (`coach_post_wins`).
3.  **Data Merge:** We aggregate these features into a single analytical table (`final_df`).

In [None]:
# --- 3. Engineer and Merge Feature Sets ---

# Player Awards
stats_single_team = (players_teams_df
                     .sort_values(['playerID', 'year', 'stint'])
                     .drop_duplicates(subset=['playerID', 'year'], keep='last')
                     [['playerID', 'year', 'tmID']]
)

awards_with_team_df = (
    awards_players_df
    .merge(stats_single_team, on=['playerID', 'year'])
)

awards_count_df = (
    awards_with_team_df
    .groupby(['tmID', 'year'])
    .size()
    .reset_index(name='num_player_awards')
)

# Teams Post
teams_post_features = teams_post_df[['tmID', 'year', 'W', 'L']].rename(columns={
    'W': 'team_post_W',
    'L': 'team_post_L'
})

# Coach Tenure
coaches_df = coaches_df.sort_values(['coachID', 'tmID', 'year'])
coaches_df['coach_tenure'] = coaches_df.groupby(['coachID', 'tmID']).cumcount() + 1

coach_tenure_df = coaches_df.groupby(['tmID', 'year'])['coach_tenure'].max().reset_index()

# Final DataFrame
final_df = teams_df.copy()

final_df = pd.merge(final_df, target_df, on=['tmID', 'year'], how='left')
final_df = pd.merge(final_df, teams_post_features, on=['tmID', 'year'], how='left')
final_df = pd.merge(final_df, awards_count_df, on=['tmID', 'year'], how='left')
final_df = pd.merge(final_df, coach_tenure_df, on=['tmID', 'year'], how='left')

# Clean-up
fill_zero_cols = [
    'coach_post_wins', 'coach_post_losses', 'team_post_W',
    'team_post_L', 'num_player_awards'
]

for col in fill_zero_cols:
    if col in final_df.columns:
        final_df[col] = final_df[col].fillna(0)

final_df['coach_tenure'] = final_df['coach_tenure'].fillna(1)

print("\n--- Final Assembled DataFrame ---")
print(final_df.head())
print(f"\nFinal DataFrame shape: {final_df.shape}")
print(f"Columns: {final_df.columns.to_list()}")

### 3.1 Derived Performance Metrics
We convert raw counts into standardized ratios to compare teams across different numbers of games played (`GP`):
* **Win Percentage:** $\frac{Wins}{Wins + Losses}$
* **Point Differential:** `o_pts` (Offensive Points) - `d_pts` (Defensive Points).
* **Playoff Flag:** A binary indicator of whether the team qualified for the post-season.

In [None]:
# 1. Win Pct & Diff (Standard)
final_df['win_pct'] = final_df['won'] / (final_df['won'] + final_df['lost'] + 1e-6)
final_df['pt_diff'] = final_df['o_pts'] - final_df['d_pts']

# 2. NEW: Year-over-Year Performance Drop ("The Cliff")
# Sort to ensure shift works correctly
final_df = final_df.sort_values(['tmID', 'year'])
final_df['prev_win_pct'] = final_df.groupby('tmID')['win_pct'].shift(1)
final_df['prev_win_pct'] = final_df['prev_win_pct'].fillna(0.5) # Fill first year with average
final_df['win_pct_change'] = final_df['win_pct'] - final_df['prev_win_pct']

# 3. NEW: Talent Mismatch ("Underachiever")
# High talent (awards) but low wins = High Pressure
# We add 0.2 to win_pct to avoid exploding numbers for very bad teams
final_df['talent_mismatch'] = final_df['num_player_awards'] / (final_df['win_pct'] + 0.2)

print("--- DataFrame with Engineered Features (sample) ---")
print(final_df[['tmID', 'year', 'win_pct', 'pt_diff', 'win_pct_change', 'talent_mismatch']].head())

In [None]:
# Define target and features
train_df = final_df.dropna(subset=['CoachChange'])
y = train_df['CoachChange']

# Define features to drop
features_to_drop = [
    # Targets/IDs
    'CoachChange', 'tmID', 'lgID', 'franchID', 'confID', 'divID', 'name', 'arena', 'coachID',
    
    # Raw stats replaced by derived metrics
    'year', 'won', 'lost', 'homeW', 'homeL', 'awayW', 'awayL', 'confW', 'confL', 'conf_win_pct',
    'o_pts', 'd_pts', 'pt_diff',
    'o_reb', 'd_reb', 'tmTRB', 'opptmTRB', 'tmORB', 'tmDRB', 'opptmORB', 'opptmDRB',
    'o_fta', 'o_3pa', 'd_fta', 'd_3pa', 'o_fgm', 'o_ftm', 'o_3pm', 'd_fgm', 'd_ftm', 'd_3pm',
    'min', 'made_playoffs', 'team_post_W', 'team_post_L', 'seeded',

    'o_fga', 'o_oreb', 'o_dreb', 'o_asts', 'o_pf', 'o_stl', 'o_to', 'o_blk',
    'd_fga', 'd_oreb', 'd_dreb', 'd_asts', 'd_pf', 'd_stl', 'd_to', 'd_blk',
    
    # -- Removed after Feature Correlation Matrix
    'firstRound', 'semis', 'finals',
]

# 3. Create the features DataFrames
X = train_df.drop(columns=features_to_drop, errors='ignore')

# Convert Y/N columns to 1/0
cols_to_map = ['playoff', 'firstRound', 'semis', 'finals']
for col in cols_to_map:
    if col in X.columns:
        X[col] = X[col].map({'Y': 1, 'N': 0})

# 4. Handle Missing Values
X = X.fillna(0)

print(f"--- Final 'X' (Features) DataFrame ---")
print(f"Shape of X: {X.shape}")
print(f"Features: {X.columns.to_list()}\n")
print(f"Count of {train_df['CoachChange'].value_counts()}\n")

print(f"--- Final 'y' (Target) Series ---")
print(f"Shape of y: {y.shape}")
print(f"Class distribution:\n{y.value_counts()}")

### 5. Exploratory Data Analysis (Correlation)
We check for multicollinearity among features. High correlation (e.g., > 0.9) between features can confuse linear models and inflate feature importance in tree models.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# --- C. Check Correlation ---

# Calculate the correlation matrix
corr_matrix = X.corr().abs()

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=False, cmap='Blues', fmt='.1f')
plt.title('Feature Correlation Matrix')
plt.show()

# You can also manually find high-correlation pairs
# Select upper triangle of correlation matrix
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.9
high_corr_features = [column for column in upper_tri.columns if any(upper_tri[column] > 0.9)]

if high_corr_features:
    print(f"\nWARNING!: High Correlation remaining in features: {high_corr_features}")
    print("Consider dropping one feature from each correlated pair.")
else:
    print("\nNo highly correlated (r > 0.9) features found. Ready for modeling.")

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer, classification_report
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.pipeline import Pipeline

def print_metrics(y_true, y_pred, model_name):
    """Prints common classification metrics."""
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall: {rec:.4f}")
    print(f"F1-Score: {f1:.4f}")
    return {"Model": model_name, "Accuracy": acc, "Precision": prec, "Recall": rec, "F1": f1}

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV

class YearlyWalkForwardSplit:
    """
    Perform walk-forward validation based on an external year series (list/array/column).
    
    Train: All years prior to the current test year.
    Test:  The specific current test year.
    """
    def __init__(self, year_series):
        self.year_series = np.array(year_series)
        self.unique_years = np.sort(np.unique(self.year_series))
        
    def get_n_splits(self, X=None, y=None, groups=None):
        return len(self.unique_years) - 1

    def split(self, X, y=None, groups=None):
        if len(X) != len(self.year_series):
            raise ValueError(f"Data length mismatch! X has {len(X)} rows, but year_series has {len(self.year_series)}.")

        for i in range(1, len(self.unique_years)):
            test_year = self.unique_years[i]
            
            # Train on everything strictly BEFORE the test year
            train_mask = self.year_series < test_year
            
            # Test on ONLY the current test year
            test_mask = self.year_series == test_year
            
            train_indices = np.flatnonzero(train_mask)
            test_indices = np.flatnonzero(test_mask)
            
            yield train_indices, test_indices

In [None]:
test_year = 10
starting_year = 3
random_state = 45

X_train = X[train_df['year'] < test_year]
y_train = y[train_df['year'] < test_year]

X_test = X[train_df['year'] == test_year]
y_test = y[train_df['year'] == test_year]

walk_forward_cv = YearlyWalkForwardSplit(train_df[train_df['year'] < test_year]['year'])
f1_scorer = make_scorer(f1_score, pos_label=1, zero_division=0) 

In [None]:
lr = LogisticRegression(solver='liblinear', max_iter=1000, random_state=42)

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', lr)
])

lr_params = {
    'lr__C': [0.01, 0.1, 1, 10, 100],
    'lr__penalty': ['l1', 'l2'],
    'lr__class_weight': [None, 'balanced']
}

lr_grid = GridSearchCV(
    estimator=lr_pipeline,
    param_grid=lr_params,
    scoring=f1_scorer,
    cv=walk_forward_cv,
    verbose=1,
    n_jobs=-1
)

lr_grid.fit(X_train, y_train)

# 5. Get Results
print(f"Best Hyperparameters: {lr_grid.best_params_}")
print(f"Best Cross-Validated F1 Score: {lr_grid.best_score_:.4f}")

In [None]:
from xgboost import XGBClassifier

xgb_params = {
  'n_estimators' : [100, 200, 500],
  'learning_rate' : [0.01, 0.05, 0.1],
  'max_depth' : [3, 4, 5, 6],
  'subsample' : [0.6, 0.8, 1.0],
  'scale_pos_weight' : [1, 10, 25],
}

xgb_classifier = XGBClassifier(random_state=42)

xgb_grid = GridSearchCV(
    estimator=xgb_classifier,
    param_grid=xgb_params,
    scoring=f1_scorer,
    cv=walk_forward_cv,
    verbose=1,
    n_jobs=-1
)

xgb_grid.fit(
  X_train,
  y_train
)

print(f"Best Hyperparameters: {xgb_grid.best_params_}")
print(f"Best Cross-Validated F1 Score: {xgb_grid.best_score_:.4f}")

In [None]:
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

mlp_params = {
  'mlp__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
  'mlp__activation': ['relu', 'tanh'],
  'mlp__alpha': [0.0001, 0.001, 0.01],
  'mlp__learning_rate_init': [0.001, 0.01],
  'mlp__max_iter': [2000],
}

mlp_pipeline = Pipeline([
  ('scaler', StandardScaler()),
  ('mlp', MLPClassifier(random_state=42))
])

mlp_grid = GridSearchCV(
  mlp_pipeline, 
  mlp_params,
  cv=walk_forward_cv, 
  scoring='f1', 
  verbose=1,
  n_jobs=-1,
)

mlp_grid.fit(
  X_train, 
  y_train,
  mlp__sample_weight=sample_weights
)

print(f"Best Hyperparameters: {mlp_grid.best_params_}")
print(f"Best Cross-Validated F1 Score: {mlp_grid.best_score_:.4f}")

In [None]:
best_estimators = {'Logistic Regression': lr_grid.best_estimator_, 'XGBClassifier': xgb_grid.best_estimator_, 'MLP': mlp_grid.best_estimator_}

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, auc

def get_metrics_dict(y_true, y_pred, model_name, threshold=None, y_proba=None):
    metrics = {
        'Model': model_name,
        'Threshold': threshold if threshold is not None else 'Default',
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred, zero_division=0),
        'Recall': recall_score(y_true, y_pred, zero_division=0),
        'F1-Score': f1_score(y_true, y_pred, zero_division=0),
    }
    if y_proba is not None:
        try:
            metrics['ROC AUC'] = roc_auc_score(y_true, y_proba)
        except ValueError:
            metrics['ROC AUC'] = None
        
        # Calculate PR AUC
        precision, recall, _ = precision_recall_curve(y_true, y_proba)
        metrics['PR AUC'] = auc(recall, precision)

    return metrics

all_metrics = []

for name, estimator in best_estimators.items():
    if hasattr(estimator, 'predict_proba'):
        y_proba = estimator.predict_proba(X_test)[:, 1]
        thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
        for threshold in thresholds:
            y_pred_threshold = (y_proba >= threshold).astype(int)
            metrics = get_metrics_dict(y_test, y_pred_threshold, name, threshold, y_proba)
            all_metrics.append(metrics)
    else:
        # Fallback for estimators that don't have predict_proba (e.g., some SVMs)
        y_pred = estimator.predict(X_test)
        metrics = get_metrics_dict(y_test, y_pred, name)
        all_metrics.append(metrics)

metrics_df = pd.DataFrame(all_metrics)

# Apply the styling to the DataFrame, focusing on the score columns
styled_metrics_df = metrics_df.style.background_gradient(cmap='Greens', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC AUC', 'PR AUC'])
styled_metrics_df