# Prediction March Madness V2

This notebook contains a refactored and improved version of the March Madness prediction model. Key improvements include:
1. Vectorized data preparation for faster execution.
2. Use of scikit-learn `Pipeline` for cleaner preprocessing.
3. Feature engineering (seed differentials, point differentials).
4. Component-based architecture using `shared.py`.

## 1. Import Libs

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import shared

%matplotlib inline

## 2. Load and Prepare Data

In [2]:
df_team = pd.read_csv("data/team_yearly_stats.csv")
df_ps_game = pd.read_csv("data/post_season_games.csv")

print(f"Team Stats Shape: {df_team.shape}")
print(f"Post Season Games Shape: {df_ps_game.shape}")

Team Stats Shape: (5300, 21)
Post Season Games Shape: (818, 7)


In [3]:
# Use the new vectorized merge function from shared.py
df_full = shared.get_team_stats_df_vectorized(df_team, df_ps_game)
print(f"Merged Data Shape: {df_full.shape}")
df_full.head()

AttributeError: module 'shared' has no attribute 'get_team_stats_df_vectorized'

## 3. Feature Engineering

We calculate differentials between teams, which are often more predictive than raw stats.

In [None]:
def add_differentials(df):
    df = df.copy()
    df['seed_diff'] = df['team_1_seed'] - df['team_2_seed']
    df['pt_diff_1'] = df['pt_pg_1'] - df['opnt_pt_pg_1']
    df['pt_diff_2'] = df['pt_pg_2'] - df['opnt_pt_pg_2']
    df['srs_diff'] = df['srs_1'] - df['srs_2']
    df['sos_diff'] = df['sos_1'] - df['sos_2']
    df['win_pct_diff'] = df['wl_pct_1'] - df['wl_pct_2']
    return df

df_full = add_differentials(df_full)

# Update feature names to include the new differentials
features = shared.ps_feature_col_names + ['seed_diff', 'srs_diff', 'sos_diff', 'win_pct_diff']
target = 't1_win'

## 4. Train/Test Split

In [None]:
X = df_full[features]
y = df_full[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set balance:\n{y_train.value_counts(normalize=True)}")

## 5. Modeling Pipeline

We use a pipeline to handle imputation and scaling consistently.

In [None]:
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Hyperparameter tuning for Logistic Regression
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")

## 6. Evaluation

In [None]:
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

## 7. Feature Importance

Let's see which features the model relies on most.

In [None]:
coeffs = best_model.named_steps['classifier'].coef_[0]
feature_importance = pd.DataFrame({'feature': features, 'importance': np.abs(coeffs)})
feature_importance = feature_importance.sort_values(by='importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
plt.title('Top 20 Features by Importance (Logistic Regression Coefficients)')
plt.show()

# We have trained our models, not experiment with them to produce your bracket!

In [None]:

shared.evaluate_tournament(df_team, best_model, features=features)
