# March Madness Prediction

## Overview

###

### Goal
Submissions are based on the Brier Score, the goal will be to minimize the brier score between the predicted probabilities and the actual game outcomes. The Brier score measures the accuracy of probablistic predition, in this case the mean square error. 

The brier score can be thought of as a cost function that measures the average squared difference between the predicted probabilities and the actual outcomes.

$$
Brier = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2
$$

where $p_i$ is the predicted probability of the event and $o_i$ is the actual outcome. The Brier score can span across all items in a set of N predictions.

Therefore, minimizing the Brier score will result in a more accurate prediction.


In [5]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.subplots as sp
import xgboost as xgb
import sklearn as sk
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import log_loss, mean_absolute_error, brier_score_loss

# Style
plt.style.use("dark_background")
px.defaults.template = 'plotly_dark'

## Load Data

Set up a data dictionary that will store the data for each file, this will make it easier to access data from the csvs. Not all files are used in the prediction process, but they are included for completeness.

Additionaly I am going to create a Sample Submission Dataframe that will be used to store the predictions for the sample submission, this will be populated with the predictions from the model later on.


In [9]:
# Load Data
import os

# Load Data - Use forward slashes or os.path.join for cross-platform compatibility
path = '../data/*.csv'
try:
    csv_files = glob.glob(path)
    print(f"Found {len(csv_files)} CSV files")
    if not csv_files:
        # Try alternative path for Windows
        path = '../data\\*.csv'
        csv_files = glob.glob(path)
        print(f"Found {len(csv_files)} CSV files with Windows path")
    
    data = {}
    for p in csv_files:
        # Extract filename without extension
        filename = p.split('/')[-1].split('\\')[-1].split('.')[0]  # Handle both separators
        print(f"Loading {filename}...")
        data[filename] = pd.read_csv(p, encoding='latin-1')
    
    print(f"Loaded data files: {list(data.keys())}")
    
except Exception as e:
    print(f"Error loading files: {e}")

# Create Teams Dataframe
teams = pd.concat([data['MTeams'], data['WTeams']])
teams_spelling = pd.concat([data['MTeamSpellings'], data['WTeamSpellings']])
teams_spelling = teams_spelling.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
teams_spelling.columns = ['TeamID', 'TeamNameCount']
teams = pd.merge(teams, teams_spelling, how='left', on=['TeamID'])

# Create Season Dataframes and S/T Flag
season_compact_results = pd.concat([data['MRegularSeasonCompactResults'], data['WRegularSeasonCompactResults']]).assign(ST='S')
season_detailed_results = pd.concat([data['MRegularSeasonDetailedResults'], data['WRegularSeasonDetailedResults']]).assign(ST='S')
tourney_compact_results = pd.concat([data['MNCAATourneyCompactResults'], data['WNCAATourneyCompactResults']]).assign(ST='T')
tourney_detailed_results = pd.concat([data['MNCAATourneyDetailedResults'], data['WNCAATourneyDetailedResults']]).assign(ST='T')

# Create Tourney Dataframes
lots = pd.concat([data['MNCAATourneySlots'], data['WNCAATourneySlots']])
seeds = pd.concat([data['MNCAATourneySeeds'], data['WNCAATourneySeeds']])
seeds['SeedValue'] = seeds['Seed'].str.extract(r'(\d+)').astype(int)
seeds_dict = {'_'.join(map(str,[int(k1),k2])):int(v[1:3]) for k1, v, k2 in seeds[['Season', 'Seed', 'TeamID']].values}
game_cities = pd.concat([data['MGameCities'], data['WGameCities']])
seasons = pd.concat([data['MSeasons'], data['WSeasons']])
cities = data['Cities']

# Create Sample Submission Dataframe
sub = data['SampleSubmissionStage1']
del data

# Seeds Dictionary
seeds = {'_'.join(map(str,[int(k1),k2])):int(v[1:3]) for k1, v, k2 in seeds[['Season', 'Seed', 'TeamID']].values}

Found 36 CSV files
Loading Cities...
Loading Conferences...
Loading MConferenceTourneyGames...
Loading MGameCities...
Loading MMasseyOrdinals...
Loading MNCAATourneyCompactResults...
Loading MNCAATourneyDetailedResults...
Loading MNCAATourneySeedRoundSlots...
Loading MNCAATourneySeeds...
Loading MNCAATourneySlots...
Loading MRegularSeasonCompactResults...
Loading MRegularSeasonDetailedResults...
Loading MSeasons...
Loading MSecondaryTourneyCompactResults...
Loading MSecondaryTourneyTeams...
Loading MTeamCoaches...
Loading MTeamConferences...
Loading MTeams...
Loading MTeamSpellings...
Loading SampleSubmissionStage1...
Loading SampleSubmissionStage2...
Loading SeedBenchmarkStage1...
Loading WConferenceTourneyGames...
Loading WGameCities...
Loading WNCAATourneyCompactResults...
Loading WNCAATourneyDetailedResults...
Loading WNCAATourneySeeds...
Loading WNCAATourneySlots...
Loading WRegularSeasonCompactResults...
Loading WRegularSeasonDetailedResults...
Loading WSeasons...
Loading WSecond

## Feature Engineering


Here we concatinate the regular season and tournament detailed results into a single dataframe. We also add additional features to the dataframe that will be used in the model.

This includes derived features such as the score difference, home advantage, and shooting percentages. We also add the derived features to the detailed results dataframe.

In [10]:
# Create Detailed Results Dataframe
all_detailed_results = pd.concat([season_detailed_results, tourney_detailed_results])
all_detailed_results.reset_index(drop=True, inplace=True)
all_detailed_results['WLoc'] = all_detailed_results['WLoc'].map({'A': 1, 'H': 2, 'N': 3})

# Add additional features to detailed results
all_detailed_results['ID'] = all_detailed_results.apply(lambda r: '_'.join(map(str, [r['Season']]+sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
all_detailed_results['IDTeams'] = all_detailed_results.apply(lambda r: '_'.join(map(str, sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
all_detailed_results['Team1'] = all_detailed_results.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[0], axis=1)
all_detailed_results['Team2'] = all_detailed_results.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[1], axis=1)
all_detailed_results['IDTeam1'] = all_detailed_results.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
all_detailed_results['IDTeam2'] = all_detailed_results.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)

all_detailed_results['Team1Seed'] = all_detailed_results['IDTeam1'].map(seeds).fillna(0)
all_detailed_results['Team2Seed'] = all_detailed_results['IDTeam2'].map(seeds).fillna(0)

all_detailed_results['ScoreDiff'] = all_detailed_results['WScore'] - all_detailed_results['LScore']
all_detailed_results['Pred'] = all_detailed_results.apply(lambda r: 1. if sorted([r['WTeamID'],r['LTeamID']])[0]==r['WTeamID'] else 0., axis=1)
all_detailed_results['ScoreDiffNorm'] = all_detailed_results.apply(lambda r: r['ScoreDiff'] * -1 if r['Pred'] == 0. else r['ScoreDiff'], axis=1)
all_detailed_results['SeedDiff'] = all_detailed_results['Team1Seed'] - all_detailed_results['Team2Seed'] 
all_detailed_results = all_detailed_results.fillna(-1)

# Add derived features to detaifled results
all_detailed_results['ScoreDiff'] = all_detailed_results['WScore'] - all_detailed_results['LScore']
all_detailed_results['HomeAdvantage'] = (all_detailed_results['WLoc'] == 2).astype(int)

# Calculate shooting percentages (handling division by zero)
all_detailed_results['WFGPct'] = np.where(all_detailed_results['WFGA'] > 0, 
                                        all_detailed_results['WFGM'] / all_detailed_results['WFGA'], 0)
all_detailed_results['WFG3Pct'] = np.where(all_detailed_results['WFGA3'] > 0, 
                                        all_detailed_results['WFGM3'] / all_detailed_results['WFGA3'], 0)
all_detailed_results['WFTPct'] = np.where(all_detailed_results['WFTA'] > 0, 
                                        all_detailed_results['WFTM'] / all_detailed_results['WFTA'], 0)
all_detailed_results['LFGPct'] = np.where(all_detailed_results['LFGA'] > 0, 
                                        all_detailed_results['LFGM'] / all_detailed_results['LFGA'], 0)
all_detailed_results['LFG3Pct'] = np.where(all_detailed_results['LFGA3'] > 0, 
                                        all_detailed_results['LFGM3'] / all_detailed_results['LFGA3'], 0)
all_detailed_results['LFTPct'] = np.where(all_detailed_results['LFTA'] > 0, 
                                        all_detailed_results['LFTM'] / all_detailed_results['LFTA'], 0)

# Add statistical differences
all_detailed_results['ReboundDiff'] = (all_detailed_results['WOR'] + all_detailed_results['WDR']) - \
                                    (all_detailed_results['LOR'] + all_detailed_results['LDR'])
all_detailed_results['AssistDiff'] = all_detailed_results['WAst'] - all_detailed_results['LAst']
all_detailed_results['TurnoverDiff'] = all_detailed_results['WTO'] - all_detailed_results['LTO']
all_detailed_results['StealDiff'] = all_detailed_results['WStl'] - all_detailed_results['LStl']
all_detailed_results['BlockDiff'] = all_detailed_results['WBlk'] - all_detailed_results['LBlk']
all_detailed_results['FoulDiff'] = all_detailed_results['WPF'] - all_detailed_results['LPF']

Now we have to begin setting up the data for the model. We will group the detailed results by the IDTeams and then aggregate the data. We will also create a sample submission dataframe that will be used to store the predictions for the sample submission, this will be populated with the predictions from the model later on.

In [11]:
c_score_col = ['NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl',
 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl',
 'LBlk', 'LPF']
c_score_agg = ['sum', 'mean', 'median', 'max', 'min', 'std', 'skew', 'nunique']
gb = all_detailed_results.groupby(by=['IDTeams']).agg({k: c_score_agg for k in c_score_col}).reset_index()
gb.columns = [''.join(c) + '_c_score' for c in gb.columns]

sub['WLoc'] = 3
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['Season'].astype(int)
sub['Team1'] = sub['ID'].map(lambda x: x.split('_')[1])
sub['Team2'] = sub['ID'].map(lambda x: x.split('_')[2])
sub['IDTeams'] = sub.apply(lambda r: '_'.join(map(str, [r['Team1'], r['Team2']])), axis=1)
sub['IDTeam1'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
sub['IDTeam2'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)
sub['Team1Seed'] = sub['IDTeam1'].map(seeds).fillna(0)
sub['Team2Seed'] = sub['IDTeam2'].map(seeds).fillna(0)
sub['SeedDiff'] = sub['Team1Seed'] - sub['Team2Seed'] 
sub = sub.fillna(-1)

games = pd.merge(all_detailed_results, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')
sub = pd.merge(sub, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')

col = [c for c in games.columns if c not in ['ID', 'DayNum', 'ST', 'Team1', 'Team2', 'IDTeams', 'IDTeam1', 'IDTeam2', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'NumOT', 'Pred', 'ScoreDiff', 'ScoreDiffNorm', 'WLoc'] + c_score_col]

## Model Training

Now we can begin training the model. We will use a simple XGBoost model to predict the outcome of the game. We will also use a simple imputer to fill in the missing values and a standard scaler to scale the data.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss, log_loss

imputer = SimpleImputer(strategy='mean')  
scaler = StandardScaler()

X = games[col].fillna(-1)
missing_cols = set(col) - set(sub.columns)
for c in missing_cols:
    sub[c] = 0

X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

# Fixed: Use XGBClassifier for probability prediction
model = xgb.XGBClassifier(
    n_estimators=3000,         # Reduced from 5000 for faster training
    learning_rate=0.05,        # Learning rate for gradient boosting
    max_depth=6,               # Control model complexity
    min_child_weight=3,        # Helps prevent overfitting
    subsample=0.8,             # Use 80% of data for each tree
    colsample_bytree=0.8,      # Use 80% of features for each tree
    objective='binary:logistic', # Proper binary classification objective
    random_state=42,
    n_jobs=-1,                 # Use all CPU cores
    eval_metric='logloss'      # Use log loss for evaluation
)

# Implement time-aware cross-validation
def time_aware_cv(model, X, y, seasons, n_splits=5):
    """
    Perform time-aware cross-validation where we train on earlier seasons
    and validate on later seasons to prevent data leakage.
    """
    unique_seasons = sorted(seasons.unique())
    cv_scores = []
    
    # Calculate split points for time-based CV
    season_splits = []
    split_size = len(unique_seasons) // (n_splits + 1)
    
    for i in range(n_splits):
        train_end_idx = split_size * (i + 2)  # End of training seasons
        val_start_idx = train_end_idx
        val_end_idx = min(train_end_idx + split_size, len(unique_seasons))
        
        if val_end_idx > len(unique_seasons):
            break
            
        train_seasons = unique_seasons[:train_end_idx]
        val_seasons = unique_seasons[val_start_idx:val_end_idx]
        
        season_splits.append((train_seasons, val_seasons))
    
    print(f"Time-aware CV with {len(season_splits)} splits:")
    
    for i, (train_seasons, val_seasons) in enumerate(season_splits):
        # Create train/val masks
        train_mask = seasons.isin(train_seasons)
        val_mask = seasons.isin(val_seasons)
        
        if train_mask.sum() == 0 or val_mask.sum() == 0:
            continue
            
        X_train_fold, X_val_fold = X[train_mask], X[val_mask]
        y_train_fold, y_val_fold = y[train_mask], y[val_mask]
        
        # Fit model on training fold
        model.fit(X_train_fold, y_train_fold)
        
        # Predict on validation fold
        y_pred_proba = model.predict_proba(X_val_fold)[:, 1]
        y_pred_proba = np.clip(y_pred_proba, 0.001, 0.999)
        
        # Calculate metrics
        brier = brier_score_loss(y_val_fold, y_pred_proba)
        logloss = log_loss(y_val_fold, y_pred_proba)
        
        cv_scores.append(brier)
        
        print(f"  Fold {i+1}: Train seasons {train_seasons[0]}-{train_seasons[-1]} | "
              f"Val seasons {val_seasons[0]}-{val_seasons[-1]} | "
              f"Brier: {brier:.4f} | LogLoss: {logloss:.4f}")
    
    return cv_scores

# Perform time-aware cross-validation
print("\nPerforming time-aware cross-validation...")
cv_brier_scores = time_aware_cv(model, X_scaled, games['Pred'], games['Season'])

print(f"\nTime-aware CV Results:")
print(f"Mean Brier Score: {np.mean(cv_brier_scores):.4f} ± {np.std(cv_brier_scores):.4f}")

# Train the final model on all data
print("\nTraining final model on all data...")
model.fit(X_scaled, games['Pred'])

# Get predictions with proper probability output
pred_proba = model.predict_proba(X_scaled)[:, 1]  # Get probability of class 1
pred_proba = np.clip(pred_proba, 0.001, 0.999)

print(f'\nFinal Model Performance on Training Data:')
print(f'Log Loss: {log_loss(games["Pred"], pred_proba):.4f}')
print(f'Mean Absolute Error: {mean_absolute_error(games["Pred"], pred_proba):.4f}')
print(f'Brier Score: {brier_score_loss(games["Pred"], pred_proba):.4f}')



Training main model...
Log Loss: 0.2801704227848632
Mean Absolute Error: 0.22647324770944763
Brier Score: 0.07499056544694568
Cross-validated MSE: 0.21594714085560499


# Old Model
Training main model...
Log Loss: 0.2801704227848632
Mean Absolute Error: 0.22647324770944763
Brier Score: 0.07499056544694568
Cross-validated MSE: 0.21594714085560499

In [14]:
# Fill in missing values for submission data
sub_X = sub[col].fillna(-1)

print("Shape of sub_X:", sub_X.shape)
print("Creating imputed data...")

# Transform using fitted imputer and scaler
sub_X_imputed = imputer.transform(sub_X)
print("Scaling data...")
sub_X_scaled = scaler.transform(sub_X_imputed)

print("Making predictions...")
# Use predict_proba for proper probability output with XGBClassifier
predictions = model.predict_proba(sub_X_scaled)[:, 1]  # Get probability of class 1
predictions = np.clip(predictions, 0.001, 0.999)
print("Prediction shape:", predictions.shape)

# Assign to dataframe
sub['Pred'] = predictions
print("Saving to CSV...")
sub[['ID', 'Pred']].to_csv('submission.csv', index=False)
print("Done!")

Shape of sub_X: (507108, 234)
Creating imputed data...


KeyboardInterrupt: 