# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss

# ======================================================================
# 1. Data Loading & Processing
# ======================================================================

# Load available competition data
m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
tourney_results = pd.concat([
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv'),
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv')
])
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

# ======================================================================
# 2. Robust Seed Processing
# ======================================================================

def parse_seed(seed):
    """Extract numeric value from any seed format"""
    try:
        return int(''.join(filter(str.isdigit, str(seed))))
    except:
        return 16  # Default for missing/invalid seeds

# Process and combine seeds
seed_df = pd.concat([m_seed, w_seed])
seed_df['SeedValue'] = seed_df['Seed'].apply(parse_seed)
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']

# ======================================================================
# 3. Training Data Preparation
# ======================================================================

# Create training examples from historical tournament results
train_data = []
for _, row in tourney_results.iterrows():
    season = row['Season']
    team1, team2 = sorted([row['WTeamID'], row['LTeamID']])
    outcome = 1 if row['WTeamID'] == team1 else 0
    
    try:
        seed_diff = seed_map.at[(season, team2)] - seed_map.at[(season, team1)]
    except KeyError:
        continue  # Skip missing team data
    
    train_data.append({
        'Season': season,
        'SeedDiff': seed_diff,
        'Outcome': outcome
    })

train_df = pd.DataFrame(train_data)

# ======================================================================
# 4. Model Training with Temporal Validation
# ======================================================================

X = train_df[['SeedDiff']]
y = train_df['Outcome']

# Initialize model
model = HistGradientBoostingClassifier(
    max_iter=200,
    early_stopping=True,
    random_state=42
)

# Time-based cross-validation
best_score = float('inf')
for train_idx, val_idx in TimeSeriesSplit(n_splits=3).split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model.fit(X_train, y_train)
    val_preds = model.predict_proba(X_val)[:, 1]
    score = brier_score_loss(y_val, val_preds)
    
    if score < best_score:
        best_model = model
        best_score = score

# ======================================================================
# 5. Submission Processing & Prediction
# ======================================================================

# Split ID column with proper handling
split_df = (
    submission['ID']
    .str.split('_', expand=True)
    .iloc[:, :3]
    .rename(columns={0: 'Season', 1: 'Team1', 2: 'Team2'})
    .astype({'Season': 'int16', 'Team1': 'int16', 'Team2': 'int16'})
)
submission = pd.concat([submission, split_df], axis=1)

# Calculate seed differences
def get_seed_diff(row):
    try:
        return (
            seed_map.at[(row['Season'], row['Team2'])] -
            seed_map.at[(row['Season'], row['Team1'])]
        )
    except KeyError:
        return 0  # Neutral prediction for missing teams

submission['SeedDiff'] = submission.apply(get_seed_diff, axis=1)

# Generate predictions
submission['Pred'] = best_model.predict_proba(submission[['SeedDiff']])[:, 1]
submission['Pred'] = submission['Pred'].clip(0.05, 0.95)

# ======================================================================
# 6. Final Validation & Output
# ======================================================================

print("Model Validation Results:")
print(f"Best Validation Brier Score: {best_score:.5f}")
print(f"Submission Stats - Mean: {submission['Pred'].mean():.4f}")
print(f"Prediction Range: [{submission['Pred'].min():.2f}, {submission['Pred'].max():.2f}]")

submission[['ID', 'Pred']].to_csv('final_submission.csv', index=False)
print("\nSubmission file successfully created!")

Model Validation Results:
Best Validation Brier Score: 0.15627
Submission Stats - Mean: 0.5001
Prediction Range: [0.05, 0.95]

Submission file successfully created!
