# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import TimeSeriesSplit, cross_val_score 
from sklearn.metrics import brier_score_loss
from tqdm import tqdm

# ======================================================================
# 1. Data Loading & Initial Processing
# ======================================================================

# Load all competition data
m_regular = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MRegularSeasonCompactResults.csv')
w_regular = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WRegularSeasonCompactResults.csv')
tourney_results = pd.concat([
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv'),
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv')
])
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

# ======================================================================
# 2. Seed Data Processing
# ======================================================================

def parse_seed(seed):
    """Robust seed parsing with error handling"""
    try:
        return int(''.join(filter(str.isdigit, str(seed))))
    except:
        return 16  # Default value for missing/invalid seeds

# Process seed information
seed_data = pd.concat([
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv'),
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
])
seed_data['SeedValue'] = seed_data['Seed'].apply(parse_seed)
seed_map = seed_data.set_index(['Season', 'TeamID'])['SeedValue'].to_dict()

# ======================================================================
# 3. Elo Rating System Implementation
# ======================================================================

class EloCalculator:
    def __init__(self, k=32, regress=0.2):
        self.k = k
        self.regress = regress
        self.ratings = {}
        
    def process_season(self, season_df):
        season_df = season_df.sort_values('DayNum')
        for _, row in season_df.iterrows():
            t1, t2, wloc = row['WTeamID'], row['LTeamID'], row['WLoc']
            margin = row['WScore'] - row['LScore']
            
            # Initialize ratings if needed
            for team in [t1, t2]:
                if team not in self.ratings:
                    self.ratings[team] = 1500
                    
            # Calculate Elo updates
            r1, r2 = self.ratings[t1], self.ratings[t2]
            q1 = 10 ** (r1 / 400)
            q2 = 10 ** (r2 / 400)
            e1 = q1 / (q1 + q2)
            delta = self.k * ((1 + margin/20) - e1)
            
            self.ratings[t1] += delta
            self.ratings[t2] -= delta
            
            # Season regression
            for team in [t1, t2]:
                self.ratings[team] = (self.ratings[team] - 1500) * (1 - self.regress) + 1500

# Calculate Elo ratings for all teams
full_regular = pd.concat([m_regular, w_regular])
elo_system = EloCalculator()
for season in tqdm(full_regular['Season'].unique(), desc='Calculating Elo Ratings'):
    season_data = full_regular[full_regular['Season'] == season]
    elo_system.process_season(season_data)

# ======================================================================
# 4. Corrected Feature Engineering
# ======================================================================

def create_tourney_features(tourney_df):
    features = []
    elo_cache = {}
    
    for season in tourney_df['Season'].unique():
        season_games = tourney_df[tourney_df['Season'] == season]
        season_teams = set(season_games['WTeamID']).union(set(season_games['LTeamID']))
        
        # Initialize season cache
        for team in season_teams:
            elo_cache[team] = elo_system.ratings.get(team, 1500)
        
        for idx, row in season_games.sort_values('DayNum').iterrows():
            t1, t2 = sorted([row['WTeamID'], row['LTeamID']])
            
            # Keep only relevant features
            features.append({
                'EloDiff': elo_cache[t1] - elo_cache[t2],
                'SeedDiff': seed_map.get((season, t2), 16) - seed_map.get((season, t1), 16),
                'Outcome': 1 if row['WTeamID'] == t1 else 0
            })
                
    return pd.DataFrame(features)

tourney_features = create_tourney_features(tourney_results)

# ======================================================================
# 5. Model Training with Correct Features
# ======================================================================

X = tourney_features[['EloDiff', 'SeedDiff']]  # Only use predictive features
y = tourney_features['Outcome']

best_params = {
    'max_iter': 500,
    'learning_rate': 0.05,
    'max_depth': 7,
    'l2_regularization': 1.0,
    'early_stopping': True,
    'validation_fraction': 0.2
}

# Simplified validation
model = HistGradientBoostingClassifier(**best_params)
model.fit(X, y)

# ======================================================================
# 6. Correct Submission Generation
# ======================================================================

def generate_submission_features(sub_df):
    features = []
    # Parse ID column directly
    for id_str in sub_df['ID']:
        parts = id_str.split('_')
        season = int(parts[0])
        t1 = int(parts[1])
        t2 = int(parts[2])
        
        features.append({
            'EloDiff': elo_system.ratings.get(t1, 1500) - elo_system.ratings.get(t2, 1500),
            'SeedDiff': seed_map.get((season, t2), 16) - seed_map.get((season, t1), 16)
        })
    return pd.DataFrame(features)

sub_features = generate_submission_features(submission)
submission['Pred'] = model.predict_proba(sub_features)[:, 1]
submission['Pred'] = submission['Pred'].clip(0.03, 0.97)

# ======================================================================
# 7. Final Validation & Output
# ======================================================================

# Cross-validation score check
cv_scores = -cross_val_score(model, X, y, 
                            cv=TimeSeriesSplit(n_splits=5),
                            scoring='neg_brier_score')
print(f"Cross-Validation Brier Scores: {cv_scores}")
print(f"Mean CV Score: {np.mean(cv_scores):.5f}")

submission[['ID', 'Pred']].to_csv('submission.csv', index=False)

Calculating Elo Ratings: 100%|██████████| 41/41 [00:17<00:00,  2.33it/s]


Cross-Validation Brier Scores: [0.18601591 0.18006941 0.17932004 0.18396685 0.19315551]
Mean CV Score: 0.18451
