# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV

# ==============================================
# 1. Data Loading & Processing
# ==============================================

# Load data
m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
tourney_results = pd.concat([
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv'),
    pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv')
])
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

# ==============================================
# 2. Correct Training Data Preparation
# ==============================================

def parse_seed(seed):
    """Extract numeric seed value from any format"""
    return int(''.join(filter(str.isdigit, str(seed)))) if pd.notnull(seed) else 16

# Process seeds
seed_df = pd.concat([m_seed, w_seed])
seed_df['SeedValue'] = seed_df['Seed'].apply(parse_seed)
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']

# Prepare training data in submission format
train_data = []
for _, row in tourney_results.iterrows():
    season = row['Season']
    team1, team2 = sorted([row['WTeamID'], row['LTeamID']])
    outcome = 1 if row['WTeamID'] == team1 else 0
    
    try:
        seed1 = seed_map.at[(season, team1)]
        seed2 = seed_map.at[(season, team2)]
    except KeyError:
        continue
        
    train_data.append({
        'Season': season,
        'SeedDiff': seed2 - seed1,  # Team2 seed - Team1 seed
        'Outcome': outcome          # 1 if Team1 wins
    })

train_df = pd.DataFrame(train_data)

# ==============================================
# 3. Model Training
# ==============================================

# Features and target
X = train_df[['SeedDiff']]
y = train_df['Outcome']

# Calibrated logistic regression
model = CalibratedClassifierCV(
    LogisticRegression(),
    method='isotonic',
    cv=5
)
model.fit(X, y)

# ==============================================
# 4. Generate Predictions
# ==============================================

# Process submission data
split_cols = submission['ID'].str.split('_', expand=True).iloc[:, :3]
split_cols.columns = ['Season', 'Team1', 'Team2']
submission = submission.join(split_cols.astype(int))

# Get seeds
submission['Seed1'] = submission.apply(
    lambda x: seed_map.get((x['Season'], x['Team1']), 16), axis=1)
submission['Seed2'] = submission.apply(
    lambda x: seed_map.get((x['Season'], x['Team2']), 16), axis=1)

# Calculate features
submission['SeedDiff'] = submission['Seed2'] - submission['Seed1']

# Predict probabilities
submission['Pred'] = model.predict_proba(submission[['SeedDiff']])[:, 1]
submission['Pred'] = submission['Pred'].clip(0.05, 0.95)

# ==============================================
# 5. Final Output
# ==============================================

print("Prediction Statistics:")
print(f"Mean: {submission['Pred'].mean():.4f}")
print(f"Brier Estimate: {np.mean((submission['Pred'] - 0.5)**2):.5f}")
submission[['ID', 'Pred']].to_csv('corrected_submission.csv', index=False)

Prediction Statistics:
Mean: 0.5245
Brier Estimate: 0.03199
