# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV

# ==============================================
# 1. Data Loading with Error Handling
# ==============================================
try:
    # Load seed data
    m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
    w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
    
    # Load historical tournament results
    tourney_results = pd.concat([
        pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv'),
        pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv')
    ])
    
    # Load submission file
    submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

except FileNotFoundError as e:
    print(f"Critical file missing: {e}")
    exit()

# ==============================================
# 2. Robust Seed Processing
# ==============================================
def parse_seed(seed):
    """Extract numeric seed value from any format"""
    try:
        # Extract digits after first character
        digits = re.sub(r"^\D+", "", str(seed)).lstrip('0')
        return int(digits) if digits else 1
    except:
        return 16  # Default for invalid/missing seeds

# Process seeds
seed_df = pd.concat([m_seed, w_seed])
seed_df['SeedValue'] = seed_df['Seed'].apply(parse_seed)
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']

# ==============================================
# 3. Historical Upset Analysis
# ==============================================
def calculate_upset_features(df):
    """Calculate historical upset rates by seed difference"""
    games = []
    for _, row in df.iterrows():
        season = row['Season']
        winner = row['WTeamID']
        loser = row['LTeamID']
        
        try:
            seed_win = seed_map.at[(season, winner)]
            seed_lose = seed_map.at[(season, loser)]
        except KeyError:
            continue
            
        diff = seed_lose - seed_win  # Positive = upset possible
        games.append({
            'SeedDiff': diff,
            'Upset': 1 if seed_win > seed_lose else 0
        })
    
    return pd.DataFrame(games)

# Calculate historical upset probabilities
historical_games = calculate_upset_features(tourney_results)
upset_rates = historical_games.groupby('SeedDiff')['Upset'].mean().reset_index()

# ==============================================
# 4. Submission Data Processing
# ==============================================
# Split ID column first
split_cols = submission['ID'].str.split('_', expand=True).iloc[:, :3]
split_cols.columns = ['Season', 'Team1', 'Team2']

# Convert dtypes after renaming
submission = submission.join(
    split_cols.astype({
        'Season': 'int16',
        'Team1': 'int16',
        'Team2': 'int16'
    })
)

# Calculate seed differences
def get_seed_diff(row):
    try:
        return (
            seed_map.at[(row['Season'], row['Team2'])] -  # Fixed here
            seed_map.at[(row['Season'], row['Team1'])]    # And here
        )
    except KeyError:
        return 0

submission['SeedDiff'] = submission.apply(get_seed_diff, axis=1)

# ==============================================
# 5. Model Training & Prediction
# ==============================================
# Prepare training data
X_train = upset_rates[['SeedDiff']]
y_train = upset_rates['Upset']

# Train calibrated model
model = CalibratedClassifierCV(
    LogisticRegression(),
    method='isotonic',
    cv=5
)
model.fit(X_train, y_train)

# Generate predictions
submission['Pred'] = model.predict_proba(submission[['SeedDiff']])[:, 1]
submission['Pred'] = submission['Pred'].clip(0.05, 0.95)

# ==============================================
# 6. Validation & Output
# ==============================================
print("\nAdvanced Model Validation:")
print(f"Average Prediction: {submission['Pred'].mean():.4f}")
print(f"Prediction Range: [{submission['Pred'].min():.2f}, {submission['Pred'].max():.2f}]")
print(f"Estimated Brier Score: {np.mean((submission['Pred'] - 0.5)**2):.5f}")

submission[['ID', 'Pred']].to_csv('optimized_submission.csv', index=False)
print("\nSubmission file successfully generated!")


Advanced Model Validation:
Average Prediction: 0.4860
Prediction Range: [0.05, 0.95]
Estimated Brier Score: 0.00357

Submission file successfully generated!
