# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
import re

def parse_seed(seed):
    if isinstance(seed, str):
        digits = re.sub(r"\D", "", seed)
        return int(digits) if digits else 16
    return 16

# ======================================================================
# 1. Load Seed Data & Add Gender Column
# ======================================================================
seed_data = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
seed_data['Gender'] = seed_data['Seed'].str[0]  # Extract 'W' or 'M' from seed
seed_data['Gender'] = seed_data['Gender'].replace({'W': 'W', 'M': 'M'}).fillna('M')  # Handle missing values
seed_data['Seed'] = seed_data['Seed'].apply(parse_seed)

# ======================================================================
# 2. Load Sample Submission for 2025 Matchups
# ======================================================================
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage2.csv')

# Process ID column
sub_pairs = submission['ID'].str.split('_', expand=True)
sub_pairs.columns = ['Season', 'TeamID_1', 'TeamID_2']
sub_pairs['Season'] = sub_pairs['Season'].astype(int)
sub_pairs['TeamID_1'] = sub_pairs['TeamID_1'].astype(int)
sub_pairs['TeamID_2'] = sub_pairs['TeamID_2'].astype(int)
sub_pairs['Gender'] = np.where(sub_pairs['TeamID_1'].astype(str).str.startswith('1'), 'M', 'W')

# ======================================================================
# 3. Add Seed Information & Predictions
# ======================================================================
seed_lookup = seed_data.set_index(['Season', 'Gender', 'TeamID'])['Seed']

sub_pairs['Seed_1'] = sub_pairs.apply(
    lambda x: seed_lookup.get((x['Season'], x['Gender'], x['TeamID_1']), 16), axis=1
)
sub_pairs['Seed_2'] = sub_pairs.apply(
    lambda x: seed_lookup.get((x['Season'], x['Gender'], x['TeamID_2']), 16), axis=1
)

sub_pairs['Seed_Diff'] = sub_pairs['Seed_2'] - sub_pairs['Seed_1']
sub_pairs['Pred'] = 0.5 + 0.03 * sub_pairs['Seed_Diff']
sub_pairs['Pred'] = sub_pairs['Pred'].clip(0.05, 0.95)

# ======================================================================
# 4. Final Submission
# ======================================================================
submission['Pred'] = sub_pairs['Pred']
submission.to_csv('submission.csv', index=False)
print("Submission successful! 🏀")

Submission successful! 🏀
