# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

In [1]:
import pandas as pd
import numpy as np
import re

# ---------------------------
# 1. Data Loading
# ---------------------------

# Load required files
m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

# ---------------------------
# 2. Robust Seed Processing
# ---------------------------

def parse_seed(seed):
    """Extract numeric value from seed string with regex"""
    try:
        digits = re.findall(r'\d+', str(seed))[0].lstrip('0')
        return int(digits) if digits else 1
    except:
        return 16  # Default for invalid/missing seeds

# Combine and process seeds
seed_df = pd.concat([m_seed, w_seed], ignore_index=True)
seed_df['SeedValue'] = seed_df['Seed'].apply(parse_seed).astype('int16')

# ---------------------------
# 3. Submission Data Processing
# ---------------------------

# Split ID into components
split_cols = submission['ID'].str.split('_', expand=True).iloc[:, :3]
split_cols.columns = ['Season', 'Team1', 'Team2']

# Convert dtypes and merge back
submission = submission.join(
    split_cols.astype({'Season': 'int16', 'Team1': 'int16', 'Team2': 'int16'})
)

# ---------------------------
# 4. Seed Merging
# ---------------------------

# Create seed mapping
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']

# Merge seeds using vectorized operations
submission = submission.merge(
    seed_df.rename(columns={'TeamID': 'Team1', 'SeedValue': 'Seed1'}),
    on=['Season', 'Team1'],
    how='left'
).merge(
    seed_df.rename(columns={'TeamID': 'Team2', 'SeedValue': 'Seed2'}),
    on=['Season', 'Team2'],
    how='left'
)

# Fill missing seeds
submission['Seed1'] = submission['Seed1'].fillna(16).astype('int16')
submission['Seed2'] = submission['Seed2'].fillna(16).astype('int16')

# ---------------------------
# 5. Prediction Calculation
# ---------------------------

# Calculate seed difference
submission['SeedDiff'] = submission['Seed2'] - submission['Seed1']
submission['Pred'] = (0.5 + 0.03 * submission['SeedDiff']).clip(0.05, 0.95)

# ---------------------------
# 6. Final Output
# ---------------------------

# Validate results
print("Prediction Summary:")
print(f"Mean: {submission['Pred'].mean():.4f}")
print(f"Range: [{submission['Pred'].min():.2f}, {submission['Pred'].max():.2f}]")

# Save submission
submission[['ID', 'Pred']].to_csv('final_submission.csv', index=False)
print("\nSubmission file created successfully!")

Prediction Summary:
Mean: 0.5025
Range: [0.05, 0.95]

Submission file created successfully!
