# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

## Step 1: Import Python packages

In [1]:
import numpy as np
import pandas as pd

## Step 2: Explore the data

In [2]:
w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
seed_df = pd.concat([m_seed, w_seed], axis=0).fillna(0.05)
submission_df = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')


Team rankings are present in the files WNCAATourneySeeds.csv and MNCAATourneySeeds.csv. 
- The "Season" column indicates the year
- The "Seed" column indicates the ranking for a given conference (W01 = ranking 1 in conference W)
- The "TeamID" column contains a unique identifier for every team

In [3]:
# Display first few rows to check structure
print(seed_df.head())
print(submission_df.head())


   Season Seed  TeamID
0    1985  W01    1207
1    1985  W02    1210
2    1985  W03    1228
3    1985  W04    1260
4    1985  W05    1374
               ID  Pred
0  2021_1101_1102   0.5
1  2021_1101_1103   0.5
2  2021_1101_1104   0.5
3  2021_1101_1105   0.5
4  2021_1101_1106   0.5


The sample_submission.csv file contains an "ID" column with the format year_teamID1_teamID2.

In [4]:
print(submission_df.head())

               ID  Pred
0  2021_1101_1102   0.5
1  2021_1101_1103   0.5
2  2021_1101_1104   0.5
3  2021_1101_1105   0.5
4  2021_1101_1106   0.5


## Step 3: Extract game info and team rankings

In [5]:
def extract_game_info(id_str):
    year, tid1, tid2 = map(int, id_str.split('_'))
    return year, min(tid1, tid2), max(tid1, tid2)  # Ensure smaller TeamID is TeamID1
def extract_seed_value(seed_str):
    try:
        return int(str(seed_str).strip()[1:])  # Extract seed value, ignoring the conference letter (e.g., 'W01' becomes 1)
    except:
        return 16  # Default to 16 for missing or invalid seeds
seed_df['SeedValue'] = seed_df['Seed'].fillna('Z16').apply(extract_seed_value)


## Step 4: Make your predictions

In [6]:
# Make sure we have the correct columns in submission_df
submission_df[['Season', 'TeamID1', 'TeamID2']] = submission_df['ID'].apply(extract_game_info).tolist()

# Now try to map the seed values again
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']

# Correct mapping of SeedValues for TeamID1 and TeamID2
submission_df['SeedValue1'] = submission_df.set_index(['Season', 'TeamID1']).index.map(seed_map).fillna(16)
submission_df['SeedValue2'] = submission_df.set_index(['Season', 'TeamID2']).index.map(seed_map).fillna(16)

# Check the first few rows to confirm
print(submission_df.head())


               ID  Pred  Season  TeamID1  TeamID2  SeedValue1  SeedValue2
0  2021_1101_1102   0.5    2021     1101     1102        14.0        16.0
1  2021_1101_1103   0.5    2021     1101     1103        14.0        16.0
2  2021_1101_1104   0.5    2021     1101     1104        14.0         2.0
3  2021_1101_1105   0.5    2021     1101     1105        14.0        16.0
4  2021_1101_1106   0.5    2021     1101     1106        14.0        16.0


In [7]:
submission_df['Pred'] = (0.5 + 0.03 * (submission_df['SeedValue2'] - submission_df['SeedValue1'])).clip(0.05, 0.95)

stats = submission_df.iloc[:, 1].describe()
print(stats)

count    507108.000000
mean          0.502152
std           0.147387
min           0.050000
25%           0.500000
50%           0.500000
75%           0.500000
max           0.950000
Name: Pred, dtype: float64


## Step 5: Understand the metric

We don't know the outcomes of the games, so instead let's assume that the team that was listed first won every single matchup. This is what we'll call our "true value". Next, we'll calculate the average squared difference between the probabilities in our submission and that ground truth value. We'll call this the "Brier score". https://en.wikipedia.org/wiki/Brier_score

In [8]:
submission_df['Pred'] = 1 / (1 + np.exp(submission_df['SeedValue1'] - submission_df['SeedValue2']))

# Check for NaN or infinite values in predictions (could happen with extreme seed differences)
if submission_df['Pred'].isnull().any() or (submission_df['Pred'] == np.inf).any():
    print("Warning: NaN or infinite values detected in predictions.")

## Step 6: Make your submission

In [9]:
submission_df[['ID', 'Pred']].to_csv('/kaggle/working/submission.csv', index=False)


In [10]:
print(f"#1 seed vs #16: {0.5 + 0.03*(16-1):.2f} → 0.95 (Correct)")
print(f"#8 seed vs #9: {0.5 + 0.03*(9-8):.2f} → 0.53 (Neutral)")
print(f"#16 vs #1: {0.5 + 0.03*(1-16):.2f} → 0.05 (Correct reversal)")


#1 seed vs #16: 0.95 → 0.95 (Correct)
#8 seed vs #9: 0.53 → 0.53 (Neutral)
#16 vs #1: 0.05 → 0.05 (Correct reversal)
