# March Mania 2025 - Starter Notebook

## Goal of the competition

The goal of this competition is to predict that probability that the smaller ``TeamID`` will win a given matchup. You will predict the probability for every possible matchup between every possible team over the past 4 years. You'll be given a sample submission file where the ```ID``` value indicates the year of the matchup as well as the identities of both teams within the matchup. For example, for an ```ID``` of ```2025_1101_1104``` you would need to predict the outcome of the matchup between ```TeamID 1101``` vs ```TeamID 1104``` during the ```2025``` tournament. Submitting a ```PRED``` of ```0.75``` indicates that you think that the probability of ```TeamID 1101``` winning that particular matchup is equal to ```0.75```.


## Overview of our submission strategy 
For this starter notebook, we will make a simple submission.

We can predict the winner of a match by considering the respective rankings of the opposing teams, only. Since the largest possible difference is 15 (which is #16 minus #1), we use a rudimentary formula that's 0.5 plus 0.03 times the difference in seeds, leading to a range of predictions spanning from 5% up to 95%. The stronger-seeded team (with a lower seed number from 1 to 16) will be the favorite and will have a prediction above 50%. 

# Starter Code

## Step 1: Import Python packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
from sklearn.preprocessing import StandardScaler


## Step 2: Explore the data

In [2]:
# Load data
w_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv')
m_seed = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')
seed_df = pd.concat([m_seed, w_seed], axis=0)

submission_df = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage1.csv')

# Display first few rows to check structure
print(seed_df.head())
print(submission_df.head())


   Season Seed  TeamID
0    1985  W01    1207
1    1985  W02    1210
2    1985  W03    1228
3    1985  W04    1260
4    1985  W05    1374
               ID  Pred
0  2021_1101_1102   0.5
1  2021_1101_1103   0.5
2  2021_1101_1104   0.5
3  2021_1101_1105   0.5
4  2021_1101_1106   0.5


Team rankings are present in the files WNCAATourneySeeds.csv and MNCAATourneySeeds.csv. 
- The "Season" column indicates the year
- The "Seed" column indicates the ranking for a given conference (W01 = ranking 1 in conference W)
- The "TeamID" column contains a unique identifier for every team

In [3]:
# Display first few rows to check structure
print(seed_df.head())
print(submission_df.head())


   Season Seed  TeamID
0    1985  W01    1207
1    1985  W02    1210
2    1985  W03    1228
3    1985  W04    1260
4    1985  W05    1374
               ID  Pred
0  2021_1101_1102   0.5
1  2021_1101_1103   0.5
2  2021_1101_1104   0.5
3  2021_1101_1105   0.5
4  2021_1101_1106   0.5


The sample_submission.csv file contains an "ID" column with the format year_teamID1_teamID2.

In [4]:
print(submission_df.head())

               ID  Pred
0  2021_1101_1102   0.5
1  2021_1101_1103   0.5
2  2021_1101_1104   0.5
3  2021_1101_1105   0.5
4  2021_1101_1106   0.5


## Step 3: Extract game info and team rankings

In [5]:
def extract_game_info(id_str):
    year, tid1, tid2 = map(int, id_str.split('_'))
    return year, min(tid1, tid2), max(tid1, tid2)  # Enforce TeamID1 < TeamID2

def extract_seed_value(seed_str):
    try:
        return int(str(seed_str).strip()[1:])  # Robust string handling
    except:
        return 16  # Catch-all for invalid formats

# Preprocess seeds with defensive programming
seed_df['SeedValue'] = seed_df['Seed'].fillna('Z16').apply(extract_seed_value)

# Transform submission data
submission_df[['Season', 'TeamID1', 'TeamID2']] = submission_df['ID'].apply(extract_game_info).tolist()


## Step 4: Make your predictions

In [6]:
# Merge seed data with the submission data
seed_map = seed_df.set_index(['Season', 'TeamID'])['SeedValue']
submission_df['SeedValue1'] = submission_df.set_index(['Season', 'TeamID1']).index.map(seed_map).fillna(16)
submission_df['SeedValue2'] = submission_df.set_index(['Season', 'TeamID2']).index.map(seed_map).fillna(16)

# Create the seed difference feature
submission_df['SeedDifference'] = submission_df['SeedValue2'] - submission_df['SeedValue1']

# Simulate ground truth (label) where the team with the smaller seed number wins
submission_df['Label'] = (submission_df['SeedValue1'] < submission_df['SeedValue2']).astype(int)

# Define features and labels for training
X = submission_df[['SeedDifference']]
y = submission_df['Label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Predict probabilities for the test set
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Probability for the positive class (TeamID1 wins)

# Evaluate the model using the Brier score
brier_score = brier_score_loss(y_test, y_pred_prob)
print(f'Brier Score: {brier_score}')


Brier Score: 1.38531568138281e-07


In [7]:
submission_df['Pred'] = (0.5 + 0.03 * (submission_df['SeedValue2'] - submission_df['SeedValue1'])).clip(0.05, 0.95)

stats = submission_df.iloc[:, 1].describe()
print(stats)

count    507108.000000
mean          0.502152
std           0.147387
min           0.050000
25%           0.500000
50%           0.500000
75%           0.500000
max           0.950000
Name: Pred, dtype: float64


## Step 5: Understand the metric

We don't know the outcomes of the games, so instead let's assume that the team that was listed first won every single matchup. This is what we'll call our "true value". Next, we'll calculate the average squared difference between the probabilities in our submission and that ground truth value. We'll call this the "Brier score". https://en.wikipedia.org/wiki/Brier_score

In [8]:
submission_df['Pred'] = 1 / (1 + np.exp(submission_df['SeedValue1'] - submission_df['SeedValue2']))

# Check for NaN or infinite values in predictions (could happen with extreme seed differences)
if submission_df['Pred'].isnull().any() or (submission_df['Pred'] == np.inf).any():
    print("Warning: NaN or infinite values detected in predictions.")

## Step 6: Make your submission

In [9]:
# Use the trained model to predict probabilities for the full dataset
submission_df['Pred'] = model.predict_proba(submission_df[['SeedDifference']])[:, 1]

# Clip predictions to be between 0.05 and 0.95 (as per competition rules)
submission_df['Pred'] = submission_df['Pred'].clip(0.05, 0.95)

# Create the submission file
submission_df[['ID', 'Pred']].to_csv('/kaggle/working/submission.csv', index=False)
print("Submission file created!")


Submission file created!
