<a href="https://www.kaggle.com/code/robinamirbahar/march-mania-2025-robina?scriptVersionId=224685110" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🏀 March Mania 2025 

Hey everyone! 👋 It’s **Robina Mirbahar**, and I’m diving into **March Mania 2025**, a Kaggle competition where we predict **college basketball matchups** based on seed rankings! 🎯  

The approach I’m using here is **simple but effective**—leveraging **seed rankings** to estimate **win probabilities**. This is a **starter-friendly method**, but it can definitely be improved with **more advanced machine learning techniques**.  

Let’s get started! 🚀  

## 📌 Overview
This notebook predicts the probability of a team winning a **March Madness** matchup using **seed rankings**.  
We use a **simple yet effective approach** based on team seeds to generate predictions.

## 📂 Dataset Information
- **MNCAATourneySeeds.csv** → Tournament seed rankings
- **SampleSubmissionStage2.csv** → Sample file for matchups in 2025

## 📌 Competition Understanding
**Goal:** Predict win probability for lower TeamID in NCAA matchups  
**Submission Format Example:**  
`2025_1101_1104 ➔ 0.75`  
*(Team 1101 has 75% chance to beat Team 1104 in 2025 tournament)*



📥 1️⃣ Import Required Libraries

In [1]:
# 📚 Essential Libraries for Data Processing
import pandas as pd
import numpy as np
import re  # For parsing seed values


✅ **Why are these needed?**  

- 🐼 **pandas** → Handling datasets efficiently 📊  
- 🔢 **numpy** → Performing numerical operations  
- 🔍 **re** → Extracting numeric seed values from text  



🏆 2️⃣ Helper Function: Parse Seed Values

In [2]:
def parse_seed(seed):
    """
    Extracts numerical seed value from the seed string.
    Example: "W01" -> 1, "M16a" -> 16
    """
    if isinstance(seed, str):
        digits = re.sub(r"\D", "", seed)  # Remove non-numeric characters
        return int(digits) if digits else 16  # Default seed = 16 if missing
    return 16


✅ Why do this?

The raw dataset contains seeds in text format (e.g., "M01", "W16a").
We extract only the numeric values for proper calculations.

📊 3️⃣ Load & Process Seed Data


In [3]:
# 📥 Load Tournament Seed Data
seed_data = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv')

# 🏀 Extract Gender ('M' or 'W') from the Seed column
seed_data['Gender'] = seed_data['Seed'].str[0]  # First letter determines gender (M or W)
seed_data['Gender'] = seed_data['Gender'].replace({'W': 'W', 'M': 'M'}).fillna('M')  # Default to 'M'

# 🎯 Convert Seed Values to Numeric
seed_data['Seed'] = seed_data['Seed'].apply(parse_seed)  # Extract numerical seed values


✅ At this stage:

The Seed column now contains numbers only.
The Gender column correctly labels teams as Men’s (M) or Women’s (W).

🏀 4️⃣ Extract Matchup Details

In [4]:
# 📥 Load Sample Submission Data
submission = pd.read_csv('/kaggle/input/march-machine-learning-mania-2025/SampleSubmissionStage2.csv')

# 🏀 Extract matchup details from ID column (e.g., 2025_1101_1104)
sub_pairs = submission['ID'].str.split('_', expand=True)
sub_pairs.columns = ['Season', 'TeamID_1', 'TeamID_2']

# 🔄 Convert values to integer type for easier processing
sub_pairs['Season'] = sub_pairs['Season'].astype(int)
sub_pairs['TeamID_1'] = sub_pairs['TeamID_1'].astype(int)
sub_pairs['TeamID_2'] = sub_pairs['TeamID_2'].astype(int)

# 🏀 Add Gender column (assuming TeamID starting with '1' is Male, else Female)
sub_pairs['Gender'] = np.where(sub_pairs['TeamID_1'].astype(str).str.startswith('1'), 'M', 'W')


✅ **Now, we have:**  

- Separated **Season**, **TeamID_1**, and **TeamID_2** from the match ID.  
- Converted all values into a **clean numerical format**.  
- Assigned **gender labels** to ensure correct matching of teams.  


🔍 5️⃣ Assign Seed Values to Teams

In [5]:
# 🎯 Create a lookup table for seeds
seed_lookup = seed_data.set_index(['Season', 'Gender', 'TeamID'])['Seed']

# 🔎 Fetch Seed Values for Each Team
sub_pairs['Seed_1'] = sub_pairs.apply(
    lambda x: seed_lookup.get((x['Season'], x['Gender'], x['TeamID_1']), 16), axis=1
)
sub_pairs['Seed_2'] = sub_pairs.apply(
    lambda x: seed_lookup.get((x['Season'], x['Gender'], x['TeamID_2']), 16), axis=1
)

# 🔢 Compute Seed Difference
sub_pairs['Seed_Diff'] = sub_pairs['Seed_2'] - sub_pairs['Seed_1']


✅ At this stage:

Every matchup has two seed rankings (Seed_1, Seed_2).
Seed Difference helps determine which team is more likely to win.

📊 6️⃣ Compute Win Probability

In [6]:
# 🎯 Use Seed Difference to Predict Win Probability
sub_pairs['Pred'] = 0.5 + (0.03 * sub_pairs['Seed_Diff'])

# 🎯 Clip Values Between 0.05 and 0.95
sub_pairs['Pred'] = sub_pairs['Pred'].clip(0.05, 0.95)


✅ How does this work?

If Seed_Diff is negative → Team 1 is the favorite.
If Seed_Diff is positive → Team 2 is the favorite.
The win probability is adjusted accordingly.
✅ Why clip values?

To avoid extreme probabilities like 0% or 100%.

📤 7️⃣ Generate Submission File

In [7]:
# 📜 Save Predictions to Submission File
submission['Pred'] = sub_pairs['Pred']
submission.to_csv('submission.csv', index=False)

# ✅ Completion Message
print("✅ Submission file created successfully! 🏀")


✅ Submission file created successfully! 🏀


🎯 At this stage:

Predictions are saved in submission.csv.

## 🎯 Summary of Steps  

| **Step**            | **Description**                                      |
|---------------------|--------------------------------------------------|
| 📥 **Load Data**    | Import seed data and sample submission.         |
| 🏀 **Extract Matchups** | Split match IDs into structured data.         |
| 🔍 **Assign Seeds**  | Match each team with their respective seed.     |
| 📊 **Compute Probabilities** | Use the formula to estimate win chances.  |
| 📤 **Generate Submission**  | Save predictions to a CSV file.           |
