# March Machine Learning Mania 2026

**Goal:** Predict win probabilities for every possible NCAA basketball tournament matchup (men's and women's), evaluated on log loss.

**Approach:**
1. Load and explore historical data
2. Engineer features (Elo ratings, efficiency metrics, seeds)
3. Build training dataset from historical tournament games
4. Train an XGBoost classifier and calibrate probabilities
5. Generate submission for all 2026 matchups

**Key design choices:**
- Men's (M) and Women's (W) data are processed with the same pipeline and combined for training
- Every tournament game is encoded with `team1 = lower TeamID`, outcome = 1 if team1 won
- Elo is updated game-by-game through each regular season; tournament seed is used as a strong prior signal

## 0. Setup

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss

# ── Data path ──────────────────────────────────────────────────────────────────
# On Kaggle this will be: /kaggle/input/march-machine-learning-mania-2026/
# Locally point to wherever you extracted the zip.
DATA_DIR = '/kaggle/input/march-machine-learning-mania-2026/' if os.path.exists('/kaggle') \
    else os.path.join(os.path.dirname(os.path.abspath('__file__')), 'data')

print(f'Data directory: {DATA_DIR}')
print(f'Files found: {len(os.listdir(DATA_DIR))}')

Data directory: /home/een/bhaskar/projects/kaggle/ncaa-2026/data
Files found: 35


## 1. Load Data

We load the key tables for both men's (M) and women's (W) tournaments.

In [2]:
def load(fname):
    return pd.read_csv(os.path.join(DATA_DIR, fname))

# ── Men's data ─────────────────────────────────────────────────────────────────
m_teams        = load('MTeams.csv')
m_seasons      = load('MSeasons.csv')
m_reg_compact  = load('MRegularSeasonCompactResults.csv')
m_tourney      = load('MNCAATourneyCompactResults.csv')
m_seeds        = load('MNCAATourneySeeds.csv')

# ── Women's data ───────────────────────────────────────────────────────────────
w_teams        = load('WTeams.csv')
w_seasons      = load('WSeasons.csv')
w_reg_compact  = load('WRegularSeasonCompactResults.csv')
w_tourney      = load('WNCAATourneyCompactResults.csv')
w_seeds        = load('WNCAATourneySeeds.csv')

# ── Submission template ────────────────────────────────────────────────────────
sample_sub = load('SampleSubmissionStage2.csv')

print('Men\'s regular season games:', len(m_reg_compact))
print('Men\'s tourney games:        ', len(m_tourney))
print('Women\'s regular season games:', len(w_reg_compact))
print('Women\'s tourney games:       ', len(w_tourney))
print('Submission rows:             ', len(sample_sub))

# Tag gender for combined pipeline
m_reg_compact['Gender'] = 'M'
m_tourney['Gender']     = 'M'
m_seeds['Gender']       = 'M'
w_reg_compact['Gender'] = 'W'
w_tourney['Gender']     = 'W'
w_seeds['Gender']       = 'W'

reg_all    = pd.concat([m_reg_compact, w_reg_compact], ignore_index=True)
tourney_all = pd.concat([m_tourney, w_tourney], ignore_index=True)
seeds_all  = pd.concat([m_seeds, w_seeds], ignore_index=True)

Men's regular season games: 196823
Men's tourney games:         2585
Women's regular season games: 140825
Women's tourney games:        1717
Submission rows:              132133


## 2. Feature Engineering

For each team and season we compute:
- **Elo rating** — game-by-game update through the regular season (K=20, home advantage=100)
- **Win %**, **avg points scored**, **avg points allowed**, **avg point differential**
- **Tournament seed** (numeric 1-16)

These become the team-level stats that we use to build matchup features.

In [3]:
# ── 2.1 Elo ratings ───────────────────────────────────────────────────────────
ELO_INIT   = 1500
ELO_K      = 20
HOME_ADV   = 100   # points added to home team's Elo for expectation calc
ELO_REVERT = 0.75  # fraction to revert toward mean between seasons

def expected_score(ra, rb):
    return 1 / (1 + 10 ** ((rb - ra) / 400))

def compute_elo(reg_df):
    """
    Returns a DataFrame indexed by (Gender, Season, TeamID) with the team's
    end-of-regular-season Elo rating.
    """
    elo = {}   # {(Gender, TeamID): current_elo}
    records = []  # (Gender, Season, TeamID, elo_end)

    for (gender, season), games in reg_df.sort_values(['Gender','Season','DayNum']).groupby(['Gender','Season']):
        # Season-start reversion toward mean
        for k in list(elo.keys()):
            if k[0] == gender:
                elo[k] = ELO_REVERT * elo[k] + (1 - ELO_REVERT) * ELO_INIT

        for _, g in games.iterrows():
            wid = (gender, g.WTeamID)
            lid = (gender, g.LTeamID)
            rw  = elo.get(wid, ELO_INIT)
            rl  = elo.get(lid, ELO_INIT)

            # Home advantage adjustment
            loc = g.WLoc
            if   loc == 'H': rw_adj = rw + HOME_ADV
            elif loc == 'A': rw_adj = rw - HOME_ADV
            else:            rw_adj = rw

            ew = expected_score(rw_adj, rl)
            elo[wid] = rw + ELO_K * (1 - ew)
            elo[lid] = rl + ELO_K * (0 - (1 - ew))

        # Snapshot end-of-season Elo for every team that played this season
        teams_this = set(games.WTeamID) | set(games.LTeamID)
        for tid in teams_this:
            records.append({
                'Gender': gender,
                'Season': season,
                'TeamID': tid,
                'Elo':    elo.get((gender, tid), ELO_INIT)
            })

    return pd.DataFrame(records)

elo_df = compute_elo(reg_all)
print('Elo records:', len(elo_df))
print(elo_df.query('Season==2025 and Gender=="M"').sort_values('Elo', ascending=False).head(10))

Elo records: 23604
      Gender  Season  TeamID          Elo
13137      M    2025    1222  1841.005440
13098      M    2025    1181  1789.364596
13041      M    2025    1120  1778.857442
13113      M    2025    1196  1771.366413
13096      M    2025    1179  1765.659670
13298      M    2025    1388  1747.671697
13307      M    2025    1397  1747.537250
13187      M    2025    1272  1743.956404
13027      M    2025    1104  1742.041244
13295      M    2025    1385  1741.740359


In [4]:
# ── 2.2 Efficiency metrics ────────────────────────────────────────────────────
def compute_efficiency(reg_df):
    """
    Per (Gender, Season, TeamID): wins, losses, pts_for, pts_against.
    Vectorized — no row-by-row iteration.
    """
    # Winner perspective
    w_view = reg_df[['Gender','Season','WTeamID','WScore','LScore']].copy()
    w_view.columns = ['Gender','Season','TeamID','PtsFor','PtsAgainst']
    w_view['Win'] = 1

    # Loser perspective
    l_view = reg_df[['Gender','Season','LTeamID','LScore','WScore']].copy()
    l_view.columns = ['Gender','Season','TeamID','PtsFor','PtsAgainst']
    l_view['Win'] = 0

    df = pd.concat([w_view, l_view], ignore_index=True)
    agg = df.groupby(['Gender','Season','TeamID']).agg(
        Games      = ('Win', 'count'),
        Wins       = ('Win', 'sum'),
        PtsFor     = ('PtsFor', 'mean'),
        PtsAgainst = ('PtsAgainst', 'mean'),
    ).reset_index()
    agg['WinPct']  = agg['Wins'] / agg['Games']
    agg['PtsDiff'] = agg['PtsFor'] - agg['PtsAgainst']
    return agg

eff_df = compute_efficiency(reg_all)
print('Efficiency records:', len(eff_df))
print(eff_df.query('Season==2025 and Gender=="M"').sort_values('PtsDiff', ascending=False).head(5))

Efficiency records: 23604
      Gender  Season  TeamID  Games  Wins     PtsFor  PtsAgainst    WinPct  \
13098      M    2025    1181     34    31  82.705882   61.911765  0.911765   
13128      M    2025    1211     33    25  86.636364   69.636364  0.757576   
13113      M    2025    1196     34    30  85.411765   69.235294  0.882353   
13137      M    2025    1222     34    30  74.205882   58.470588  0.882353   
13378      M    2025    1471     32    28  77.937500   62.843750  0.875000   

         PtsDiff  
13098  20.794118  
13128  17.000000  
13113  16.176471  
13137  15.735294  
13378  15.093750  


In [5]:
# ── 2.3 Tournament seeds ──────────────────────────────────────────────────────
def parse_seed(seed_str):
    """Convert 'W01', 'X16a' etc. to integer 1-16."""
    # Strip region letter and play-in suffix (a/b)
    return int(''.join(filter(str.isdigit, seed_str)))

seeds_all['SeedNum'] = seeds_all['Seed'].apply(parse_seed)
print(seeds_all.head())

   Season Seed  TeamID Gender  SeedNum
0    1985  W01    1207      M        1
1    1985  W02    1210      M        2
2    1985  W03    1228      M        3
3    1985  W04    1260      M        4
4    1985  W05    1374      M        5


In [6]:
# ── 2.4 Merge all features into a single team-season stats table ──────────────
team_stats = elo_df.merge(eff_df, on=['Gender','Season','TeamID'], how='left') \
                   .merge(seeds_all[['Gender','Season','TeamID','SeedNum']],
                          on=['Gender','Season','TeamID'], how='left')

# Seed 0 = unknown (not in the tournament that season — used for regular-season-only teams)
team_stats['SeedNum'] = team_stats['SeedNum'].fillna(0)

print('Team-stats shape:', team_stats.shape)
print(team_stats.query('Season==2025 and Gender=="M" and SeedNum>0').head())

Team-stats shape: (23604, 11)
      Gender  Season  TeamID          Elo  Games  Wins     PtsFor  PtsAgainst  \
13026      M    2025    1103  1682.001274     32    26  83.968750   75.906250   
13027      M    2025    1104  1742.041244     33    25  91.121212   81.424242   
13029      M    2025    1106  1453.846713     33    18  72.181818   72.484848   
13032      M    2025    1110  1518.231857     32    20  67.625000   68.093750   
13034      M    2025    1112  1702.074743     34    22  81.735294   72.441176   

         WinPct   PtsDiff  SeedNum  
13026  0.812500  8.062500     13.0  
13027  0.757576  9.696970      2.0  
13029  0.545455 -0.303030     16.0  
13032  0.625000 -0.468750     16.0  
13034  0.647059  9.294118      4.0  


## 3. Build Training Dataset

Each historical tournament game becomes one training row:
- `team1` = team with the lower TeamID, `team2` = the other
- Features = element-wise differences: `team1_stat - team2_stat`
- Label `y = 1` if `team1` won

This symmetric encoding means the model sees every matchup from both sides via the sign of the difference.

In [7]:
FEATURE_COLS = ['Elo', 'WinPct', 'PtsFor', 'PtsAgainst', 'PtsDiff', 'SeedNum']

def make_features(t1_stats, t2_stats):
    """Return a dict of difference features (team1 - team2)."""
    feats = {}
    for col in FEATURE_COLS:
        v1 = t1_stats.get(col, np.nan)
        v2 = t2_stats.get(col, np.nan)
        feats[f'{col}_diff'] = v1 - v2
    return feats

# Index team_stats for fast lookup
ts_idx = team_stats.set_index(['Gender','Season','TeamID'])[FEATURE_COLS].to_dict('index')

def get_stats(gender, season, team_id):
    return ts_idx.get((gender, season, team_id), {c: np.nan for c in FEATURE_COLS})

# Build training rows
train_rows = []
for _, g in tourney_all.iterrows():
    t1, t2 = (g.WTeamID, g.LTeamID) if g.WTeamID < g.LTeamID else (g.LTeamID, g.WTeamID)
    y = 1 if g.WTeamID < g.LTeamID else 0   # 1 = lower ID won

    s1 = get_stats(g.Gender, g.Season, t1)
    s2 = get_stats(g.Gender, g.Season, t2)
    row = {'Gender': g.Gender, 'Season': g.Season, 'T1': t1, 'T2': t2, 'y': y}
    row.update(make_features(s1, s2))
    train_rows.append(row)

train_df = pd.DataFrame(train_rows).dropna()
print('Training rows:', len(train_df))
print(train_df.head())

Training rows: 4302
  Gender  Season    T1    T2  y    Elo_diff  WinPct_diff  PtsFor_diff  \
0      M    1985  1116  1234  1   13.090200    -0.030303    -4.400000   
1      M    1985  1120  1345  1  -13.359972    -0.059310     1.224828   
2      M    1985  1207  1250  1  224.378498     0.546616     9.982120   
3      M    1985  1229  1425  1   11.314189     0.062169     3.199735   
4      M    1985  1242  1325  1    6.282735     0.025926     8.477778   

   PtsAgainst_diff  PtsDiff_diff  SeedNum_diff  
0         2.430303     -6.830303           1.0  
1         1.335172     -0.110345           5.0  
2       -10.132822     20.114943         -15.0  
3         1.022487      2.177249           1.0  
4         7.400000      1.077778         -11.0  


## 4. Train Model

We train an XGBoost classifier on the historical tournament games. We evaluate with cross-validation on log loss, then train a final model on all data. Probability calibration is applied afterward to sharpen the predicted probabilities.

In [8]:
DIFF_COLS = [f'{c}_diff' for c in FEATURE_COLS]

X = train_df[DIFF_COLS].values
y = train_df['y'].values

# ── Base XGBoost model ────────────────────────────────────────────────────────
xgb = XGBClassifier(
    n_estimators      = 500,
    max_depth         = 3,
    learning_rate     = 0.05,
    subsample         = 0.8,
    colsample_bytree  = 0.8,
    min_child_weight  = 5,
    eval_metric       = 'logloss',
    use_label_encoder = False,
    random_state      = 42,
    n_jobs            = -1,
)

# Cross-validation log loss (stratified 5-fold)
cv_scores = cross_val_score(xgb, X, y, cv=5, scoring='neg_log_loss')
print(f'CV log loss: {-cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

CV log loss: 0.5252 ± 0.0572


## 5. Calibration

We use Platt scaling (sigmoid calibration via `CalibratedClassifierCV`) to ensure predicted probabilities are well-calibrated. This is important for log loss.

In [9]:
# Calibrated classifier (5-fold cross-calibration)
calibrated_model = CalibratedClassifierCV(xgb, method='sigmoid', cv=5)
calibrated_model.fit(X, y)

# Final training log loss (in-sample, for reference)
train_preds = calibrated_model.predict_proba(X)[:, 1]
print(f'Train log loss (calibrated): {log_loss(y, train_preds):.4f}')

# Feature importance from one of the underlying base estimators
base_xgb = calibrated_model.calibrated_classifiers_[0].estimator
importances = pd.Series(base_xgb.feature_importances_, index=DIFF_COLS).sort_values(ascending=False)
print('\nFeature importances:')
print(importances)

Train log loss (calibrated): 0.4495

Feature importances:
SeedNum_diff       0.494973
Elo_diff           0.196290
PtsDiff_diff       0.109972
WinPct_diff        0.075604
PtsAgainst_diff    0.062282
PtsFor_diff        0.060878
dtype: float32


## 6. Generate 2026 Predictions

We parse all 132K submission IDs at once into a DataFrame, join the 2026 team stats for both teams via vectorized merges, compute difference features in bulk, and call `predict_proba` in a single batch call — no Python loops.

In [10]:
# ── 6. Generate 2026 Predictions (vectorized) ─────────────────────────────────
# Parse all submission IDs at once into a DataFrame
sub_df = sample_sub['ID'].str.split('_', expand=True)
sub_df.columns = ['Season', 'T1', 'T2']
sub_df[['Season','T1','T2']] = sub_df[['Season','T1','T2']].astype(int)
sub_df['ID'] = sample_sub['ID'].values
sub_df['Gender'] = np.where(sub_df['T1'] < 3000, 'M', 'W')

# Build 2026 team stats lookup table
stats_2026 = team_stats[team_stats['Season'] == 2026][['Gender','TeamID'] + FEATURE_COLS].copy()

# Join stats for T1 and T2
pred_df = sub_df.merge(
    stats_2026.rename(columns={c: f'{c}_T1' for c in FEATURE_COLS}).rename(columns={'TeamID': 'T1'}),
    on=['Gender','T1'], how='left'
).merge(
    stats_2026.rename(columns={c: f'{c}_T2' for c in FEATURE_COLS}).rename(columns={'TeamID': 'T2'}),
    on=['Gender','T2'], how='left'
)

# Compute difference features
for col in FEATURE_COLS:
    pred_df[f'{col}_diff'] = pred_df[f'{col}_T1'] - pred_df[f'{col}_T2']

# Identify rows with complete features vs. missing data
feat_matrix = pred_df[DIFF_COLS].values
has_data    = ~np.any(np.isnan(feat_matrix), axis=1)
missing     = int((~has_data).sum())

# Predict in one batch for all complete rows
preds = np.full(len(pred_df), 0.5)
if has_data.any():
    preds[has_data] = calibrated_model.predict_proba(feat_matrix[has_data])[:, 1]

# Clip to avoid log-loss explosion at boundaries
preds = np.clip(preds, 0.025, 0.975)

submission = pd.DataFrame({'ID': pred_df['ID'], 'Pred': preds})
print(f'Submission rows: {len(submission)}, fallback predictions (no data): {missing}')
print(submission.describe())

Submission rows: 132133, fallback predictions (no data): 0
                Pred
count  132133.000000
mean        0.537265
std         0.233497
min         0.108885
25%         0.341434
50%         0.542743
75%         0.739355
max         0.896291


## 7. Sanity Checks

In [11]:
# ── Distribution of predictions ───────────────────────────────────────────────
bins = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
print('Prediction distribution:')
print(pd.cut(submission['Pred'], bins=bins).value_counts().sort_index())

# ── Symmetry check: Pred(A vs B) + Pred(B vs A) should ~ 1 ───────────────────
# Build a lookup
pred_map = submission.set_index('ID')['Pred'].to_dict()

sym_errors = []
for rid, pred in list(pred_map.items())[:1000]:
    yr, t1, t2 = rid.split('_')
    rev_id = f'{yr}_{t2}_{t1}'
    if rev_id in pred_map:
        sym_errors.append(abs(pred + pred_map[rev_id] - 1.0))

# Note: submission only contains IDs with t1 < t2, so reverse lookups won't exist
# — that's expected. The symmetry property is enforced by the encoding convention.
print(f'\nSymmetry note: submission uses lower-ID-first convention (IDs always t1 < t2).')
print(f'Pred range: [{submission["Pred"].min():.4f}, {submission["Pred"].max():.4f}]')

# Verify no NaNs or out-of-range values
assert submission['Pred'].between(0, 1).all(), 'Out-of-range predictions!'
assert submission['Pred'].notna().all(), 'NaN predictions!'
print('All predictions valid.')

Prediction distribution:
Pred
(0.0, 0.1]        0
(0.1, 0.2]    13621
(0.2, 0.3]    13826
(0.3, 0.4]    14385
(0.4, 0.5]    16635
(0.5, 0.6]    17801
(0.6, 0.7]    16794
(0.7, 0.8]    14813
(0.8, 0.9]    24258
(0.9, 1.0]        0
Name: count, dtype: int64

Symmetry note: submission uses lower-ID-first convention (IDs always t1 < t2).
Pred range: [0.1089, 0.8963]
All predictions valid.


## 8. Save Submission

In [12]:
out_path = 'submission.csv'
submission.to_csv(out_path, index=False)
print(f'Saved to {out_path}')
print(submission.head(10))

Saved to submission.csv
               ID      Pred
0  2026_1101_1102  0.795162
1  2026_1101_1103  0.175016
2  2026_1101_1104  0.181909
3  2026_1101_1105  0.713078
4  2026_1101_1106  0.456241
5  2026_1101_1107  0.443885
6  2026_1101_1108  0.738641
7  2026_1101_1110  0.440358
8  2026_1101_1111  0.361230
9  2026_1101_1112  0.208722


---

## Next Steps (improvements to explore)

| Area | Idea |
|---|---|
| Features | Add KenPom / Massey ordinals from `MMasseyOrdinals.csv` |
| Features | Use detailed box-score stats (FG%, turnovers, rebounds) from `*DetailedResults.csv` |
| Features | Coach tenure, conference strength |
| Model | LightGBM or ensemble of XGB + logistic regression |
| Calibration | Isotonic regression instead of Platt scaling |
| Seeds | Use 2026 seeds once bracket is announced (Stage 2) and re-train on seeded teams only |