## Background

 - [Strategy heatmaps for all submissions](https://www.kaggle.com/c/march-machine-learning-mania-2017/discussion/30333) (2017)
 - [March ML Mania 2023 Heatmap of Submissions](https://www.kaggle.com/code/jtrotman/march-ml-mania-2023-heatmap-of-submissions) (2023)
 

Welcome to this Jupyter notebook where we will simulate the 2023 March Madness basketball tournaments using a submission for the Kaggle competition. The above links show how we can convert a competition submission into a grid, which makes it very easy to read off probabilities for games with which we can simulate the whole tournament!

We will be using the median of the experts' submissions as our prediction model, but any other submission file will do. In addition, we will compare our simulated tournament results to 538's own forecast, a popular sports prediction website known for their statistical models and accurate forecasts.

Using the `TournamentSimulator` class below we can:

- Generate a bracket (using 0.5 for match predictions instead of a random in *0..1*)
- Simulate tournament from start
- Simulate tournament from a later stage using known results and the pre-tournament submission
- Compare distributions to fivethirtyeight *rd1_win*, *rd2_win*, etc...

Let's get started and see how well we can predict the outcome of the March Madness tournament!

## Contents

<div id="contents_list"></div>

In [1]:
import base64, io, re, os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import HTML, Markdown, display
from pathlib import Path
import seaborn as sns

In [2]:
plt.style.use('fivethirtyeight')

In [3]:
bd = Path("../input/march-machine-learning-mania-2023")

def read_seeds(WM):
    seeds = pd.read_csv(bd / f"{WM}NCAATourneySeeds.csv")
    teams = pd.read_csv(bd / f"{WM}Teams.csv", index_col="TeamID")
    spellings = pd.read_csv(bd / f"{WM}TeamSpellings.csv", encoding="cp1250", index_col="TeamNameSpelling")
    seeds = seeds.join(teams, on="TeamID")
    seeds["Label"] = seeds[["Seed", "TeamName"]].apply(" ".join, axis=1)
    return seeds, teams, spellings

MNCAATourneySeeds, MTeams, MTeamSpellings = read_seeds("M")
WNCAATourneySeeds, WTeams, WTeamSpellings = read_seeds("W")

column = 'pred' # relies on read_sub lowercasing the column names

def read_sub(filename):
    sub_df = pd.read_csv(filename)
    sub_df.columns = sub_df.columns.str.lower()
    parts = sub_df.id.str.split("_")
    parts_df = pd.DataFrame.from_records(parts).astype(int)
    sub_df[["Year", "LTeamID", "HTeamID"]] = parts_df
    year = sub_df.Year.values[0]
    return sub_df, year

In [4]:
!wget -q https://projects.fivethirtyeight.com/march-madness-api/2023/fivethirtyeight_ncaa_forecasts.csv
!wget -q https://storage.googleapis.com/kaggle-forum-message-attachments/2185061/18828/ExpertsMedianSubmission.csv
!wget -q https://storage.googleapis.com/kaggle-forum-message-attachments/2190790/18864/2023_03_21_09.26.17%20%20scoring%2096%20games%20total.csv
!wget -q https://storage.googleapis.com/kaggle-forum-message-attachments/2200334/18898/2023_03_28_09.13.20%20%20scoring%20120%20games%20total.csv

# Configuration

Fork the notebook and run with a different submission:

In [5]:
sub_name = "Experts"
sub, year = read_sub('ExpertsMedianSubmission.csv')

# e.g.

# sub_name = "It's That Time of the Year Again"
# sub, year = read_sub('../input/it-s-that-time-of-the-year-again/submission.csv')

# sub_name = "(Using 538) Time of the Year Again"
# sub, year = read_sub('../input/using-538-it-s-that-time-of-the-year-again/submission.csv')

# FiveThirtyEight

In [6]:
f38 = pd.read_csv('fivethirtyeight_ncaa_forecasts.csv')
f38.shape

In [7]:
def setup(df):
    df.loc[df.gender=='mens', 'kId'] = df[df.gender=='mens'].team_name.str.lower().map(MTeamSpellings.TeamID)
    df.loc[df.gender=='womens', 'kId'] = df[df.gender=='womens'].team_name.str.lower().map(WTeamSpellings.TeamID)
    df['kId'] = df['kId'].astype(int)
    df['seed_int'] = df.team_seed.str.replace('\D+', '', regex=True).astype(int)
    df['year'] = df.forecast_date.str[:4].astype(int)
    cols = ['rd1_win','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']
    df['stage'] = ((df[cols]>=1) @ (1 << np.arange(len(cols))))
    df['teams_alive'] = df.groupby(['gender', 'forecast_date']).team_alive.transform('sum')

setup(f38)

This helps simplify things, different dates have different numbers of teams left:

In [8]:
for key, subdf in f38.groupby(['gender', 'forecast_date']):
    print(key, len(subdf), subdf.team_alive.sum())

In [9]:
def get_f38(src, gender, remaining_teams):
    for key, subdf in f38.groupby(['gender', 'forecast_date']):
        if key[0] == gender and subdf.team_alive.sum() == remaining_teams:
            return subdf.query('rd1_win>0').copy()

In [10]:
# get_f38(f38, 'mens', 16)

# TournamentSimulator

In [11]:
%%writefile TournamentSimulator.py
class TournamentSimulator:
    def __init__(self, probs, seeds, deterministic):
        self.winner = np.zeros(63, dtype=int)
        self.loser  = np.zeros(63, dtype=int)
        #self.path  = np.zeros((63, 2), dtype=int)
        self.matchup_counts = np.zeros((64, 64), dtype=int)
        self.round_counts = np.zeros((64), dtype=int)
        self.last_round_counts = np.zeros((64, 7), dtype=int)
        self.probs = probs
        self.seeds = seeds
        if deterministic:
            self.sampler = lambda n: np.full(n, 0.5) # use this to generate a bracket
        else:
            self.sampler = np.random.random
    #
    def _result(self, a, b):
        self.round_counts[a] += 1
        self.round_counts[b] += 1
        self.matchup_counts[a,b] += 1
        self.matchup_counts[b,a] += 1
        i = self.used
        p = self.samples[i]
        self.used += 1
        #self.path[i] = a, b
        #return (a,b) if self.rand() < self.probs[a,b] else (b,a)
        return (a,b) if p < self.probs[a,b] else (b,a)
    #
    def simR1(self):
        _result = self._result
        self.samples = self.sampler(63)
        self.used = 0
        self.round_counts.fill(0)
        winner, loser, seeds = self.winner, self.loser, self.seeds
        for region in range(4):
            o = region * 8
            r = region * 16
            winner[o+0], loser[o+0] = _result(seeds[r+0], seeds[r+15]) # W01, W16
            winner[o+1], loser[o+1] = _result(seeds[r+1], seeds[r+14]) # W02, W15
            winner[o+2], loser[o+2] = _result(seeds[r+2], seeds[r+13]) # W03, W14
            winner[o+3], loser[o+3] = _result(seeds[r+3], seeds[r+12]) # W04, W13
            winner[o+4], loser[o+4] = _result(seeds[r+4], seeds[r+11]) # W05, W12
            winner[o+5], loser[o+5] = _result(seeds[r+5], seeds[r+10]) # W06, W11
            winner[o+6], loser[o+6] = _result(seeds[r+6], seeds[r+9]) # W07, W10
            winner[o+7], loser[o+7] = _result(seeds[r+7], seeds[r+8]) # W08, W9
    #
    def simR2(self):
        _result = self._result
        winner, loser = self.winner, self.loser
        winner[32], loser[32] = _result(winner[0], winner[7]) # R1W1, R1W8
        winner[33], loser[33] = _result(winner[1], winner[6]) # R1W2, R1W7
        winner[34], loser[34] = _result(winner[2], winner[5]) # R1W3, R1W6
        winner[35], loser[35] = _result(winner[3], winner[4]) # R1W4, R1W5
        #
        winner[36], loser[36] = _result(winner[8], winner[15]) # R1X1, R1X8
        winner[37], loser[37] = _result(winner[9], winner[14]) # R1X2, R1X7
        winner[38], loser[38] = _result(winner[10], winner[13]) # R1X3, R1X6
        winner[39], loser[39] = _result(winner[11], winner[12]) # R1X4, R1X5
        #
        winner[40], loser[40] = _result(winner[16], winner[23]) # R1Y1, R1Y8
        winner[41], loser[41] = _result(winner[17], winner[22]) # R1Y2, R1Y7
        winner[42], loser[42] = _result(winner[18], winner[21]) # R1Y3, R1Y6
        winner[43], loser[43] = _result(winner[19], winner[20]) # R1Y4, R1Y5
        #
        winner[44], loser[44] = _result(winner[24], winner[31]) # R1Z1, R1Z8
        winner[45], loser[45] = _result(winner[25], winner[30]) # R1Z2, R1Z7
        winner[46], loser[46] = _result(winner[26], winner[29]) # R1Z3, R1Z6
        winner[47], loser[47] = _result(winner[27], winner[28]) # R1Z4, R1Z5
    #
    def simR3(self):
        _result = self._result
        winner, loser = self.winner, self.loser
        winner[48], loser[48] = _result(winner[32], winner[35]) # R2W1, R2W4
        winner[49], loser[49] = _result(winner[33], winner[34]) # R2W2, R2W3
        #
        winner[50], loser[50] = _result(winner[36], winner[39]) # R2X1, R2X4
        winner[51], loser[51] = _result(winner[37], winner[38]) # R2X2, R2X3
        #
        winner[52], loser[52] = _result(winner[40], winner[43]) # R2Y1, R2Y4
        winner[53], loser[53] = _result(winner[41], winner[42]) # R2Y2, R2Y3
        #
        winner[54], loser[54] = _result(winner[44], winner[47]) # R2Z1, R2Z4
        winner[55], loser[55] = _result(winner[45], winner[46]) # R2Z2, R2Z3
    #
    def simR4(self):
        _result = self._result
        winner, loser = self.winner, self.loser
        winner[56], loser[56] = _result(winner[48], winner[49]) # R3W1, R3W2
        winner[57], loser[57] = _result(winner[50], winner[51]) # R3X1, R3X2
        winner[58], loser[58] = _result(winner[52], winner[53]) # R3Y1, R3Y2
        winner[59], loser[59] = _result(winner[54], winner[55]) # R3Z1, R3Z2
    #
    def simR5(self):
        _result = self._result
        winner, loser = self.winner, self.loser
        winner[60], loser[60] = _result(winner[56], winner[57]) # R4W1, R4X1
        winner[61], loser[61] = _result(winner[58], winner[59]) # R4Y1, R4Z1
    #
    def simR6(self):
        _result = self._result
        self.winner[62], self.loser[62] = _result(self.winner[60], self.winner[61]) # R5WX, R5YZ
    #
    def simulate(self):
        self.simR1()
        self.simR2()
        self.simR3()
        self.simR4()
        self.simR5()
        self.simR6()
        winner = self.winner[-1]
        self.round_counts[winner] += 1 # round 7
        self.last_round_counts[self.seeds, self.round_counts-1] += 1
        return winner

In [12]:
%run -i TournamentSimulator.py

In [13]:
def make_grid(seeds, sub):
    _, team_idx = pd.factorize(seeds.TeamID)
    id2ind = {t: i for i, t in enumerate(team_idx)}
    sub["t1"] = sub.LTeamID.map(id2ind)
    sub["t2"] = sub.HTeamID.map(id2ind)
    sub_df = sub.dropna().copy()
    sub_df[['t1','t2']] = sub_df[['t1','t2']].astype(int)
    grid = np.zeros((len(seeds), len(seeds)))
    grid[sub_df.t1, sub_df.t2] = sub_df[column]
    grid[sub_df.t2, sub_df.t1] = 1 - sub_df[column]
    return grid, id2ind, team_idx

def update_grid(truth, id2ind, grid, verbose=True):
    truth["t1"] = truth.LTeamID.map(id2ind)
    truth["t2"] = truth.HTeamID.map(id2ind)
    truth_df = truth.dropna().copy()
    truth_df[['t1','t2']] = truth_df[['t1','t2']].astype(int)
    truth_df = truth_df.query('pred>=0')
    if verbose:
        print(truth_df.pred.value_counts())
    grid[truth_df.t1, truth_df.t2] = truth_df[column]
    grid[truth_df.t2, truth_df.t1] = 1 - truth_df[column]

def add_label(df):
    df = df.copy()
    df.insert(0, "Label", seeds.set_index("TeamID").Label)
    return df

def convert_results_to_cumulative(df):
    df = df.iloc[:, ::-1].expanding(axis=1).sum()
    df = df.drop('won_0', axis=1)
    df = df.sort_values(list(df.columns), ascending=False)
    df = df.iloc[:, ::-1]
    return df

colors = {"W":"#3498db", "X":"#1abc9c", "Y":"#f39c12", "Z":"#9b59b6", }

def style_region(r):
    return 'background-color:' + colors[r[0]]

def style_df_basic(df):
    right_align_cols = df.columns.drop('Label')
    return (df.style.format(precision=0, thousands=',')
            .applymap(style_region, subset=['Label'])
            .set_properties(**{'text-align': 'right'}, subset=right_align_cols))

def style_df(df):
    return style_df_basic(df).background_gradient()

def plot_matchup_counts(ts, tag=''):
    fig, ax = plt.subplots(1, 2, figsize=(14, 7))  
    sns.heatmap((ts.matchup_counts), square=True, robust=True, ax=ax[0])
    sns.heatmap(np.log1p(ts.matchup_counts), square=True, ax=ax[1])
    ax[0].set_title(f"{tag} Match Counts")
    ax[1].set_title(f"{tag} Log Match Counts")
    plt.tight_layout()

In [14]:
sub_code = "M"
f38_type = "mens"

seeds = MNCAATourneySeeds.query("Season==" + str(year))
seeds = seeds[~seeds.Seed.isin('W16b X16a Y11a Z11b'.split())].copy() # lost in playoffs

In [15]:
grid, id2ind, team_idx = make_grid(seeds, sub)

In [16]:
Markdown(f"# [{sub_code}] {sub_name} *Bracket* Outcome")

In [17]:
ts = TournamentSimulator(grid, np.arange(64), deterministic=True)

for i in range(1):
    ts.simulate()

In [18]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
style_df(add_label(df))

In [19]:
Markdown(f"# [{sub_code}] {sub_name} Simulate From Start")

In [20]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

N_SIMS = 100_000

for i in range(N_SIMS):
    ts.simulate()

This truly is a heatmap - showing how many times each matchup was hit...

Reusing this guide from the [2017 tournament post](https://www.kaggle.com/c/march-machine-learning-mania-2017/discussion/30333) you can see the first round in red and the final game match as the deep blue top-right (and mirrored bottom-left) quadrants.
The first round matches are referenced in every simulation but the later rounds feature rarer matches between whichever teams survived.

<img width=500 height=500 src="https://storage.googleapis.com/kaggle-forum-message-attachments/168920/6110/game-rounds-heatmap.png">

In [21]:
plot_matchup_counts(ts)

In [22]:
forecast = get_f38(f38, f38_type, remaining_teams=64)

In [23]:
Markdown(f"# [{sub_code}] {sub_name} Round Win Counts")

These are simply counts of the times the team won that many matches - this is an interesting view as it shows Houston have about the same chance of winning just 3 games as they do of winning the championship (all 6).

Because it's 100,000 simulations you can think of the values as percentages with the comma as decimal point.

In [24]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = (result_stats)
df = df.sort_values(list(df.columns[::-1]), ascending=False)
style_df(add_label(df))

In [25]:
Markdown(f"# [{sub_code}] {sub_name} Cumulative Round Win Counts")

The *won_1, won_2* etc columns are from simulations above and the *rd2_win* etc are fivethirtyeight's probabilities, scaled to the same range.

In [26]:
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
style_df(add_label(df))

In [27]:
Markdown(f"# [{sub_code}] {sub_name} Plot 538 vs Sim")

In [28]:
fig, ax = plt.subplots(2, 3, figsize=(14,7))
for i, r in enumerate(range(1, 7)):
    df.plot.scatter(f'rd{r+1}_win', f'won_{r}', ax=ax.ravel()[i]);
plt.tight_layout();

In [29]:
Markdown(f"# [{sub_code}] {sub_name} Plot 538 vs Sim (2)")

The blue points are from simulations above and the red are fivethirtyeight's probabilities, scaled to the same range.

In [30]:
fig, ax = plt.subplots(2, 3, figsize=(14,7))
for i, r in enumerate(range(1, 7)):
    ax.ravel()[i].scatter(np.arange(len(df)), df[f'won_{r}'], alpha=.4);
    ax.ravel()[i].scatter(np.arange(len(df)), df[f'rd{r+1}_win'], alpha=.4);
    ax.ravel()[i].set_ylabel(f'R{r} count')
    ax.ravel()[i].set_xlabel('team')
# plt.title('Calibration');
plt.tight_layout();

In [31]:
stage_name = "Sweet 16"
remaining_teams = 16
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Onwards")

In [32]:
truth, _ = read_sub('2023_03_21_09.26.17  scoring 96 games total.csv')

In [33]:
# overrides entries in the 2D array 'grid' passed in with results of games
update_grid(truth, id2ind, grid)

In [34]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

for i in range(N_SIMS):
    ts.simulate()

In [35]:
plot_matchup_counts(ts, f'{stage_name} -')

In [36]:
forecast = get_f38(f38, f38_type, remaining_teams=remaining_teams)

In [37]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
(style_df_basic(add_label(df).query('won_2>0'))
  .background_gradient(cmap='seismic', vmin=0, vmax=N_SIMS))

In [38]:
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Likelihood Ratio")

This is how many times more likely the submission projects these teams reaching the later stages is than fivethirtyeight (compare to above for details).

In [39]:
cols1 = ['won_3','won_4','won_5','won_6']
cols2 = ['rd4_win','rd5_win','rd6_win','rd7_win']
ratios = df[cols1].divide(df[cols2].values, axis=0)
ratios.columns = ratios.columns.str.replace('won_', 'ratio_')
rcols = list(ratios.columns)
ratios = df[cols1].astype(int).join(ratios)
(add_label(ratios).query('ratio_3>0').style.format(thousands=',', precision=2)
  .background_gradient(subset=cols1, cmap='seismic', vmin=0, vmax=N_SIMS)
  .background_gradient(subset=rcols)
  .applymap(style_region, subset=['Label']))

In [40]:
stage_name = "Final 4"
remaining_teams = 4
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Onwards")

In [41]:
truth, _ = read_sub('2023_03_28_09.13.20  scoring 120 games total.csv')

In [42]:
# overrides entries in the 2D array 'grid' passed in with results of games
update_grid(truth, id2ind, grid)

In [43]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

for i in range(N_SIMS):
    ts.simulate()

In [44]:
plot_matchup_counts(ts, f'{stage_name} -')

In [45]:
forecast = get_f38(f38, f38_type, remaining_teams=remaining_teams)

In [46]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
(style_df_basic(add_label(df).query('won_4>0'))
  .background_gradient(cmap='seismic', vmin=0, vmax=N_SIMS))

In [47]:
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Likelihood Ratio")

This is how many times more likely the submission projects these teams reaching the later stages is than fivethirtyeight (compare to above for details).

In [48]:
cols1 = ['won_5','won_6']
cols2 = ['rd6_win','rd7_win']
ratios = df[cols1].divide(df[cols2].values, axis=0)
ratios.columns = ratios.columns.str.replace('won_', 'ratio_')
rcols = list(ratios.columns)
ratios = df[cols1].astype(int).join(ratios)
(add_label(ratios).query('ratio_5>0').style.format(thousands=',', precision=2)
  .background_gradient(subset=cols1, cmap='seismic', vmin=0, vmax=N_SIMS)
  .background_gradient(subset=rcols)
  .applymap(style_region, subset=['Label']))

In [49]:
sub_code = "W"
f38_type = "womens"

seeds = WNCAATourneySeeds.query("Season==" + str(year))
seeds = seeds[~seeds.Seed.isin('W11a X16b Y16a Z11a'.split())].copy() # lost in playoffs

In [50]:
grid, id2ind, team_idx = make_grid(seeds, sub)

In [51]:
Markdown(f"# [{sub_code}] {sub_name} *Bracket* Outcome")

In [52]:
ts = TournamentSimulator(grid, np.arange(64), deterministic=True)

for i in range(1):
    ts.simulate()

In [53]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
style_df(add_label(df))

In [54]:
Markdown(f"# [{sub_code}] {sub_name} Simulate From Start")

In [55]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

N_SIMS = 100_000

for i in range(N_SIMS):
    ts.simulate()

In [56]:
plot_matchup_counts(ts)

In [57]:
forecast = get_f38(f38, f38_type, remaining_teams=64)

In [58]:
Markdown(f"# [{sub_code}] {sub_name} Round Win Counts")

These are simply counts of the times the team won that many matches...

In [59]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = (result_stats)
df = df.sort_values(list(df.columns[::-1]), ascending=False)
style_df(add_label(df))

In [60]:
Markdown(f"# [{sub_code}] {sub_name} Cumulative Round Win Counts")

The *won_1, won_2* etc columns are from simulations above and the *rd2_win* etc are fivethirtyeight's probabilities, scaled to the same range.

In [61]:
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
style_df(add_label(df))

In [62]:
Markdown(f"# [{sub_code}] {sub_name} Plot 538 vs Sim")

In [63]:
fig, ax = plt.subplots(2, 3, figsize=(14,7))
for i, r in enumerate(range(1, 7)):
    df.plot.scatter(f'rd{r+1}_win', f'won_{r}', ax=ax.ravel()[i]);
plt.tight_layout();

In [64]:
Markdown(f"# [{sub_code}] {sub_name} Plot 538 vs Sim (2)")

The blue points are from simulations above and the red are fivethirtyeight's probabilities, scaled to the same range.

In [65]:
fig, ax = plt.subplots(2, 3, figsize=(14,7))
for i, r in enumerate(range(1, 7)):
    ax.ravel()[i].scatter(np.arange(len(df)), df[f'won_{r}'], alpha=.4);
    ax.ravel()[i].scatter(np.arange(len(df)), df[f'rd{r+1}_win'], alpha=.4);
    ax.ravel()[i].set_ylabel(f'R{r} count')
    ax.ravel()[i].set_xlabel('team')
# plt.title('Calibration');
plt.tight_layout();

In [66]:
stage_name = "Sweet 16"
remaining_teams = 16
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Onwards")

In [67]:
truth, _ = read_sub('2023_03_21_09.26.17  scoring 96 games total.csv')

In [68]:
# overrides entries in the 2D array 'grid' passed in with results of games
update_grid(truth, id2ind, grid)

In [69]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

for i in range(N_SIMS):
    ts.simulate()

In [70]:
plot_matchup_counts(ts, f'{stage_name} -')

In [71]:
forecast = get_f38(f38, f38_type, remaining_teams=remaining_teams)

In [72]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
(style_df_basic(add_label(df).query('won_2>0'))
  .background_gradient(cmap='seismic', vmin=0, vmax=N_SIMS))

In [73]:
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Likelihood Ratio")

This is how many times more likely the submission projects these teams reaching the later stages is than fivethirtyeight (compare to above for details).

In [74]:
cols1 = ['won_3','won_4','won_5','won_6']
cols2 = ['rd4_win','rd5_win','rd6_win','rd7_win']
ratios = df[cols1].divide(df[cols2].values, axis=0)
ratios.columns = ratios.columns.str.replace('won_', 'ratio_')
rcols = list(ratios.columns)
ratios = df[cols1].astype(int).join(ratios)
(add_label(ratios).query('ratio_3>0').style.format(thousands=',', precision=2)
  .background_gradient(subset=cols1, cmap='seismic', vmin=0, vmax=N_SIMS)
  .background_gradient(subset=rcols)
  .applymap(style_region, subset=['Label']))

In [75]:
stage_name = "Final 4"
remaining_teams = 4
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Onwards")

In [76]:
truth, _ = read_sub('2023_03_28_09.13.20  scoring 120 games total.csv')

In [77]:
# overrides entries in the 2D array 'grid' passed in with results of games
update_grid(truth, id2ind, grid)

In [78]:
%%time
ts = TournamentSimulator(grid, np.arange(64), deterministic=False)

for i in range(N_SIMS):
    ts.simulate()

In [79]:
plot_matchup_counts(ts, f'{stage_name} -')

In [80]:
forecast = get_f38(f38, f38_type, remaining_teams=remaining_teams)

In [81]:
result_stats = pd.DataFrame(ts.last_round_counts, index=seeds.TeamID).add_prefix('won_')
df = convert_results_to_cumulative(result_stats)
df = df.join(forecast[['kId','rd2_win','rd3_win','rd4_win','rd5_win','rd6_win','rd7_win']].set_index('kId')*N_SIMS)
(style_df_basic(add_label(df).query('won_4>0'))
  .background_gradient(cmap='seismic', vmin=0, vmax=N_SIMS))

In [82]:
Markdown(f"# [{sub_code}] {sub_name} {stage_name} Likelihood Ratio")

This is how many times more likely the submission projects these teams reaching the later stages is than fivethirtyeight (compare to above for details).

In [83]:
cols1 = ['won_5','won_6']
cols2 = ['rd6_win','rd7_win']
ratios = df[cols1].divide(df[cols2].values, axis=0)
ratios.columns = ratios.columns.str.replace('won_', 'ratio_')
rcols = list(ratios.columns)
ratios = df[cols1].astype(int).join(ratios)
(add_label(ratios).query('ratio_5>0').style.format(thousands=',', precision=2)
  .background_gradient(subset=cols1, cmap='seismic', vmin=0, vmax=N_SIMS)
  .background_gradient(subset=rcols)
  .applymap(style_region, subset=['Label']))

In [84]:
HTML(r"""
<script>
function style_headers(h) {
    for (i = 0; i<h.length; i++) {
        txt = h[i].textContent.toString();
        if (txt.indexOf("[M]") >= 0) {
            h[i].style.background = '#6C4E97';
        }
        else if (txt.indexOf("[W]") >= 0) {
            h[i].style.background = '#D4855A';
        }
        else {
            h[i].style.background = '#404040';
        }
        h[i].style.color = '#d0d0d0';
        h[i].style.padding = '15px';
        h[i].style.borderRadius = '15px'; 
    }
}
style_headers(document.getElementsByTagName("H1"));
style_headers(document.getElementsByTagName("H2"));
h = document.getElementsByTagName("H1");
src = '';
for (i = 0; i<h.length; i++) {
    t = h[i].textContent.toString();
    src += "<li><a href=#" + h[i]['id'] + ">" + t.replace('¶', '') + "</a>";
}
tag = document.getElementById("contents_list")
tag.innerHTML = "<ul>" + src + "</ul>"
</script>
""")