In [1]:
import numpy as np
import pandas as pd

from data import *
from models import *
from simulation import *

**Introduction**

Every March, Americans gather around their television sets for one of the most exciting playoffs in all of sports, the NCAA Division 1 Men's Basketball Finals, or March Madness. The "Madness" part of the coloquial name comes from the large field and single elimination structure of the tournament, in which ill-regarded teams often make deep runs, knocking off favorites as they advance. In the weeks leading up to the tournament each year, the millions try to predict the madness, picking winners for each of the tournament's 63+ games, hoping to become the first to have a perfect bracket and win a large sum. 

The most obvious strategy for picking winners is to simply pick the higher seeded team whenever possible, flipping a coin when two equal-seeded teams go head-to-head. Not only is this not a very fun strategy, but history has also shown that the best brackets deviate from this method, often quite dramatically. My goal with this project is to try to beat the boring-but-obvious method, and in the process, determine which factors and stats go into team success in March Madness, thus shedding a little light on the "Madness".

**Materials & Methods**

In order to determine which factors contribute to a team's success in March Madness, I ran variable selection on a logistic regression model that included a wide range of team statistics. The dependent variable for the regression was game outcome (either win or lose). The independent variables appear below, along with a short description of each. Note that each of them represents the difference in one team's stat and their opponent's (i.e. <i>SeedDif</i> is the difference between the two teams tournament seeds (this can be a negative value)):

In [2]:
for i in range(len(xVariables)):
    print(xVariables[i], ': ', xVariablesDesc[i])

SeedDif :  Difference in seeds
RecordDif :  Difference in record
PtsPGDif :  Difference in points per game
PtsPGDifDif :  Difference in points per game differential
TrueShtPercDif :  Difference in true shooting percentage
ORPGDif :  Difference in offensive rebounds per game
DRPGDif :  Difference in defensive rebounds per game
AstPGDif :  Difference in assists per game
StlPGDif :  Difference in steals per game
BlkPGDif :  Difference in blocks per game
TOPGDif :  Difference in turnovers per game
ATRDif :  Difference in assist to turnover ratio
PFPGDif :  Difference in personal fouls per game
FTAPGDif :  Difference in free throw attempts per game
DefMetricDif :  Difference in defensive metric (combination of blk, stl, DR, and PF) per game
ConfAppDif :  Difference in conference appearances


The variable selection employed AIC for model comparisons, and ran until improvement was no longer significant. Due to the random nature of train/test segmentation, as well as the relatively small sample size (1,115 games played from 2003 - 2019), the variable selection was run 1,000 times, and the frequency with which each variable was selected was reported.

Using the results of the variable selection, I created several models using combinations of the variables and compared their predictions for historical tournament games to the actual outcomes of those games. I then compared the success rate of the models to the success rate of the "higher-seed-always-wins" technique.

**Results**

The (sorted) results of the variable selection runs appear below (the number next to each variable represents the ratio with which it was selected):

In [6]:
freq = logisticSelectMulti(yVariable, xVariables, logRegDF, 1000)

In [10]:
sortedFreq = sorted(freq, reverse = True)
sortedVars = [x for _, x in sorted(zip(freq, xVariables), reverse = True)]

for i in range(len(freq)):
    print(sortedVars[i], sortedFreq[i])

DRPGDif 0.58
StlPGDif 0.573
FTAPGDif 0.562
AstPGDif 0.554
TrueShtPercDif 0.543
PtsPGDif 0.541
DefMetricDif 0.526
ATRDif 0.519
PFPGDif 0.516
TOPGDif 0.515
BlkPGDif 0.509
ORPGDif 0.497
RecordDif 0.426
ConfAppDif 0.373
PtsPGDifDif 0.367
SeedDif 0.284


Using the results from the variable selection, as well as my own knowledge about the sport, I created several potential logistic regression models and tested each against the "better-seed-wins" strategy. The results appear below:

In [4]:
highSeedWinsAcc = tourneySimVsActual(list(range(2003, 2020)), highSeedWins, masterCompact)
fullModelAcc = tourneySimVsActual(list(range(2003, 2020)), logRegPredictFull, masterDetailed)
fullModelAccJr = tourneySimVsActual(list(range(2003, 2020)), logRegPredictFullJr, masterDetailed)

print('Better seed wins (baseline): ', highSeedWinsAcc)
print('Full model (using all variables): ', fullModelAcc)
print('Best partial model (best result using fewest variables): ', fullModelAccJr)

Better seed wins (baseline):  0.6313901345291479
Full model (using all variables):  0.6556053811659193
Best model (using curated variables):  0.6502242152466368


The winner-picking success rate of baseline, seed-based model is ~63% (this moves around a little due to the "flip a coin" tie-breaker randomness). The full model, using all the variables from above, predicts games at 65.56% accuracy, a slight but not insignificant gain over the baseline. Arguably the most useful model is the best-performing partial model, which uses 12 of the 16 variables in the full model, but retains almost all of the accuracy of the full model, with 65.02%. This model includes the following variables, which seem to be the most useful for predicting tournament success:

In [2]:
for i in chosenFeatures:
    print(i)

NumSeedDif
RecordDif
PtsPGDif
PtsPGDifDif
TrueShtPercDif
ORPGDif
DRPGDif
AstPGDif
StlPGDif
BlkPGDif
TOPGDif
ATRDif


With the most important statistical elements of team success identified, the "Madness" has receded a bit, and hopefully more enjoyment and understanding can be found in both predicting and watching the annual tournament as a result.