NCAA 2019
===

Two logistic regressions are ensembled to predict a March Madness bracket: the first is trained on team stats, and the second is trained on vegas moneylines/spreads.

Underpinning
---

Before training the models, the necessary libraries and data are imported, helper methods are defined, and data is assembled into convenient data structures.

Import required libraries

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

Create a helpers for importing CSV datasets and translating team names into a standardized team name across datasets.

In [3]:
def from_csv(path):
    csv = pd.read_csv(path)
    return np.asarray(csv)

translator = {
    "Albany": "Albany NY",
    "Massachusetts Lowell": "Massachusetts",
    "MD Baltimore Cty": "Maryland",
    "MD Baltimore County": "Maryland",
    "NC-Wilmington": "UNC Wilmington",
    "N.C. State": "NC State",
    "N.C. St": "NC State",
    "NC St": "NC State",
    "N Carolina": "North Carolina",
    "Southern Methodist": "SMU",
    "Monmouth-NJ": "Monmouth NJ",
    "Loyola-Maryland": "Loyola MD",
    "UMKC": "Missouri KC",
    "Kent St": "Kent",
    "LIU Brooklyn": "Long Island",
    "Wis.-Milwaukee": "WI Milwaukee",
    "Boston U": "Boston Univ",
    "Idaho State": "Idaho St",
    "Stephen F. Austin": "SF Austin",
    "Mount St Mary's": "Mt St Mary's",
    "St Mary's": "St Mary's CA",
    "Central Florida": "UCF",
    "Saint Louis": "St Louis",
    "Southern California": "USC",
    "WVirginia": "West Virginia",
    "Brigham Young": "BYU",
    "Western Kentucky": "WKU",
    "VCU": "VA Commonwealth",
    "Southern Illinois": "S Illinois",
    "St Joseph's": "St Joseph's PA",
    "S Carolina": "South Carolina",
    "Miami-Florida": "Miami FL",
    "Miami (OH)": "Miami OH",
    "N Iowa": "Northern Iowa",
    "Western Michigan": "W Michigan",
    "Louisiana St": "LSU",
    "UC Santa Barbara": "Santa Barbara",
    "Illinois-Chicago": "IL Chicago",
    "S Florida": "South Florida",
    "George Washington": "G Washington",
    "Middle Tennessee St": "MTSU",
    "Loyola Marymount": "Loy Marymount",
    "Texas Christian": "TCU",
    "Eastern Michigan": "E Michigan",
    "NC-Greensboro": "UNC Greensboro",
    "Texas-El Paso": "UTEP",
    "Green Bay": "WI Green Bay",
    "Eastern Washington": "E Washington",
    "Central Michigan": "C Michigan",
    "Cal St Fullerton": "CS Fullerton",
    "S Alabama": "South Alabama",
    "CSU Northridge": "CS Northridge",
    "Western Carolina": "W Carolina",
    "N Arizona": "Northern Arizona",
    "Eastern Illinois": "E Illinois",
    "Pennsylvania": "Penn",
    "Georgia Southern": "Ga Southern",
    "Eastern Kentucky": "E Kentucky",
    "N Texas": "North Texas",
    "Sacramento St": "CS Sacramento",
    "Tenn-Martin": "TN Martin",
    "E Carolina": "East Carolina",
    "Florida International": "Florida Intl",
    "Elon University": "Elon",
    "Elon University": "Elon",
    "Louisiana-Lafayette": "Lafayette",
    "ULL": "Lafayette",
    "Florida Atlantic": "FL Atlantic",
    "Indiana - Purdue": "Purdue",
    "E Tennessee St": "ETSU",
    "Western Illinois": "W Illinois",
    "Texas-Arlington": "UT Arlington",
    "S Dakota": "South Dakota",
    "N Dakota": "North Dakota",
    "IUPU - Ft. Wayne": "Purdue Fort Wayne",
    "SIU - Edwardsville": "Edwardsville",
    "The Citadel": "Citadel",
    "No.Carolina A&T": "NC A&T",
    "Texas-San Antonio": "UT San Antonio",
    "Nebraska Omaha": "NE Omaha",
    "Coastal Carolina": "Coastal Car",
    "Little Rock": "Ark Little Rock",
    "Arkansas-Little Rock": "Ark Little Rock",
}

Load in historical teams and map the Kaggle ID to a standardized team name

In [4]:
teams_csv = from_csv('./data/teams.csv')

team_ids = {}

for index in range(0, len(teams_csv)):
    team = teams_csv[index]
    team_id = team[0]
    team_name = team[1]
    team_ids[team_id] = team_name

# allows us to map a team id to its name: team_ids[1101] => Abilene Chr

Create a helpful intermediate data structure that allows us to quickly access stats for a given team in a given year

In [5]:
# team name, adj offensive efficiency, adj defensive efficiency, adj tempo, luck, adj margin of efficiency, 
# strength of schedule margin, avg opponent offense, avg opponent defense, non-conference, conf, year
kenpom_csv = from_csv("./data/kenpom.csv")

# transform conf to category
conference_ids = {}
unique_confs = np.unique(kenpom_csv[:, 10])
for index in range(0, len(unique_confs)):
    conference_ids[unique_confs[index]] = index

# year > team > stats
kenpom_team_stats_by_year = {}

for index in range(0, len(kenpom_csv)):
    team, offense, defense, tempo, luck, em, sos, oppoff, oppdef, nonconf, conf, year = kenpom_csv[index]
    
    if (not isinstance(kenpom_team_stats_by_year.get(year), dict)):
        kenpom_team_stats_by_year[year] = {}
    
    kenpom_team_stats_by_year[year][team] = [offense, defense, tempo, luck, em, sos, oppoff, oppdef, nonconf, conference_ids[conf]]
    
# kenpom_team_stats_by_year[2002]['Kent']


Load in the full game history since 2003 and create a helper for testing against historical matchups

In [6]:
# Season, Day Num, Winning Team ID, Winning Score, Losing Team ID, Losing Score, Winner Location (Home/Away/Neu.),
# Num OT, WFGM (Field Goals made), WFGA (attempted), WFGM3 (three pointers made), WFGA3, WFTM (free throws made),
# WFTA (free throw attempted), WOR (offensive rebounds), WDR (defensive rebounds), WAst (assists), WTO (turnovers),
# WStl (steals), WBlk (blocks), WPF (personal fouls), LFGM, LFGA, LFGM3, LFGA3, LFTM, LFTA, LOR, LDR, LAst, LTO, LStl, 
# LBlk, LPF
raw_game_history = np.concatenate((
    from_csv('./data/results.csv'),
    from_csv('./data/prelim_2019_regular_season_detailed_results.csv'),
))

x_training_set = []
y_training_set = []

WIN = 1
LOSS = 0

for index in range(0, len(raw_game_history)):
    row = raw_game_history[index]
    year = row[0]
    winning_team_id = row[2]
    losing_team_id = row[4]
    winning_team_name = team_ids[winning_team_id]
    losing_team_name = team_ids[losing_team_id]
    winning_team_location_letter = row[6]

    if winning_team_location_letter == 'H':
        wloc = 1
        lloc = 0
    if winning_team_location_letter == 'N':
        wloc = 0.5
        lloc = 0.5
    if winning_team_location_letter == 'A':
        wloc = 0
        lloc = 1

    if (not isinstance(kenpom_team_stats_by_year.get(year), dict)):
        continue
    if (not isinstance(kenpom_team_stats_by_year[year].get(winning_team_name), list)):
        continue
    if (not isinstance(kenpom_team_stats_by_year[year].get(losing_team_name), list)):
        continue
        
    winner_stats = kenpom_team_stats_by_year[year][winning_team_name]
    loser_stats = kenpom_team_stats_by_year[year][losing_team_name]
        
    x_training_set.append(np.concatenate((winner_stats, loser_stats, [wloc])))
    y_training_set.append(WIN)
    
    # Append same outcome in both positions to avoid training the model on first position being winner
    x_training_set.append(np.concatenate((loser_stats, winner_stats, [lloc])))
    y_training_set.append(LOSS)

X_train, X_test, y_train, y_test = train_test_split(x_training_set, y_training_set)


Moneylines Model
---

A logistic regression is trained on historical Vegas spreads, and an accompanying linear regression is initialized to approximate match-ups for which the spread is unavilable (all games after the first round).

Train a logistic regression on historical spreads:

In [7]:
raw_moneylines = from_csv('./data/spreads.csv')

hyperparameters = {
    'solver': ['lbfgs'],
    'max_iter': [100, 500, 1000, 5000, 10000, 50000, 100000],
    'penalty': ['l2'],
    'C': np.logspace(0, 4, 10),
}

spreads = raw_moneylines[:, 0].reshape(-1, 1)
winloss = np.asarray(raw_moneylines[:, 1], dtype='int')

moneylines_X_train, moneylines_X_test, moneylines_y_train, moneylines_y_test = train_test_split(spreads, winloss)

spread_classifier = GridSearchCV(LogisticRegression(), hyperparameters, cv=5, verbose=0, error_score='raise')
spread_classifier.fit(moneylines_X_train, moneylines_y_train)

y_pred = spread_classifier.predict(moneylines_X_test)
accuracy = np.mean(y_pred == moneylines_y_test)
print("Test set score: {:.2f}".format(accuracy))

Test set score: 0.73


The bracket must be submitted prior to the start of the tournament when Vegas spreads for only the first round are available, and after submitting the bracket, selections are immutable.  Any model based on Vegas spreads will then have to estimate future spreads.  To this end, a linear regression is initialized for the linear interpolation of unrealized spreads based on team stats.  Finally, the win/loss predictions based on the estimated pointspreads are compared against the win/loss predictions for the actual pointspreads.

In [8]:
team_stats = []
stat_based_spreads = []

errors = {}

for index in range(0, len(raw_moneylines)):
    moneyline, winloss, home_team_name, away_team_name, raw_year = raw_moneylines[index]
    _, year = raw_year.split('-')
    
    home_team_name = translator.get(home_team_name, home_team_name)
    away_team_name = translator.get(away_team_name, away_team_name)
        
    try:
        home_stats = kenpom_team_stats_by_year[int(year)][home_team_name]
        away_stats = kenpom_team_stats_by_year[int(year)][away_team_name]
    
        if home_stats and away_stats:
            team_stats.append(np.concatenate((home_stats, away_stats)))
            stat_based_spreads.append(moneyline)
    except:
        if errors.get(home_team_name):
            errors[home_team_name] = errors[home_team_name] + 1
        else:
            errors[home_team_name] = 1
            
        if errors.get(away_team_name):
            errors[away_team_name] = errors[away_team_name] + 1
        else:
            errors[away_team_name] = 1

# print(sorted(errors.items(), key=lambda kv: -kv[1]))
estimator_X_train, estimator_X_test, estimator_y_train, estimator_y_test = train_test_split(team_stats, stat_based_spreads)

regressor = LinearRegression()
regressor.fit(estimator_X_train, estimator_y_train)

regressor.predict(estimator_X_test)

estimated_moneylines_test_pred = regressor.predict(estimator_X_test).reshape(-1, 1)
y_pred = spread_classifier.predict(estimated_moneylines_test_pred)
actual = spread_classifier.predict(np.asarray(estimator_y_test).reshape(-1, 1))

accuracy = np.mean(y_pred == actual)
print("Test set score: {:.2f}".format(accuracy))

Test set score: 0.81


  linalg.lstsq(X, y)


Test the accuracy of the model using estmates only.

In [9]:
estimated_moneylines_test_pred = regressor.predict(np.asarray(X_test)[:, :-1]).reshape(-1, 1)
y_pred = spread_classifier.predict(estimated_moneylines_test_pred)

accuracy = np.mean(y_pred == y_test)
print("Test set score: {:.2f}".format(accuracy))

Test set score: 0.74


Team Stats Model
---

A logistic regression is trained on historical team stats and attempts to predict win/loss outcome.

In [10]:
hyperparameters = {
    'solver': ['lbfgs'],
    'max_iter': [1000, 5000],
    'penalty': ['l2'],
    'C': np.logspace(0, 4, 5),
}

kenpom_classifier = GridSearchCV(LogisticRegression(), hyperparameters, cv=5, verbose=0, error_score='raise')
kenpom_classifier.fit(X_train, y_train)

y_pred = kenpom_classifier.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print("Test set score: {:.2f}".format(accuracy))

Test set score: 0.78


Ensemble
---

The two models are ensembled.

In [11]:
predictions = []

MONEYLINE_WEIGHT = 0.20
KENPOM_WEIGHT = 0.80

for index in range(0, len(X_test)):
    row = np.asarray(X_test[index])
    estimated_spread = regressor.predict([row[:-1]])
    spread_based_pred = spread_classifier.predict_proba([estimated_spread])[0, 1]
    kenpom_pred = kenpom_classifier.predict_proba([row])[0, 1]
    win_probability = (MONEYLINE_WEIGHT * spread_based_pred) + (KENPOM_WEIGHT * kenpom_pred)
    pred = 1 if win_probability >= 0.5 else 0
    
    predictions.append(pred)

accuracy = np.mean(np.asarray(predictions) == y_test)
print("Test set score: {:.2f}".format(accuracy))

Test set score: 0.78


Output a CSV of all historical match predictions for upload to Kaggle

In [12]:
ncaa_matchups = from_csv('./data/matchups.csv')[:, 0]

kaggle_predictions = [['ID', 'Pred']]

NEUTRAL_LOCATION = 0.5

for index in range(0, len(ncaa_matchups)):
    match_id = ncaa_matchups[index]
    _year, team_a_id, team_b_id = match_id.split("_")
    year = int(_year)
    team_a_name = team_ids[int(team_a_id)]
    team_b_name = team_ids[int(team_b_id)]
    team_a_name = translator.get(team_a_name, team_a_name)
    team_b_name = translator.get(team_b_name, team_b_name)
    
    team_a_stats = kenpom_team_stats_by_year[year][team_a_name]
    team_b_stats = kenpom_team_stats_by_year[year][team_b_name]
    
    estimated_spread = regressor.predict([np.concatenate((team_a_stats, team_b_stats))])
    spread_based_pred = spread_classifier.predict_proba([estimated_spread])[0, 1]
    kenpom_pred = kenpom_classifier.predict_proba([np.concatenate((team_a_stats, team_b_stats, [NEUTRAL_LOCATION]))])[0, 1]
    win_probability = (MONEYLINE_WEIGHT * spread_based_pred) + (KENPOM_WEIGHT * kenpom_pred)
    
    kaggle_predictions.append([match_id, win_probability])
    
pd.DataFrame(data=kaggle_predictions).to_csv('./data/kaggle_predictions_for_stage_1_with_ensemble_model.csv', index=False, header=False)

Output a CSV of potential 2018 bracket matchups for upload to Kaggle

In [13]:
ncaa_matchups = from_csv('./data/stage_2_sample_submission.csv')[:, 0]

kaggle_predictions = [['ID', 'Pred']]

NEUTRAL_LOCATION = 0.5

for index in range(0, len(ncaa_matchups)):
    match_id = ncaa_matchups[index]
    _year, team_a_id, team_b_id = match_id.split("_")
    year = int(_year)
    team_a_name = team_ids[int(team_a_id)]
    team_b_name = team_ids[int(team_b_id)]
    team_a_name = translator.get(team_a_name, team_a_name)
    team_b_name = translator.get(team_b_name, team_b_name)
    
    team_a_stats = kenpom_team_stats_by_year[year][team_a_name]
    team_b_stats = kenpom_team_stats_by_year[year][team_b_name]
    
    estimated_spread = regressor.predict([np.concatenate((team_a_stats, team_b_stats))])
    spread_based_pred = spread_classifier.predict_proba([estimated_spread])[0, 1]
    kenpom_pred = kenpom_classifier.predict_proba([np.concatenate((team_a_stats, team_b_stats, [NEUTRAL_LOCATION]))])[0, 1]
    win_probability = (MONEYLINE_WEIGHT * spread_based_pred) + (KENPOM_WEIGHT * kenpom_pred)
    
    kaggle_predictions.append([match_id, win_probability])
    
pd.DataFrame(data=kaggle_predictions).to_csv('./data/kaggle_predictions_for_stage_2_with_ensemble_model.csv', index=False, header=False)

Create a helper for filling out the office bracket

In [14]:
NEUTRAL_LOCATION = 0.5
def bracket_helper(team1, team2, actual_spread=None):
    team1_stats = kenpom_team_stats_by_year[2019][team1]
    team2_stats = kenpom_team_stats_by_year[2019][team2]
    
    if actual_spread != None:
        spread = actual_spread
    else:
        spread = regressor.predict([np.concatenate((team1_stats, team2_stats))])
        
    spread_based_pred = spread_classifier.predict_proba([spread])[0, 1]
    kenpom_pred = kenpom_classifier.predict_proba([np.concatenate((team1_stats, team2_stats, [NEUTRAL_LOCATION]))])[0, 1]
    win_probability = (MONEYLINE_WEIGHT * spread_based_pred) + (KENPOM_WEIGHT * kenpom_pred)
    
    print("The probability that {} beats {} is {:.2f}".format(team1, team2, win_probability))
    
bracket_helper("Gonzaga", "Michigan")


The probability that Gonzaga beats Michigan is 0.57


Helper to see what team names are available for 2019

In [15]:
print(kenpom_team_stats_by_year[2019].keys())

dict_keys(['Virginia', 'Gonzaga', 'Duke', 'Michigan St', 'Michigan', 'North Carolina', 'Kentucky', 'Tennessee', 'Texas Tech', 'Purdue', 'Virginia Tech', 'Wisconsin', 'Auburn', 'Florida St', 'Houston', 'Iowa St', 'LSU', 'Louisville', 'Wofford', 'Mississippi St', 'Kansas', 'Buffalo', 'Kansas St', 'Maryland', 'Nevada', 'Villanova', 'Florida', 'Marquette', 'Texas', "St Mary's CA", 'Clemson', 'Cincinnati', 'Utah St', 'NC State', 'Syracuse', 'Iowa', 'VA Commonwealth', 'Oklahoma', 'Nebraska', 'Penn St', 'Baylor', 'Mississippi', 'Oregon', 'Indiana', 'Ohio St', 'UCF', 'Minnesota', 'TCU', 'Lipscomb', 'Belmont', 'New Mexico St', 'Murray St', 'Washington', 'Creighton', 'Arkansas', 'Seton Hall', 'Furman', 'Toledo', 'Alabama', 'Dayton', 'Memphis', 'Arizona St', 'Liberty', 'Colorado', 'Xavier', 'San Francisco', 'ETSU', 'Missouri', 'Butler', 'South Carolina', 'Fresno St', 'Miami FL', 'Northwestern', 'UC Irvine', 'Rutgers', 'Providence', 'Temple', 'UNC Greensboro', 'Northeastern', 'Vermont', "St John's