# March Machine Learning

## Overview
This notebook will load in the datasets.

## Dataset

#### RegularSeasonDetailedResults.csv and TourneyDetailedResults.csv

These file identifies the game-by-game results for 32 seasons of historical data, from 1985 to 2015. Each row in the file represents a single game played. From this file we use the following:

  * "season" - this is the year of the associated entry in seasons.csv
  * "wteam" - this identifies the id number of the team that won the game
  * "wscore" - this identifies the number of points scored by the winning team  
  * "lteam" - this identifies the id number of the team that lost the game
  * "lscore" - this identifies the number of points scored by the losing team
  * "wfgm" - field goals made
  * "wfga" - field goals attempted
  * "wfgm3" - three pointers made
  * "wfga3" - three pointers attempted
  * "wftm" - free throws made
  * "wfta" - free throws attempted
  * "wast" - assists
  * "wto" - turnovers
  
#### Seasons.csv
This file identifies the seeds for all teams in each NCAA tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on the bracket structure.

  * "season" - the year
  * "seed" - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tells you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have the same first three characters. For example, the first record in the file is seed W01, which means we are looking at the #1 seed in the W region (which we can see from the "seasons.csv" file was the East region). This seed is also referenced in the "tourney_slots.csv" file that tells us which bracket slots face which other bracket slots in which rounds.
  * "team" - this identifies the id number of the team, as specified in the teams.csv file

#### Teams.csv

This file identifies the different college teams present in the dataset. Each team has a 4 digit id number
  
#### TourneySeeds.csv

This file identifies the seeds for all teams in each NCAA tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on the bracket structure.

  * "season" - the year
  * "seed" - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tells you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have the same first three characters. For example, the first record in the file is seed W01, which means we are looking at the #1 seed in the W region (which we can see from the "seasons.csv" file was the East region). This seed is also referenced in the "tourney_slots.csv" file that tells us which bracket slots face which other bracket slots in which rounds.
  * "team" - this identifies the id number of the team, as specified in the teams.csv file

#### TourneySlots.csv

This file identifies the mechanism by which teams are paired against each other, depending upon their seeds. Because of the existence of play-in games for particular seed numbers, the pairings have small differences from year to year. If there were N teams in the tournament during a particular year, there were N-1 teams eliminated (leaving one champion) and therefore N-1 games played, as well as N-1 slots in the tournament bracket, and thus there will be N-1 records in this file for that season.

  * "season" - the year
  * "slot" - this uniquely identifies one of the tournament games. For play-in games, it is a three-character string identifying the seed fulfilled by the winning team, such as W16 or Z13. For regular tournament games, it is a four-character string, where the first two characters tell you which round the game is (R1, R2, R3, R4, R5, or R6) and the second two characters tell you the expected seed of the favored team. Thus the first row is R1W1, identifying the Round 1 game played in the W bracket, where the favored team is the 1 seed. As a further example, the R2W1 slot indicates the Round 2 game that would have the 1 seed from the W bracket, assuming that all favored teams have won up to that point. The slot names are different for the final two rounds, where R5WX identifies the national semifinal game between the winners of regions W and X, and R5YZ identifies the national semifinal game between the winners of regions Y and Z, and R6CH identifies the championship game. The "slot" value is used in other columns in order to represent the advancement and pairings of winners of previous games.
  * "strongseed" - this indicates the expected higher-seeded (lower seed number) team that plays in this game
  * "weakseed" - this indicates the expected weaker-seeded team that plays in this game

## Read and Process Data

### Open the datafiles and load into Pandas dataframes 

In [133]:

# Import necessary tools
import numpy as np
import pandas as pd
import pylab as pl
import random

# Read in the file URL.
season_results_file = 'Data/2017/RegularSeasonDetailedResults.csv'
tourney_results_file = 'Data/2017/TourneyDetailedResults.csv'
teams_file = 'Data/2017/Teams.csv'

results = pd.read_csv(season_results_file)
tourney_results = pd.read_csv(tourney_results_file)
teams = pd.read_csv(teams_file)

# Read in the files for bracket structure and seeding
tourney_slots_file = 'Data/2017/TourneySlots.csv'
tourney_seeds_file = 'Data/2017/TourneySeeds.csv'
tourney_slots = pd.read_csv(tourney_slots_file)
tourney_seeds = pd.read_csv(tourney_seeds_file)


### Select features to use and calculate season averages for each team

The `get_team_averages(teamid, year)` function will return the feature vector for a team. The feature vector will consist of what we use to train and test our model based on what we assume to be the most important aspects of a college basketball game. These statistics are calculated as an average over the entire regurlar season for the given year.
  
  * Field goal percentage
  * 3-point percentage 
  * Free-throw percentage 
  * Assists
  * Turnovers
  * Tournament seed

In [134]:

def get_team_averages(teamid, year):
    team_averages = dict()
    season_results = results.loc[results['Season'] == year]

    team_wins = season_results.loc[season_results['Wteam'] == teamid]
    team_losses = season_results.loc[season_results['Lteam'] == teamid]
    num_games = len(team_wins) + len(team_losses)
    if not num_games: return None
    percent_win = len(team_wins) / num_games
    percent_loss = len(team_losses) / num_games

    mean_win_results = team_wins.mean()
    mean_loss_results = team_losses.mean()
    
    # Arbitrary features for now
    if not len(team_wins):
        mean_win_results = mean_win_results.fillna(0)
        # Team
        field_goal_percentage = mean_loss_results['Wfgm']/mean_loss_results['Wfga']
        fg3pt_percentage = mean_loss_results['Wfgm3']/mean_loss_results['Wfga3']
        ft_percentage = mean_loss_results['Wftm']/mean_loss_results['Wfta']
        # Opp
        opp_field_goal_percentage = mean_loss_results['Lfgm']/mean_loss_results['Lfga']
        opp_fg3pt_percentage = mean_loss_results['Lfgm3']/mean_loss_results['Lfga3']
        opp_ft_percentage = mean_loss_results['Lftm']/mean_loss_results['Lfta']
    elif not len(team_losses):
        mean_loss_results = mean_loss_results.fillna(0)
        # Team
        field_goal_percentage = mean_win_results['Lfgm']/mean_win_results['Lfga']
        fg3pt_percentage = mean_win_results['Lfgm3']/mean_win_results['Lfga3']
        ft_percentage = mean_win_results['Lftm']/mean_win_results['Lfta']
        # Opp
        opp_field_goal_percentage = mean_win_results['Wfgm']/mean_win_results['Wfga']
        opp_fg3pt_percentage = mean_win_results['Wfgm3']/mean_win_results['Wfga3']
        opp_ft_percentage = mean_win_results['Wftm']/mean_win_results['Wfta']
    else:
        # Team
        field_goal_percentage = (mean_win_results['Wfgm']*percent_win)/(mean_win_results['Wfga']) + (mean_loss_results['Lfgm']*percent_loss)/(mean_loss_results['Lfga'])
        fg3pt_percentage = (mean_win_results['Wfgm3']*percent_win)/(mean_win_results['Wfga3']) + (mean_loss_results['Lfgm3']*percent_loss)/(mean_loss_results['Lfga3'])
        ft_percentage = (mean_win_results['Wftm']*percent_win)/(mean_win_results['Wfta']) + (mean_loss_results['Lftm']*percent_loss)/(mean_loss_results['Lfta'])
        # Opp
        opp_field_goal_percentage = (mean_win_results['Lfgm']*percent_win)/(mean_win_results['Lfga']) + (mean_loss_results['Wfgm']*percent_loss)/(mean_loss_results['Wfga'])
        opp_fg3pt_percentage = (mean_win_results['Lfgm3']*percent_win)/(mean_win_results['Lfga3']) + (mean_loss_results['Wfgm3']*percent_loss)/(mean_loss_results['Wfga3'])
        opp_ft_percentage = (mean_win_results['Lftm']*percent_win)/(mean_win_results['Lfta']) + (mean_loss_results['Wftm']*percent_loss)/(mean_loss_results['Wfta'])
    
    # Team
    assists = (mean_win_results['Wast']*percent_win) + (mean_loss_results['Last']*percent_loss)
    turnovers = (mean_win_results['Wto']*percent_win) + (mean_loss_results['Lto']*percent_loss)
    # Opp
    opp_assists = (mean_win_results['Last']*percent_win) + (mean_loss_results['Wast']*percent_loss)
    opp_turnovers = (mean_win_results['Lto']*percent_win) + (mean_loss_results['Wto']*percent_loss)
    
    # Tourney Seed
    tourney_seeds_year = tourney_seeds.loc[tourney_seeds['Season'] == year]
    tourney_seed = tourney_seeds_year.loc[tourney_seeds_year['Team'] == teamid]['Seed']
    if len(tourney_seed):
        tourney_seed = tourney_seed.tolist()[0][1:]
        try:
            tourney_seed = float(tourney_seed)
        except:
            tourney_seed = float(tourney_seed[:-1])
    else:
        tourney_seed = 20
    
    return [field_goal_percentage, fg3pt_percentage, ft_percentage, assists, turnovers, tourney_seed,
           opp_field_goal_percentage, opp_fg3pt_percentage, opp_ft_percentage, opp_assists, opp_turnovers]


### Training and testing set creation

The training and testing set work by calculating the feature vectors for every single matchup in the reuglar season from 2003 through 2016 and using that and the actual game outcome as training data. It then uses the historical tournament games for all of those years as test data in a similar manner. We have two methods for doing this: averages over the entire regular season, and statistics for individual games.

#### Average feature vectors

The average feature vectors were used for training and for the actual true predicitons. This is becuase the stat line for a particular game would make it very easy to predict the outcome of that particular game.

In [136]:

# Create training data set (2003 through 2016)
years = list(range(2003, 2016 + 1))

def create_training_set_averages():
    training_set_features = []
    training_set_class = []
    for year in years:
        team_stats = dict()
        for team in teams['Team_Id']:
            team_stats[team] = get_team_averages(team, year)
        games = results.loc[results['Season'] == year]
        for index, game in games.iterrows():
            win_team = team_stats[game['Wteam']]
            lose_team = team_stats[game['Lteam']]
            game_differential = [w - l for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(0) # Represents win
            game_differential = [l - w for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(1) # Represents loss
    return [np.asarray(training_set_features), np.asarray(training_set_class)]
        
training_set = create_training_set_averages()

In [137]:
# Create testing set (historical tournament data)

def create_testing_set_averages():
    testing_set_features = []
    testing_set_class = []
    for year in years:
        team_stats = dict()
        for team in teams['Team_Id']:
            team_stats[team] = get_team_averages(team, year)
        games = tourney_results.loc[tourney_results['Season'] == year]
        for index, game in games.iterrows():
            win_team = team_stats[game['Wteam']]
            lose_team = team_stats[game['Lteam']]
            game_differential = [w - l for w, l in zip(win_team, lose_team)]
            testing_set_features.append(game_differential)
            testing_set_class.append(0) # Represents win
    return [np.asarray(testing_set_features), np.asarray(testing_set_class)]
        
testing_set = create_testing_set_averages()

#### Game line statistics as feature vectors

While the average feature vectors make more sense for actually predicting games, it is useful to know how the model would do if it knows the actual statistics for the game. This would give us an idea of a sort of best case scenario or thoeretical expected max for our model.

In [153]:
def get_game_stats(game, team):
    # Team
    if game['Wfga']:
        field_goal_percentage = game['Wfgm']/game['Wfga']
    else:
        field_goal_percentage = 0
    if game['Wfga3']:
        fg3pt_percentage = game['Wfgm3']/game['Wfga3']
    else:
        fg3pt_percentage = 0
    if game['Wfta']:
        ft_percentage = game['Wftm']/game['Wfta']
    else:
        ft_percentage = 0
    # Opp
    if game['Lfga']:
        opp_field_goal_percentage = game['Lfgm']/game['Lfga']
    else:
        opp_field_goal_percentage = 0
    if game['Lfga3']:
        opp_fg3pt_percentage = game['Lfgm3']/game['Lfga3']
    else:
        opp_fg3pt_percentage = 0
    if game['Lfta']:
        opp_ft_percentage = game['Lftm']/game['Lfta']
    else:
        opp_ft_percentage = 0
    
    # Team
    assists = game['Wast']
    turnovers = game['Wto']
    # Opp
    opp_assists = game['Last']
    opp_turnovers = game['Lto']

    if team == 'Wteam':
        # Tourney Seed
        tourney_seeds_year = tourney_seeds.loc[tourney_seeds['Season'] == game['Season']]
        tourney_seed = tourney_seeds_year.loc[tourney_seeds_year['Team'] == game['Wteam']]['Seed']
        if len(tourney_seed):
            tourney_seed = tourney_seed.tolist()[0][1:]
            try:
                tourney_seed = float(tourney_seed)
            except:
                tourney_seed = float(tourney_seed[:-1])
        else:
            tourney_seed = 20
        return [field_goal_percentage, fg3pt_percentage, ft_percentage, assists, turnovers, tourney_seed,
               opp_field_goal_percentage, opp_fg3pt_percentage, opp_ft_percentage, opp_assists, opp_turnovers]
    elif team == 'Lteam':
        # Tourney Seed
        tourney_seeds_year = tourney_seeds.loc[tourney_seeds['Season'] == game['Season']]
        tourney_seed = tourney_seeds_year.loc[tourney_seeds_year['Team'] == game['Lteam']]['Seed']
        if len(tourney_seed):            
            tourney_seed = tourney_seed.tolist()[0][1:]
            try:
                tourney_seed = float(tourney_seed)
            except:
                tourney_seed = float(tourney_seed[:-1])
        else:
            tourney_seed = 20
        return [opp_field_goal_percentage, opp_fg3pt_percentage, opp_ft_percentage, opp_assists, opp_turnovers, tourney_seed,
               field_goal_percentage, fg3pt_percentage, ft_percentage, assists, turnovers]
    else: # Should never happen
        return None

In [160]:
# Create training data set (2003 through 2015)
years = list(range(2003, 2015 + 1))

def create_training_set_games():
    training_set_features = []
    training_set_class = []
    for year in years:
        games = results.loc[results['Season'] == year]
        for index, game in games.iterrows():
            win_team = get_game_stats(game, 'Wteam')
            lose_team = get_game_stats(game, 'Lteam')
            game_differential = [w - l for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(0) # Represents win
            game_differential = [l - w for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(1) # Represents loss
    return [np.asarray(training_set_features), np.asarray(training_set_class)]
        
training_set_games = create_training_set_games()

In [159]:
# Create testing set (historical tournament data)

def create_testing_set_games():
    testing_set_features = []
    testing_set_class = []
    for year in years:
        games = tourney_results.loc[tourney_results['Season'] == year]
        for index, game in games.iterrows():
            win_team = get_game_stats(game, 'Wteam')
            lose_team = get_game_stats(game, 'Lteam')
            game_differential = [w - l for w, l in zip(win_team, lose_team)]
            testing_set_features.append(game_differential)
            testing_set_class.append(0) # Represents win
    return [np.asarray(testing_set_features), np.asarray(testing_set_class)]
        
testing_set_games = create_testing_set_games()

In [9]:
# If using MLP, scale data since it is sensitive to this

def scale_sets():
    from sklearn.preprocessing import StandardScaler  
    scaler = StandardScaler()  
    # Fit only on training data
    scaler.fit(training_set[0])  
    training_set[0] = scaler.transform(training_set[0])  
    # Apply same transformation to test data
    testing_set[0] = scaler.transform(testing_set[0])


### Training and testing the model

We used three different classifiers and evaulated each of their performances. In order to select a different classifier, simply comment out the `classifier_model` line with the desired classifier and re-run the training data fit and then testing data test.

#### Train the desired model

In [177]:
# Train the model

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier


#classifier_model = GaussianNB()
#classifier_model = LogisticRegression()
classifier_model = MLPClassifier(hidden_layer_sizes=(15, 5), random_state=True)


classifier_model.fit(training_set[0], training_set[1])

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15, 5), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=True,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

#### Test the model against

Testing is done against all of the historical tournament games. The result is the percentage of games where the model was correct. This will give an overall idea of how well a model should be expected to perform.

In [176]:
# Test the accuracy of the model on historical tournament data

def test_model(classifier_model, testing_set):
    accuracy=[]
    predictions = classifier_model.predict(testing_set[0])

    predictions[predictions < .5] = 0
    predictions[predictions >= .5] = 1
    accuracy.append(np.mean(predictions == testing_set[1]))

    print("The accuracy is", sum(accuracy)/len(accuracy))
    
test_model(classifier_model, testing_set)

The accuracy is 0.673960612691


### Individual Game Prediction

This function can be used to predict any game in a given year! This will be used to predict the games of the tournament. It simply gets the team averages for the year and subtracts the difference to create the feature vector to predict games.

In [180]:
# Compare two teams average features to create new feature datastructure 

def predict(team1, team2, year):
    team1_averages = get_team_averages(team1, year)
    team2_averages = get_team_averages(team2, year)
    game_differential = [w - l for w, l in zip(team1_averages, team2_averages)]
    return classifier_model.predict_proba([game_differential])[0]

# For fun predict Notre Dame vs UNC this past year
# Team IDs can be found in the teams.csv file
print(predict(1323, 1314, 2017))

[ 0.45599637  0.54400363]


#### Bracket Prediciton

This function will predict a bracket for a given year. The output format is the round and game followed by the winning team and the chance they would win.

In [115]:
# Predict brackets!

def predict_bracket(year):
    year_tourney_slots = tourney_slots.loc[tourney_slots['Season'] == year]
    year_tourney_seeds = tourney_seeds.loc[tourney_seeds['Season'] == year]

    predictions = dict()
    for index, row in year_tourney_slots.iterrows():
        team1 = row['Strongseed']
        team2 = row['Weakseed']
        
        while team1 not in year_tourney_seeds['Seed'].tolist():
            team1 = predictions[team1][0]
        while team2 not in year_tourney_seeds['Seed'].tolist():
            team2 = predictions[team2][0]
            
        team1_id = year_tourney_seeds.loc[year_tourney_seeds['Seed'] == team1]['Team'].values[0]
        team2_id = year_tourney_seeds.loc[year_tourney_seeds['Seed'] == team2]['Team'].values[0]
        team1_name = teams.loc[teams['Team_Id'] == team1_id]['Team_Name'].values[0]
        team2_name = teams.loc[teams['Team_Id'] == team2_id]['Team_Name'].values[0]
        
        prediction = predict(team1_id, team2_id, year)
        
        monte_carlo_factor = random.uniform(0.9, 1.1)
        if prediction[0]*monte_carlo_factor >= 0.5:
            predictions[row['Slot']] = [team1, team1_name, prediction[0]]
        else:
            predictions[row['Slot']] = [team2, team2_name, prediction[1]]
    return predictions


### 2017 Bracket Predicted 

In [181]:

predict_bracket(2017)


{'R1W1': ['W01', 'Villanova', 0.89966138140662155],
 'R1W2': ['W02', 'Duke', 0.8345099562261189],
 'R1W3': ['W03', 'Baylor', 0.71571870929893355],
 'R1W4': ['W04', 'Florida', 0.70251950533055063],
 'R1W5': ['W05', 'Virginia', 0.81317805558077549],
 'R1W6': ['W06', 'SMU', 0.78551103057399185],
 'R1W7': ['W10', 'Marquette', 0.49215820662964849],
 'R1W8': ['W08', 'Wisconsin', 0.55328411264626565],
 'R1X1': ['X01', 'Gonzaga', 0.92553541252262717],
 'R1X2': ['X02', 'Arizona', 0.80883808581282024],
 'R1X3': ['X03', 'Florida St', 0.79229791140633388],
 'R1X4': ['X04', 'West Virginia', 0.78642318522983734],
 'R1X5': ['X05', 'Notre Dame', 0.67967452371500137],
 'R1X6': ['X06', 'Maryland', 0.72041455071806781],
 'R1X7': ['X07', "St Mary's CA", 0.75516604899918116],
 'R1X8': ['X08', 'Northwestern', 0.72583978755019807],
 'R1Y1': ['Y01', 'Kansas', 0.73187022085926601],
 'R1Y2': ['Y02', 'Louisville', 0.87023776851457013],
 'R1Y3': ['Y03', 'Oregon', 0.83850831280914639],
 'R1Y4': ['Y04', 'Purdue', 0