# March Machine Learning

## Overview
This notebook will load in the datasets.

## Dataset

Two years will be predicted: 2016 and 2017. Each year has a dataset consisting of the following files:
* RegularSeasonDetailedResults.csv
  * This file identifies the game-by-game results for 32 seasons of historical data, from 1985 to 2015. Each year, it includes all games played from daynum 0 through 132 (which by definition is "Selection Sunday," the day that tournament pairings are announced). Each row in the file represents a single game played.
  * "season" - this is the year of the associated entry in seasons.csv (the year in which the final tournament occurs)
  * "daynum" - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "dayzero" date in the "seasons.csv" file. For example, the first game in the file was daynum=20. Combined with the fact from the "season.csv" file that day zero was 10/29/1984, that means the first game was played 20 days later, or 11/18/1984. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key. In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original date.
  * "wteam" - this identifies the id number of the team that won the game, as listed in the "teams.csv" file. No matter whether the game was won by the home team or visiting team, "wteam" always identifies the winning team.
  * "wscore" - this identifies the number of points scored by the winning team.  
  * "lteam" - this identifies the id number of the team that lost the game.
  * "lscore" - this identifies the number of points scored by the losing team.
  * "numot" - this indicates the number of overtime periods in the game, an integer 0 or higher.
  * "wloc" - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team.
  * "wfgm" - field goals made
  * "wfga" - field goals attempted
  * "wfgm3" - three pointers made
  * "wfga3" - three pointers attempted
  * "wftm" - free throws made
  * "wfta" - free throws attempted
  * "wor" - offensive rebounds
  * "wdr" - defensive rebounds
  * "wast" - assists
  * "wto" - turnovers
  * "wstl" - steals
  * "wblk" - blocks
  * "wpf" - personal fouls
  
* Seasons.csv
  * This file identifies the seeds for all teams in each NCAA tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on the bracket structure.
  * "season" - the year
  * "seed" - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tells you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have the same first three characters. For example, the first record in the file is seed W01, which means we are looking at the #1 seed in the W region (which we can see from the "seasons.csv" file was the East region). This seed is also referenced in the "tourney_slots.csv" file that tells us which bracket slots face which other bracket slots in which rounds.
  * "team" - this identifies the id number of the team, as specified in the teams.csv file

* Teams.csv
* TourneyDetailedResults.csv
* TourneySeeds.csv
* TourneySlots.csv

### 2016

In [182]:
# Load in 2016 data

# Import necessary tools
import numpy as np
import pandas as pd
import pylab as pl

# Read in the file URL.
season_results_file = 'Data/2017/RegularSeasonDetailedResults.csv'
tourney_results_file = 'Data/2017/TourneyDetailedResults.csv'
teams_file = 'Data/2017/Teams.csv'

results = pd.read_csv(season_results_file)
tourney_results = pd.read_csv(tourney_results_file)
teams = pd.read_csv(teams_file)


In [255]:
# Select features to use and calculate season averages for each team

def get_team_averages(teamid, year):
    team_averages = dict()
    season_results = results.loc[results['Season'] == year]

    team_wins = season_results.loc[season_results['Wteam'] == teamid]
    team_losses = season_results.loc[season_results['Lteam'] == teamid]
    num_games = len(team_wins) + len(team_losses)
    if not num_games: return None
    percent_win = len(team_wins) / num_games
    percent_loss = len(team_losses) / num_games

    mean_win_results = team_wins.mean()
    mean_loss_results = team_losses.mean()

    # TODO Calculate entropy to determine which features to select 
    # TODO PCA to reduce dimensionality 
    
    # Arbitrary features for now
    if not len(team_wins):
        mean_win_results = mean_win_results.fillna(0)
        # Team
        field_goal_percentage = mean_loss_results['Wfgm']/mean_loss_results['Wfga']
        fg3pt_percentage = mean_loss_results['Wfgm3']/mean_loss_results['Wfga3']
        ft_percentage = mean_loss_results['Wftm']/mean_loss_results['Wfta']
        # Opp
        opp_field_goal_percentage = mean_loss_results['Lfgm']/mean_loss_results['Lfga']
        opp_fg3pt_percentage = mean_loss_results['Lfgm3']/mean_loss_results['Lfga3']
        opp_ft_percentage = mean_loss_results['Lftm']/mean_loss_results['Lfta']
    elif not len(team_losses):
        mean_loss_results = mean_loss_results.fillna(0)
        # Team
        field_goal_percentage = mean_win_results['Lfgm']/mean_win_results['Lfga']
        fg3pt_percentage = mean_win_results['Lfgm3']/mean_win_results['Lfga3']
        ft_percentage = mean_win_results['Lftm']/mean_win_results['Lfta']
        # Opp
        opp_field_goal_percentage = mean_win_results['Wfgm']/mean_loss_results['Wfga']
        opp_fg3pt_percentage = mean_win_results['Wfgm3']/mean_loss_results['Wfga3']
        opp_ft_percentage = mean_win_results['Wftm']/mean_loss_results['Wfta']
    else:
        # Team
        field_goal_percentage = (mean_win_results['Wfgm']*percent_win)/(mean_win_results['Wfga']) + (mean_loss_results['Lfgm']*percent_loss)/(mean_loss_results['Lfga'])
        fg3pt_percentage = (mean_win_results['Wfgm3']*percent_win)/(mean_win_results['Wfga3']) + (mean_loss_results['Lfgm3']*percent_loss)/(mean_loss_results['Lfga3'])
        ft_percentage = (mean_win_results['Wftm']*percent_win)/(mean_win_results['Wfta']) + (mean_loss_results['Lftm']*percent_loss)/(mean_loss_results['Lfta'])
        # Opp
        opp_field_goal_percentage = (mean_win_results['Lfgm']*percent_win)/(mean_win_results['Lfga']) + (mean_loss_results['Wfgm']*percent_loss)/(mean_loss_results['Wfga'])
        opp_fg3pt_percentage = (mean_win_results['Lfgm3']*percent_win)/(mean_win_results['Lfga3']) + (mean_loss_results['Wfgm3']*percent_loss)/(mean_loss_results['Wfga3'])
        opp_ft_percentage = (mean_win_results['Lftm']*percent_win)/(mean_win_results['Lfta']) + (mean_loss_results['Wftm']*percent_loss)/(mean_loss_results['Wfta'])
    
    # Team
    assists = (mean_win_results['Wast']*percent_win) + (mean_loss_results['Last']*percent_loss)
    turnovers = (mean_win_results['Wto']*percent_win) + (mean_loss_results['Lto']*percent_loss)
    # Opp
    opp_assists = (mean_win_results['Last']*percent_win) + (mean_loss_results['Wast']*percent_loss)
    opp_turnovers = (mean_win_results['Lto']*percent_win) + (mean_loss_results['Wto']*percent_loss)
    
    # TODO Win percent home
    # TODO Win percent away
    
    return [field_goal_percentage, fg3pt_percentage, ft_percentage, assists, turnovers,
           opp_field_goal_percentage, opp_fg3pt_percentage, opp_ft_percentage, opp_assists, opp_turnovers]


In [256]:
# Create training data set (2003 through 2015)
years = list(range(2003, 2015 + 1))

def create_training_set():
    training_set_features = []
    training_set_class = []
    for year in years:
        team_stats = dict()
        for team in teams['Team_Id']:
            team_stats[team] = get_team_averages(team, year)
        games = results.loc[results['Season'] == year]
        for index, game in games.iterrows():
            win_team = team_stats[game['Wteam']]
            lose_team = team_stats[game['Lteam']]
            game_differential = [w - l for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(0) # 1 Represents win
            game_differential = [l - w for w, l in zip(win_team, lose_team)]
            training_set_features.append(game_differential)
            training_set_class.append(1) # 0 Represents loss
    return [np.asarray(training_set_features), np.asarray(training_set_class)]
        
training_set = create_training_set()

In [257]:
# Train the model

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

classifier_model = GaussianNB()

classifier_model.fit(training_set[0], training_set[1])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [254]:
# Compare two teams average features to create new feature datastructure 

def predict(team1, team2, year):
    team1_averages = get_team_averages(team1, year)
    team2_averages = get_team_averages(team2, year)
    game_differential = [w - l for w, l in zip(team1_averages, team2_averages)]
    return classifier_model.predict_proba([game_differential])[0]

print(predict(1323, 1452, 2017))


[ 0.75820503  0.24179497]


In [180]:
# Run the prediction for the model


### 2017

In [2]:
# Load in 2017 data
