# March Madness Prediction

## Overview

### Goal
Submissions are based on the Brier Score, the goal will be to minimize the brier score between the predicted probabilities and the actual game outcomes. The Brier score measures the accuracy of probablistic predition, in this case the mean square error. 

The brier score can be thought of as a cost function that measures the average squared difference between the predicted probabilities and the actual outcomes.

$$
Brier = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2
$$

where $p_i$ is the predicted probability of the event and $o_i$ is the actual outcome. The Brier score can span across all items in a set of N predictions.

Therefore, minimizing the Brier score will result in a more accurate prediction.




## Import Libraries
Numpy for numerical operations
Pandas for data manipulation
Matplotlib, Seaborn, Plotly for plotting



In [6]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.subplots as sp
import sklearn as sk


## Load Data

We want to get a baseline model in which we can improve upon. In order to do this effectively, I will use a class structure to store all the data and functions that will be used along the process. This will make it easier to improve and maintain changes to the prediction process.


In [68]:
class MarchMadnessPredictor:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        self.data = None
        self.teams = None
        self.seeds  = None
        self.submission = None

    def load_data(self):
        
        """
        Set up a data dictionary that will store the data for each file. e.g.
        self.data = {
            'teams': [DataFrame with teams data],
            'games': [DataFrame with games data],
            'players': [DataFrame with players data]
        }
        """

        files = glob.glob(self.data_dir + '*.csv')
        self.data = {file.split('\\')[-1].split('.')[0]: pd.read_csv(file, encoding='latin-1') for file in files}

        teams = pd.concat([self.data['MTeams'], self.data['WTeams']])
        teams_spelling = pd.concat([self.data['MTeamSpellings'], self.data['WTeamSpellings']])
        teams_spelling = teams_spelling.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
        teams_spelling.columns = ['TeamID', 'TeamNameCount']
        self.teams = pd.merge(teams, teams_spelling, how='left', on=['TeamID'])
        #print(self.teams.head())

        season_compact_results = pd.concat([self.data['MRegularSeasonCompactResults'], self.data['WRegularSeasonCompactResults']]).assign(ST='S')
        season_detailed_results = pd.concat([self.data['MRegularSeasonDetailedResults'], self.data['WRegularSeasonDetailedResults']]).assign(ST='S')
        tourney_compact_results = pd.concat([self.data['MNCAATourneyCompactResults'], self.data['WNCAATourneyCompactResults']]).assign(ST='T')
        tourney_detailed_results = pd.concat([self.data['MNCAATourneyDetailedResults'], self.data['WNCAATourneyDetailedResults']]).assign(ST='T')

        seeds = pd.concat([self.data['MNCAATourneySeeds'], self.data['WNCAATourneySeeds']])
        self.seeds = seeds
        #print(self.seeds.head())

        seeds = seeds.groupby(by='TeamID', as_index=False)['Seed'].count()
        seeds.columns = ['TeamID', 'SeedCount']
        self.teams = pd.merge(self.teams, seeds, how='left', on=['TeamID'])
        #print(self.teams.head())

        self.submission = self.data['SampleSubmissionStage1']

        self.games = pd.concat([season_compact_results, tourney_compact_results])

    def create_model(self):
        self.model = sk.ensemble.RandomForestRegressor(
          n_estimators=235,
          random_state=42,
          max_depth=15,
          min_samples_split=2,
          max_features='sqrt',
          n_jobs=-1)
        self.calibration_model = sk.ensemble.RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10)




In [69]:
if __name__ == '__main__':
    data_dir = 'data/'
    predictor = MarchMadnessPredictor(data_dir)
    predictor.load_data()


   Season  DayNum  WTeamID  WScore  LTeamID  LScore WLoc  NumOT ST
0    1985      20     1228      81     1328      64    N      0  S
1    1985      25     1106      77     1354      70    H      0  S
2    1985      25     1112      63     1223      56    H      0  S
3    1985      25     1165      70     1432      54    H      0  S
4    1985      25     1192      86     1447      74    H      0  S
