# March Madness Prediction

## Overview

### Goal
Submissions are based on the Brier Score, the goal will be to minimize the brier score between the predicted probabilities and the actual game outcomes. The Brier score measures the accuracy of probablistic predition, in this case the mean square error. 

The brier score can be thought of as a cost function that measures the average squared difference between the predicted probabilities and the actual outcomes.

$$
Brier = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2
$$

where $p_i$ is the predicted probability of the event and $o_i$ is the actual outcome. The Brier score can span across all items in a set of N predictions.

Therefore, minimizing the Brier score will result in a more accurate prediction.




## Import Libraries
Numpy for numerical operations
Pandas for data manipulation
Matplotlib, Seaborn, Plotly for plotting



In [6]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.subplots as sp

## Load Data

We want to get a baseline model in which we can improve upon. In order to do this effectively, I will use a class structure to store all the data and functions that will be used along the process. This will make it easier to improve and maintain changes to the prediction process.


In [42]:
class MarchMadnessPredictor:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        self.data = None
        self.teams = None

    def load_data(self):
        
        """
        Set up a data dictionary that will store the data for each file. e.g.
        self.data = {
            'teams': [DataFrame with teams data],
            'games': [DataFrame with games data],
            'players': [DataFrame with players data]
        }
        """

        files = glob.glob(self.data_dir + '*.csv')
        self.data = {file.split('\\')[-1].split('.')[0]: pd.read_csv(file, encoding='latin-1') for file in files}

        teams = pd.concat([self.data['MTeams'], self.data['WTeams']])
        teams_spelling = pd.concat([self.data['MTeamSpellings'], self.data['WTeamSpellings']])
        teams_spelling = teams_spelling.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
        teams_spelling.columns = ['TeamID', 'TeamNameCount']
        self.teams = pd.merge(teams, teams_spelling, how='left', on=['TeamID'])

        print(self.teams)
        

        
        



In [43]:
if __name__ == '__main__':
    data_dir = 'data/'
    predictor = MarchMadnessPredictor(data_dir)
    predictor.load_data()


     TeamID        TeamName  FirstD1Season  LastD1Season  TeamNameCount
0      1101     Abilene Chr         2014.0        2025.0              3
1      1102       Air Force         1985.0        2025.0              2
2      1103           Akron         1985.0        2025.0              1
3      1104         Alabama         1985.0        2025.0              1
4      1105     Alabama A&M         2000.0        2025.0              2
..      ...             ...            ...           ...            ...
753    3476       Stonehill            NaN           NaN              1
754    3477  East Texas A&M            NaN           NaN              2
755    3478        Le Moyne            NaN           NaN              1
756    3479      Mercyhurst            NaN           NaN              1
757    3480    West Georgia            NaN           NaN              1

[758 rows x 5 columns]
