**March Madness Machine Learning 2021**

This project will use the dataset given by kaggle's 2021 competition dataset. We will be implementing logistic regression through scikit-learn to train our model and predict the winner of the 2021 NCAA Division 1 Men's basketball tournament.

In [6]:
#Importing python packages

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

In [7]:
#reading in data files

datapath_one = '../input/ncaam-march-mania-2021/MDataFiles_Stage1/'
datapath_two = '../input/ncaam-march-mania-2021/MDataFiles_Stage2/'



teams = pd.read_csv(datapath_one + 'MTeams.csv')
reg_szn = pd.read_csv(datapath_one + 'MRegularSeasonDetailedResults.csv')
conference_tourney = pd.read_csv(datapath_one + 'MConferenceTourneyGames.csv')
tourney_seeds = pd.read_csv(datapath_one + 'MNCAATourneySeeds.csv')
tourney_results = pd.read_csv(datapath_one + 'MNCAATourneyDetailedResults.csv')
dataset_two = pd.read_csv(datapath_two + 'MRegularSeasonDetailedResults.csv')
tourney_two = pd.read_csv(datapath_two + 'MNCAATourneyDetailedResults.csv')
twentyone_results = dataset_two[dataset_two.get('Season')==2021].reset_index().drop(columns=['index', 'Season'])
seeds = pd.read_csv(datapath_two + 'MNCAATourneySeeds.csv')
tourney_teams = seeds[seeds.get('Season')==2021]
reg_szn = reg_szn[(reg_szn.get('Season')>=2003) & (reg_szn.get('Season')<2020)]

In [8]:
seeds

In [9]:
#defining helper functions
def convert_seed(s):
    """
    converts the the string s given by the seed of a particular
    team in the seeds DataFrame into the integer equivalent so
    that the seeding of a team can be accounted for in the model.
    """
    if s[1] == '0':
        return int(s[2])
    else:
        return int(s[1:3])

The following visualization displays the number of tournament wins by seed from 2003 to 2019. Note that the data includes first four wins, so the 11 and 16 seed play-in games are included

In [10]:
#displays histogram
a = seeds[(seeds.get('Season')>=2003) & (seeds.get('Season')<2020)]
temp1 = tourney_results[tourney_results.get('Season')==2003]
temp2 = a[a.get('Season')==2003]
seed = temp2[temp2.get('TeamID')==1421].get('Seed')
arr = []
for i in range(2003, 2020):
    temp1 = tourney_results[tourney_results.get('Season')==i]
    temp2 = a[a.get('Season')==i]
    for team in temp1.get('WTeamID'):
        seed = temp2[temp2.get('TeamID')==team].get('Seed').iloc[0]
        arr.append(convert_seed(seed))

plt.hist(arr, bins=16)
plt.ylabel('T ourney wins (since 2003)')
plt.xlabel('Seed')
plt.title('Tourney wins by seed (since 2003)')

In [11]:
#Unused features
Drop_cols = ['DayNum', 'LTeamID', 'NumOT', 'WStl', 'WBlk', 'WPF', 'LAst', 'LStl', 'LBlk', 'LPF', 'LFTA', 'LFTM']

**Data Wrangling**

Since the data given is by game and we want season averages, we must clean the data so that we can have a table of season averages. In order to this, we are aggregating each team's wins and losses, and averaging the statistics from those games to train our model.

In [12]:
#aggregate team stats by wins
w_updated = reg_szn.rename(columns={'WTeamID': 'TeamID'})
w_stat_totals = w_updated.groupby(['Season', 'TeamID']).sum()
w_stat_totals = w_stat_totals.drop(Drop_cols, 1)
w_stat_totals = w_stat_totals.rename(columns={'WScore': 'Pts', 'LScore': 'OPts',
                                    'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 
                                    'WFGA3': 'FGA3', 'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast',
                                    'WTO': 'TO', 'LFGM': 'OFGM', 'LFGA': 'OFGA',
                                    'LFGM3': 'OFGM3', 'LFGA3': 'OFGA3', 'WFTM': 'FTM',
                                    'WFTA': 'FTA', 'LOR': 'OOR', 'LDR': 'ODR', 'LTO': 'OTO'})

In [13]:
w_stat_totals

In [14]:
w_updated

In [15]:
#aggregate team stats by losses
updated = reg_szn.rename(columns={'LTeamID': 'TeamID'})
l_stat_totals = updated.groupby(['Season', 'TeamID']).sum()
Drop_co = ['DayNum', 'WTeamID', 'NumOT', 'LStl', 'LBlk', 'LPF', 'WAst', 'WStl', 'WBlk', 'WPF', 'WFTA', 'WFTM']
l_stat_totals = l_stat_totals.drop(Drop_co, 1)
l_stat_totals = l_stat_totals.rename(columns={'WScore': 'OPts', 'LScore': 'Pts',
                                    'WFGM': 'OFGM', 'WFGA': 'OFGA', 'WFGM3': 'OFGM3', 
                                    'WFGA3': 'OFGA3', 'WOR': 'OOR', 'WDR': 'ODR', 'LAst': 'Ast',
                                    'WTO': 'OTO', 'LFGM': 'FGM', 'LFGA': 'FGA',
                                    'LFGM3': 'FGM3', 'LFGA3': 'FGA3', 'LFTM': 'FTM',
                                    'LFTA': 'FTA', 'LOR': 'OR', 'LDR': 'DR', 'LTO': 'TO'})
index1 = pd.MultiIndex.from_tuples([(2015, 1246)], names=["Season", "TeamID"])
index2 = pd.MultiIndex.from_tuples([(2014, 1455)], names=["Season", "TeamID"])
kentucky_tot = pd.DataFrame([[0]*l_stat_totals.shape[1]], columns = list(l_stat_totals.columns), index = index1)
wichita_tot = pd.DataFrame([[0]*l_stat_totals.shape[1]], columns = list(l_stat_totals.columns), index = index2)
l_stat_totals = l_stat_totals.append(kentucky_tot)
l_stat_totals = l_stat_totals.append(wichita_tot)

In [16]:
#combining winning and losing team stats for overall stats
stat_totals = (l_stat_totals + w_stat_totals).dropna()

After aggregating the data that gives us a DataFrame of the total season stats of every team from 2003 to 2019, We must also aggregate the number of wins and losses for each team. In order to do this, we must account for teams that haven't won or lost in a particular season in order for us to account for mismatched column and rows within the wins and losses tables.

In [17]:
#aggregating number of wins and losses by team
w_tot_games = w_updated.groupby(['Season', 'TeamID']).count()

l_tot_games = updated.groupby(['Season', 'TeamID']).count()
idx1 = pd.MultiIndex.from_tuples([(2015, 1246)], names=["Season", "TeamID"])
idx2 = pd.MultiIndex.from_tuples([(2014, 1455)], names=["Season", "TeamID"])
kentucky_num = pd.DataFrame([[0]*l_tot_games.shape[1]], columns = list(l_tot_games.columns), index = idx1)
#Kentucky was undefeated entering the tournament in 2015
wichita_num = pd.DataFrame([[0]*l_tot_games.shape[1]], columns = list(l_tot_games.columns), index = idx2)
#Wichita State was undefeated entering the tournament in 2014
Drops = ['DayNum', 'NumOT', 'WStl', 'WBlk', 'WPF', 'LAst', 'LStl', 'LBlk', 'LPF', 'LFTA', 'LFTM', 'TeamID', 'WTeamID', 'WLoc']
l_tot_games = l_tot_games.append(kentucky_num)
l_tot_games = l_tot_games.append(wichita_num)

In [18]:
Drops = ['DayNum', 'NumOT', 'WStl', 'WBlk', 'WPF', 'LAst', 'LStl', 'LBlk', 'LPF', 'LFTA', 'LFTM', 'WTeamID', 'WLoc', 'LTeamID']
games_count = l_tot_games + w_tot_games
games_count = games_count.rename(columns={'WScore': 'Pts', 'LScore': 'OPts',
                                    'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 
                                    'WFGA3': 'FGA3', 'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast',
                                    'WTO': 'TO', 'LFGM': 'OFGM', 'LFGA': 'OFGA',
                                    'LFGM3': 'OFGM3', 'LFGA3': 'OFGA3', 'WFTM': 'FTM',
                                    'WFTA': 'FTA', 'LOR': 'OOR', 'LDR': 'ODR', 'LTO': 'OTO'})
games_count = games_count.drop(Drops, 1).dropna()
#drops the N/A values that represents a team that finished a season with 0 wins, as this team is assumed to have not made the tournament.

In [19]:
#creating averages of each team
averages = stat_totals/games_count

The pre-tournament season averages data for every team is displayed below.

In [20]:
averages

The field goal, three point, and free throw percentages of the team are computed along with the field goal and three point percentages of the team's opponents. Additionally, the makes and attempts of field goals and free throws are dropped from the model, while average 3 point attempts are also dropped.

In [21]:
#dropping/adding features
averages['FG']=averages['FGM']/averages['FGA']
averages['3FG']=averages['FGM3']/averages['FGA3']
averages['FT']=averages['FTM']/averages['FTA']
averages['OFG']=averages['OFGM']/averages['OFGA']
averages['O3FG']=averages['OFGM3']/averages['OFGA3']
team_averages = averages.drop(['FGA', 'FTM', 'FGM', 'FTA', 'FGA3', 'OFGM', 'OFGA', 'OFGA3'], 1)

One final feature we need is the variable we are predicting, which is which team will win the championship. This is done by finding the winner of the game on the last day, day 154.

In [22]:
c = tourney_results[tourney_results.get('DayNum')==154].set_index('Season').get('WTeamID')
#finding winners by year

In [23]:
c

The final features to be added to this data set are the seeding of the team, which measures how good the team was over the course of the season based on the opinion of a group of experts. Additionally, the number of wins and losses of each team are also included. The tourney teams are kept in the training data, while teams that did not make the tourney were cut out. Finally, the champion of the season(what we are trying to predict) is marked as 1 or 0, with 1 indicating that the team in that year won the championship and 0 indicating that the team did not win.

In [25]:
final_data = pd.DataFrame()
for i in range(2003, 2020):
    temp = tourney_seeds[tourney_seeds.get('Season')==i]
    for team in temp.get('TeamID'):
        if team == c.loc[i]:
            champion = 1
        else:
            champion = 0
        seed = temp[temp.get('TeamID')==team].get('Seed').iloc[0]
        n_wins = w_tot_games.loc[i, team].get('DayNum')
        n_losses = l_tot_games.loc[i, team].get('DayNum')
        indexes = pd.MultiIndex.from_tuples([(i, team)], names=["Season", "TeamID"])
        data = pd.DataFrame([[convert_seed(seed), n_wins, n_losses, champion]+ list(averages.loc[i, team])], 
                            columns=['Seed', 'W', 'L', 'won_title'] + list(averages.columns), index = indexes)
        final_data = final_data.append(data)
final_data = final_data.drop(columns=['FGA', 'FGA3', 'FGM', 'FTA', 'FTM', 'OFGM', 'OFGA3', 'OFGA'])

The final season statistics for every tourney team from 2003-2019 is displayed below.

In [26]:
final_data

**Prediction Data**

The data used for predictions(2021 college basketball season) will be organized to predict winner of the tournament.

In [49]:
Drop_cols = ['DayNum', 'LTeamID', 'NumOT', 'WStl', 'WBlk', 'WPF', 'LAst', 'LStl', 'LBlk', 'LPF', 'LFTA', 'LFTM']

In [32]:
#aggregate team total stats by wins
w_totals = twentyone_results.groupby('WTeamID').sum()
w_totals = w_totals.drop(Drop_cols, 1)
w_totals = w_totals.rename(columns={'WScore': 'Pts', 'LScore': 'OPts',
                                    'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 
                                    'WFGA3': 'FGA3', 'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast',
                                    'WTO': 'TO', 'LFGM': 'OFGM', 'LFGA': 'OFGA',
                                    'LFGM3': 'OFGM3', 'LFGA3': 'OFGA3', 'WFTM': 'FTM',
                                    'WFTA': 'FTA', 'LOR': 'OOR', 'LDR': 'ODR', 'LTO': 'OTO'})
w_totals

In [50]:
#aggregate team total stats by losses
twentyone_results = twentyone_results.rename(columns={'LTeamID': 'TeamID'})
l_totals = twentyone_results.groupby('TeamID').sum()
Drop_co = ['DayNum', 'WTeamID', 'NumOT', 'LStl', 'LBlk', 'LPF', 'WAst', 'WStl', 'WBlk', 'WPF', 'WFTA', 'WFTM']
l_totals = l_totals.drop(Drop_co, 1)
l_totals = l_totals.rename(columns={'WScore': 'OPts', 'LScore': 'Pts',
                                    'WFGM': 'OFGM', 'WFGA': 'OFGA', 'WFGM3': 'OFGM3', 
                                    'WFGA3': 'OFGA3', 'WOR': 'OOR', 'WDR': 'ODR', 'LAst': 'Ast',
                                    'WTO': 'OTO', 'LFGM': 'FGM', 'LFGA': 'FGA',
                                    'LFGM3': 'FGM3', 'LFGA3': 'FGA3', 'LFTM': 'FTM',
                                    'LFTA': 'FTA', 'LOR': 'OR', 'LDR': 'DR', 'LTO': 'TO'})
gonzaga_tot = pd.DataFrame([[0]*l_totals.shape[1]], columns = list(l_totals.columns), index = [1211])
#undefeated Gonzaga will be added to balance the team losses and wins stats
l_totals = l_totals.append(gonzaga_tot)

In [51]:
#combining wins and losses stats for each team
totals = (l_totals + w_totals).dropna()

In [52]:
#counting number of wins and losses for each team
w_games = twentyone_results.groupby('WTeamID').count()
l_games = twentyone_results.groupby('TeamID').count()
gonzaga_num = pd.DataFrame([[0]*l_games.shape[1]], columns = list(l_games.columns), index = [1211])
l_games = l_games.append(gonzaga_num)
l_games

In [53]:
#computing season averages for each team
Drops = ['DayNum', 'NumOT', 'WStl', 'WBlk', 'WPF', 'LAst', 'LStl', 'LBlk', 'LPF', 'LFTA', 'LFTM', 'TeamID', 'WTeamID', 'WLoc']
total_games = l_games + w_games
total_games = total_games.rename(columns={'WScore': 'Pts', 'LScore': 'OPts',
                                    'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 
                                    'WFGA3': 'FGA3', 'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast',
                                    'WTO': 'TO', 'LFGM': 'OFGM', 'LFGA': 'OFGA',
                                    'LFGM3': 'OFGM3', 'LFGA3': 'OFGA3', 'WFTM': 'FTM',
                                    'WFTA': 'FTA', 'LOR': 'OOR', 'LDR': 'ODR', 'LTO': 'OTO'})
total_games = total_games.drop(Drops, 1).dropna()

In [54]:
averages = totals/total_games
averages['FG']=averages['FGM']/averages['FGA']
averages['3FG']=averages['FGM3']/averages['FGA3']
averages['FT']=averages['FTM']/averages['FTA']
averages['OFG']=averages['OFGM']/averages['OFGA']
averages['O3FG']=averages['OFGM3']/averages['OFGA3']
team_averages = averages.drop(['FGA', 'FTM', 'FGM', 'FTA', 'FGA3', 'OFGM', 'OFGA', 'OFGA3'], 1)
team_averages

This combines all the data from the tournament teams from the 2021 season to be used as our predictor variable.

In [55]:
team_seeds = pd.DataFrame()
for team in tourney_teams.get('TeamID'):
    seed = tourney_teams[tourney_teams.get('TeamID')==team].get('Seed').iloc[0]
    n_wins = w_games.loc[team].get('DayNum')
    n_losses = l_games.loc[team].get('DayNum')
    data = pd.DataFrame([[convert_seed(seed), n_wins, n_losses]+ list(team_averages.loc[team])], columns=['Seed', 'W', 'L'] + list(team_averages.columns), index = [team])
    team_seeds = team_seeds.append(data)
team_seeds

**Training Model**

The model will use logistic regression to fit the data onto a function that maps out the probability of each team winning.

In [60]:
X = final_data.reset_index().set_index('TeamID').drop(columns = [ 'won_title'])
y = final_data.reset_index().set_index('TeamID').won_title
model = LogisticRegression(max_iter = 500)
model.fit(X, y)

In [61]:
y_predict = model.predict(X)

Displays results of prediction

In [62]:
pd.DataFrame({'Predicted': y_predict})