# Introduction

Continuing from part I where we built the model, this notebook will cover the methods used to make our predictions.

In [2]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns',99)
import warnings 
warnings.filterwarnings('ignore')

### Import the Files

The below code details the process used to reformat the 2019 data into the same format used in our logistic regression data. Whereas part I only used data from 2008 to 2018, the datasets in part II include information from the 2019 regular season. The NCAA statistics are absent since the tournament has not happened yet, thus our process will be slightly different from the one used in part I, but similar overall.

In [3]:
Basics = ['Teams.csv','Seasons.csv','NCAATourneySeeds.csv',
          'RegularSeasonCompactResults.csv','NCAATourneyCompactResults.csv']

BoxScores = ['RegularSeasonDetailedResults.csv','NCAATourneyDetailedResults.csv']

Rankings = ['MasseyOrdinals_thru_2019_day_128.csv']

Supplements = ['Conferences.csv','TeamConferences.csv']

Data = [Basics,BoxScores,Rankings,Supplements]

FileIn = []

for lists in Data:
    for csv in lists:
        print(f'File: {csv}, {lists.index(csv)}')
        Df = pd.read_csv(f'DataFiles - Stage 2/{csv}')
        FileIn.append(Df)

#Basics
Teams = FileIn[0]
Seasons = FileIn[1]
NCAASeeds = FileIn[2]
RSCompact = FileIn[3]
NCAACompact = FileIn[4]

#Box Scores
RSDetailed = FileIn[5]
NCAADetailed = FileIn[6]

#Rankings
MasseyRank = FileIn[7]

#Supplementary
Conferences = FileIn[8]
TeamConferences = FileIn[9]


File: Teams.csv, 0
File: Seasons.csv, 1
File: NCAATourneySeeds.csv, 2
File: RegularSeasonCompactResults.csv, 3
File: NCAATourneyCompactResults.csv, 4
File: RegularSeasonDetailedResults.csv, 0
File: NCAATourneyDetailedResults.csv, 1
File: MasseyOrdinals_thru_2019_day_128.csv, 0
File: Conferences.csv, 0
File: TeamConferences.csv, 1


### Format Data

We can ignore NCAA Compact because it doesn't add anything when it comes to 2019 stats.

In [4]:
RSCompact = RSCompact[RSCompact['Season'] == 2019]
RSCompact['WPointDiff'] = RSCompact['WScore'] - RSCompact['LScore']
RSCompact['LPointDiff'] = -RSCompact['WPointDiff']
RSCompact['Tournament'] = 'Regular Season'

NCAASeeds = NCAASeeds[NCAASeeds['Season'] == 2019]

# Add NCAA Seeds
Model = pd.merge(RSCompact,NCAASeeds,how='left',left_on=['WTeamID','Season'],right_on=['TeamID','Season'])
Model.drop(['TeamID'],axis = 1,inplace=True)
Model['WSeed'] = Model['Seed']
Model.drop(['Seed'],axis = 1, inplace = True)
Model.insert(3,'W_Seed',Model['WSeed'],allow_duplicates=False)
Model.drop(['WSeed'],axis=1,inplace = True)

Model = pd.merge(Model,NCAASeeds,how='left',left_on=['LTeamID','Season'],right_on=['TeamID','Season'])
Model.drop(['TeamID'],axis = 1,inplace=True)
Model['LSeed'] = Model['Seed']
Model.drop(['Seed'],axis = 1, inplace = True)
Model.insert(6,'L_Seed',Model['LSeed'],allow_duplicates=False)
Model.drop(['LSeed'],axis=1,inplace = True)

#Add TeamNames
Model2 = pd.merge(Model,Teams,how='left',left_on=['WTeamID'],right_on=['TeamID'])
Model2.drop(['TeamID','FirstD1Season','LastD1Season'],axis = 1,inplace=True)
Model2['WTeamName'] = Model2['TeamName']
Model2.insert(3,'W_TeamName',Model2['WTeamName'],allow_duplicates=False)
Model2.drop(['WTeamName','TeamName'],axis=1,inplace = True)

Model2 = pd.merge(Model2,Teams,how='left',left_on=['LTeamID'],right_on=['TeamID'])
Model2.drop(['TeamID','FirstD1Season','LastD1Season'],axis = 1,inplace=True)
Model2['LTeamName'] = Model2['TeamName']
Model2.insert(6,'L_TeamName',Model2['LTeamName'],allow_duplicates=False)
Model2.drop(['LTeamName','TeamName'],axis=1,inplace = True)

#Add Region Data
ModelDF = pd.merge(Model2,Seasons,how = 'left',on = 'Season')
ModelDF['W_Seed'] = ModelDF['W_Seed'].fillna(value='DNC')
ModelDF['L_Seed'] = ModelDF['L_Seed'].fillna(value='DNC')

#Add Detailed season data and reorganize the columns
RSDetailed['Tournament'] = 'Regular Season'
ModelDetailed = pd.merge(ModelDF,RSDetailed, how = 'inner')
Model_Detailed = ModelDetailed[[col for col in ModelDetailed if col not in ['DayZero','RegionW','RegionX','RegionY','RegionZ','Tournament']] 
              + ['DayZero','RegionW','RegionX','RegionY','RegionZ','Tournament']]

#Add Massey Ranks
#Take ranking of the latest day and apply to rest of dataset
MasseyRank = MasseyRank[MasseyRank['RankingDayNum'] == 128]
MasseyRank.drop(['RankingDayNum'],axis=1, inplace=True)
#Separate the rankings into columns in order to choose best systems
MasseyRank = MasseyRank.pivot_table('OrdinalRank', ['Season', 'TeamID'], 'SystemName')
#Choose the best rankings
Ratings = MasseyRank[['WOL','RTH','SAG','MOR']]

#Merge the model with ratings systems
ModelRanked = pd.merge(Model_Detailed,Ratings, how='left',left_on = ['WTeamID','Season']
                       ,right_on=['TeamID','Season'])

ModelRanked[['W_WOL','W_RTH','W_SAG','W_MOR']] = ModelRanked[['WOL','RTH','SAG','MOR']]
ModelRanked.drop(['WOL','RTH','SAG','MOR'],axis=1,inplace=True)
ModelRanked = pd.merge(ModelRanked,Ratings, how='left',left_on = ['LTeamID','Season']
                        ,right_on=['TeamID','Season'])
ModelRanked[['L_WOL','L_RTH','L_SAG','L_MOR']] = ModelRanked[['WOL','RTH','SAG','MOR']]
ModelRanked.drop(['WOL','RTH','SAG','MOR'],axis=1,inplace=True)



In 2019, only the 11th and 16th seeds for the W and X region had a play-in tournament. This play-in tournament would determine the eventual 11th and 16th seeds for the actual tournament.

In [5]:
#Add a column for seednums irrespective of region
numlist = ['01','02','03','04','05','06','07','08',
           '09','10','12','13','14','15']

playin_list11 = ['11a','11b']
playin_list16 = ['16a','16b']

for seed in numlist:
        ModelRanked.loc[ModelRanked.W_Seed == f'W{seed}','W_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.W_Seed == f'X{seed}','W_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.W_Seed == f'Y{seed}','W_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.W_Seed == f'Z{seed}','W_SeedNum'] = seed
        
for seed in numlist:
        ModelRanked.loc[ModelRanked.L_Seed == f'W{seed}','L_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.L_Seed == f'X{seed}','L_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.L_Seed == f'Y{seed}','L_SeedNum'] = seed
        ModelRanked.loc[ModelRanked.L_Seed == f'Z{seed}','L_SeedNum'] = seed
        

for seed in playin_list11:
        ModelRanked.loc[ModelRanked.W_Seed == f'W{seed}','W_SeedNum'] = 11
        ModelRanked.loc[ModelRanked.W_Seed == f'X{seed}','W_SeedNum'] = 11
        ModelRanked.loc[ModelRanked.L_Seed == f'W{seed}','L_SeedNum'] = 11
        ModelRanked.loc[ModelRanked.L_Seed == f'X{seed}','L_SeedNum'] = 11
for seed in playin_list16:
        ModelRanked.loc[ModelRanked.W_Seed == f'W{seed}','W_SeedNum'] = 16
        ModelRanked.loc[ModelRanked.W_Seed == f'X{seed}','W_SeedNum'] = 16
        ModelRanked.loc[ModelRanked.L_Seed == f'W{seed}','L_SeedNum'] = 16
        ModelRanked.loc[ModelRanked.L_Seed == f'X{seed}','L_SeedNum'] = 16


ModelRanked['W_SeedNum'].fillna(value= 'DNC',inplace=True)
ModelRanked['L_SeedNum'].fillna(value= 'DNC',inplace=True)
ModelRanked['W_SeedNum'].replace(to_replace= 'DNC',value = 18, inplace=True)
ModelRanked['L_SeedNum'].replace(to_replace= 'DNC',value = 18, inplace=True)
ModelRanked['W_SeedNum'] = pd.to_numeric(ModelRanked['W_SeedNum'])
ModelRanked['L_SeedNum'] = pd.to_numeric(ModelRanked['L_SeedNum'])


#Create W_Region and L_Region column to enable one hot encoding
#we want to see the winner based on region
ModelRanked['W_Region'] = ModelRanked['W_Seed'].str.extract('(^[W-Z])',expand=True)
ModelRanked['L_Region'] = ModelRanked['L_Seed'].str.extract('(^[W-Z])',expand=True)
ModelRanked['W_Region'].fillna(value= 'DNC',inplace=True)
ModelRanked['L_Region'].fillna(value= 'DNC',inplace=True)

#Add Conferences
ConferenceDF = pd.merge(TeamConferences,Conferences)

ModelRankedv2 = pd.merge(ModelRanked,ConferenceDF,how='left',left_on=['WTeamID','Season'],right_on=['TeamID','Season'])
ModelRankedv2['W_Conf'] = ModelRankedv2['Description']
ModelRankedv2.drop(['TeamID','ConfAbbrev','Description'],axis=1,inplace=True)
ModelRankedv2 = pd.merge(ModelRankedv2,ConferenceDF,how='left',left_on=['LTeamID','Season'],right_on=['TeamID','Season'])
ModelRankedv2['L_Conf'] = ModelRankedv2['Description']
ModelRankedv2.drop(['TeamID','ConfAbbrev','Description'],axis=1,inplace=True)


In [6]:
#Add Advanced Stats

#Regular Seasons stats for games that a team won
ModelRankedv2['WPoss'] = ModelRankedv2.apply(lambda row: 0.96*row.WFGA - 0.96*row.WOR + 0.96*row.WTO + 0.96*0.475 * row.WFTA,axis=1)
ModelRankedv2['W_OffRtg'] = ModelRankedv2.apply(lambda row: (row.WScore * 100)/row.WPoss,axis=1 )
ModelRankedv2['W_DefRtg'] = ModelRankedv2.apply(lambda row: (row.LScore * 100)/row.WPoss,axis=1 )
ModelRankedv2['W_EFG'] = ModelRankedv2.apply(lambda row: (row.WFGM + 0.5 * row.WFGM3)/row.WFGA , axis = 1 )
ModelRankedv2['W_TS'] = ModelRankedv2.apply(lambda row: row.WScore / (2*(row.WFGA + (0.44*row.WFTA))), axis = 1)
#Regular Seasons stats for games that a team lost
ModelRankedv2['LPoss'] = ModelRankedv2.apply(lambda row: 0.96*row.LFGA - 0.96*row.LOR + 0.96*row.LTO + 0.96*0.475 * row.LFTA,axis=1)
ModelRankedv2['L_OffRtg'] = ModelRankedv2.apply(lambda row: (row.LScore * 100)/row.LPoss,axis=1 )
ModelRankedv2['L_DefRtg'] = ModelRankedv2.apply(lambda row: (row.WScore * 100)/row.LPoss,axis=1 )
ModelRankedv2['L_EFG'] = ModelRankedv2.apply(lambda row: (row.LFGM + 0.5 * row.LFGM3)/row.LFGA , axis = 1 )
ModelRankedv2['L_TS'] = ModelRankedv2.apply(lambda row: row.LScore / (2*(row.LFGA + (0.44*row.LFTA))), axis = 1)

#Team Impact Estimator
Wie = ModelRankedv2.apply(lambda row: row.WScore + row.WFGM + row.WFTM - row.WFGA - row.WFTA - row.WDR + (0.5*row.WOR)
                         + row.WAst + row.WStl + (0.5 * row.WBlk) - row.WPF - row.WTO,axis=1)
Lie = ModelRankedv2.apply(lambda row: row.LScore + row.LFGM + row.LFTM - row.LFGA - row.LFTA - row.LDR + (0.5*row.LOR)
                         + row.LAst + row.LStl + (0.5 * row.LBlk) - row.LPF - row.LTO,axis=1)
ModelRankedv2['WIE'] = Wie/(Wie+Lie) * 100
ModelRankedv2['LIE'] = Lie/(Wie+Lie) * 100

ModelRankedv2['W_OR_pct'] = ModelRankedv2.apply(lambda row: row.WOR / (row.WOR + row.LDR), axis=1)
ModelRankedv2['W_DR_pct'] = ModelRankedv2.apply(lambda row: row.WDR / (row.WDR + row.LOR), axis=1)
ModelRankedv2['W_REB_pct'] = ModelRankedv2.apply(lambda row: (row.W_OR_pct + row.W_DR_pct) / 2, axis=1)
ModelRankedv2['W_TO_poss'] = ModelRankedv2.apply(lambda row: row.WTO / row.WPoss, axis=1)
ModelRankedv2['W_FT_rate'] = ModelRankedv2.apply(lambda row: row.WFTM / row.WFGA, axis=1)
ModelRankedv2['W_AST_rtio'] = ModelRankedv2.apply(lambda row: row.WAst / (row.WFGA + 0.475*row.WFTA + row.WTO + row.WAst) * 100, axis=1)
ModelRankedv2['W_BLK_pct'] = ModelRankedv2.apply(lambda row: row.WBlk / row.LFGA, axis=1)
ModelRankedv2['W_STL_pct'] = ModelRankedv2.apply(lambda row: row.WStl / row.LPoss, axis=1)
ModelRankedv2['W_NetRtg'] = ModelRankedv2.apply(lambda row: row.W_OffRtg - row.W_DefRtg, axis=1)
ModelRankedv2['L_OR_pct'] = ModelRankedv2.apply(lambda row: row.LOR / (row.LOR + row.WDR), axis=1)
ModelRankedv2['L_DR_pct'] = ModelRankedv2.apply(lambda row: row.LDR / (row.LDR + row.WOR), axis=1)
ModelRankedv2['L_REB_pct'] = ModelRankedv2.apply(lambda row: (row.L_OR_pct + row.L_DR_pct) / 2, axis=1)
ModelRankedv2['L_TO_poss'] = ModelRankedv2.apply(lambda row: row.LTO / row.LPoss, axis=1)
ModelRankedv2['L_FT_rate'] = ModelRankedv2.apply(lambda row: row.LFTM / row.LFGA, axis=1)
ModelRankedv2['L_AST_rtio'] = ModelRankedv2.apply(lambda row: row.LAst / (row.LFGA + 0.475*row.LFTA + row.LTO + row.LAst) * 100, axis=1)
ModelRankedv2['L_BLK_pct'] = ModelRankedv2.apply(lambda row: row.LBlk / row.WFGA, axis=1)
ModelRankedv2['L_STL_pct'] = ModelRankedv2.apply(lambda row: row.LStl / row.WPoss, axis=1)
ModelRankedv2['L_NetRtg'] = ModelRankedv2.apply(lambda row: row.L_OffRtg - row.L_DefRtg, axis=1)

#for some reason pandas calculates inf values
#inf values mess with the mean calculation
#remove rows with inf values to prevent future NaNs from appearing when calucalting season avg
ModelRankedv2.drop(ModelRankedv2[ModelRankedv2['WIE'] == np.inf].index,inplace = True)
ModelRankedv2.drop(ModelRankedv2[ModelRankedv2['LIE'] == np.inf].index,inplace = True)

#One hot encode the winning regions
from pandas import get_dummies
ModelHot = pd.get_dummies(ModelRankedv2,columns = ['W_Region','L_Region'])
ModelHot.drop(ModelHot[ModelHot['WIE'].isnull() == True].index, inplace = True)

In [7]:
#Will not include ratings of POM to Mass due to NaN values

W_Stats = ['WScore','WPointDiff','WFGM','WFGA','WFGM3','WFGA3','WFTM','WFTA','WOR','WDR','WAst','WTO',
          'WStl','WBlk','WPF','W_WOL', 'W_RTH', 'W_SAG', 'W_MOR','WPoss','W_OffRtg','W_DefRtg','W_EFG','W_TS','WIE',
          'W_OR_pct','W_DR_pct','W_REB_pct','W_TO_poss','W_FT_rate','W_AST_rtio','W_BLK_pct','W_STL_pct','W_NetRtg']

#we need to do neg point diff for losing team
L_Stats = ['LScore','LPointDiff','LFGM','LFGA','LFGM3','LFGA3','LFTM','LFTA','LOR','LDR','LAst','LTO',
          'LStl','LBlk','LPF','LPoss','L_OffRtg','L_DefRtg','L_EFG','L_TS','LIE',
          'L_OR_pct','L_DR_pct','L_REB_pct','L_TO_poss','L_FT_rate','L_AST_rtio','L_BLK_pct','L_STL_pct','L_NetRtg']

Encodings = ['W_Region_DNC','W_Region_W', 'W_Region_X', 'W_Region_Y', 
             'W_Region_Z', 'L_Region_DNC','L_Region_W', 'L_Region_X', 'L_Region_Y', 'L_Region_Z']

In [8]:
#new dataframe for averages
ModelAvg = pd.DataFrame()

for i in W_Stats:
    ModelAvg[f'{i}']= ModelHot[f'{i}'].groupby([ModelHot['Season'],ModelHot['WTeamID']]).mean()
for i in L_Stats:
    ModelAvg[f'{i}']= ModelHot[f'{i}'].groupby([ModelHot['Season'],ModelHot['WTeamID']]).mean()

#Games won in season vs Games lost
ModelAvg['NumWins'] = ModelHot['WTeamID'].groupby([ModelHot['Season'],ModelHot['WTeamID']]).count()
ModelAvg['NumLoss'] = ModelHot['LTeamID'].groupby([ModelHot['Season'],ModelHot['LTeamID']]).count()
ModelAvg['NumLoss'].fillna(0,inplace=True)

In [9]:
ModelAvg['WinPct'] = ModelAvg.NumWins / (ModelAvg.NumWins + ModelAvg.NumLoss)
ModelAvg.rename(columns={'W_WOL': 'WOL', 'W_RTH': 'RTH','W_SAG': 'SAG', 'W_MOR': 'MOR'}, inplace=True)

#let's do a weighted average of the the season stats for games won and lost
Avg_Stats = ['Score','PointDiff','FGM','FGA','FGM3','FGA3','FTM','FTA','OR','DR','Ast','TO',
          'Stl','Blk','PF','Poss','_OffRtg','_DefRtg','_EFG','_TS','IE',
          '_OR_pct','_DR_pct','_REB_pct','_TO_poss','_FT_rate','_AST_rtio','_BLK_pct','_STL_pct','_NetRtg']
for i in Avg_Stats:
    ModelAvg[f'{i}'] = (ModelAvg[f'W{i}'] * ModelAvg['WinPct']) + (ModelAvg[f'L{i}'] * (1-ModelAvg['WinPct']))

ModelAvg.reset_index(inplace=True)
ModelAvg.rename(columns={'WTeamID': 'TeamID'}, inplace=True)
ModelAvg = pd.merge(ModelAvg,Teams,how='left')
ModelAvg = pd.merge(ModelAvg,ConferenceDF,how='left')
ModelAvg.drop(columns=['FirstD1Season','LastD1Season','ConfAbbrev'],axis=1,inplace=True)

#Reorganize to the dataframe to only have these columns of average stats
SeasonStatsCol = ['Season','TeamID','TeamName','Description','WOL','RTH','SAG','MOR','NumWins', 'NumLoss', 
                  'WinPct', 'Score', 'PointDiff', 'FGM','FGA', 'FGM3', 'FGA3', 'FTM', 
                  'FTA', 'OR', 'DR', 'Ast', 'TO', 'Stl','Blk', 'PF', 'Poss', '_OffRtg', 
                  '_DefRtg','_NetRtg', '_EFG', '_TS', 'IE','_OR_pct', '_DR_pct', '_REB_pct', 
                  '_TO_poss', '_FT_rate', '_AST_rtio','_BLK_pct', '_STL_pct']

SeasonStats = pd.DataFrame()
for i in SeasonStatsCol:
    SeasonStats[f'{i}'] = ModelAvg[f'{i}'] 

In [10]:
#Assign the seed to each competing team in the tournament
#We do this to build our bracket
SeasonStats = pd.merge(SeasonStats,NCAASeeds)
SeasonStats.head()

Unnamed: 0,Season,TeamID,TeamName,Description,WOL,RTH,SAG,MOR,NumWins,NumLoss,WinPct,Score,PointDiff,FGM,FGA,FGM3,FGA3,FTM,FTA,OR,DR,Ast,TO,Stl,Blk,PF,Poss,_OffRtg,_DefRtg,_NetRtg,_EFG,_TS,IE,_OR_pct,_DR_pct,_REB_pct,_TO_poss,_FT_rate,_AST_rtio,_BLK_pct,_STL_pct,Seed
0,2019,1101,Abilene Chr,Southland Conference,116.0,156.0,171.0,172.0,22,6,0.785714,71.386364,6.363636,25.038961,54.574675,6.970779,18.305195,14.337662,19.915584,9.142857,23.876623,14.25,11.659091,7.282468,2.672078,18.62013,63.888779,111.979297,101.907135,10.072161,0.523101,0.564102,64.007055,0.275117,0.728724,0.50192,0.180511,0.26703,15.756479,0.051181,0.111873,Y15
1,2019,1113,Arizona St,Pacific-12 Conference,50.0,53.0,53.0,57.0,22,10,0.6875,78.153409,4.397727,26.622159,58.434659,7.713068,22.15625,17.196023,25.144886,11.227273,27.008523,14.144886,13.982955,6.272727,3.167614,20.801136,70.208795,111.537771,105.176073,6.361699,0.524836,0.564468,37.589592,0.309978,0.715854,0.512916,0.198601,0.297952,14.358986,0.054273,0.089556,X11a
2,2019,1120,Auburn,Southeastern Conference,25.0,19.0,11.0,11.0,25,9,0.735294,77.268235,8.056471,26.443529,58.407059,10.892941,28.563529,13.488235,18.705882,11.515294,22.32,14.307059,13.214118,8.817647,4.763529,17.538824,66.231529,116.583159,105.015542,11.567617,0.544964,0.579968,59.011816,0.326601,0.67952,0.50306,0.198859,0.237386,14.841612,0.089277,0.133729,Y05
3,2019,1124,Baylor,Big 12 Conference,46.0,40.0,38.0,43.0,19,12,0.612903,70.064516,2.971138,24.750424,56.55348,8.151104,22.480475,12.412564,18.11545,11.83871,23.456706,13.983022,12.337861,5.764007,4.865874,17.448217,63.031171,111.089852,106.350541,4.739311,0.508811,0.541998,78.423283,0.345284,0.68023,0.512757,0.195424,0.224075,15.165528,0.085629,0.091008,X09
4,2019,1125,Belmont,Ohio Valley Conference,34.0,71.0,59.0,50.0,25,5,0.833333,86.573333,11.146667,31.513333,62.713333,10.273333,26.26,13.273333,18.42,8.74,29.32,19.06,11.966667,6.56,3.8,15.546667,71.70192,121.115749,105.254693,15.861056,0.586635,0.612721,87.085636,0.253906,0.765089,0.509498,0.165775,0.219016,18.475214,0.059457,0.091077,W11a


In [11]:
SeasonStats.to_csv('SeasonStats_2019.csv',encoding = 'utf-8', index = False)

In [10]:
#Add One Hot Encoding to SeasonStats
ModelHot2 = ModelHot[['W_SeedNum', 'W_Region_DNC', 'W_Region_W', 'W_Region_X', 'W_Region_Y', 'W_Region_Z','WTeamID']]

ModelHot2.rename(columns = {'W_SeedNum':'SeedNum','W_Region_DNC':'_Region_DNC', 'W_Region_W':'_Region_W', 'W_Region_X':'_Region_X', 
                        'W_Region_Y':'_Region_Y', 'W_Region_Z':'_Region_Z','WTeamID':'TeamID'},inplace=True)

In [11]:
#Add SeedNum and Region Information
SeasonStats = pd.merge(SeasonStats,ModelHot2,how='left',on='TeamID')
SeasonStats = SeasonStats.drop_duplicates()
SeasonStats = SeasonStats.reset_index()
SeasonStats.drop(['index'],axis=1,inplace=True)

In [13]:
#Finalized SeasonStats

#Make sure that there are 68 rows, because there are only 68 teams competing in the tournament 
#4 in the play-in, 64 in the actual tournament
print(SeasonStats.shape)
SeasonStats.head()

(68, 48)


Unnamed: 0,Season,TeamID,TeamName,Description,WOL,RTH,SAG,MOR,NumWins,NumLoss,WinPct,Score,PointDiff,FGM,FGA,FGM3,FGA3,FTM,FTA,OR,DR,Ast,TO,Stl,Blk,PF,Poss,_OffRtg,_DefRtg,_NetRtg,_EFG,_TS,IE,_OR_pct,_DR_pct,_REB_pct,_TO_poss,_FT_rate,_AST_rtio,_BLK_pct,_STL_pct,Seed,SeedNum,_Region_DNC,_Region_W,_Region_X,_Region_Y,_Region_Z
0,2019,1101,Abilene Chr,Southland Conference,116.0,156.0,171.0,172.0,22,6,0.785714,71.386364,6.363636,25.038961,54.574675,6.970779,18.305195,14.337662,19.915584,9.142857,23.876623,14.25,11.659091,7.282468,2.672078,18.62013,63.888779,111.979297,101.907135,10.072161,0.523101,0.564102,64.007055,0.275117,0.728724,0.50192,0.180511,0.26703,15.756479,0.051181,0.111873,Y15,15,0,0,0,1,0
1,2019,1113,Arizona St,Pacific-12 Conference,50.0,53.0,53.0,57.0,22,10,0.6875,78.153409,4.397727,26.622159,58.434659,7.713068,22.15625,17.196023,25.144886,11.227273,27.008523,14.144886,13.982955,6.272727,3.167614,20.801136,70.208795,111.537771,105.176073,6.361699,0.524836,0.564468,37.589592,0.309978,0.715854,0.512916,0.198601,0.297952,14.358986,0.054273,0.089556,X11a,11,0,0,1,0,0
2,2019,1120,Auburn,Southeastern Conference,25.0,19.0,11.0,11.0,25,9,0.735294,77.268235,8.056471,26.443529,58.407059,10.892941,28.563529,13.488235,18.705882,11.515294,22.32,14.307059,13.214118,8.817647,4.763529,17.538824,66.231529,116.583159,105.015542,11.567617,0.544964,0.579968,59.011816,0.326601,0.67952,0.50306,0.198859,0.237386,14.841612,0.089277,0.133729,Y05,5,0,0,0,1,0
3,2019,1124,Baylor,Big 12 Conference,46.0,40.0,38.0,43.0,19,12,0.612903,70.064516,2.971138,24.750424,56.55348,8.151104,22.480475,12.412564,18.11545,11.83871,23.456706,13.983022,12.337861,5.764007,4.865874,17.448217,63.031171,111.089852,106.350541,4.739311,0.508811,0.541998,78.423283,0.345284,0.68023,0.512757,0.195424,0.224075,15.165528,0.085629,0.091008,X09,9,0,0,1,0,0
4,2019,1125,Belmont,Ohio Valley Conference,34.0,71.0,59.0,50.0,25,5,0.833333,86.573333,11.146667,31.513333,62.713333,10.273333,26.26,13.273333,18.42,8.74,29.32,19.06,11.966667,6.56,3.8,15.546667,71.70192,121.115749,105.254693,15.861056,0.586635,0.612721,87.085636,0.253906,0.765089,0.509498,0.165775,0.219016,18.475214,0.059457,0.091077,W11a,11,0,1,0,0,0


### Tournament Format

The main issue I had to tackle with predicting the tournament was having to reseed the teams for each round. This is because each win/loss in subsequent rounds changes the team's seeding within the tournament (ex. The winner of slot W16 will be known as W16, the winner of slot R1W1 will be known as seed R1W1, etc.).

To get past this issue, I had to redefine the seeds by creating new dataframes for each round of the tournament. The slot column would be the matchup key identifier, and also serving as the seeding identifier for following rounds.

In [14]:
NCAATourneySlots = pd.read_csv('DataFiles - Stage 2/NCAATourneySlots.csv')
NCAATourneySlots = NCAATourneySlots[NCAATourneySlots['Season'] == 2019]
NCAATourneySlots.head()

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
2184,2019,W11,W11a,W11b
2185,2019,W16,W16a,W16b
2186,2019,X11,X11a,X11b
2187,2019,X16,X16a,X16b
2188,2019,R1W1,W01,W16


In [15]:
slot_num_R1 = [1,2,3,4,5,6,7,8]
slot_num_R2 = [1,2,3,4]
slot_num_R3 = [1,2]

for seed in slot_num_R1:
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R1W{seed}','Round'] = 1
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R1X{seed}','Round'] = 1
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R1Y{seed}','Round'] = 1
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R1Z{seed}','Round'] = 1
    
for seed in slot_num_R2:
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R2W{seed}','Round'] = 2
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R2X{seed}','Round'] = 2
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R2Y{seed}','Round'] = 2
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R2Z{seed}','Round'] = 2

for seed in slot_num_R3:
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R3W{seed}','Round'] = 3
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R3X{seed}','Round'] = 3
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R3Y{seed}','Round'] = 3
    NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R3Z{seed}','Round'] = 3

NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R4W1','Round'] = 4
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R4X1','Round'] = 4
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R4Y1','Round'] = 4
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R4Z1','Round'] = 4

NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R5WX','Round'] = 5
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R5YZ','Round'] = 5

NCAATourneySlots.loc[NCAATourneySlots.Slot == f'R6CH','Round'] = 6

NCAATourneySlots.loc[NCAATourneySlots.Slot == f'W11','Round'] = 0
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'W16','Round'] = 0
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'X11','Round'] = 0
NCAATourneySlots.loc[NCAATourneySlots.Slot == f'X16','Round'] = 0

I instatiated new dataframes for each round.

In [197]:
#Divide the tournament data into Rounds and attach teams with their season stats
PlayIn = NCAATourneySlots[NCAATourneySlots['Round'] == 0]
Round1 = NCAATourneySlots[NCAATourneySlots['Round'] == 1]
Round2 = NCAATourneySlots[NCAATourneySlots['Round'] == 2]
Round3 = NCAATourneySlots[NCAATourneySlots['Round'] == 3]
Round4 = NCAATourneySlots[NCAATourneySlots['Round'] == 4]
Round5 = NCAATourneySlots[NCAATourneySlots['Round'] == 5]
Round6 = NCAATourneySlots[NCAATourneySlots['Round'] == 6]


The below will format our data where we can get the difference in seasonal averages for each team facing off

In [17]:
TeamSeeds = pd.DataFrame()
TeamSeeds = SeasonStats[list(['TeamName','Seed'])]

In [79]:
PlayIn_Matchup = pd.merge(PlayIn,TeamSeeds,left_on = ['StrongSeed'],right_on=['Seed'])
PlayIn_Matchup.drop(['Seed'],axis=1,inplace=True)
PlayIn_Matchup = pd.merge(PlayIn_Matchup,TeamSeeds,left_on = ['WeakSeed'],right_on=['Seed'])
PlayIn_Matchup.drop(['Seed'],axis=1,inplace=True)
PlayIn_Matchup.rename(columns = {'TeamName_x':'TeamName_1','TeamName_y':'TeamName_2'},inplace=True)

In [80]:
#Matchups for the play-in tournament
PlayIn_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2
0,2019,W11,W11a,W11b,0.0,Belmont,Temple
1,2019,W16,W16a,W16b,0.0,N Dakota St,NC Central
2,2019,X11,X11a,X11b,0.0,Arizona St,St John's
3,2019,X16,X16a,X16b,0.0,F Dickinson,Prairie View


The below code forms the baseline for finding the differential season averages, seeds, and ranks between competing teams for each slot.

In [24]:
rename_list = ['SeedNum','_Region_DNC', '_Region_W', '_Region_X', '_Region_Y', '_Region_Z',
                'WOL','RTH','SAG','MOR','NumWins', 'NumLoss', 
                  'WinPct', 'Score', 'PointDiff', 'FGM','FGA', 'FGM3', 'FGA3', 'FTM', 
                  'FTA', 'OR', 'DR', 'Ast', 'TO', 'Stl','Blk', 'PF', 'Poss', '_OffRtg', 
                  '_DefRtg','_NetRtg', '_EFG', '_TS', 'IE','_OR_pct', '_DR_pct', '_REB_pct', 
                  '_TO_poss', '_FT_rate', '_AST_rtio','_BLK_pct', '_STL_pct']

#Create a model that accounts for the differences in the values of each team
Stats1 = ['1SeedNum','1_Region_DNC', '1_Region_W', '1_Region_X', '1_Region_Y', '1_Region_Z',
       '1WOL', '1RTH', '1SAG', '1MOR', '1NumWins', '1NumLoss', '1WinPct',
       '1Score', '1PointDiff', '1FGM', '1FGA', '1FGM3', '1FGA3', '1FTM',
       '1FTA', '1OR', '1DR', '1Ast', '1TO', '1Stl', '1Blk', '1PF', '1Poss',
       '1_OffRtg', '1_DefRtg', '1_NetRtg', '1_EFG', '1_TS', '1IE', '1_OR_pct',
       '1_DR_pct', '1_REB_pct', '1_TO_poss', '1_FT_rate', '1_AST_rtio',
       '1_BLK_pct', '1_STL_pct']
Stats2 = ['2SeedNum','2_Region_DNC', '2_Region_W', '2_Region_X', '2_Region_Y', '2_Region_Z',
       '2WOL', '2RTH', '2SAG', '2MOR', '2NumWins', '2NumLoss', '2WinPct',
       '2Score', '2PointDiff', '2FGM', '2FGA', '2FGM3', '2FGA3', '2FTM',
       '2FTA', '2OR', '2DR', '2Ast', '2TO', '2Stl', '2Blk', '2PF', '2Poss',
       '2_OffRtg', '2_DefRtg', '2_NetRtg', '2_EFG', '2_TS', '2IE', '2_OR_pct',
       '2_DR_pct', '2_REB_pct', '2_TO_poss', '2_FT_rate', '2_AST_rtio',
       '2_BLK_pct', '2_STL_pct']
ColumnName = ['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 'Region_Z',
       'WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM',
       'FTA', 'OR', 'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss',
       '_OffRtg', '_DefRtg', '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct',
       '_DR_pct', '_REB_pct', '_TO_poss', '_FT_rate', '_AST_rtio',
       '_BLK_pct', '_STL_pct']

In [29]:
#merge the strong seed
PlayIn = pd.merge(PlayIn,SeasonStats,left_on = ['Season','StrongSeed'],right_on=['Season','Seed'])
PlayIn.drop(['Seed','TeamID','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    PlayIn.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
PlayIn = pd.merge(PlayIn,SeasonStats,left_on = ['Season','WeakSeed'],right_on=['Season','Seed'])
PlayIn.drop(['Seed','TeamID','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    PlayIn.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
PlayIn_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    PlayIn_Diff[Col] = PlayIn[stat_1] - PlayIn[stat_2]

PlayIn_Diff = PlayIn_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [30]:
PlayIn_Diff

Unnamed: 0,SeedNum,Region_DNC,RegionW,Region_X,Region_Y,Region_Z,WOL,RTH,SAG,MOR,NumWins,NumLoss,WinPct,Score,PointDiff,FGM,FGA,FGM3,FGA3,FTM,FTA,OR,DR,Ast,TO,Stl,Blk,PF,Poss,_OffRtg,_DefRtg,_NetRtg,_EFG,_TS,IE,_OR_pct,_DR_pct,_REB_pct,_TO_poss,_FT_rate,_AST_rtio,_BLK_pct,_STL_pct
0,0,0,0,0,0,0,2.0,23.0,-6.0,-10.0,2,-4,0.114583,10.49317,7.551558,4.372029,2.438877,2.713551,3.678478,-0.964438,-1.156087,-0.875489,3.584946,4.132011,-0.162409,-1.373424,1.28913,-1.197899,2.498703,10.762131,0.232868,10.529263,0.071798,0.058427,58.367851,-0.010565,0.0432,0.016318,-0.007961,-0.022912,3.111626,0.017154,-0.022611
1,0,0,0,0,0,0,-105.0,-114.0,-105.0,-53.0,1,0,0.016129,2.418952,0.300403,0.019624,-0.126075,2.069892,2.806452,0.309812,-0.648118,-2.865054,2.053898,-3.081317,-2.752957,-0.364516,-0.696909,-0.575806,-0.308961,4.800998,4.107504,0.693494,0.023255,0.025486,-11.787559,-0.092032,0.090128,-0.000952,-0.043257,0.012925,-2.477824,-0.013171,-0.006037
2,0,0,0,0,0,0,-2.0,4.0,-8.0,-15.0,1,-2,0.051136,-0.275162,1.112013,-0.867018,-1.803436,-1.334551,-2.804789,2.793425,4.9977,2.872294,0.069129,-0.335633,1.324946,-1.532468,0.410038,2.762175,-0.937802,1.196478,-0.479229,1.675707,-0.007139,-0.004766,-5.599101,0.076082,-0.037638,0.019222,0.021083,0.053986,-0.43361,0.007981,-0.018515
3,0,0,0,0,0,0,30.0,-14.0,-13.0,-88.0,-3,1,-0.055718,1.380883,-1.404902,1.59561,0.394638,2.009961,1.170391,-3.8203,-8.472699,-1.05239,-1.357166,0.472234,-0.940371,-0.950542,0.881604,-5.392892,-3.377161,7.717666,9.530805,-1.813139,0.047125,0.042238,20.866076,0.006925,0.00294,0.004933,-0.003298,-0.063177,1.116473,0.012251,-0.009163


### Predictions

In [31]:
import pickle
from sklearn.externals import joblib
import numpy as np
import pandas as pd

# Machine Learning
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import model_selection, metrics, linear_model, datasets, feature_selection

from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn import pipeline

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA
from sklearn import model_selection, metrics, linear_model, datasets, feature_selection

from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report

pd.set_option('display.max_columns',99)
import warnings 
warnings.filterwarnings('ignore')

Due to the logistic regression model having the highest accuracy, precision, and recall scores, I used the optimized model to make my predictions.

In [53]:
# load the model from disk
logit_model = joblib.load('fitted_search.pkl')

In [54]:
#Determine winning team per matchup
logit_model.predict(PlayIn_Diff)

array([2, 1, 1, 1])

The predicted class probabilities represent the probability of team 1 winning the match versus team 2 given their differential season averages.

In [56]:
#Determine probability of falling into either class 1 or class 2
logit_model.predict_proba(PlayIn_Diff)

array([[0.41835779, 0.58164221],
       [0.82388219, 0.17611781],
       [0.50000937, 0.49999063],
       [0.51844965, 0.48155035]])

In [81]:
win_prob_playin = logit_model.predict_proba(PlayIn_Diff)
winners_playin = logit_model.predict(PlayIn_Diff)
PlayIn_Matchup['Winners'] = winners_playin
win_prob_playin = pd.DataFrame(win_prob_playin,columns = ['Team_1_Probability','Team_2_Probability'])
PlayIn_Matchup = pd.concat((PlayIn_Matchup,win_prob_playin),axis=1)

In [82]:
PlayIn_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,W11,W11a,W11b,0.0,Belmont,Temple,2,0.418358,0.581642
1,2019,W16,W16a,W16b,0.0,N Dakota St,NC Central,1,0.823882,0.176118
2,2019,X11,X11a,X11b,0.0,Arizona St,St John's,1,0.500009,0.499991
3,2019,X16,X16a,X16b,0.0,F Dickinson,Prairie View,1,0.51845,0.48155


The model predicted that the 11th and 16th seeds from each region are as follows:
- W11: Temple
- W16: N Dakota St
- X11: Arizona St
- X16: F Dickinson

The same logic from the code above will be applied to each round of the tournament, concluding at the championship. Team 1 is on the left, Team 2 is on the right.

## Determine the Matchups for Round 1 of the Tournament

In [86]:
drop_list1 = ['W11a','W16b','X11b','X16b']
for seed in drop_list1:
    TeamSeeds.drop(TeamSeeds[TeamSeeds['Seed'] == seed].index,inplace = True)

In [95]:
#Dataframe with new seeding

new_seeds = ['W11b','W16a','X11a','X16a']
rename_seeds = ['W11','W16','X11','X16']

for new,rename in zip(new_seeds,rename_seeds):
    TeamSeeds.replace(to_replace= new, value = rename,inplace=True)

In [98]:
Round1_Matchup = pd.merge(Round1,TeamSeeds,left_on = ['StrongSeed'],right_on=['Seed'])
Round1_Matchup.drop(['Seed'],axis=1,inplace=True)
Round1_Matchup = pd.merge(Round1_Matchup,TeamSeeds,left_on = ['WeakSeed'],right_on=['Seed'])
Round1_Matchup.drop(['Seed'],axis=1,inplace=True)
Round1_Matchup.rename(columns = {'TeamName_x':'TeamName_1','TeamName_y':'TeamName_2'},inplace=True)

In [100]:
Round1_Matchup.head()

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2
0,2019,R1W1,W01,W16,1.0,Duke,N Dakota St
1,2019,R1W2,W02,W15,1.0,Michigan St,Bradley
2,2019,R1W3,W03,W14,1.0,LSU,Yale
3,2019,R1W4,W04,W13,1.0,Virginia Tech,St Louis
4,2019,R1W5,W05,W12,1.0,Mississippi St,Liberty


In [128]:
#We must rename the seeds in SeasonStats to align with our new seedings
SeasonStatsRound1 = SeasonStats.copy()

for new,rename in zip(new_seeds,rename_seeds):
    SeasonStatsRound1.replace(to_replace= new, value = rename,inplace=True)

In [131]:
#merge the strong seed
Round1 = pd.merge(Round1,SeasonStatsRound1,left_on = ['Season','StrongSeed'],right_on=['Season','Seed'])
Round1.drop(['Seed','TeamID','TeamName','Description'],axis=1,inplace=True)

for i in rename_list:
    Round1.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round1 = pd.merge(Round1,SeasonStatsRound1,left_on = ['Season','WeakSeed'],right_on=['Season','Seed'])
Round1.drop(['Seed','TeamID','TeamName','Description'],axis=1,inplace=True)

for i in rename_list:
    Round1.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round1_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round1_Diff[Col] = Round1[stat_1] - Round1[stat_2]

Round1_Diff = Round1_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [132]:
win_prob_Round1 = logit_model.predict_proba(Round1_Diff)
winners_Round1 = logit_model.predict(Round1_Diff)
Round1_Matchup['Winners'] = winners_Round1
win_prob_Round1 = pd.DataFrame(win_prob_Round1,columns = ['Team_1_Probability','Team_2_Probability'])
Round1_Matchup = pd.concat((Round1_Matchup,win_prob_Round1),axis=1)

In [133]:
Round1_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R1W1,W01,W16,1.0,Duke,N Dakota St,1,0.986072,0.013928
1,2019,R1W2,W02,W15,1.0,Michigan St,Bradley,1,0.954047,0.045953
2,2019,R1W3,W03,W14,1.0,LSU,Yale,1,0.882739,0.117261
3,2019,R1W4,W04,W13,1.0,Virginia Tech,St Louis,1,0.869784,0.130216
4,2019,R1W5,W05,W12,1.0,Mississippi St,Liberty,1,0.768255,0.231745
5,2019,R1W6,W06,W11,1.0,Maryland,Temple,1,0.650594,0.349406
6,2019,R1W7,W07,W10,1.0,Louisville,Minnesota,1,0.53538,0.46462
7,2019,R1W8,W08,W09,1.0,VA Commonwealth,UCF,2,0.484688,0.515312
8,2019,R1X1,X01,X16,1.0,Gonzaga,F Dickinson,1,0.987002,0.012998
9,2019,R1X2,X02,X15,1.0,Michigan,Montana,1,0.930876,0.069124


## Round 2

In [158]:
Team1_Losers = ['VA Commonwealth','Syracuse','Utah St','Mississippi']
Team2_Winners = ['UCF','Baylor','Washington','Oklahoma']

Seeding_Round2 = Round1_Matchup[['Slot','TeamName_1']]
Seeding_Round2.rename(columns={'TeamName_1':'TeamName'},inplace=True)

for new,rename in zip(Team1_Losers,Team2_Winners):
    Seeding_Round2.replace(to_replace= new, value = rename,inplace=True)

In [160]:
Seeding_Round2.head()

Unnamed: 0,Slot,TeamName
0,R1W1,Duke
1,R1W2,Michigan St
2,R1W3,LSU
3,R1W4,Virginia Tech
4,R1W5,Mississippi St


In [198]:
Round2_Matchup = pd.merge(Round2,Seeding_Round2,left_on ='StrongSeed',right_on='Slot')
Round2_Matchup.drop('Slot_y',axis=1,inplace=True)
Round2_Matchup.rename(columns={'TeamName':'TeamName_1'},inplace=True)
Round2_Matchup = pd.merge(Round2_Matchup,Seeding_Round2,left_on ='WeakSeed',right_on='Slot')
Round2_Matchup.drop('Slot',axis=1,inplace=True)
Round2_Matchup.rename(columns={'Slot_x':'Slot','TeamName':'TeamName_2'},inplace=True)

In [199]:
#merge the strong seed
Round2 = pd.merge(Round2_Matchup,SeasonStatsRound1,left_on = ['Season','TeamName_1'],right_on=['Season','TeamName'])
Round2.drop(['Seed','TeamID','TeamName','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    Round2.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round2 = pd.merge(Round2,SeasonStatsRound1,left_on = ['Season','TeamName_2'],right_on=['Season','TeamName'])
Round2.drop(['Seed','TeamID','TeamName','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    Round2.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round2_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round2_Diff[Col] = Round2[stat_1] - Round2[stat_2]

Round2_Diff = Round2_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [200]:
win_prob_Round2 = logit_model.predict_proba(Round2_Diff)
winners_Round2 = logit_model.predict(Round2_Diff)
Round2_Matchup['Winners'] = winners_Round2
win_prob_Round2 = pd.DataFrame(win_prob_Round2,columns = ['Team_1_Probability','Team_2_Probability'])
Round2_Matchup = pd.concat((Round2_Matchup,win_prob_Round2),axis=1)

In [201]:
Round2_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R2W1,R1W1,R1W8,2.0,Duke,UCF,1,0.733168,0.266832
1,2019,R2W2,R1W2,R1W7,2.0,Michigan St,Louisville,1,0.686582,0.313418
2,2019,R2W3,R1W3,R1W6,2.0,LSU,Maryland,1,0.605842,0.394158
3,2019,R2W4,R1W4,R1W5,2.0,Virginia Tech,Mississippi St,1,0.536051,0.463949
4,2019,R2X1,R1X1,R1X8,2.0,Gonzaga,Baylor,1,0.782846,0.217154
5,2019,R2X2,R1X2,R1X7,2.0,Michigan,Nevada,1,0.630121,0.369879
6,2019,R2X3,R1X3,R1X6,2.0,Texas Tech,Buffalo,1,0.551426,0.448574
7,2019,R2X4,R1X4,R1X5,2.0,Florida St,Marquette,1,0.563554,0.436446
8,2019,R2Y1,R1Y1,R1Y8,2.0,North Carolina,Washington,1,0.714086,0.285914
9,2019,R2Y2,R1Y2,R1Y7,2.0,Kentucky,Wofford,1,0.639016,0.360984


## Round 3 (Sweet Sixteen)

In [182]:
Seeding_Round3 = Round2_Matchup[['Slot','TeamName_1']]
Seeding_Round3.rename(columns={'TeamName_1':'TeamName'},inplace=True)

In [184]:
Seeding_Round3.head()

Unnamed: 0,Slot,TeamName
0,R2W1,Duke
1,R2W2,Michigan St
2,R2W3,LSU
3,R2W4,Virginia Tech
4,R2X1,Gonzaga


In [186]:
Round3_Matchup = pd.merge(Round3,Seeding_Round3,left_on ='StrongSeed',right_on='Slot')
Round3_Matchup.drop('Slot_y',axis=1,inplace=True)
Round3_Matchup.rename(columns={'TeamName':'TeamName_1'},inplace=True)
Round3_Matchup = pd.merge(Round3_Matchup,Seeding_Round3,left_on ='WeakSeed',right_on='Slot')
Round3_Matchup.drop('Slot',axis=1,inplace=True)
Round3_Matchup.rename(columns={'Slot_x':'Slot','TeamName':'TeamName_2'},inplace=True)

In [188]:
#merge the strong seed
Round3 = pd.merge(Round3_Matchup,SeasonStatsRound1,left_on = ['Season','TeamName_1'],right_on=['Season','TeamName'])
Round3.drop(['Seed','TeamID','TeamName','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    Round3.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round3 = pd.merge(Round3,SeasonStatsRound1,left_on = ['Season','TeamName_2'],right_on=['Season','TeamName'])
Round3.drop(['Seed','TeamID','TeamName','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    Round3.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round3_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round3_Diff[Col] = Round3[stat_1] - Round3[stat_2]

Round3_Diff = Round3_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [192]:
win_prob_Round3 = logit_model.predict_proba(Round3_Diff)
winners_Round3 = logit_model.predict(Round3_Diff)
Round3_Matchup['Winners'] = winners_Round3
win_prob_Round3 = pd.DataFrame(win_prob_Round3,columns = ['Team_1_Probability','Team_2_Probability'])
Round3_Matchup = pd.concat((Round3_Matchup,win_prob_Round3),axis=1)

In [193]:
Round3_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R3W1,R2W1,R2W4,3.0,Duke,Virginia Tech,1,0.622715,0.377285
1,2019,R3W2,R2W2,R2W3,3.0,Michigan St,LSU,1,0.51871,0.48129
2,2019,R3X1,R2X1,R2X4,3.0,Gonzaga,Florida St,1,0.595557,0.404443
3,2019,R3X2,R2X2,R2X3,3.0,Michigan,Texas Tech,1,0.546044,0.453956
4,2019,R3Y1,R2Y1,R2Y4,3.0,North Carolina,Kansas,1,0.602936,0.397064
5,2019,R3Y2,R2Y2,R2Y3,3.0,Kentucky,Houston,2,0.499599,0.500401
6,2019,R3Z1,R2Z1,R2Z4,3.0,Virginia,Kansas St,1,0.63975,0.36025
7,2019,R3Z2,R2Z2,R2Z3,3.0,Tennessee,Purdue,1,0.584593,0.415407


## Round 4 (Elite Eight)

In [202]:
Team1_Losers = ['Kentucky']
Team2_Winners = ['Houston']

Seeding_Round4 = Round3_Matchup[['Slot','TeamName_1']]
Seeding_Round4.rename(columns={'TeamName_1':'TeamName'},inplace=True)

for new,rename in zip(Team1_Losers,Team2_Winners):
    Seeding_Round4.replace(to_replace= new, value = rename,inplace=True)

In [204]:
Seeding_Round4.tail()

Unnamed: 0,Slot,TeamName
3,R3X2,Michigan
4,R3Y1,North Carolina
5,R3Y2,Houston
6,R3Z1,Virginia
7,R3Z2,Tennessee


In [205]:
Round4_Matchup = pd.merge(Round4,Seeding_Round4,left_on ='StrongSeed',right_on='Slot')
Round4_Matchup.drop('Slot_y',axis=1,inplace=True)
Round4_Matchup.rename(columns={'TeamName':'TeamName_1'},inplace=True)
Round4_Matchup = pd.merge(Round4_Matchup,Seeding_Round4,left_on ='WeakSeed',right_on='Slot')
Round4_Matchup.drop('Slot',axis=1,inplace=True)
Round4_Matchup.rename(columns={'Slot_x':'Slot','TeamName':'TeamName_2'},inplace=True)

#merge the strong seed
Round4 = pd.merge(Round4_Matchup,SeasonStatsRound1,left_on = ['Season','TeamName_1'],right_on=['Season','TeamName'])
Round4.drop(['Seed','TeamID','TeamName','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    Round4.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round4 = pd.merge(Round4,SeasonStatsRound1,left_on = ['Season','TeamName_2'],right_on=['Season','TeamName'])
Round4.drop(['Seed','TeamID','TeamName','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    Round4.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round4_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round4_Diff[Col] = Round4[stat_1] - Round4[stat_2]

Round4_Diff = Round4_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]


In [206]:
win_prob_Round4 = logit_model.predict_proba(Round4_Diff)
winners_Round4 = logit_model.predict(Round4_Diff)
Round4_Matchup['Winners'] = winners_Round4
win_prob_Round4 = pd.DataFrame(win_prob_Round4,columns = ['Team_1_Probability','Team_2_Probability'])
Round4_Matchup = pd.concat((Round4_Matchup,win_prob_Round4),axis=1)

In [207]:
Round4_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R4W1,R3W1,R3W2,4.0,Duke,Michigan St,1,0.537325,0.462675
1,2019,R4X1,R3X1,R3X2,4.0,Gonzaga,Michigan,1,0.53291,0.46709
2,2019,R4Y1,R3Y1,R3Y2,4.0,North Carolina,Houston,1,0.511464,0.488536
3,2019,R4Z1,R3Z1,R3Z2,4.0,Virginia,Tennessee,1,0.539161,0.460839


## Round 5 (Final Four)

In [208]:
Seeding_Round5 = Round4_Matchup[['Slot','TeamName_1']]
Seeding_Round5.rename(columns={'TeamName_1':'TeamName'},inplace=True)

In [209]:
Seeding_Round5

Unnamed: 0,Slot,TeamName
0,R4W1,Duke
1,R4X1,Gonzaga
2,R4Y1,North Carolina
3,R4Z1,Virginia


In [210]:
Round5_Matchup = pd.merge(Round5,Seeding_Round5,left_on ='StrongSeed',right_on='Slot')
Round5_Matchup.drop('Slot_y',axis=1,inplace=True)
Round5_Matchup.rename(columns={'TeamName':'TeamName_1'},inplace=True)
Round5_Matchup = pd.merge(Round5_Matchup,Seeding_Round5,left_on ='WeakSeed',right_on='Slot')
Round5_Matchup.drop('Slot',axis=1,inplace=True)
Round5_Matchup.rename(columns={'Slot_x':'Slot','TeamName':'TeamName_2'},inplace=True)

#merge the strong seed
Round5 = pd.merge(Round5_Matchup,SeasonStatsRound1,left_on = ['Season','TeamName_1'],right_on=['Season','TeamName'])
Round5.drop(['Seed','TeamID','TeamName','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    Round5.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round5 = pd.merge(Round5,SeasonStatsRound1,left_on = ['Season','TeamName_2'],right_on=['Season','TeamName'])
Round5.drop(['Seed','TeamID','TeamName','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    Round5.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round5_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round5_Diff[Col] = Round5[stat_1] - Round5[stat_2]

Round5_Diff = Round5_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [211]:
win_prob_Round5 = logit_model.predict_proba(Round5_Diff)
winners_Round5 = logit_model.predict(Round5_Diff)
Round5_Matchup['Winners'] = winners_Round5
win_prob_Round5 = pd.DataFrame(win_prob_Round5,columns = ['Team_1_Probability','Team_2_Probability'])
Round5_Matchup = pd.concat((Round5_Matchup,win_prob_Round5),axis=1)

In [212]:
Round5_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R5WX,R4W1,R4X1,5.0,Duke,Gonzaga,2,0.493036,0.506964
1,2019,R5YZ,R4Y1,R4Z1,5.0,North Carolina,Virginia,2,0.456983,0.543017


The model predicted Gonzaga and Virginia to beat Duke and North Carolina to head to the finals.

## Championships

In [214]:
Seeding_Round6 = Round5_Matchup[['Slot','TeamName_2']]
Seeding_Round6.rename(columns={'TeamName_2':'TeamName'},inplace=True)

In [215]:
Seeding_Round6

Unnamed: 0,Slot,TeamName
0,R5WX,Gonzaga
1,R5YZ,Virginia


In [216]:
Round6_Matchup = pd.merge(Round6,Seeding_Round6,left_on ='StrongSeed',right_on='Slot')
Round6_Matchup.drop('Slot_y',axis=1,inplace=True)
Round6_Matchup.rename(columns={'TeamName':'TeamName_1'},inplace=True)
Round6_Matchup = pd.merge(Round6_Matchup,Seeding_Round6,left_on ='WeakSeed',right_on='Slot')
Round6_Matchup.drop('Slot',axis=1,inplace=True)
Round6_Matchup.rename(columns={'Slot_x':'Slot','TeamName':'TeamName_2'},inplace=True)

#merge the strong seed
Round6 = pd.merge(Round6_Matchup,SeasonStatsRound1,left_on = ['Season','TeamName_1'],right_on=['Season','TeamName'])
Round6.drop(['Seed','TeamID','TeamName','TeamName_1','Description'],axis=1,inplace=True)

for i in rename_list:
    Round6.rename(columns = {f'{i}':f'1{i}'},inplace=True)

#merge the weak seed
Round6 = pd.merge(Round6,SeasonStatsRound1,left_on = ['Season','TeamName_2'],right_on=['Season','TeamName'])
Round6.drop(['Seed','TeamID','TeamName','TeamName_2','Description'],axis=1,inplace=True)

for i in rename_list:
    Round6.rename(columns = {f'{i}':f'2{i}'},inplace=True)

#Create new dataframe and loop for new columns
Round6_Diff = pd.DataFrame()

for stat_1,stat_2,Col in zip(Stats1,Stats2,ColumnName):
    Round6_Diff[Col] = Round6[stat_1] - Round6[stat_2]

Round6_Diff = Round6_Diff[['SeedNum','Region_DNC', 'RegionW', 'Region_X', 'Region_Y', 
        'Region_Z','WOL', 'RTH', 'SAG', 'MOR', 'NumWins', 'NumLoss', 'WinPct',
       'Score', 'PointDiff', 'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR',
       'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'Poss', '_OffRtg', '_DefRtg',
       '_NetRtg', '_EFG', '_TS', 'IE', '_OR_pct', '_DR_pct', '_REB_pct',
       '_TO_poss', '_FT_rate', '_AST_rtio', '_BLK_pct', '_STL_pct']]

In [217]:
win_prob_Round6 = logit_model.predict_proba(Round6_Diff)
winners_Round6 = logit_model.predict(Round6_Diff)
Round6_Matchup['Winners'] = winners_Round6
win_prob_Round6 = pd.DataFrame(win_prob_Round6,columns = ['Team_1_Probability','Team_2_Probability'])
Round6_Matchup = pd.concat((Round6_Matchup,win_prob_Round6),axis=1)

In [218]:
Round6_Matchup

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed,Round,TeamName_1,TeamName_2,Winners,Team_1_Probability,Team_2_Probability
0,2019,R6CH,R5WX,R5YZ,6.0,Gonzaga,Virginia,2,0.472055,0.527945


Our model predicted that Virginia would win the 2019 NCAA Tournament.