**Introduction:**

One notable difference between the men's and women's March Madness tournament is that the top 16 seeds in the women’s tournament will host 3 other teams for rounds 1 and 2. In other words, seeds 1-4 in every region will be guaranteed a home game for round 1, and if they win, round 2 will also be a home game. The goal of this Kernel is to show how important it is to account for these home games in your model.

Some of the ideas for this kernal were inspired by this article and it is worth a read: https://audacityofhoops.blogspot.com/2010/04/opponent-adjusted-four-factors.html


Step 1: Build a simple model with raw un-weighted 4 Factor Stats.Predict on 2018 Data and get a baseline log loss.

Step 2: Weight the 4 factor stats when a team is playing a home game. Rerun the model, and compare the log loss scores.


In [1]:
import numpy as np 
import pandas as pd
import os
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
#import data
in_path = '../input/'
RegularSeasonDetailedResults = pd.read_csv(in_path + 'wdatafiles/WRegularSeasonDetailedResults.csv')
NCAATourneyCompactResults = pd.read_csv(in_path + 'wdatafiles/WNCAATourneyCompactResults.csv')

RegularSeasonDetailedResults.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,WFGM3,WFGA3,WFTM,WFTA,WOR,WDR,WAst,WTO,WStl,WBlk,WPF,LFGM,LFGA,LFGM3,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2010,11,3103,63,3237,49,H,0,23,54,5,9,12,19,10,26,14,18,7,0,15,20,54,3,13,6,10,11,27,11,23,7,6,19
1,2010,11,3104,73,3399,68,N,0,26,62,5,12,16,28,16,31,15,20,5,2,25,25,63,4,21,14,27,14,26,7,20,4,2,27
2,2010,11,3110,71,3224,59,A,0,29,62,6,15,7,12,14,23,18,13,6,2,17,19,58,2,14,19,23,17,23,8,15,6,0,15
3,2010,11,3111,63,3267,58,A,0,27,52,4,11,5,9,6,40,14,27,5,10,18,18,74,6,26,16,25,22,22,15,11,14,5,14
4,2010,11,3119,74,3447,70,H,1,30,74,7,20,7,11,14,33,18,11,5,3,18,25,74,9,17,11,21,21,32,12,14,4,2,14


In [3]:
def Get4FactorStats(df):
    
    df ['WEFG']       = (df ['WFGM'] + (df ['WFGM3']*0.5))/df ['WFGA']
    df ['LEFG']       = (df ['LFGM'] + (df ['LFGM3']*0.5))/df ['LFGA']
    df ['WTOV']       = df ['WTO']/((df ['WFGA'] + 0.44) + (df ['WFTA']+df ['WTO']))
    df ['LTOV']       = df ['LTO']/((df ['LFGA'] + 0.44) + (df ['LFTA']+df ['LTO']))
    df ['WORB']       = df ['WOR']/(df ['WOR'] + df ['LDR'])
    df ['LORB']       = df ['LOR']/(df ['LOR'] + df ['WDR'])
    df ['WFTAR']      = df ['WFTA']/(df ['WFGA'])
    df ['LFTAR']      = df ['LFTA']/(df ['LFGA'])
    
    return df

reg_season_4_factor = Get4FactorStats(RegularSeasonDetailedResults)

#Split into two datasets of the winners and loosers because you want every team to have a line for a single game.
reg_season_4_factor_w = reg_season_4_factor[['Season','WTeamID','DayNum','WEFG','WTOV','WORB','WFTAR']]


reg_season_4_factor_w = reg_season_4_factor_w.rename(columns = {
                                               'WEFG': 'EFG',
                                               'WTOV': 'TOV',
                                               'WORB': 'ORB',
                                               'WFTAR': 'FTAR',
                                               'WLoc': 'Loc',
                                               'WTeamID':'TeamID'})


reg_season_4_factor_l = reg_season_4_factor[['Season','LTeamID','DayNum','LEFG','LTOV','LORB','LFTAR']]


reg_season_4_factor_l = reg_season_4_factor_l.rename(columns = {
                                               'LEFG': 'EFG',
                                               'LTOV': 'TOV',
                                               'LORB': 'ORB',
                                               'LFTAR': 'FTAR',
                                               'LTeamID':'TeamID'})

#Get 4 factors on the game level 
reg_season_4_factor = Get4FactorStats(RegularSeasonDetailedResults)


#Split into two datasets of the winners and loosers because you want every team to have a line for a single game.
reg_season_4_factor_w = reg_season_4_factor[['Season','WTeamID','DayNum','WEFG','WTOV','WORB','WFTAR']]


reg_season_4_factor_w = reg_season_4_factor_w.rename(columns = {
                                               'WEFG': 'EFG',
                                               'WTOV': 'TOV',
                                               'WORB': 'ORB',
                                               'WFTAR': 'FTAR',
                                               'WLoc': 'Loc',
                                               'WTeamID':'TeamID'})


reg_season_4_factor_l = reg_season_4_factor[['Season','LTeamID','DayNum','LEFG','LTOV','LORB','LFTAR']]


reg_season_4_factor_l = reg_season_4_factor_l.rename(columns = {
                                               'LEFG': 'EFG',
                                               'LTOV': 'TOV',
                                               'LORB': 'ORB',
                                               'LFTAR': 'FTAR',
                                               'LTeamID':'TeamID'})
#Set the data back together
reg_season_4_factor_set = (reg_season_4_factor_w, reg_season_4_factor_l)
reg_season_4_factor_set = pd.concat(reg_season_4_factor_set, ignore_index = True)
reg_season_4_factor_set.drop_duplicates()

#Group by so we get the avg 4 factor results at the season team level.
reg_season_4_factor_overall_avg =  reg_season_4_factor_set.groupby(['Season','TeamID'])[['EFG','TOV','ORB','FTAR']].mean().reset_index()

reg_season_4_factor_overall_avg.head()


Unnamed: 0,Season,TeamID,EFG,TOV,ORB,FTAR
0,2010,3102,0.407876,0.206943,0.345366,0.25772
1,2010,3103,0.441526,0.196685,0.391232,0.364156
2,2010,3104,0.436584,0.195814,0.352448,0.273776
3,2010,3105,0.437404,0.249214,0.384361,0.505554
4,2010,3106,0.377771,0.206727,0.389308,0.465138


Great! So what we have now is the avg 4 factor results at a Season and Team level. Next we will organize the tournament data and then join these two datasets together.

In [4]:

#Get the outcome of the tournament games.
def NCAASetWinAndLoseTeamsRecords(NCAATourneyCompactResults):
    NCAA_res_w = NCAATourneyCompactResults.rename(columns = {'WTeamID': 'NCAA_TEAMID',
                                                           'LTeamID': 'NCAA_O_TEAMID', 
                                                           'WScore':'NCAA_SCORE',
                                                           'LScore':'NCAA_O_SCORE'
                                                             })
    NCAA_res_l = NCAATourneyCompactResults.rename(columns = {'LTeamID': 'NCAA_TEAMID',
                                                           'WTeamID': 'NCAA_O_TEAMID', 
                                                           'LScore':'NCAA_SCORE',
                                                           'WScore':'NCAA_O_SCORE'
                                                             })
        
    NCAA_Ses = (NCAA_res_w, NCAA_res_l)
    NCAA_Ses = pd.concat(NCAA_Ses, ignore_index = True,sort = True)
    NCAA_Ses ['OUTCOME'] = np.where(NCAA_Ses['NCAA_SCORE']>NCAA_Ses['NCAA_O_SCORE'], 1, 0)
    NCAA_Ses = NCAA_Ses[['Season','NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME']]
    return NCAA_Ses

Tourney_Results = NCAASetWinAndLoseTeamsRecords(NCAATourneyCompactResults)

Tourney_Results.head()

Unnamed: 0,Season,NCAA_TEAMID,NCAA_O_TEAMID,OUTCOME
0,1998,3104,3422,1
1,1998,3112,3365,1
2,1998,3163,3193,1
3,1998,3198,3266,1
4,1998,3203,3208,1


In [5]:
#merge 4 factor stats on tournament-outcome data.
Tourney_Results_1 = pd.merge(Tourney_Results, reg_season_4_factor_overall_avg, how='inner', 
                    left_on=['Season','NCAA_TEAMID'],
                    right_on=['Season','TeamID'] )



Tourney_Results_2 = pd.merge(Tourney_Results_1, reg_season_4_factor_overall_avg, how='inner', 
                    left_on=['Season','NCAA_O_TEAMID'],
                    right_on=['Season','TeamID'],suffixes = ['','_op'] )



Non_weighted_4_factors = Tourney_Results_2[['Season','NCAA_TEAMID','NCAA_O_TEAMID','OUTCOME','EFG','TOV','ORB','FTAR','EFG_op','TOV_op','ORB_op','FTAR_op']]

Non_weighted_4_factors.head()

Unnamed: 0,Season,NCAA_TEAMID,NCAA_O_TEAMID,OUTCOME,EFG,TOV,ORB,FTAR,EFG_op,TOV_op,ORB_op,FTAR_op
0,2010,3124,3201,1,0.484261,0.172037,0.364244,0.438309,0.492357,0.163805,0.359992,0.30205
1,2010,3124,3207,1,0.484261,0.172037,0.364244,0.438309,0.450856,0.168999,0.406329,0.327538
2,2010,3265,3207,0,0.478543,0.154016,0.307726,0.336385,0.450856,0.168999,0.406329,0.327538
3,2010,3124,3397,1,0.484261,0.172037,0.364244,0.438309,0.504445,0.158655,0.426177,0.296546
4,2010,3173,3397,0,0.465242,0.172238,0.381488,0.315887,0.504445,0.158655,0.426177,0.296546


Now we have a very basic data set to get our baseline score of. Lets train a model and see how it goes.

In [6]:

def print_score(m):
                
    print ("train log loss :", metrics.log_loss(y_train.tolist(),m.predict_proba(X_train).tolist(), eps=1e-15))
    print ("test log loss :", metrics.log_loss(y_valid.tolist(),m.predict_proba(X_valid).tolist(), eps=1e-15))
    
    if hasattr(m, 'oob_score_'): print ("oob_score : ", m.oob_score_)
    


train = Non_weighted_4_factors[Non_weighted_4_factors.Season <= 2017]
valid = Non_weighted_4_factors[Non_weighted_4_factors.Season == 2018]


X_train = train.drop(['Season', 'NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'], axis=1)
y_train = train['OUTCOME']
X_valid = valid.drop(['Season', 'NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'], axis=1)
y_valid = valid['OUTCOME']


m = RandomForestClassifier(n_estimators=500, n_jobs=-1, oob_score=True, random_state=0)
m.fit(X_train, y_train)
print_score(m)


train log loss : 0.16448612070378216
test log loss : 0.5708310327799656
oob_score :  0.6468253968253969


Remember the test log loss score and now lets see what we can get after weighting these 4 factors when a team is playing at home. 

Step 2:

Some of this will look familiar, but now we will be taking home/away into account.

In [7]:
#get 4 factors on the game level 
reg_season_h = Get4FactorStats(RegularSeasonDetailedResults)

#split into two datasets of the winners and loosers because you want every
# team to have a line for a single game.  one game = two lines, one for each team
reg_season_h_w = reg_season_h[['Season','WTeamID','DayNum','WEFG','WTOV','WORB','WFTAR','WLoc']].copy()


reg_season_h_w_1 = reg_season_h_w.loc[reg_season_h_w.WLoc == 'N', 'WLoc'] = 'A'


reg_season_h_w = reg_season_h_w.rename(columns = {
                                               'WEFG': 'EFG',
                                               'WTOV': 'TOV',
                                               'WORB': 'ORB',
                                               'WFTAR': 'FTAR',
                                               'WLoc': 'Loc',
                                               'WTeamID':'TeamID'})


reg_season_h_l = reg_season_h[['Season','LTeamID','DayNum','LEFG','LTOV','LORB','LFTAR','WLoc']].copy()
#Home vs away is opposite for the LTeamID

reg_season_h_l.loc[reg_season_h_l.WLoc == 'H', 'Loc'] = 'A'
reg_season_h_l.loc[reg_season_h_l.WLoc == 'A', 'Loc'] = 'H'
reg_season_h_l.loc[reg_season_h_l.WLoc == 'N', 'Loc'] = 'A'


reg_season_h_l = reg_season_h_l.drop(['WLoc'], axis=1)

reg_season_h_l = reg_season_h_l.rename(columns = {
                                               'LEFG': 'EFG',
                                               'LTOV': 'TOV',
                                               'LORB': 'ORB',
                                               'LFTAR': 'FTAR',
                                               'LTeamID':'TeamID'})
#set the data back together
reg_season_4_factor_h_a = (reg_season_h_w, reg_season_h_l)
reg_season_4_factor_h_a = pd.concat(reg_season_4_factor_h_a, ignore_index = True)
reg_season_4_factor_h_a.drop_duplicates()

#group by for calculating home stats. 
reg_season_4_factor_home_avg =  reg_season_4_factor_h_a.groupby(['Season','TeamID','Loc'])[['EFG','TOV','ORB','FTAR']].mean().reset_index()

reg_season_4_factor_home_avg = reg_season_4_factor_home_avg.drop(reg_season_4_factor_home_avg[reg_season_4_factor_home_avg.Loc == 'A'].index)


reg_season_4_factor_home_avg = reg_season_4_factor_home_avg.rename(columns = {
                                               'EFG': 'EFG_home',
                                               'TOV': 'TOV_home',
                                               'ORB': 'ORB_home',
                                               'FTAR': 'FTAR_home'
                                               })
reg_season_4_factor_home_avg = reg_season_4_factor_home_avg.drop(['Loc'],axis=1)

#group by for calculating whole season stats
reg_season_4_factor_overall_avg =  reg_season_4_factor_h_a.groupby(['Season','TeamID'])[['EFG','TOV','ORB','FTAR']].mean().reset_index()

reg_season_4_factor_overall_avg = reg_season_4_factor_overall_avg.rename(columns = {
                                               'EFG': 'EFG_overall',
                                               'TOV': 'TOV_overall',
                                               'ORB': 'ORB_overall',
                                               'FTAR': 'FTAR_overall'
                                               })


reg_season_4_factor_before_calc = pd.merge(reg_season_4_factor_home_avg, reg_season_4_factor_overall_avg, how='inner', 
                    left_on=['Season','TeamID'],
                    right_on=['Season','TeamID'] )


#calc for the weights on the team season level. This calculation comes from the article mentioned in the intro.
reg_season_4_factor_before_calc ['EFG_weight'] = (reg_season_4_factor_before_calc ['EFG_home']/reg_season_4_factor_before_calc ['EFG_overall'])**0.5
reg_season_4_factor_before_calc ['TOV_weight'] = (reg_season_4_factor_before_calc ['TOV_home']/reg_season_4_factor_before_calc ['TOV_overall'])**0.5
reg_season_4_factor_before_calc ['ORB_weight'] = (reg_season_4_factor_before_calc ['ORB_home']/reg_season_4_factor_before_calc ['ORB_overall'])**0.5
reg_season_4_factor_before_calc ['FTAR_weight'] = (reg_season_4_factor_before_calc ['FTAR_home']/reg_season_4_factor_before_calc ['FTAR_overall'])**0.5



#dataset that will be joined to full data set.
home_weights = reg_season_4_factor_before_calc [['Season','TeamID','EFG_home','TOV_home','ORB_home','FTAR_home']]



In [8]:
#merge weights in
Non_weighted_4_factors_with_weights = pd.merge(Non_weighted_4_factors,home_weights , how='inner', 
                    left_on=['Season','NCAA_TEAMID'],
                    right_on=['Season','TeamID'] )

Non_weighted_4_factors_with_weights = pd.merge(Non_weighted_4_factors_with_weights,home_weights , how='inner', 
                    left_on=['Season','NCAA_O_TEAMID'],
                    right_on=['Season','TeamID'], suffixes =['', '_op'] )

Non_weighted_4_factors_with_weights.head()

Unnamed: 0,Season,NCAA_TEAMID,NCAA_O_TEAMID,OUTCOME,EFG,TOV,ORB,FTAR,EFG_op,TOV_op,ORB_op,FTAR_op,TeamID,EFG_home,TOV_home,ORB_home,FTAR_home,TeamID_op,EFG_home_op,TOV_home_op,ORB_home_op,FTAR_home_op
0,2010,3124,3201,1,0.484261,0.172037,0.364244,0.438309,0.492357,0.163805,0.359992,0.30205,3124,0.506918,0.167221,0.40385,0.507862,3201,0.507463,0.165599,0.373166,0.289829
1,2010,3124,3207,1,0.484261,0.172037,0.364244,0.438309,0.450856,0.168999,0.406329,0.327538,3124,0.506918,0.167221,0.40385,0.507862,3207,0.462913,0.166008,0.39584,0.40098
2,2010,3265,3207,0,0.478543,0.154016,0.307726,0.336385,0.450856,0.168999,0.406329,0.327538,3265,0.482394,0.138403,0.302733,0.351364,3207,0.462913,0.166008,0.39584,0.40098
3,2010,3124,3397,1,0.484261,0.172037,0.364244,0.438309,0.504445,0.158655,0.426177,0.296546,3124,0.506918,0.167221,0.40385,0.507862,3397,0.516748,0.136557,0.431577,0.283453
4,2010,3173,3397,0,0.465242,0.172238,0.381488,0.315887,0.504445,0.158655,0.426177,0.296546,3173,0.493387,0.161403,0.343631,0.341311,3397,0.516748,0.136557,0.431577,0.283453


In [9]:
#find home/away.

NCAA_res_w = NCAATourneyCompactResults.rename(columns = {'WTeamID': 'NCAA_TEAMID',
                                                           'LTeamID': 'NCAA_O_TEAMID', 
                                                           'WScore':'NCAA_SCORE',
                                                           'LScore':'NCAA_O_SCORE'
                                                             })
NCAA_res_l = NCAATourneyCompactResults.rename(columns = {'LTeamID': 'NCAA_TEAMID',
                                                           'WTeamID': 'NCAA_O_TEAMID', 
                                                           'LScore':'NCAA_SCORE',
                                                           'WScore':'NCAA_O_SCORE'
                                                             })
        
NCAA_Ses = (NCAA_res_w, NCAA_res_l)
NCAA_Ses = pd.concat(NCAA_Ses, ignore_index = True,sort = True)
NCAA_Ses ['OUTCOME'] = np.where(NCAA_Ses['NCAA_SCORE']>NCAA_Ses['NCAA_O_SCORE'], 1, 0)
home_game = NCAA_Ses[['Season','NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME','WLoc']]

home_game.head()

Unnamed: 0,Season,NCAA_TEAMID,NCAA_O_TEAMID,OUTCOME,WLoc
0,1998,3104,3422,1,H
1,1998,3112,3365,1,H
2,1998,3163,3193,1,H
3,1998,3198,3266,1,H
4,1998,3203,3208,1,A


In [10]:


#merge WLoc in (Home,Away,Neutral) 

Weighted_4_factor = pd.merge(Non_weighted_4_factors_with_weights,home_game , how='inner', 
                    left_on=['Season','NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'],
                    right_on=['Season','NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'] )

#multiply the four factor by the  teams's home weights that we calculated above.
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 1), 'EFG'] = Weighted_4_factor['EFG'] * Weighted_4_factor['EFG_home']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 1), 'TOV'] = Weighted_4_factor['TOV'] * Weighted_4_factor['TOV_home']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 1), 'ORB'] = Weighted_4_factor['EFG'] * Weighted_4_factor['ORB_home']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 1), 'FTAR'] = Weighted_4_factor['TOV'] * Weighted_4_factor['FTAR_home']

Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 0), 'EFG_op'] = Weighted_4_factor['EFG_op'] * Weighted_4_factor['EFG_home_op']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 0), 'TOV_op'] = Weighted_4_factor['TOV_op'] * Weighted_4_factor['TOV_home_op']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 0), 'EFG_op'] = Weighted_4_factor['EFG_op'] * Weighted_4_factor['ORB_home_op']
Weighted_4_factor.loc[(Weighted_4_factor.WLoc == 'H') &(Weighted_4_factor.OUTCOME == 0), 'TOV_op'] = Weighted_4_factor['TOV_op'] * Weighted_4_factor['FTAR_home_op']



Weighted_4_factor = Weighted_4_factor.drop(['EFG_home', 'TOV_home', 
                 'ORB_home', 'FTAR_home', 'EFG_home_op', 'TOV_home_op',
                 'ORB_home_op', 'FTAR_home_op', 'WLoc', 'TeamID_op','TeamID'], axis=1)

Weighted_4_factor.head()

Unnamed: 0,Season,NCAA_TEAMID,NCAA_O_TEAMID,OUTCOME,EFG,TOV,ORB,FTAR,EFG_op,TOV_op,ORB_op,FTAR_op
0,2010,3124,3201,1,0.484261,0.172037,0.364244,0.438309,0.492357,0.163805,0.359992,0.30205
1,2010,3124,3207,1,0.484261,0.172037,0.364244,0.438309,0.450856,0.168999,0.406329,0.327538
2,2010,3265,3207,0,0.478543,0.154016,0.307726,0.336385,0.450856,0.168999,0.406329,0.327538
3,2010,3124,3397,1,0.484261,0.172037,0.364244,0.438309,0.504445,0.158655,0.426177,0.296546
4,2010,3173,3397,0,0.465242,0.172238,0.381488,0.315887,0.1125,0.006141,0.426177,0.296546


Lets take our new weighted data set and run it through a model.

In [11]:
train = Weighted_4_factor[Weighted_4_factor.Season <= 2017]
valid = Weighted_4_factor[Weighted_4_factor.Season == 2018]


X_train = train.drop(['Season', 'NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'], axis=1)
y_train = train['OUTCOME']
X_valid = valid.drop(['Season', 'NCAA_TEAMID', 'NCAA_O_TEAMID', 'OUTCOME'], axis=1)
y_valid = valid['OUTCOME']


m = RandomForestClassifier(n_estimators=500, n_jobs=-1, oob_score=True, random_state=0)
m.fit(X_train, y_train)
print_score(m)

train log loss : 0.13295250099287026
test log loss : 0.36085889123856907
oob_score :  0.6989795918367347


Looks Better to me! Pretty surprising how important home court advantage can be. 

Please feel free to provide feedback and ask any questions you might have. 