In trying to predict the outcome of March Madness Games it is first important to figure out the best teams. For this I will be putting together a Value Over Average statistic for each team based on every play of a season. 

Methodology:

Points scored in a close game are more important than a blowout. Points scored in a close game with only a few seconds left are more important than at the very beginning of a game. So with these in mind the value of a play needs to take into account both score and time remaining. 

The baseline value of a play will be points scored:

    Freethrow = 1
    2pter = 2
    3pter = 3

How to categorize defense?

    Blocks, steals, defensive rebounds can be used. Should the defensive team be rewarded for the other team missing? 
    
    There will be an OffenseType and DefenseType parameter for each value. 

Close games should result in higher values

    if deltaScore +- 5 add 10% ?
    
Time remaining should adjust this further. Down by 2 with 5 seconds and you hit a 3 should be extreme value. 
    Start this adjustment with 5 mins left in the game?
    Need to adjust for overtime

A miss should have an equally negative effect on a plays value in high stakes situations:
    under 5 mins with 6 points?


Thoughts for other factors:

Blowing out bad teams is more important than close wins against good teams. 
Do Assists indicate more stable teams?
Do % of points from jump shots indicate anything?


Adjustments:

Need to find ways to adjust for opponent, the conference the team plays in (the better conferences are just simply better)

In [16]:
import pandas as pd

In [17]:
plays2015 = pd.read_csv('MEvents2015.csv')

In [18]:
plays2015.head()

Unnamed: 0,EventID,Season,DayNum,WTeamID,LTeamID,WFinalScore,LFinalScore,WCurrentScore,LCurrentScore,ElapsedSeconds,EventTeamID,EventPlayerID,EventType,EventSubType,X,Y,Area
0,1,2015,11,1103,1420,74,57,0,0,19,1103,100,miss3,unk,0,0,0
1,2,2015,11,1103,1420,74,57,0,0,19,1420,11784,reb,def,0,0,0
2,3,2015,11,1103,1420,74,57,0,0,27,1420,11789,made2,dunk,0,0,0
3,4,2015,11,1103,1420,74,57,0,0,27,1420,11803,assist,,0,0,0
4,5,2015,11,1103,1420,74,57,0,0,59,1103,87,made2,jump,0,0,0


In [19]:
event_types = plays2015['EventType'].unique()

In [20]:
event_types

array(['miss3', 'reb', 'made2', 'assist', 'turnover', 'steal', 'foul',
       'miss2', 'made3', 'timeout', 'sub', 'made1', 'miss1', 'block'],
      dtype=object)

In [21]:
def adjust_value(base_value, time_bonus, close_game, blown_out):
    if time_bonus == True:
        time_value = base_value * .1 #This can be adjusted but basically it adds value to outcomes in games 
    else:
        time_value = 0
    
    if close_game == True:
        close_value = base_value * .1
    else:
        close_value = 0
    
    adjusted_value = base_value + time_value + close_value
    
    if adjusted_value > 0 and blown_out == True: #If the team is being blown out offense doesnt matter
        adjusted_value = 0
    else:
        pass
    
    return adjusted_value

def get_value(event):
    # Calculate Current Score Differential (WinningTeam - LosingTeam):
    deltaScore = event['WCurrentScore'] - event['LCurrentScore']
    
    #Calculate time remaining and if time bonus is applicable
    time_remaining = 40*60 - event['ElapsedSeconds']
    if time_remaining <= 300:
        time_bonus = True
    else:
        time_bonus = False
        
    #Find out if close game bonus is in effect. This will be deltaScore of 6 or less and 5 mins or less
    if abs(deltaScore) <= 6 and time_bonus == True:
        close_game = True
    else:
        close_game = False
        
    #Find out if team is getting blown out. Check if deltaScore is > 15 and if the team is the winning team or not
    #This will only apply to Offensive Value Additions? 
    if deltaScore > 15 and event['EventTeamId']==event['LTeamID']: #Check if the event team is the team getting blown out
        blown_out = True
    else:
        blown_out = False
        
    #Determine if the event is Offensive or Defensive.
    if ((event['EventType'] == 'reb') & (event['EventSubType']=='def') or (event['EventType'] == 'block') or (event['EventType']=='steal')):
        ValueType = 'Def'
    else:
        ValueType = 'Off'
        
    #Determine base value:
    if event['EventType']=='made1':
        base_value = 1
    elif event['EventType']=='made2':
        base_value = 2
    elif event['EventType']=='made3':
        base_value = 3
    elif event['EventType']=='turnover':
        base_value = -1
    elif event['EventType']=='steal':
        base_value = -1
    elif event['EventType']=='block':
        base_value = -1
    elif (event['EventType'] == 'reb') & (event['EventSubType']=='def'):
        base_value = -1
    else:
        base_value = 0
        
    #Calculate adjusted value
    adjusted_value = adjust_value(base_value, time_bonus, close_game, blown_out)
    
    return adjusted_value

def get_value_type(event):
    #Determine if the event is Offensive or Defensive.
    if ((event['EventType'] == 'reb') & (event['EventSubType']=='def') or (event['EventType'] == 'block') or (event['EventType']=='steal')):
        ValueType = 'Def'
    else:
        ValueType = 'Off'
    return ValueType

#Want to add a gameID to the column for easing grouping
#to make it easier it will simply be made from the DayNum,WTeamID and LTeamID
def make_gameid(event):
    year = event['Season']
    DayNum = event['DayNum']
    wteam = event['WTeamID']
    lteam = event['LTeamID']
    
    game_id = int(str(year)+str(DayNum)+str(wteam)+str(lteam))
    return game_id
        

In [22]:
test_game = plays2015.loc[(plays2015['WTeamID']==1103) & (plays2015['LTeamID']==1420)]

In [23]:
test_game.head(20)

Unnamed: 0,EventID,Season,DayNum,WTeamID,LTeamID,WFinalScore,LFinalScore,WCurrentScore,LCurrentScore,ElapsedSeconds,EventTeamID,EventPlayerID,EventType,EventSubType,X,Y,Area
0,1,2015,11,1103,1420,74,57,0,0,19,1103,100,miss3,unk,0,0,0
1,2,2015,11,1103,1420,74,57,0,0,19,1420,11784,reb,def,0,0,0
2,3,2015,11,1103,1420,74,57,0,0,27,1420,11789,made2,dunk,0,0,0
3,4,2015,11,1103,1420,74,57,0,0,27,1420,11803,assist,,0,0,0
4,5,2015,11,1103,1420,74,57,0,0,59,1103,87,made2,jump,0,0,0
5,6,2015,11,1103,1420,74,57,0,0,72,1420,11784,turnover,unk,0,0,0
6,7,2015,11,1103,1420,74,57,0,0,73,1103,107,steal,,0,0,0
7,8,2015,11,1103,1420,74,57,0,0,75,1420,11803,foul,unk,0,0,0
8,9,2015,11,1103,1420,74,57,0,0,94,1103,92,made2,jump,0,0,0
9,10,2015,11,1103,1420,74,57,0,0,101,1420,11789,made2,lay,0,0,0


In [24]:
test_game['Value'] = test_game.apply(get_value,axis=1)
test_game['ValueType'] = test_game.apply(get_value_type, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
test_game.head(20)

Unnamed: 0,EventID,Season,DayNum,WTeamID,LTeamID,WFinalScore,LFinalScore,WCurrentScore,LCurrentScore,ElapsedSeconds,EventTeamID,EventPlayerID,EventType,EventSubType,X,Y,Area,Value,ValueType
0,1,2015,11,1103,1420,74,57,0,0,19,1103,100,miss3,unk,0,0,0,0.0,Off
1,2,2015,11,1103,1420,74,57,0,0,19,1420,11784,reb,def,0,0,0,-1.0,Def
2,3,2015,11,1103,1420,74,57,0,0,27,1420,11789,made2,dunk,0,0,0,2.0,Off
3,4,2015,11,1103,1420,74,57,0,0,27,1420,11803,assist,,0,0,0,0.0,Off
4,5,2015,11,1103,1420,74,57,0,0,59,1103,87,made2,jump,0,0,0,2.0,Off
5,6,2015,11,1103,1420,74,57,0,0,72,1420,11784,turnover,unk,0,0,0,-1.0,Off
6,7,2015,11,1103,1420,74,57,0,0,73,1103,107,steal,,0,0,0,-1.0,Def
7,8,2015,11,1103,1420,74,57,0,0,75,1420,11803,foul,unk,0,0,0,0.0,Off
8,9,2015,11,1103,1420,74,57,0,0,94,1103,92,made2,jump,0,0,0,2.0,Off
9,10,2015,11,1103,1420,74,57,0,0,101,1420,11789,made2,lay,0,0,0,2.0,Off


In [26]:
test_game.groupby(['EventTeamID','ValueType']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,EventID,Season,DayNum,WTeamID,LTeamID,WFinalScore,LFinalScore,WCurrentScore,LCurrentScore,ElapsedSeconds,EventPlayerID,X,Y,Area,Value
EventTeamID,ValueType,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1103,Def,8821,74555,407,40811,52540,2738,2109,0,0,48159,3345,0,0,0,-37.4
1103,Off,54353,451360,2464,247072,318080,16576,12768,0,0,293775,32245,0,0,0,63.2
1420,Def,8275,64480,352,35296,45440,2368,1824,0,0,44371,365508,0,0,0,-33.2
1420,Off,47867,392925,2145,215085,276900,14430,11115,0,0,257866,2121994,0,0,0,44.0


In [None]:
plays2015['Value'] = plays2015.apply(get_value,axis=1)
plays2015['ValueType'] = plays2015.apply(get_value_type, axis=1)
plays2015['GameID'] = plays2015.apply(make_gameid, axis=1)

In [None]:
teams2015 = plays2015.groupby(['EventTeamID','ValueType']).sum()

In [None]:
teams2015.tail(10)

In [None]:
teams = pd.read_csv('MTeams.csv')

In [None]:
teams.head()

In [None]:
team_ids = teams2015.index.get_level_values(0).unique()

In [None]:
team_ids = team_ids.drop(0)

In [None]:
team_ids

In [None]:
def_value = []
off_value = []
total_value = []
for team in team_ids:
    def_value.append(teams2015.xs(team)['Value'][0])
    off_value.append(teams2015.xs(team)['Value'][1])
    total_value.append(off_value[-1] - def_value[-1])
    

In [None]:
df_value = pd.DataFrame(zip(list(team_ids),def_value,off_value,total_value),columns=['TeamID','Def Value','Off Value','Total Value'])

In [None]:
df_value.tail()

In [None]:
test_game['GameID'] = test_game.apply(make_gameid,axis=1)

In [None]:
test_game.head()

In [None]:
games2015 = plays2015.groupby(by=['GameID','EventTeamID','ValueType']).sum()['Value']

In [None]:
games2015.head()

In [None]:
games2015[20151111031420,1103,'Def']

In [None]:
test_game_id = 20151111031420

In [None]:
games2015.index.get_level_values(0)

In [None]:
#Get the opponent teamID from just gameID
a = games2015.loc[20151111031420].index.get_level_values(0).unique()
a

In [None]:
b = [i for i in a if i != 1390][0]
b

In [None]:
games2015.loc[20151111031420,b,'Off']

In [None]:
team_ids

In [None]:
def get_opponent_avgs(team_id):
    game_ids = games2015[:,team_id,'Def'].index
    num_games = len(game_ids) #This will be useful
    
    opponent_off_value = []
    opponent_def_value = []
    
    for game in game_ids:
        opponent = [i for i in games2015.loc[game].index.get_level_values(0).unique() if i != team_id][0]
        opponent_off_value.append(games2015[game,opponent,'Off'])
        opponent_def_value.append(games2015[game,opponent,'Def'])
        
    opponent_avg_off_value = sum(opponent_off_value) / len(opponent_off_value)
    opponent_avg_def_value = sum(opponent_def_value) / len(opponent_def_value)
    
    df_info = pd.DataFrame([[team_id,num_games,opponent_avg_off_value,opponent_avg_def_value]],columns=['TeamID','Games Played','Opponents Avg Off Value','Opponents Avg Def Value'])
    
    return df_info

df_opponents_avgs = pd.DataFrame([],columns=['TeamID','Games Played','Opponents Avg Off Value','Opponents Avg Def Value'])

for team in team_ids:
    df_opponents_avgs = df_opponents_avgs.append(get_opponent_avgs(team),ignore_index=True)

In [None]:
get_opponent_avgs(1390)

In [None]:
df_opponents_avgs.head()

In [None]:
df_value.head()

In [None]:
#Now need to adjust the value a team generated in a game to the average of the opponent

def adjusted_value(team_id):
    game_ids = games2015[:,team_id,'Def'].index
    num_games = len(game_ids) #This will be useful
    
    adjusted_off_value = []
    adjusted_def_value = []
    
    for game in game_ids:
        opponent = [i for i in games2015[game].index.get_level_values(0).unique() if i != team_id][0] #This gives the opponents id for the game
        
        adjusted_off_value.append(games2015[game,team_id,'Off'] - df_opponents_avgs.loc[df_opponents_avgs['TeamID']==opponent]['Opponents Avg Off Value'].reset_index(drop=True)[0])
        adjusted_def_value.append(games2015[game,team_id,'Def'] - df_opponents_avgs.loc[df_opponents_avgs['TeamID']==opponent]['Opponents Avg Def Value'].reset_index(drop=True)[0])

    adjusted_off_value = sum(adjusted_off_value) / len(adjusted_off_value)
    adjusted_def_value = sum(adjusted_def_value) / len(adjusted_def_value)
    
    total_adjusted_mean_value = adjusted_off_value - adjusted_def_value
    
    df_info = pd.DataFrame([[team_id,num_games,adjusted_off_value,adjusted_def_value,total_adjusted_mean_value]],columns=['TeamID','Number of Games','Average Adjusted Off Value','Average Adjusted Def Value','Total Average Value'])
    
    return df_info

df_avgs = pd.DataFrame([],columns=['TeamID','Number of Games','Average Adjusted Off Value','Average Adjusted Def Value','Total Average Value'])

for team in team_ids:
    df_avgs = df_avgs.append(adjusted_value(team),ignore_index=True)

In [None]:
df_avgs.head()

In [None]:
#Combine this with the teams df
teams.head()

In [None]:
test = teams.set_index('TeamID').join(df_avgs.set_index('TeamID'))

In [None]:
test.sort_values(by='Total Average Value',ascending=False).head(40)

In [None]:
data = teams.set_index('TeamID').join(df_value.set_index('TeamID'))
#By leaving it as total's for the value column the analysis rewards teams that played more games = conference tourney success

In [None]:
data=data.join(df_avgs.set_index('TeamID'))
data.drop(['FirstD1Season','LastD1Season'],axis=1,inplace=True)
data.sort_values(by='Off Value',ascending=False).head(30)


In [None]:
#With the value stats completed (can still look at adjusting the parameters) now can began completing some season total stats. W,L,windiff etc.

stats = pd.read_csv('MRegularSeasonDetailedResults.csv')

In [None]:
stats.head()

In [None]:
def get_year_stats(year):
    stats_year = stats.loc[stats['Season']==year]
    
    avg_win_diffs = []
    avg_game_diffs = []
    num_wins = []
    num_losses = []
    for team in team_ids:
        #Find the average WinDiff for each team
        team_wins = stats_year.loc[stats_year['WTeamID']==team]
        team_avg_win_diff = (team_wins['WScore'].sum() - team_wins['LScore'].sum()) / len(team_wins['WScore'])
        avg_win_diffs.append(round(team_avg_win_diff,2))
        
        #Get total average game diff
        
        team_losses = stats_year.loc[stats_year['LTeamID']==team]
        team_avg_game_diff = (team_wins['WScore'].sum() + team_losses['LScore'].sum() - team_wins['LScore'].sum() - team_losses['WScore'].sum()) / len(team_wins['WScore'])
        avg_game_diffs.append(round(team_avg_game_diff,2))
        
        #Team counting stats
        num_wins.append(len(team_wins))
        num_losses.append(len(team_losses))
        
        
    df_info = pd.DataFrame(zip(team_ids, [year]*len(num_wins), num_wins, num_losses, avg_win_diffs, avg_game_diffs),columns=['TeamID','Year','Wins','Losses','Avg Win Diff','Avg Game Diff'])
        
        
    return df_info

In [None]:
df_win_diffs = get_year_stats(2015)

In [None]:
data=data.join(df_win_diffs.set_index('TeamID'))

In [None]:
data.sort_values(by='Avg Game Diff',ascending=False).head(40)

In [None]:
#Try and grab the Kenpom ratings #Note that for previous years this will include the tourney games

In [None]:
kenpom2015 = pd.read_csv('2015Kenpom.csv')

In [14]:
kenpom2015.head()

Unnamed: 0,TeamName,Kenpom Overall,Kenpom Tempo,Kenpom RankAdjOE,Kenpom RankAdjDE
0,Kentucky,1,251,5,2
1,Arizona,2,78,11,3
2,Wisconsin,3,347,1,30
3,Virginia,4,349,27,1
4,Villanova,5,181,4,13


In [15]:
# Check for any errors in the kenpom document
for name in kenpom2015['TeamName']:
    dict_fix = {}
    try:
        team_id = data.loc[data['TeamName']==name.strip('.')].index[0]
    except IndexError:
        print('{} is not in the database'.format(name))
        correct = input('What is the correct spelling')        
        dict_fix[name] = correct


NameError: name 'data' is not defined

In [None]:
data.loc[data['TeamName']=='Arizona']

In [None]:
kenpom2015.loc[kenpom2015['TeamName']=='Arizona']

In [None]:
kenpom2015.head()

In [None]:
def get_team_id(team):
    try:
        team_data = pd.read_csv('MTeams.csv')
        team_id = team_data.loc[team_data['TeamName']==team]['TeamID'].values[0]
        return team_id
    except IndexError:
        print(team)

In [None]:
kenpom2015['TeamID'] = kenpom2015.apply(lambda team: get_team_id(team['TeamName'].strip('.')),axis=1)

In [None]:
kenpom2015.set_index('TeamID',inplace=True)

In [None]:
#Join Kenpom and data dataframes
df_all_2015 = data.join(kenpom2015.drop('TeamName',axis=1))

In [None]:
kenpom2015.head()


In [None]:
data.head()

In [None]:
df_all_2015.sort_values(by='Total Average Value',ascending=False).head(10)

df_all_2015 contains all the data from each team in one place for the 2015 season. Now to set up a method to generate all possible matchups in the 2015 tourney and simulate the outcomes

In [None]:
df = pd.read_csv('MNCAATourneyCompactResults.csv')

In [None]:
df_2015 = df.loc[df['Season']==2015]

In [None]:
df_2015.head()

Need to get a list of all the teams in the 2015 tourney. (Will then repeat for 2016-2019)

In [None]:
teams_2015 = list(df_2015['WTeamID'].unique())
teams_2015 = teams_2015 + list(df_2015['LTeamID'].unique())
teams_2015 = list(set(teams_2015))

Now need to create a dataframe of matchup data to train a model on

In [None]:
import itertools

In [None]:
matchups = list(itertools.combinations(teams_2015,2))

In [None]:
def get_stats(team_id, dataframe):
    team_data = dataframe.loc[team_id][['Number of Games','Average Adjusted Off Value','Average Adjusted Def Value','Total Average Value','Wins','Avg Win Diff','Kenpom Overall','Kenpom RankAdjOE','Kenpom RankAdjDE']]
    return team_data

def matchup_data_gen(teamID_1, teamID_2, dataframe):
    try:
        team_1_data = get_stats(teamID_1, dataframe)
        team_1_data['TeamID'] = team_1_data.name

        team_2_data = get_stats(teamID_2, dataframe)
        team_2_data['TeamID'] = team_2_data.name
        team_2_data = team_2_data.add_prefix('Opp_')

        data = team_1_data.append(team_2_data).to_frame().T

        return data
    except AttributeError:
        print(teamID_1, teamID_2)


For each possible matchup generate a row for the dataframe

In [None]:
for matchup in matchups:
    data = matchup_data_gen(matchup[0],matchup[1],df_all_2015)

In [None]:
matchup_data_gen(matchups[0][0],matchups[0][1],df_all_2015)

In [None]:
data = matchup_data_gen(matchups[20][0],matchups[20][1],df_all_2015)

In [None]:
data

In [None]:
a = get_stats(1120,df_all_2015)

In [None]:
type(a)

In [None]:
df_all_2015.loc[1248]