# Regression Hyperparameter Optimization

In [2]:
# importing necessary packages
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss

There are a handful of different parameters and different ways with which to build these elo ratings.
- K: the learning rate
- HOME_ADVANTAGE: how much of a home court advantage each team receives
- CONF_REGRESRSION_FACTOR: how much a team's elo rating is regressed to their conference's mean elo
- D1_REGRESSION_FACTOR: how much a team's elo rating is regressed to the NCAA mean elo (which will always be 1500)
- CONF_TOURNAMENT (bool): whether team's elo rating should be updated during their conference tournament
- NCAA_TOURNAMENT (bool): whether a team's elo rating should be updated during any NCAA tournament games
- OTHER_TOURNAMENT (bool): whether a team's elo rating should be updated during other postseason tournaments (NIT, CBI)
- EVAL_MODE: how to evaluate a model (currently using log-loss)
- EVAL_GAMES: which games are considered in evaluating the model

Theoretically, each of these parameters/options should be considered together, but I can treat most of them as independent and optimize them seperately. There are a few parameters which I can't treat as independent.

- CONF_REGRESSION_FACTOR and D1_REGRESSION_FACTOR need to be considered together because they're dependent on each other. They both must be in the range [0,1], but their sum also has to be less than or equal to 1. Intuitively, the larger one is, the smaller the other is.
- K should also be considered with the reversion factors. Intuitively, these might correlate; how well we can regress each team's elo to their true rating affects how quickly we want our elo ratings to update.
- EVAL_GAMES will also effect how the other parameters should be set, but EVAL_GAMES should be set based on which games I want to use elo rating's to predict. For now, I'll evaluate all games; I want this model to predict who wins each game, not just postseason games.
- EVAL_MODE isn't something I want to optimize; that's a good way to emulate overfitting.

HOME_ADVANTAGE and the TOURNAMENT parameters shouldn't really affect the other parameters. Thus, I'm going to only optimize three parameters here: K, CONF_REGRESSION_FACTOR, and D1_REGRESSION_FACTOR. 

In [6]:
# Setting constant parameters
DATA_PATH = "../../march-machine-learning-mania-2024-data"
HOME_ADVANTAGE = 100.

# Creating list of tuples being considered
PARAM_TUPLES = [] # Consists of tuples (K, CONF_REGRESSION_FACTOR, D1_REGRESSION_FACTOR, LOSS)
for i in range(32,64,4):
    for j in range(0,12,1):
        for k in range(0,12-j,1):
            PARAM_TUPLES.append( [i,j/20.,k/20., -1] )

This creates all combinations of parameters such that:
- K is in range [32,60] and is a multiple of 4
- CONF_REGRESSION_PARAMETER is in range [0,0.55] and is a multiple of 0.05
- D1_REGRESSION_PARAMETER is in range [0,0.55] and is a multiple of 0.05
- CONF_REGRESSION_PARAMETER + D1_REGRESSION_PARAMETER is in range [0,0.55]

For each set of these possible tuples, I'm going to run the elo ratings on the entire set of data, calculate the mean log loss, and store it in the tuple. When this is complete, I will look for the set of hyperparameters which will minimize the (negative) log loss. I expect that, if this sampling of tuples were plotted, we should get a relatively smooth curve.

In [7]:
# HELPER FUNCTIONS
def elo_pred(elo1, elo2):
    return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))

def expected_margin(elo_diff):
    return((7.5 + 0.006 * elo_diff))

def elo_update(w_elo, l_elo, margin, K):
    elo_diff = w_elo - l_elo
    pred = elo_pred(w_elo, l_elo)
    mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)
    update = K * mult * (1 - pred)
    return(pred, update)

In [8]:
for index in range(len(PARAM_TUPLES)):
    PARAM_TUPLE = PARAM_TUPLES[index]
    K = PARAM_TUPLE[0]
    CONF_REGRESSION_FACTOR = PARAM_TUPLE[1]
    D1_REGRESSION_FACTOR = PARAM_TUPLE[2]

    rs = pd.read_csv(DATA_PATH + "/MRegularSeasonCompactResults.csv")
    team_ids = set(rs.WTeamID).union(set(rs.LTeamID))

    # Creating dict of conference affiliation by year
    conf_df = pd.read_csv(DATA_PATH + "/MTeamConferences.csv")
    conf_df['TeamYear'] = conf_df.Season.astype(str) + "_" + conf_df.TeamID.astype(str)
    conf_dict = dict(zip(conf_df['TeamYear'], conf_df['ConfAbbrev']))
    confs = set(conf_df['ConfAbbrev'])
    conf_df.head()
    conf_elo_dict = {}
    for c in confs: conf_elo_dict[c] = []

    elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))
    rs['margin'] = rs.WScore - rs.LScore

    preds = []
    w_elo = []
    l_elo = []

    season_conf_elos = []

    # Group the DataFrame by 'Season'
    grouped = rs.groupby('Season')

    # Iterate over each group
    for season, group in grouped:

        # Iterate over each game in the season
        for row in group.itertuples():
        
            # Get key data from current row
            w = row.WTeamID
            l = row.LTeamID
            margin = row.margin
            wloc = row.WLoc
            
            # Does either team get a home-court advantage?
            w_ad, l_ad, = 0., 0.
            if wloc == "H":
                w_ad += HOME_ADVANTAGE
            elif wloc == "A":
                l_ad += HOME_ADVANTAGE
            
            # Get elo updates as a result of the game
            pred, update = elo_update(elo_dict[w] + w_ad,
                                    elo_dict[l] + l_ad, 
                                    margin, K)
            elo_dict[w] += update
            elo_dict[l] -= update
        
            # Save prediction and new Elos for each round
            preds.append(pred)
            w_elo.append(elo_dict[w])
            l_elo.append(elo_dict[l])
        
        # Assigning teams to conference
        for id in team_ids:
            if f"{season}_{id}" in conf_dict.keys(): conf_elo_dict[conf_dict[f"{season}_{id}"]].append(elo_dict[id]) # Only updating teams which are active
        
        # Calculating mean elo by conference
        for k, v in conf_elo_dict.items():
            if len(v) > 0: conf_elo_dict[k] = [sum(v)/len(v)] # Only modifying conferences with at least one team
    
        # Reverting each team's elo towards mean
        for id in team_ids:
            if f"{season}_{id}" in conf_dict.keys(): # Only updating teams which are D1
                this_conf = conf_dict[f"{season}_{id}"]
                this_conf_elo = conf_elo_dict[this_conf][0]
                elo_dict[id] = (1 - CONF_REGRESSION_FACTOR - D1_REGRESSION_FACTOR) * elo_dict[id] + CONF_REGRESSION_FACTOR * this_conf_elo + D1_REGRESSION_FACTOR * 1500
        
        # Clearing conference season ave elo ratings
        for k in conf_elo_dict.keys():
            if len(conf_elo_dict[k]) > 0: 
                season_conf_elos.append( (season, k, conf_elo_dict[k][0]) )
                conf_elo_dict[k] = []

    rs['w_elo'] = w_elo
    rs['l_elo'] = l_elo

    this_log_loss = np.mean(-np.log(preds))
    new_tuple = (K, CONF_REGRESSION_FACTOR, D1_REGRESSION_FACTOR, this_log_loss)
    PARAM_TUPLES[index] = new_tuple

    if (index + 1) % 50 == 0: print(f"Completed iteration {index+1}")




Completed iteration 50
Completed iteration 100
Completed iteration 150
Completed iteration 200
Completed iteration 250
Completed iteration 300
Completed iteration 350
Completed iteration 400
Completed iteration 450
Completed iteration 500
Completed iteration 550
Completed iteration 600


In [13]:
# Saving the results to a csv file
param_df = pd.DataFrame(PARAM_TUPLES, columns=['K', 'CONF_REGRESSION_PARAM', 'D1_REGRESSION_PARAM', 'log_loss'])
param_df.to_csv("results/k_CONF-REGRESSION-PARAM_D1-REGRESSION-PARAM_results.csv",index=None)

Now, to analyze the results. I'm going to start by printing out the ten sets of parameters which achieved the best results. I expect that these will be bunched relatively tightly. That is, if (X,Y,Z) is the optimal set of hyper parameters, I'm also expecting to see (X+4,Y,Z), (X-4,Y,Z), (X,Y+.05,Z), etc. among the top sets of hyperparameters. I also want the optimal set of hyperparameters to be near the center of the range. It's okay if D1_REGRESSION_PARAMETER or CONF_REGRESSION_PARAMETER is 0, because that bound is set by the domain; it would be nonsensical to move each team away from the D1 or conference mean elo rating each year. If the optimal hyperparameter is at any of the other bounds, though, this indicates that my bounds aren't optimal.

In [14]:
# Not the most efficient way to achieve these results but a simple way
param_df = pd.read_csv("results/k_CONF-REGRESSION-PARAM_D1-REGRESSION-PARAM_results.csv")
param_df = param_df.sort_values(by=['log_loss'], ascending=True)
param_df.head(10)

Unnamed: 0,K,CONF_REGRESSION_PARAM,D1_REGRESSION_PARAM,log_loss
375,48,0.35,0.0,0.519821
458,52,0.4,0.0,0.519842
453,52,0.35,0.0,0.519844
380,48,0.4,0.0,0.519898
369,48,0.3,0.0,0.51997
462,52,0.45,0.0,0.520045
536,56,0.4,0.0,0.520046
447,52,0.3,0.0,0.520069
297,44,0.35,0.0,0.520093
531,56,0.35,0.0,0.520121


This is a clear indicator that the optimal hyper parameters has D1_REGRESSION_FACTOR set to 0, CONF_REGRESSION_FACTOR set at 0.35 or 0.4, and K set at 48 or 52. I'll split the difference and make my optimal hyperparemeter tuples (50, 0.375,0).