### ELO

The purpose of this notebook is to take all of the games played by each team and formulate their rankings.

Before we begin, import the necessary statements and create our list of teams.

In [1]:
import pandas as pd
import pickle

In [2]:
teams = ["Bethel", "Goshen", "Grace", "HU", "IWU", "Marian", "MVNU", "SAU", "SFU", "Taylor"]

We load the list of conference games played by each team in 2019 originally obtained in the [Park_Factor](Park_Factor.ipynb) notebook.

In [3]:
with open('2019Schedule.pkl', 'rb') as f:
    tidy_conf = pickle.load(f)

Since we will be combining the data from all teams we add a column for the team name and move it to the column directly following the date.

In [4]:
for i in range(len(teams)): #add column for team
    tidy_conf[i]["Team"] = teams[i]
    team = tidy_conf[i].pop("Team")
    tidy_conf[i].insert(1, team.name, team) #move team column to second

In [5]:
tidy_conf[2][:5]

Unnamed: 0,Date,Team,Opponent,Location,Score,Outcome,Opp_score
7,2019-03-08,Grace,Bethel (Ind.),N,13,W,6
8,2019-03-09,Grace,Bethel (Ind.),N,14,W,2
9,2019-03-09,Grace,Bethel (Ind.),N,3,W,1
10,2019-03-14,Grace,Taylor (Ind.),A,5,L,15
11,2019-03-16,Grace,Taylor (Ind.),A,2,L,10


For notational purposes, we swap out the team names used by DakStats for the corresponding names in our `teams` list.

In [6]:
for df in tidy_conf:    
    df.Opponent.replace({
            'Bethel (Ind.)' : 'Bethel',
            'Taylor (Ind.)' : 'Taylor',
            'Spring Arbor (Mich.)' : 'SAU',
            'Huntington (Ind.)' : 'HU',
            'St. Francis (Ind.)' : 'SFU',
            'Indiana Wesleyan' : 'IWU',
            'Mount Vernon Nazarene (Ohio)' : 'MVNU',
            'Marian (Ind.)' : 'Marian',
            'Goshen (Ind.)' : 'Goshen',
            'Grace (Ind.)' : 'Grace'
        }, 
    inplace=True)
tidy_conf[2][::3]

Unnamed: 0,Date,Team,Opponent,Location,Score,Outcome,Opp_score
7,2019-03-08,Grace,Bethel,N,13,W,6
10,2019-03-14,Grace,Taylor,A,5,L,15
13,2019-03-23,Grace,SAU,A,2,L,4
18,2019-03-28,Grace,HU,H,12,W,2
23,2019-04-05,Grace,SFU,A,17,W,13
26,2019-04-09,Grace,IWU,H,16,L,20
30,2019-04-13,Grace,MVNU,A,5,L,10
33,2019-04-22,Grace,Marian,H,0,L,9
36,2019-04-26,Grace,Goshen,H,5,W,1


To isolate each game, we filter each team's table to only show their wins. Since each game has only one winner, when we combine all of the teams' tables, each game will only show up once.

In [7]:
conf_w = [df[df.Outcome.str.contains("W", regex = False)] for df in tidy_conf]
conf_w[2][:5]

Unnamed: 0,Date,Team,Opponent,Location,Score,Outcome,Opp_score
7,2019-03-08,Grace,Bethel,N,13,W,6
8,2019-03-09,Grace,Bethel,N,14,W,2
9,2019-03-09,Grace,Bethel,N,3,W,1
18,2019-03-28,Grace,HU,H,12,W,2
20,2019-04-01,Grace,HU,H,10,W,9


We combine all of the tables and rename a few of the columns to better comprehend the data.

In [None]:
all_games = pd.concat(conf_w)
all_games = all_games.sort_values('Date')
all_games.rename(columns={
    'Team': 'Win_Tm',
    'Opponent': 'Lose_Tm',
    'Score': 'W_Score',
    'Opp_score': 'L_Score'}, 
    inplace=True)

del all_games["Outcome"]
all_games[::30]

Now that we have all our games, we create our function for updating each team's ranking based on the game results. This formula was adapted from http://andr3w321.com/elo-ratings-part-1/. We inculde an additional modification for margin of victory, multiplying the rating increase or decrease by a constant increasing from `1` by `0.05` for each additional run over the opponent. We also incorporate homefield advantage by giving the home teams a boost of `37.85` to their rankings. Later we will show how we arrived at this bonus.

In [9]:
def gamePlayed(winTeam, loseTeam, marginOfVictory=1, winTeamLocation="N", k=20, tie=False): 
    if winTeamLocation == "H":
        rW = eloDict[winTeam] + 37.85 # get ratings
        rL = eloDict[loseTeam]
    elif winTeamLocation == "A":
        rW = eloDict[winTeam]
        rL = eloDict[loseTeam] + 37.85
    elif winTeamLocation == "N":
        rW = eloDict[winTeam]
        rL = eloDict[loseTeam]
    cW = 10 ** (rW/400)
    cL = 10 ** (rL/400)
    exp_winTeam = cW / float(cW + cL)
    exp_loseTeam = cL / float(cW + cL)
    if tie == True:
        s1 = 0.5
        s2 = 0.5
    else:
        s1 = 1
        s2 = 0
    if winTeamLocation == "H":
        new_rW = rW + k * (0.95 + 0.05*marginOfVictory) * (s1 - exp_winTeam) - 37.85
        new_rL = rL + k * (0.95 + 0.05*marginOfVictory) * (s2 - exp_loseTeam)
    elif winTeamLocation == "A":
        new_rW = rW + k * (0.95 + 0.05*marginOfVictory) * (s1 - exp_winTeam)
        new_rL = rL + k * (0.95 + 0.05*marginOfVictory) * (s2 - exp_loseTeam) - 37.85
    elif winTeamLocation == "N":
        new_rW = rW + k * (0.95 + 0.05*marginOfVictory) * (s1 - exp_winTeam)
        new_rL = rL + k * (0.95 + 0.05*marginOfVictory) * (s2 - exp_loseTeam)
    eloDict[winTeam] = new_rW
    eloDict[loseTeam] = new_rL

Now we can create a dictionary and fill it with each team's name and a base rating of `1500`.

In [10]:
eloDict = {}
for team in teams:
    eloDict[team] = 1500

Iterating over each row in our table containing all of the conference games in 2019, we apply the `gamePlayed` function to update the rankings after each team's games. We then print out our dictionary of ratings at the conclusion of the season.

In [11]:
for game in all_games.iterrows():
    gamePlayed(game[1].Win_Tm, game[1].Lose_Tm, game[1].W_Score - game[1].L_Score, game[1].Location)

In [12]:
eloDict

{'Bethel': 1395.7822717560884,
 'Goshen': 1487.2907354386887,
 'Grace': 1420.7617309112625,
 'HU': 1598.2393021698197,
 'IWU': 1550.4590681682805,
 'Marian': 1550.2816649125655,
 'MVNU': 1600.2591623902965,
 'SAU': 1483.099880201794,
 'SFU': 1390.7852740777919,
 'Taylor': 1523.0409099734127}

Since these games were played two years ago, due to the nature of roster turnover in college baseball we decided to regress our Elo ratings back to the orignal baseline. To do this, we regress one third of the way back to the baseline of 1500. We compute this regression twice as two years have passed since the last games.

In [13]:
for i in range(2):
    for team in teams:
        eloDict[team] = eloDict[team] - ((eloDict[team] - 1500) * (1/3))

Now we print out our regressed Elo ratings. Notice they are much closer to the baseline of 1500 than at the conclusion of the 2019 season. 

In [14]:
eloDict

{'Bethel': 1453.6810096693728,
 'Goshen': 1494.3514379727505,
 'Grace': 1464.7829915161167,
 'HU': 1543.6619120754754,
 'IWU': 1522.4262525192357,
 'Marian': 1522.347406627807,
 'MVNU': 1544.5596277290208,
 'SAU': 1492.4888356452418,
 'SFU': 1451.4601218123519,
 'Taylor': 1510.2404044326279}

In order to make predictions about games, we develop a function to return a probability of victory. The function is constructed so that we can compare two actual teams, two independent ratings, or one of each. We incorporate homefield advantage to this function as well. Similar to the `gamePlayed` function, we adapted our `expectGame` function from the one provided at http://andr3w321.com/elo-ratings-part-1/.

In [15]:
def expectGame(winner, loser, winLocation="N"):
    if type(winner) == str:
        if winLocation == "N" or winLocation == "A": r1 = eloDict[winner]
        elif winLocation == "H": r1 = eloDict[winner] + 37.85
    else:
        if winLocation == "N" or winLocation == "A": r1 = winner
        elif winLocation == "H": r1 = winner + 37.85
    if type(loser) == str:
        if winLocation == "N" or winLocation == "H": r2 = eloDict[loser]
        elif winLocation == "A": r2 = eloDict[loser] + 37.85
    else:
        if winLocation == "N" or winLocation == "H": r2 = loser
        elif winLocation == "A": r2 = loser + 37.85
    d = r1 - r2
    p = 1 - 1 / (1 + 10 ** (d / 400.0))
    return p

Here is an example of the function in use.

In [16]:
expectGame("Grace", 1500, "H")

0.5037891068465891

Below is the code used to determine the quantity of the bonus provided for the home team. In the [HomeField](HomeField.ipynb) notebook, we gathered Crossroads League data from the past 5 years (excluding 2020) and found the total win percentage of home teams to be approximately 0.55425. We then used our base function for predicting games without the homefield advantage adjustment and added rating points until the win probability matched that of the home teams' win percentage over the last 5 full seasons. Most accurate to two decimal places, we found this bonus to be `37.85`.

In [17]:
#eloDict["team1"] = 1537.85
#eloDict["team2"] = 1500
#expectGame("team1", "team2")
#0.5542560584880326

Now, we load the games played in 2021 that were scraped in the [Get2021Games](Get2021Games.ipynb) notebook.

In [18]:
with open('2021games.pkl', 'rb') as f:
    games2021 = pickle.load(f)
games2021[-5:]

Unnamed: 0,Date,Win_Tm,Lose_Tm,Location,W_Score,L_Score
26,2021-03-20,Taylor,MVNU,H,4,1
24,2021-03-22,Marian,SAU,H,9,4
19,2021-03-22,SAU,Marian,A,12,11
27,2021-03-22,Taylor,MVNU,H,1,0
28,2021-03-22,Taylor,MVNU,H,9,2


We iterate over each game and adjust the rankings based on the results. We print out the updated Elo dictionary and load it onto a pickle file named `EloRatings`.

In [19]:
for game in games2021.iterrows():
    gamePlayed(game[1].Win_Tm, game[1].Lose_Tm, game[1].W_Score - game[1].L_Score, game[1].Location)

In [20]:
eloDict

{'Bethel': 1393.7860257348211,
 'Goshen': 1390.3152109015975,
 'Grace': 1434.6376645107598,
 'HU': 1567.6122210487417,
 'IWU': 1627.3883727527807,
 'Marian': 1561.5340998571191,
 'MVNU': 1457.8575811307321,
 'SAU': 1480.78035889379,
 'SFU': 1476.9665567950756,
 'Taylor': 1609.1219083745827}

In [21]:
with open('EloRatings.pkl', 'wb') as f:
    pickle.dump(eloDict, f)