# Building the Elo Model

Now that we have prepared our data, its time to start creating our model. First though we'll need to create some additional variables that we'll use in our model and in the evaluation of our model. First though lets load our data and verify everything is as we would expect. It should look just like our output from our Data_Preparation_and_Cleaning notebook.

In [1]:
#short list of packages we'll be using
packages <- c("ggplot2","RColorBrewer","ggthemes","gridExtra","lubridate","tidyr","plyr","dplyr","reshape2","dummies","caTools","rpart","rpart.plot","rattle")

#install packages if they aren't already installed
#then load the packages
for (package in packages){
    if(!require(package,character.only=TRUE)) install.packages(package,character.only=TRUE)
    library(package,character.only=TRUE,quietly=TRUE)
}

Loading required package: ggplot2
Loading required package: RColorBrewer
Loading required package: ggthemes
Loading required package: gridExtra
Loading required package: lubridate

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

Loading required package: tidyr
Loading required package: plyr

Attaching package: ‘plyr’

The following object is masked from ‘package:lubridate’:

    here

Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union

The following object is masked from ‘package:gridExtra’:

    combine

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: reshape2



In [2]:
df <- read.csv("data_prep_output.csv", as.is=TRUE, header=TRUE)
head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,away_team_elo,home_team_win_prob
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,1561.242,0.4847889
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,1535.438,0.4184365
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,1572.94,0.3124766
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,1452.964,0.7356794
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,1407.844,0.606149
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,1475.1,0.5518257


### The Elo Rating System

As mentioned previously, the FiveThirtyEight forecasts are all based upon the Elo rating system. This rating system was developed in order to rate chess players, but has since been adapted to be used in many head-to-head competitions (such as basketball, soccer, and of course American Football).

According to Wikipedia, "Elo rating is represnted by a number which increases or decreases depending on the outcome of games between rated players. After every game, the winning player takes points from the losing one. The difference between the ratings of the winner and loser determines the total number of points gained or lost after a game." FiveThirtyEight uses Elo to assign ratings to each NFL team, with the average Elo being around 1500. Those ratings are then used to generate a probability of each team winning each of their games. FiveThirtyEight's model also takes into account the location of the game, giving a home field advantage Elo bump to the home teams.

The Elo system is a zero-sum game, meaning that any Elo won by a team is taken from another team. In this way, we calculate an Elo gain and Elo loss for each team for each game. FiveThirtyEight's model does this using three different factors:

- The K-Factor
- The Forecast Delta
- The Margin-of-Victory Multiplier

We need need to calculate each of these in order to add them to our data to model our results. We are going to calculate everything for both the home and away team (although only one is necessary) as this will allow us to test any models on home data or away data.

For a detailed description of how the Elo model works, check out: https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/

### The K-Factor

According to FiveThirtyEight, "all Elo systems come with a special multiplier called K that regulates how quickly the ratings change in response to new information." In this way, the K-Factor helps a team's Elo rating stay relatively stable in the face of a few losses. As those losses accumulate over time, a team's Elo rating will plummet but if a team is able to recover from a short losing streak they won't see as much lasting damage to their Elo rating. The original K-Factor was 10 for the original rating system established for chess.

This system was devised and works well for sports like baseball and basketball, which play a lot of games, but for a sport like football where teams play significantly less games, FiveThirtyEight has decided to up the K-Factor to 20. This, as they put it, is "large enough that new results carry weight, but not so large that ratings bounce around each week." We would encourage you to play around with the K-factor as you see fit (perhaps finding an even better one!).

In [3]:
k_factor <- 20

### The Forecast Delta

The forecast delta is the difference between the binary result of a game (codified as a win = 1, a loss = 0, and a tie = 0.5) and the pregame prediction generated by the Elo ratings. In order to calculate this, we'll need to calculate an Elo rating. The formula for the probability that a certain team (lets call them Team A) will win over another team (lets call them team B) is governed by this equation:

<img src="../Images/Prob_TeamA_Win.JPG\">

Because the equation is governed by the *EloDiff* we will need to calculate this and add it to our data. We'll do this for both the home and away team to get a probability of the home team or the away team winning. We'll also add in the home team advantage as described by FiveThirtyEight, which is an adjustment of 65 points to the home team's Elo rating.

In [4]:
#calculating elo_diffs
df$home_elo_diff <- (df$home_team_elo + 65) - df$away_team_elo
df$away_elo_diff <- df$away_team_elo - (df$home_team_elo + 65)

#calculating probabilities
df$home_win_prob <- 1 / (10^((-df$home_elo_diff)/400) + 1)
df$away_win_prob <- 1 / (10^((-df$away_elo_diff)/400) + 1)

head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,away_team_elo,home_team_win_prob,home_elo_diff,away_elo_diff,home_win_prob,away_win_prob
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,1561.242,0.4847889,-10.573,10.573,0.4847889,0.5152111
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,1535.438,0.4184365,-57.187,57.187,0.4184365,0.5815635
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,1572.94,0.3124766,-136.988,136.988,0.3124766,0.6875234
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,1452.964,0.7356794,177.823,-177.823,0.7356794,0.2643206
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,1407.844,0.606149,74.899,-74.899,0.606149,0.393851
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,1475.1,0.5518257,36.142,-36.142,0.5518257,0.4481743


You can see in the above dataframe that the home_team_win_prob column and home_win_prob are the same for the first few rows. The home_team_win_prob column was provided in the FiveThirtyEight data so we'll go ahead and get rid of it since we have calculated it (and verified that the formula is correct).

In [5]:
#remove column and show new df
df$home_team_win_prob <- NULL

Now that we have the win probabilities for both the home and away team, we need to calculate the outcome of each game in order to calculate our forecast delta. We'll do this really quick in the cell below.

In [6]:
#calculate a codeified home_result and away_result column
df$home_result <- ifelse(df$home_score > df$away_score, 1, ifelse(df$home_score < df$away_score, 0, 0.5))
df$away_result <- ifelse(df$away_score > df$home_score, 1, ifelse(df$away_score < df$home_score, 0, 0.5))

#these should be perfectly, negatively correlated with each other
#so we can check that our ifelse()'s worked by checking that
print("Home Result and Away Result correlation:")
cor(df$home_result,df$away_result)
head(df)

[1] "Home Result and Away Result correlation:"


X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,away_team_elo,home_elo_diff,away_elo_diff,home_win_prob,away_win_prob,home_result,away_result
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,1561.242,-10.573,10.573,0.4847889,0.5152111,0,1
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,1535.438,-57.187,57.187,0.4184365,0.5815635,0,1
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,1572.94,-136.988,136.988,0.3124766,0.6875234,1,0
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,1452.964,177.823,-177.823,0.7356794,0.2643206,1,0
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,1407.844,74.899,-74.899,0.606149,0.393851,0,1
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,1475.1,36.142,-36.142,0.5518257,0.4481743,0,1


With this result mapped, we can now calculate four forecast delat for each game.

In [7]:
df$home_forecast_delta <- df$home_result - df$home_win_prob
df$away_forecast_delta <- df$away_result - df$away_win_prob
head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,away_team_elo,home_elo_diff,away_elo_diff,home_win_prob,away_win_prob,home_result,away_result,home_forecast_delta,away_forecast_delta
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,1561.242,-10.573,10.573,0.4847889,0.5152111,0,1,-0.4847889,0.4847889
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,1535.438,-57.187,57.187,0.4184365,0.5815635,0,1,-0.4184365,0.4184365
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,1572.94,-136.988,136.988,0.3124766,0.6875234,1,0,0.6875234,-0.6875234
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,1452.964,177.823,-177.823,0.7356794,0.2643206,1,0,0.2643206,-0.2643206
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,1407.844,74.899,-74.899,0.606149,0.393851,0,1,-0.606149,0.606149
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,1475.1,36.142,-36.142,0.5518257,0.4481743,0,1,-0.5518257,0.5518257


### Margin-of-Victory Multiplier

Beyond looking at just the probabilities of wins and losses, FiveThirtyEight also takes into account how much a team won or lost by. This is accounted for in what FiveThirtyEight calls the Margin of Victory (MoV) multiplier. The MoV is calculated using the following formula:

<img src="../Images/MOV_Multiplier.JPG\">

We will create a column for both *MoVMultiplier* and *WinnerPointDiff* in the next cell.

In [8]:
#calculate winner_point_diff
df$winner_point_diff <- df$home_score - df$away_score

#calculate mov_multiplier
df$mov_multiplier <- ifelse(
    df$winner_point_diff > 0,
    ((log(abs(df$winner_point_diff)+1))* (2.2/((df$home_team_elo - df$away_team_elo)*0.001+2.2))),
    ((log(abs(df$winner_point_diff)+1))* (2.2/((df$away_team_elo - df$home_team_elo)*0.001+2.2)))
)

head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,⋯,home_elo_diff,away_elo_diff,home_win_prob,away_win_prob,home_result,away_result,home_forecast_delta,away_forecast_delta,winner_point_diff,mov_multiplier
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,⋯,-10.573,10.573,0.4847889,0.5152111,0,1,-0.4847889,0.4847889,-3,1.3402548
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,⋯,-57.187,57.187,0.4184365,0.5815635,0,1,-0.4184365,0.4184365,-6,1.8435218
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,-136.988,136.988,0.3124766,0.6875234,1,0,0.6875234,-0.6875234,3,1.5264411
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,177.823,-177.823,0.7356794,0.2643206,1,0,0.2643206,-0.2643206,4,1.5309271
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,74.899,-74.899,0.606149,0.393851,0,1,-0.606149,0.606149,-28,3.3825156
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,36.142,-36.142,0.5518257,0.4481743,0,1,-0.5518257,0.5518257,-1,0.6841727


### Calculating Change in Elo

Now that we have our K-factor established, our forecast deltas generated, and our margin-of-victory multiplier calculated, we can find the Elo change for each team after each game by simply multiplying all of these values together.

The formula for the change in Elo (Delta Elo) is simply the K-Factor multiplied by the Margin-of-Victory (MoV) multiplier multiplied by the Forecast Delta, which recall is codified as the predicted result (1 for a win, 0 for a loss, and 0.5 for a tie) minus the predicted probability.

For any given team (we'll call them A), the formula looks like this:

$$\Delta Elo(A) = K * MoV * (Pred(A) - Prob(A))$$

We will calculate this for each game in our data below.

In [9]:
#calculate the elo change
df$home_elo_change <- k_factor * df$mov_multiplier * df$home_forecast_delta
df$away_elo_change <- k_factor * df$mov_multiplier * df$away_forecast_delta

#calculate both the new elos for the team after the game
head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,⋯,home_win_prob,away_win_prob,home_result,away_result,home_forecast_delta,away_forecast_delta,winner_point_diff,mov_multiplier,home_elo_change,away_elo_change
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,⋯,0.4847889,0.5152111,0,1,-0.4847889,0.4847889,-3,1.3402548,-12.994814,12.994814
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,⋯,0.4184365,0.5815635,0,1,-0.4184365,0.4184365,-6,1.8435218,-15.427938,15.427938
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,0.3124766,0.6875234,1,0,0.6875234,-0.6875234,3,1.5264411,20.989279,-20.989279
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,0.7356794,0.2643206,1,0,0.2643206,-0.2643206,4,1.5309271,8.093113,-8.093113
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,0.606149,0.393851,0,1,-0.606149,0.606149,-28,3.3825156,-41.006171,41.006171
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,0.5518257,0.4481743,0,1,-0.5518257,0.5518257,-1,0.6841727,-7.550882,7.550882


Now we can calculate a new, post-game Elo for each team.

In [10]:
#calculate elo change only whenever it is not a tie
#when it is a tie, elo's stay the same
df$home_elo_new <- ifelse(df$home_result==1,df$home_team_elo+df$home_elo_change,
                         ifelse(df$home_result==0,df$home_team_elo+df$home_elo_change,df$home_team_elo))

df$away_elo_new <- ifelse(df$away_result==1,df$away_team_elo+df$away_elo_change,
                         ifelse(df$away_result==0,df$away_team_elo+df$away_elo_change,df$away_team_elo))

head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,spread_favorite,home_team_elo,⋯,home_result,away_result,home_forecast_delta,away_forecast_delta,winner_point_diff,mov_multiplier,home_elo_change,away_elo_change,home_elo_new,away_elo_new
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,⋯,0,1,-0.4847889,0.4847889,-3,1.3402548,-12.994814,12.994814,1472.674,1574.237
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,⋯,0,1,-0.4184365,0.4184365,-6,1.8435218,-15.427938,15.427938,1397.823,1550.866
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,1,0,0.6875234,-0.6875234,3,1.5264411,20.989279,-20.989279,1391.941,1551.951
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,1,0,0.2643206,-0.2643206,4,1.5309271,8.093113,-8.093113,1573.88,1444.871
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,0,1,-0.606149,0.606149,-28,3.3825156,-41.006171,41.006171,1376.737,1448.85
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,0,1,-0.5518257,0.5518257,-1,0.6841727,-7.550882,7.550882,1438.691,1482.651


### Elo Spread

Another feature of the FiveThirtyEight Elo ratings and one of the reason that having the Vegas betting data is important is that we can calculate an comparison to the spread using Elo. The formula for this is simply the elo_diff divided by 25. This is not important for our Elo but could be useful in our model. Since the vegas spread is always in terms of the point differential for the home team, we will only calculate the elo_spred for the home team as well.

In [11]:
#rename spread_favorite to vegas_spread
names(df)[names(df) == "spread_favorite"] <- "vegas_spread"
df$elo_spread <- df$home_elo_diff / 25
head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,vegas_spread,home_team_elo,⋯,away_result,home_forecast_delta,away_forecast_delta,winner_point_diff,mov_multiplier,home_elo_change,away_elo_change,home_elo_new,away_elo_new,elo_spread
1,2002-09-05,2002,NYG,SF,13,16,SF,-4.0,1485.669,⋯,1,-0.4847889,0.4847889,-3,1.3402548,-12.994814,12.994814,1472.674,1574.237,-0.42292
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,-3.0,1413.251,⋯,1,-0.4184365,0.4184365,-6,1.8435218,-15.427938,15.427938,1397.823,1550.866,-2.28748
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,0,0.6875234,-0.6875234,3,1.5264411,20.989279,-20.989279,1391.941,1551.951,-5.47952
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,0,0.2643206,-0.2643206,4,1.5309271,8.093113,-8.093113,1573.88,1444.871,7.11292
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,1,-0.606149,0.606149,-28,3.3825156,-41.006171,41.006171,1376.737,1448.85,2.99596
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,1,-0.5518257,0.5518257,-1,0.6841727,-7.550882,7.550882,1438.691,1482.651,1.44568


### Calculating Vegas Probabilities

Similar to how we can generate a probability using Elo rating, we can do the same with our Vegas spread. Above we converted our home_elo_diff into a spread, we'll do this process in reverse in order to calculate the probability of the Vegas spread, replacing elo_spread with vegas_spread.

Alternatively, it is was established that the closing spread (spread_favorite, in our data) is very correlated with a team's win probability. A formula was provided in the following article which provides us a way to directly convert the Vegas spread into a win probability. We'll do that in the cell below as well.

Source: https://medium.com/the-intelligent-sports-wagerer/what-point-spreads-can-teach-you-about-implied-win-probabilities-a8bb3623d2c5

In [12]:
#convert away team spread favorites to positive spreads
df$vegas_spread <- ifelse(df$away_team_id == df$team_favorite_id,
                         df$vegas_spread*-1,
                         df$vegas_spread)

#calculate vegas spread using method 1
df$vegas_probA <- 1 / (10^((-25*df$vegas_spread)/400) + 1)

#calculate vegas probability using formula from medium article
df$vegas_probB <- ifelse(((-0.03*df$vegas_spread) + 0.5) < 0,0,
                         ifelse(((-0.03*df$vegas_spread) + 0.5)>1,1,
                                ((-0.03*df$vegas_spread) + 0.5)))

head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,vegas_spread,home_team_elo,⋯,away_forecast_delta,winner_point_diff,mov_multiplier,home_elo_change,away_elo_change,home_elo_new,away_elo_new,elo_spread,vegas_probA,vegas_probB
1,2002-09-05,2002,NYG,SF,13,16,SF,4.0,1485.669,⋯,0.4847889,-3,1.3402548,-12.994814,12.994814,1472.674,1574.237,-0.42292,0.640065,0.38
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,3.0,1413.251,⋯,0.4184365,-6,1.8435218,-15.427938,15.427938,1397.823,1550.866,-2.28748,0.6062878,0.41
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,-0.6875234,3,1.5264411,20.989279,-20.989279,1391.941,1551.951,-5.47952,0.5,0.5
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,-0.2643206,4,1.5309271,8.093113,-8.093113,1573.88,1444.871,7.11292,0.3435301,0.635
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,0.606149,-28,3.3825156,-41.006171,41.006171,1376.737,1448.85,2.99596,0.3937122,0.59
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,0.5518257,-1,0.6841727,-7.550882,7.550882,1438.691,1482.651,1.44568,0.4285369,0.56


### Our Predictions & Brier Scores

To generate our probabilities, we combine our vegas and home elo probabilities to get an average. We'll use this average as our predictor for now while we work to develop a logistic regression model or a decision tree model to produce a probability for us. From here, we can evaluate our our predictions, the Elo preditions, and the Vegas predictions using what is known as the Brier score. The Brier score is used by FiveThirtyEight during their game to score participants.

The Brier Score is defined as the probability of a forecast minus the actual outcome of the event (codified to 1 for a positive outcome or 0 for a negative one) squared and then divided by the number of forecasting instances. It is the same as the mean squared error of the forecast. The formula is:

<img src="../Images/BrierScore.PNG\">

The way FiveThirtyEight's scoring works is that they multiply the Brier Score by 100 and then subratct the result from 25. This gives us a way to evaluate each of the different methods of prediction on a fairly human readable scale.

Brier Score Source: https://en.wikipedia.org/wiki/Brier_score
Scoring Source: https://fivethirtyeight.com/features/how-to-play-our-nfl-predictions-game/

In [13]:
str(df)

'data.frame':	3978 obs. of  28 variables:
 $ X                  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ schedule_date      : chr  "2002-09-05" "2002-09-08" "2002-09-08" "2002-09-08" ...
 $ schedule_season    : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
 $ home_team_id       : chr  "NYG" "BUF" "CAR" "CHI" ...
 $ away_team_id       : chr  "SF" "NYJ" "BAL" "MIN" ...
 $ home_score         : int  13 31 10 27 6 39 23 37 19 25 ...
 $ away_score         : int  16 37 7 23 34 40 16 34 10 28 ...
 $ team_favorite_id   : chr  "SF" "NYJ" "PICK" "CHI" ...
 $ vegas_spread       : num  4 3 0 -4.5 -3 -2 3 -7 8.5 3.5 ...
 $ home_team_elo      : num  1486 1413 1371 1566 1418 ...
 $ away_team_elo      : num  1561 1535 1573 1453 1408 ...
 $ home_elo_diff      : num  -10.6 -57.2 -137 177.8 74.9 ...
 $ away_elo_diff      : num  10.6 57.2 137 -177.8 -74.9 ...
 $ home_win_prob      : num  0.485 0.418 0.312 0.736 0.606 ...
 $ away_win_prob      : num  0.515 0.582 0.688 0.264 0.394 ...
 $ home_result  

In [14]:
#generate home probabilities for both vegas odds methods
df$group8_home_probA <- (df$home_win_prob + df$vegas_probA)/2
df$group8_home_probB <- (df$home_win_prob + df$vegas_probB)/2

#generate away probabilities
df$group8_away_probA <- abs(1 - df$group8_home_probA)
df$group8_away_probB <- abs(1 - df$group8_home_probB)

#evaluate brier scores for all probabilities
df$group8A_brier <- (df$group8_home_probA-df$home_result)^2
df$group8B_brier <- (df$group8_home_probB-df$home_result)^2
df$vegasA_brier <- (df$vegas_probA-df$home_result)^2
df$vegasB_brier <- (df$vegas_probB-df$home_result)^2
df$five38_brier <- (df$home_win_prob-df$home_result)^2

#evaluate points using FiveThirtyEight's point system
df$group8A_points <- 25 - (100*df$group8A_brier)
df$group8B_points <- 25 - (100*df$group8B_brier)
df$vegasA_points <- 25 - (100*df$vegasA_brier)
df$vegasB_points <- 25 - (100*df$vegasB_brier)
df$five38_points <- 25 - (100*df$five38_brier)

head(df)

X,schedule_date,schedule_season,home_team_id,away_team_id,home_score,away_score,team_favorite_id,vegas_spread,home_team_elo,⋯,group8A_brier,group8B_brier,vegasA_brier,vegasB_brier,five38_brier,group8A_points,group8B_points,vegasA_points,vegasB_points,five38_points
1,2002-09-05,2002,NYG,SF,13,16,SF,4.0,1485.669,⋯,0.3163241,0.18696497,0.4096832,0.1444,0.2350203,-6.632408812,6.303503,-15.96832,10.56,1.49797
2,2002-09-08,2002,BUF,NYJ,31,37,NYJ,3.0,1413.251,⋯,0.262515,0.17157678,0.3675849,0.1681,0.1750891,-1.251500813,7.842322,-11.758493,8.19,7.491086
3,2002-09-08,2002,CAR,BAL,10,7,PICK,0.0,1370.952,⋯,0.352553,0.35255296,0.25,0.25,0.4726884,-10.255296121,-10.255296,0.0,0.0,-22.268844
4,2002-09-08,2002,CHI,MIN,27,23,CHI,-4.5,1565.787,⋯,0.2119638,0.09901112,0.4309527,0.133225,0.0698654,3.803620176,15.098888,-18.095271,11.6775,18.01346
5,2002-09-08,2002,CIN,LAC,6,34,CIN,-3.0,1417.743,⋯,0.2499306,0.35769313,0.1550093,0.3481,0.3674166,0.006939073,-10.769313,9.499072,-9.81,-11.741665
6,2002-09-08,2002,CLE,KC,39,40,CLE,-2.0,1446.242,⋯,0.2402777,0.30903911,0.1836439,0.3136,0.3045116,0.972229442,-5.903911,6.635614,-6.36,-5.451162


# Results

Now lets aggregate all of the points and see how we did! We'll take this output and create some visualizations with it in our other notebooks.

In [15]:
points_by_season <- df %>%
    group_by(schedule_season) %>%
    mutate(group8_total_points = sum(group8B_points)) %>%
    mutate(vegas_total_points = sum(vegasB_points)) %>%
    mutate(five38_total_points = sum(five38_points)) %>%
    select(schedule_season,group8_total_points,vegas_total_points,five38_total_points) %>%
    distinct()

head(points_by_season)

schedule_season,group8_total_points,vegas_total_points,five38_total_points
2002,688.1402,653.0375,633.1252
2003,836.0192,881.6925,713.3039
2004,837.109,899.4975,655.7038
2005,1135.2946,1225.6525,963.5225
2006,575.6436,584.005,481.6456
2007,1155.7797,1257.025,938.1081


In [16]:
#write to csv that lands in the current directory
write.csv(points_by_season,file="elo_output.csv")