In [1]:
import numpy as np
import pandas as pd
import requests

We're going to be creating SRS ratings for the 2019 season, but you should be able to go back and use the same code to generate ratings for prior seasons quite easily after getting through this. Remember, the main two data points that SRS is concerned with are scoring margin and SOS. We'll be calculating the latter as part of the base SRS calculation. As for the former, score data is very easily accessed via the CFBD API's /games endpoint. Let's use Python's request library to load all games from the 2019 season into a pandas DataFrame object.

In [2]:
response = requests.get(
    "https://api.collegefootballdata.com/games",
    params={"year": 2019, "seasonType": "both"}
)

data = pd.read_json(response.text)
data.head()

Unnamed: 0,id,season,week,season_type,start_date,neutral_site,conference_game,attendance,venue_id,venue,home_team,home_conference,home_points,home_line_scores,home_post_win_prob,away_team,away_conference,away_points,away_line_scores,away_post_win_prob
0,401110723,2019,1,regular,2019-08-24T23:00:00.000Z,True,False,,4013,Camping World Stadium,Florida,SEC,24.0,"[7, 0, 10, 7]",0.905953,Miami,ACC,20.0,"[3, 10, 0, 7]",0.094047
1,401114164,2019,1,regular,2019-08-25T02:30:00.000Z,False,False,,3610,Aloha Stadium,Hawai'i,Mountain West,45.0,"[14, 14, 7, 10]",0.68863,Arizona,Pac-12,38.0,"[0, 21, 14, 3]",0.31137
2,401117855,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3892,Rentschler Field,Connecticut,American Athletic,24.0,"[7, 3, 14, 0]",0.728942,Wagner,,21.0,"[0, 0, 14, 7]",0.271058
3,401119254,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3700,Doyt Perry Stadium,Bowling Green,Mid-American,46.0,"[13, 17, 7, 9]",0.999979,Morgan State,,3.0,"[0, 3, 0, 0]",2.1e-05
4,401117854,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3854,Nippert Stadium,Cincinnati,American Athletic,24.0,"[7, 3, 7, 7]",0.996829,UCLA,Pac-12,14.0,"[0, 7, 7, 0]",0.003171


This only outputs the first five games in the data frame, but do you notice anything in those five games? This output includes games against non-FBS opponents. How should we treat such games? There are a few different options. For one, we can just include them but then our system of equations grows much larger and includes teams that offer very few data points. Sports Reference takes the approach of just tossing all non-FBS teams into a bucket and treating them as a single team. I don't really like that much because even still, not all FCS teams are the same. We're just going to exclude them. CFBD has an, admittedly, peculiar way of distinguishing FCS teams. Looking in the conference columns above, you'll see that all non-FBS teams have a null conference value, which Python represents are 'None'. Let's go ahead and filter out any games where either team has no conference data and thus are non-FBS schools.

In [3]:
data = data[
    (pd.notna(data['home_conference'])) # games with a non-FBS home team
    & (pd.notna(data['away_conference'])) # games with a non-FBS away team
]

There's still one oddity with which we need to deal. This data could potentially include games that have not yet been played. As I'm writing this, there are still a couple of bowl games left in the 2019 season, including the National Championship game. Another admitted peculiarity of the CFBD API at the time of writing is that it does not have a flag for completed games. Instead, we need to filter out games lacking score data. I'm going to just include these filters with the ones in the snippet above.

In [4]:
data = data[
    (data['home_points'] == data['home_points']) # filtering out future games
    & (data['away_points'] == data['away_points'])
    & (pd.notna(data['home_conference'])) # games with a non-FBS home team
    & (pd.notna(data['away_conference'])) # games with a non-FBS away team
]

Next thing I want to do is to find each team's average scoring margin. Notice how we have data frame columns for home points and away points. Let's create new columns for the home and away scoring margins. One question we need to answer, though, is how we deal with home field advantage. The SRS formula used by Sports Reference notably does not make any sort of home field adjustment, but I would still like to make such an adjustment. We'll set home field advantage at +2.5 points for the purposes of this article, but may go back and adjust. The code for these calculations looks like this:

In [5]:
data['home_spread'] = np.where(data['neutral_site'] == True, data['home_points'] - data['away_points'], (data['home_points'] - data['away_points'] - 2.5))
data['away_spread'] = -data['home_spread']
data.head()

Unnamed: 0,id,season,week,season_type,start_date,neutral_site,conference_game,attendance,venue_id,venue,...,home_points,home_line_scores,home_post_win_prob,away_team,away_conference,away_points,away_line_scores,away_post_win_prob,home_spread,away_spread
0,401110723,2019,1,regular,2019-08-24T23:00:00.000Z,True,False,,4013,Camping World Stadium,...,24.0,"[7, 0, 10, 7]",0.905953,Miami,ACC,20.0,"[3, 10, 0, 7]",0.094047,4.0,-4.0
1,401114164,2019,1,regular,2019-08-25T02:30:00.000Z,False,False,,3610,Aloha Stadium,...,45.0,"[14, 14, 7, 10]",0.68863,Arizona,Pac-12,38.0,"[0, 21, 14, 3]",0.31137,4.5,-4.5
4,401117854,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3854,Nippert Stadium,...,24.0,"[7, 3, 7, 7]",0.996829,UCLA,Pac-12,14.0,"[0, 7, 7, 0]",0.003171,7.5,-7.5
9,401114236,2019,1,regular,2019-08-30T00:00:00.000Z,False,False,,4729,Benson Field at Yulman Stadium,...,42.0,"[7, 21, 14, 0]",0.999668,Florida International,Conference USA,14.0,"[0, 7, 7, 0]",0.000332,25.5,-25.5
10,401111653,2019,1,regular,2019-08-30T00:00:00.000Z,False,True,,3836,Memorial Stadium,...,52.0,"[14, 14, 14, 10]",0.999976,Georgia Tech,ACC,14.0,"[0, 0, 7, 7]",2.4e-05,35.5,-35.5


What is going on above? Well first off, I am making this calculation conditional on whether the game is at a neutral site. If it's at a neutral site, I include no home field adjustment. Otherwise, I subtract 2.5 points from the home team's margin. The away team's margin is just the inverse of the home team's.

We're going to do a little cleanup here. It would be nice if we could get our data into a format that would make our calculations are little bit easier. I'm thinking of something that just has the following columns:

* team name
* spread
* opponent

To do this, we'll essentially be converting each row into two, one for each of the teams, and then getting rid of all the columns that we don't care about. Go ahead and run the following code.

In [6]:
teams = pd.concat([
    data[['home_team', 'home_spread', 'away_team']].rename(columns={'home_team': 'team', 'home_spread': 'spread', 'away_team': 'opponent'}),
    data[['away_team', 'away_spread', 'home_team']].rename(columns={'away_team': 'team', 'away_spread': 'spread', 'home_team': 'opponent'})
])

teams.head()

Unnamed: 0,team,spread,opponent
0,Florida,4.0,Miami
1,Hawai'i,4.5,Arizona
4,Cincinnati,7.5,UCLA
9,Tulane,25.5,Florida International
10,Clemson,35.5,Georgia Tech


There's one more question we face. We already made one adjustment to account for HFA, but how should we handle scoring margin outliers? That is to say, how much do we want our ratings affected by blowouts? We can leave scoring margin uncapped, which would then give teams credit for "style" points. For now, let's go ahead and cap margins at 28 points. This is another thing we can come back and adjust.

In [7]:
teams['spread'] = np.where(teams['spread'] > 28, 28, teams['spread']) # cap the upper bound scoring margin at +28 points
teams['spread'] = np.where(teams['spread'] < -28, -28, teams['spread']) # cap the lower bound scoring margin at -28 points
teams.head()

Unnamed: 0,team,spread,opponent
0,Florida,4.0,Miami
1,Hawai'i,4.5,Arizona
4,Cincinnati,7.5,UCLA
9,Tulane,25.5,Florida International
10,Clemson,28.0,Georgia Tech


It's now time to calculate the average scoring margin for each team. We'll use the convenient grouping operators in pandas to group the teams data frame by team name.

In [8]:
spreads = teams.groupby('team').spread.mean()
spreads.head()

team
Air Force            11.833333
Akron               -21.125000
Alabama              20.333333
Appalachian State    15.307692
Arizona             -11.363636
Name: spread, dtype: float64

Before we get any further, I'd be remiss if I didn't give a shout out to Andrew Mauboussin as a lot of what follows is based on a GitHub repo of his to which someone on Patreon had pointed me. Now that we have the data in the state in which we want it to be, the final steps are to go ahead and perform the SRS calculations. Remember, we will be building a system of 130 equations with 130 unknown variables (so a 130 x 130 matrix). To once again quote Doug Drinen's primer, which was mainly concerned with NFL:

> The idea is to define a system of 32 equations in 32 unknowns. The solution to that system will be collection of 32 numbers and those numbers will serve as the ratings of the 32 NFL teams. Define R_ind as Indianapolis' rating, R_pit as Pittsburgh's rankings, and so on. Those are the unknowns. The equations are:
>
> R_ind = 12.0 + (1/16) (R_bal + R_jax + R_cle + . . . . + R_ari)
> 
> R_pit = 8.2 + (1/16) (R_ten + R_hou + R_nwe + . . . . + R_det)
> 
> .
> 
> .
> 
> .
> 
> R_stl = -4.1 + (1/16) (R_sfo + R_ari + R_ten + . . . . + R_dal)
>
>
> One equation for each team. The number just after the equal sign is that team's average point margin.

Extrapolate this to the 130 teams in FBS. By solving this system of equations, we will find each team's adjusted rating which incorporates scoring margin and strength. I once had a Linear Algebra professor tell me that his course would be the most important undergraduate math course I would ever take. I was dubious then, not so much now.

It's actually not so bad. In order to build up our system of equations, we are going to define two arrays. The first array will define the coefficients for our system of equations and will have dimensions of 130 x 130. The second array will house our solutions and thus be 130 x 1 (one rating for each team). Go ahead and run this code.

In [9]:
# create empty arrays
terms = []
solutions = []

for team in spreads.keys():
    row = []
    # get a list of team opponents
    opps = list(teams[teams['team'] == team]['opponent'])
    
    for opp in spreads.keys():
        if opp == team:
        	# coefficient for the team should be 1
            row.append(1)
        elif opp in opps:
        	# coefficient for opponents should be 1 over the number of opponents
            row.append(-1.0/len(opps))
        else:
        	# teams not faced get a coefficient of 0
            row.append(0)
            
    terms.append(row)
    
    # average game spread on the other side of the equation
    solutions.append(spreads[team])

Okay, we have our system of equations built up, but how do we solve them? Seems like this would be pretty hard, but it's actually the easiest part of this whole guide. The NumPy library contains a linear algebra package which includes a convenient method for solving a system of linear equations. Let's go ahead and solve this system.

In [10]:
solutions = np.linalg.solve(np.array(terms), np.array(solutions))
solutions

array([  1.83152811, -38.99538672,  17.09317512,   2.59362714,
       -14.03554643,  -2.87607636, -14.85549722, -12.46345153,
       -13.89805687,  10.99729137,  -9.17777622, -12.8658184 ,
         4.60687885,   0.48070966, -10.20317539, -31.2534094 ,
        -8.3941979 ,  -4.3775469 , -13.24934   , -17.57644881,
         0.13316888,  16.58388205, -17.89139792, -10.44330409,
       -17.40032253, -27.9047799 ,  -8.8807763 , -22.5308818 ,
       -16.58680614,   9.94968073,  -0.23930534, -18.87627003,
        -7.81604163, -12.27979085,  11.20596687,  -8.53501417,
       -16.85838729, -16.96372389, -10.45136986,  -8.63212612,
        -9.11027461,  -1.8631479 ,   6.40004748,   3.9315234 ,
       -12.74943847,   0.85465581, -13.93984115,  -2.84208967,
        21.27509269, -14.25729279,  -0.8711436 , -15.49807108,
       -10.39369547,  -7.3870365 , -11.97106221, -12.56148814,
         5.15892546,  -6.65042944, -14.50281729,   8.06800547,
        -3.18464261, -14.97531103,   5.35530121,  -4.13

That's not really something we can read as it's all numbers.  Let's add team names to this data using Python's zip method. Better yet, let's also create a new pandas DataFrame with this information.

In [11]:
ratings = list(zip( spreads.keys(), solutions ))
srs = pd.DataFrame(ratings, columns=['team', 'rating'])
srs.head()

Unnamed: 0,team,rating
0,Air Force,1.831528
1,Akron,-38.995387
2,Alabama,17.093175
3,Appalachian State,2.593627
4,Arizona,-14.035546


Lastly, there is no ordering here. We have teams matched up with ratings, which is great, but lets order these teams by rating so that we can get some rankings.

In [12]:
rankings = srs.sort_values('rating', ascending=False).reset_index()[['team', 'rating']]
rankings.loc[:24]

Unnamed: 0,team,rating
0,LSU,21.275093
1,Ohio State,20.155456
2,Alabama,17.093175
3,Clemson,16.583882
4,Georgia,11.205967
5,Auburn,10.997291
6,Penn State,10.560605
7,Oregon,10.349621
8,Florida,9.949681
9,Oklahoma,9.645868


And that's all there is to it. Notice that LSU is ranked #1 at +21.3. This is saying that LSU is 21.3 points better than an average team on a neutral field. Meanwhile, Clemson comes in #4 at +16.5. By these ratings, our SRS model is saying that LSU is a 4.8 point favorite over Clemson on a neutral field (i.e. the National Championship game)

Go back and tinker.

What happens to our ratings if we do any of the following?
* Adjust home field advantage up or down from 2.5
* Remove home field advantage adjustment completely
* Adjust the scoring margin cap up or down from 28
* Remove the scoring margin cap completely