# NFL power ratings calculation

We read in the file that was created in the [GetLines notebook](GetLines).

In [11]:
import pandas as pd
pd.options.display.max_columns = 33 # Display all teams and HFA

Here is how the first few rows look.

In [42]:
df = pd.read_csv("data/spreads-2023-08-10.csv")
df.head()

Unnamed: 0,home_team,away_team,spread_line,neutral
0,KC,DET,6.5,False
1,ATL,CAR,3.0,False
2,BAL,HOU,9.5,False
3,CLE,CIN,-1.0,False
4,IND,JAX,-3.5,False


## Setting up 272 equations in 33 variables

This computation should look familiar from the [Introduction notebook](intro), where we were using only four games.

In [13]:
home_dummies = pd.get_dummies(df["home_team"], dtype="int")
away_dummies = pd.get_dummies(df["away_team"], dtype="int")
coefs = home_dummies - away_dummies

Notice how most of these entries are 0 (this is a "sparse" matrix, although I don't think we ever take advantage of that fact), with a `1` corresponding to the home teams and a `-1` corresponding to the away teams.  (We haven't added the home-field advantage column yet.)  For example, in the top row, we have a `1` in the "KC" column and a `-1` in the "DET" column.

In [43]:
coefs.head()

Unnamed: 0,ARI,ATL,BAL,BUF,CAR,CHI,CIN,CLE,DAL,DEN,DET,GB,HOU,IND,JAX,KC,LA,LAC,LV,MIA,MIN,NE,NO,NYG,NYJ,PHI,PIT,SEA,SF,TB,TEN,WAS,HFA
0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


Again, as in the [Introduction notebook](intro), we want a home-field advantage factor for almost all the rows, with the only exceptions being the rows corresponding to the 5 games being played at a neutral location.

In [44]:
coefs["HFA"] = (~df["neutral"]).astype(int)

Here are some reality checks.

* There are 272 games in the regular season, and 32 teams (plus one column for home-field advantage).

In [45]:
coefs.shape

(272, 33)

* NFC teams have 8 home games and 9 road games.  This corresponds to the "ARI" column containing 8 values of `1`, containing 9 values of `-1`, and many values of `0`.  (There is a `0` for each game that does not involve Arizona.)

In [46]:
coefs["ARI"].value_counts()

ARI
 0    255
-1      9
 1      8
Name: count, dtype: int64

* Five games are being played at neutral locations.

In [47]:
coefs["HFA"].value_counts()

HFA
1    267
0      5
Name: count, dtype: int64

## Finding the closest solution to these equations

Again, we follow the method used in the [Introduction notebook](intro) to find the "best" solutions (in terms of Mean Squared Error) to the given 272 equations.  The main tool we use is the `LinearRegression` class defined by scikit-learn.

In [19]:
from sklearn.linear_model import LinearRegression

Here we create an instance ("instantiate") a `LinearRegression` object.  We specify `fit_intercept=False`, because we don't want to allow an additive constant (what would that mean in the context of power ratings anyway?).

In [20]:
reg = LinearRegression(fit_intercept=False)

The hard work is done by the `fit` method of `reg`.  We specify that the input variables are from the `coefs` DataFrame, and the desired outputs are in the "spread_line" column in `df`.

```{caution}
As mentioned in the [Introduction notebook](intro), there are potentially issues with the best fit solution not being uniquely determined.  (In fact, when I tried to do this same thing in R, it would not let me.)  A way to get around this would be to delete one of the team columns from our `coefs` DataFrame, which has the effect of forcing that team's power rating to be `0`.
```

In [48]:
reg.fit(coefs, df["spread_line"])

As in the [Introduction notebook](intro), the resulting values are stored in the newly defined `coef_` attribute of `reg`.  Putting these together in a pandas Series is a convenient way to include both the numerical values as well as the corresponding column names like "HFA" and "KC".

In [49]:
ser = pd.Series(reg.coef_, index=coefs.columns)

Here is how the pandas Series `ser` looks.

In [50]:
ser

ARI   -5.515579
ATL   -2.205334
BAL    2.118708
BUF    4.444309
CAR   -2.865624
CHI   -1.913665
CIN    3.594643
CLE    0.658812
DAL    2.862485
DEN    0.551433
DET    0.972427
GB    -1.632132
HOU   -4.681461
IND   -4.112030
JAX    0.955583
KC     5.732833
LA    -2.751637
LAC    2.296995
LV    -1.598324
MIA    2.473332
MIN    0.122250
NE    -0.428057
NO    -0.541405
NYG   -0.725707
NYJ    2.380826
PHI    4.488439
PIT    0.070543
SEA    0.598628
SF     3.343375
TB    -4.058167
TEN   -2.737334
WAS   -1.899165
HFA    1.486125
dtype: float64

## Visualizing the power ratings

Here is a quick example of using the Python library Altair to visualize these power ratings.

In [35]:
import altair as alt

Here we turn the pandas Series `ser` into a two-column DataFrame, with one column containing the team and the other column containing the power rating "PR".  We're not adding any new data here; we're just putting the data into the pandas DataFrame format that Altair expects.  We do remove the "HFA" value, since visualizing it next to the team power ratings seems a little strange.

In [31]:
df_pr = ser.drop("HFA").reset_index().rename({"index": "Team", 0: "PR"}, axis=1)

In [33]:
df_pr

Unnamed: 0,Team,PR
0,ARI,-5.515579
1,ATL,-2.205334
2,BAL,2.118708
3,BUF,4.444309
4,CAR,-2.865624
5,CHI,-1.913665
6,CIN,3.594643
7,CLE,0.658812
8,DAL,2.862485
9,DEN,0.551433


Here we make a bar chart of the data.  We tell Altair that the team name should go along the x-axis and the Power Rating value should determine the heights of the bars.  The only customization we do is changing from the default color scheme to "tealblues".  (See [here](https://vega.github.io/vega/docs/schemes/) for other color scheme options.)

In [36]:
alt.Chart(df_pr).mark_bar().encode(
    x = "Team",
    y = "PR",
    color = alt.Color("PR").scale(scheme="tealblues")
)

Here is one more example.  We we sort the bars from biggest power rating to smallest.  We also add a tooltip, so that if you hover your mouse over one of the bars, it will indicate the team and the power rating.

In [41]:
alt.Chart(df_pr).mark_bar().encode(
    x = alt.X("Team").sort("-y"),
    y = "PR",
    color = alt.Color("PR").scale(scheme="tealblues"),
    tooltip = ["Team", "PR"]
)