Run the following code to import the required packages:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
import matplotlib.pyplot as plt
%matplotlib inline

## Multiple Linear Regression with Categorical Explanatory Variables

So far, we've only considered numerical explanatory variables. What if we also consider categorical variables, like player position? Let's analyze the following fantasy football dataset created by merging ESPN projections located here:

https://www.espn.com/fantasy/football/story/_/page/17RanksPreseason200PPR/2017-fantasy-football-ppr-rankings-top-200

with how players actually performed that season:

In [None]:
df = pd.read_csv('data/football.csv', index_col=0)
df.head()

We can reduce our dataset to purely the numerical columns:

In [None]:
df = df.drop(columns=['player', 'positions'])
df.head()

Let's view a scatterplot of the data:

In [None]:
x = df['2017 projected ranking']
y = df['2017 actual points scored']
fit = np.polyfit(x, y,1)
fit_fn = np.poly1d(fit) 
plt.plot(x, y,'.', x, fit_fn(x))
plt.xlabel('projected ranking')
plt.ylabel('actual points scored')

If we do that and run a linear regression, we get an R-squared value of only 16%:

In [None]:
X = df.drop(columns = ['2017 actual points scored'])
y = df['2017 actual points scored']

model = LinearRegression()
model.fit(X, y)
print('R-squared', model.score(X,y))

Let's see if we can do better by using categorical variables, too. Our goal will be to use ESPN rankings (numerical) AND player position (categorical) to predict their fantasy points.

Step 1: Create a new dataframe from the original dataframe that includes just your predictor variables, the positions and ranking columns. 

In [None]:
df = pd.read_csv('data/football.csv', index_col=0)
input_data = df[['2017 projected ranking','positions']]
input_data.head()

Step 2: Create a one-hot matrix using pd.dummy that turns the categorical column, position, into numerical columns for each position:

In [None]:
one_hot = pd.get_dummies(df['positions'])
one_hot.head()

Step 3: We can now merge the two dataframe by first dropping the position labels column, since now each position is treated numerically:

In [None]:
input_data = input_data.drop('positions', axis = 1)
input_data = input_data.join(one_hot)
input_data.head()

We are now ready to create a linear regression model that predicts actual points scored:

In [None]:
X = input_data
y = df['2017 actual points scored']

model = LinearRegression()
model.fit(X, y)
print('R-squared', model.score(X,y))

Wow! Our R-Squared doubled when we added in position.

What exactly is our model?

In [None]:
print(model.intercept_)
print(model.coef_)

This means that:

$\text{predicted actual points scored} = 232.9 - 0.61 (\text{projected ranking})+ 63.40 (QB) - 47.91 (RB) - 6.13 (TE) - 9.34 (WR)$

This makes sense since being a QB should indicate that you'll score more points and having a higher ranking (meaning ranking #200 instead of #1) should actually mean that you score less points.

Let's visualize what is going on with this multiple linear regression by creating the following plot below:

In [None]:
quarterbacks = df[df['positions'] == 'QB']
runningbacks = df[df['positions'] == 'RB']
receivers = df[df['positions'] == 'WR']
tightends = df[df['positions'] == 'TE']

fig, ax = plt.subplots(figsize=(7,7))

ax.scatter(quarterbacks['2017 projected ranking'], quarterbacks['2017 actual points scored'],  color='blue', label='qb')
ax.scatter(runningbacks['2017 projected ranking'], runningbacks['2017 actual points scored'],  color='red', label='rb')
ax.scatter(receivers['2017 projected ranking'], receivers['2017 actual points scored'],  color='yellow', label = 'wr')
ax.scatter(tightends['2017 projected ranking'], tightends['2017 actual points scored'],  color='green', label = 'te')

ax.plot(X['2017 projected ranking'].values, model.predict(X), 'k.', label = 'predictions')
ax.set_xlabel('ESPN projected pre-season player ranking 2017')
ax.set_ylabel('Total ESPN Fantasy Points that player scored in 2017')
ax.set_title('Position matters!')
ax.legend(loc='best');

If you look closely, you should actually see four black lines in the plot above, corresponding to a different regression line for each position.

Finally, we can analyze who did best and who did worst.

We can create another column in the results dataframe called "predicted points" that lists the predicted points from the line of best fit for each player and then use that column to calculate the residuals. We can then sort the dataframe by residuals from lowest to highest to see the most overrated players.

In [None]:
df = pd.read_csv('data/football.csv', index_col=0)
df['predicted points'] = fit_fn(df['2017 projected ranking'])
df['residual'] = df['2017 actual points scored'] - df['predicted points']
df.sort_values(by = 'residual', ascending = True).head()

It makes sense that David Johnson was the most overrated player that year, as he was ranked #1 but unfortunately got injured early in the season. 

In [None]:
# how could you see the most UNDER rated players from 2017?

df.sort_values(by = 'residual', ascending = False).head()

In [None]:
# if you're a player of Fantasy Football ... creating a model that could select the best team might
# be an interesting project 
