I'll try to build a WR stat predictor in this notebook. First lets load our full database

In [1]:
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import numpy as np
from scipy.stats.stats import pearsonr 
import math

  (fname, cnt))


In [2]:
import pandas as pd
import pickle

samples=pd.read_csv('TrainingSamples.csv')
samples.drop('Unnamed: 0', axis=1, inplace=True)

with open("stat_order.pickle",'rb') as f:
    stat_order=pickle.load(f)

Next, lets remove the first year of football data (since the averages will be bad) and extract all of the WR players

In [None]:
WRsamples=samples[(samples['Position']=="WR") & (samples['Week']>100)].reset_index(drop=True)

In [None]:
len(WRsamples)

All of our lists got saved as strings in csv format, so lets convert them back to lists

In [None]:
all_lists = ['Opp Avg Stats','Opp Stat Std','Opp Avg Stats v Team','Opp Stat Std v Team',\
             'Opp Players','Player Avg Stats','Player Stat Std','Player Avg Stats v Opp',\
             'Player Stat Std v Opp','Team Avg Stats','Team Stat Std','Team Avg v Opp',\
             'Team Stat Std v Opp','Stat Outcome']

for list in all_lists:
    WRsamples[list] = WRsamples[list].apply(eval)

Next, lets remove any players that have "nan" stats

In [None]:
WRsamples['No Player Stats'] = WRsamples.apply(lambda x: math.isnan(x['Player Avg Stats'][0]), axis = 1)

In [None]:
WRsamples=WRsamples[WRsamples['No Player Stats']==False].reset_index(drop=True)

In [None]:
len(WRsamples)

For now also remove players with very low average targets as well

In [None]:
tarHist=np.hstack(WRsamples['Player Avg Stats'].apply(lambda x: x[stat_order.index('receiving_tar')]))
plt.hist(tarHist, bins='auto')
plt.show()

In [None]:

WRsamples['High Receiving']=WRsamples['Player Avg Stats'].apply(lambda x: x[stat_order.index('receiving_tar')] > 4)
WRsamples=WRsamples[WRsamples['High Receiving']==True].reset_index(drop=True)
len(WRsamples)


# Examining feature correlations for receiving yards

We'll need to predict each WR stat separately.  The relevant scoring stats are:

* Receiving Yards
* Receptions
* TD receptions
* 2pt Receiving Conversion

Technically, we should try to predict rushing yards, punt return td, etc as well.. but these are good for a start.

Lets make some plots to see which features correlate with receving yards.  Here's a list of features that seem relevant:

* player average: receiving_yds
* player average: receiving_rec
* player average: receiving_tds
* player average: receiving_twoptm
* player average: receiving_tar
* player average: receiving_twopta
* player average: receiving_yac_yds
* team average: passing_yds
* team average: passing_tds
* team average: passing_twoptm
* team average: passing_int
* team average: passing_att
* team average: passing_cmp
* team average: passing_incmp
* team average: passing_cmp_air_yds
* team average: rushing_yds (maybe they pass less if they rush)
* team average: rushing_att
* opponent average: defense_sk
* opponent average: defense_int
* opponent average: defense_pass_def
* opponent average: defense_rushing_yds_allowed
* opponent average: defense_passing_yds_allowed
* opponent average: defense_rushing_tds_allowed
* opponent average: defense_passing_tds_allowed
* opponent average: defense_points_allowed
* Comparison: Team W/L to Opp W/L


This is a long list.. but it's worth looking at all of the correlations here.

In [None]:
def CalcCorr(x_stat_type,y_stat):
    #x_stat is a stat in stat_order, such as "receiving_yds"
    #x_stat_type is "Player" or "Team" or "Opp"
    for x_stat in stat_order:
        x_dfColumn = '%s Avg Stats' % x_stat_type

        all_x_stats =  WRsamples[x_dfColumn] 
        all_y_stats =  WRsamples['Stat Outcome'] 

        x_stats = all_x_stats.apply(lambda x: x[stat_order.index(x_stat)])
        y_stats = all_y_stats.apply(lambda x: x[stat_order.index(y_stat)])

        if abs(pearsonr(x_stats, y_stats)[0]) > 0.1:
            print x_stat_type, x_stat, "Correlation:", pearsonr(x_stats, y_stats)[0]


In [None]:
CalcCorr('Player','receiving_yds')
CalcCorr('Team','receiving_yds')    
CalcCorr('Opp','receiving_yds')

It's pretty surprising that's there is no correlation between the opponent's passing yards allowed and the outcome of the player's receiving yards... Let's plot that to look into it a little further

In [None]:
import numpy as np
from scipy.stats import gaussian_kde

def PlotCorr(x_stat,x_stat_type,y_stat):
    #x_stat is a stat in stat_order, such as "receiving_yds"
    #x_stat_type is "Player" or "Team" or "Opp"
    x_dfColumn = '%s Avg Stats' % x_stat_type
    
    all_x_stats =  WRsamples[x_dfColumn] 
    all_y_stats =  WRsamples['Stat Outcome'] 

    x_stats = all_x_stats.apply(lambda x: x[stat_order.index(x_stat)])
    y_stats = all_y_stats.apply(lambda x: x[stat_order.index(y_stat)])

    xy = np.vstack([x_stats,y_stats])
    z = gaussian_kde(xy)(xy)
    idx = z.argsort()
    x, y, z = x_stats[idx], y_stats[idx], z[idx]

    fig, ax = plt.subplots()
    ax.scatter(x, y, c=z, s=50, edgecolor='')
    plt.show()

In [None]:
PlotCorr('defense_passing_yds_allowed','Opp','receiving_yds')

Interesting... there isn't a huge spread in the allowed passing yards of teams, and there really doesn't look like there's any correlation here.

In [None]:
PlotCorr('passing_yds','Team','receiving_yds')

For team's passing yards, there are clearly two types of team... a high passing yards group and a low passing yards group.  Surprisingly, the high passing yards group doesn't seem to have higher yards for each player.  Instead, they must be passing to a wider range of players.

# Training a predictor

Okay, for now lets use the seven features that have abs(correlation) > 0.1 to training our predictor.  We might add more features based on our CV results later.

First, we need to split our sample into training, CV, and testing sets.  Let's randomly order our dataframe, using the same seed so that we get the same order every time we run this code.

In [None]:
np.random.seed(42)
WRsamples=WRsamples.reindex(np.random.permutation(WRsamples.index))
WRsamples.reset_index(inplace=True)

Now lets put half of the samples in training, a quarter in CV, and a quarter in test.

In [None]:
WR_training = WRsamples.ix[:round(len(WRsamples)/2)]
WR_cv = WRsamples.ix[round(len(WRsamples)/2): round(len(WRsamples)/2) + round(len(WRsamples)/4)]
WR_test = WRsamples.ix[round(len(WRsamples)/2) + round(len(WRsamples)/4):]

In [None]:
WR_training['Discard']=WR_training['Stat Outcome'].apply(lambda x: x[stat_order.index('receiving_yds')]==0)
WR_cv['Discard']=WR_cv['Stat Outcome'].apply(lambda x: x[stat_order.index('receiving_yds')]==0)

In [None]:
WR_training=WR_training[WR_training['Discard']==False]
WR_cv=WR_cv[WR_cv['Discard']==False]

Now lets train a simple prediction model before trying anything fancy. First we'll try a linear regression model, then we'll try a nonlinear regression model, then we'll try a neural network.  As a reminder, here's the features that have a correlation with a player's receiving yard outcome

In [None]:
player_feats=['receiving_yds','receiving_rec','receiving_tar','receiving_yac_yds']
team_feats=[]
opp_feats=[]

all_feats=[player_feats,team_feats,opp_feats]

In [None]:
#Extract the features we want
def getFeats(row,feats):
    feat_df=pd.DataFrame()
    feat_list=[]
    
    #Player feats
    feat_list= [row['Player Avg Stats'][stat_order.index(feat)] for feat in feats[0]]

    #Team feats
    feat_list = feat_list + [row['Team Avg Stats'][stat_order.index(feat)] for feat in feats[1]]

    #Opponent feats
    feat_list = feat_list + [row['Opp Avg Stats'][stat_order.index(feat)] for feat in feats[2]]

    return feat_list

In [None]:
X_train=pd.DataFrame(WR_training.apply(getFeats,args=[all_feats],axis=1).tolist())
X_cv=pd.DataFrame(WR_cv.apply(getFeats,args=[all_feats],axis=1).tolist())

In [None]:
#Extract feature names for column names
def getFeatNames(feats):
    feat_names=[]
    
    for feat in feats[0]:
        feat_names = feat_names + ['player_%s' % feat]
    for feat in feats[1]:
        feat_names = feat_names + ['team_%s' % feat]
    for feat in feats[2]:
        feat_names = feat_names + ['opp_%s' % feat]
    
    return feat_names

In [None]:
X_train.columns=getFeatNames(all_feats)
X_cv.columns=getFeatNames(all_feats)
X_train.head()

In [None]:
#Scale the features to help with fitting

#Get min and max of each feature in from the data set
def getExtremeOfFeat(feat,feat_type):
    #Player feats
    feat_list = WRsamples.apply(lambda x: x['%s Avg Stats' % feat_type][stat_order.index(feat)],axis=1).tolist()

    return (max(feat_list),min(feat_list))

def scaleFeat(featFrame,feats):
    
    #player feats
    for feat in feats[0]:
        (max_feat,min_feat)=getExtremeOfFeat(feat,'Player')
        featFrame['player_%s' % feat]= (featFrame['player_%s' % feat] - min_feat)/(max_feat-min_feat)
    
    #team feats
    for feat in feats[1]:
        (max_feat,min_feat)=getExtremeOfFeat(feat,'Team')
        featFrame['team_%s' % feat]= (featFrame['team_%s' % feat] - min_feat)/(max_feat-min_feat)
    
    #opp feats
    for feat in feats[2]:
        (max_feat,min_feat)=getExtremeOfFeat(feat,'Opp')
        featFrame['opp_%s' % feat]= (featFrame['opp_%s' % feat] - min_feat)/(max_feat-min_feat)
        
    return featFrame

In [None]:
scaleFeat(X_train,all_feats)
scaleFeat(X_cv,all_feats)
X_train.head()

In [None]:
#Get outcome for the stat we are predicting
def getStatRes(row,stat):
    return row['Stat Outcome'][stat_order.index(stat)]

In [None]:
Y_train=WR_training.apply(getStatRes,args=['receiving_yds'],axis=1)
Y_cv=WR_cv.apply(getStatRes,args=['receiving_yds'],axis=1)

In [None]:
#Train a Linear regression model
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train,Y_train)

#Check performance on CV data
plt.scatter(Y_cv, lm.predict(X_cv))
plt.xlabel("True Receiving Yards")
plt.ylabel("Predicted Reveiving Yards")


#Calculate mean squared error
mse = np.mean((Y_cv - lm.predict(X_cv)) **2)
print 'MSE:', np.sqrt(mse)
print 'Average Yardage:', np.mean(Y_cv)
print 'Correlation: ', pearsonr(Y_cv,lm.predict(X_cv))

# Try Support Vector Regression

In [None]:
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3, degree=2)
from sklearn import svm

lm=svr_rbf
#lm=svr_lin
lm.fit(X_train,Y_train)


#Check performance on CV data
plt.scatter(Y_cv, lm.predict(X_cv))
plt.xlabel("True Receiving Yards")
plt.ylabel("Predicted Reveiving Yards")


#Calculate mean squared error
mse = np.mean((Y_cv - lm.predict(X_cv)) **2)
print 'MSE:', np.sqrt(mse)
print 'Average Yardage:', np.mean(Y_cv)
print 'Correlation: ', pearsonr(Y_cv,lm.predict(X_cv))


# Try Bayesian Ridge Regression

In [None]:
from sklearn.linear_model import BayesianRidge

lm = BayesianRidge(compute_score=True)
lm.fit(X_train,Y_train)

#Check performance on CV data
plt.scatter(Y_cv, lm.predict(X_cv))
plt.xlabel("True Receiving Yards")
plt.ylabel("Predicted Reveiving Yards")


#Calculate mean squared error
mse = np.mean((Y_cv - lm.predict(X_cv)) **2)
print 'MSE:', np.sqrt(mse)
print 'Average Yardage:', np.mean(Y_cv)
print 'Correlation: ', pearsonr(Y_cv,lm.predict(X_cv))


# Try Elastic Net

In [None]:
from sklearn.linear_model import ElasticNet

lm = ElasticNet(alpha=0.05,random_state=1)
lm.fit(X_train,Y_train)

#Check performance on CV data
plt.scatter(Y_cv, lm.predict(X_cv))
plt.xlabel("True Receiving Yards")
plt.ylabel("Predicted Reveiving Yards")


#Calculate mean squared error
mse = np.mean((Y_cv - lm.predict(X_cv)) **2)
print 'MSE:', np.sqrt(mse)
print 'Average Yardage:', np.mean(Y_cv)
print 'Correlation: ', pearsonr(Y_cv,lm.predict(X_cv))



# Try Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

lm = MLPClassifier()

lm.fit(X_train,Y_train)


#Check performance on CV data
plt.scatter(Y_cv, lm.predict(X_cv))
plt.xlabel("True Receiving Yards")
plt.ylabel("Predicted Reveiving Yards")


#Calculate mean squared error
mse = np.mean((Y_cv - lm.predict(X_cv)) **2)
print 'MSE:', np.sqrt(mse)
print 'Average Yardage:', np.mean(Y_cv)
print 'Correlation: ', pearsonr(Y_cv,lm.predict(X_cv))

