# First analysis

## Introduction

We are going to look at data from individual plays and from individual players from 2009 to present using data from [nfldb](https://github.com/BurntSushi/nfldb), with the goal of estimating fantasy football scores from the features in the available data.  In particular, we want to answer:

- How accurately can we predict fantasy football scores for a given set of scoring rules?
- For a given fantasy football contest, can we estimate the distribution of scores for all players well?
- What is the expected value of a bet using an optimized machine learning algorthim using this data?

We will proceed as follows: the data is in a PostgreSQL database with 8 separate tables, with the full schema found in pdf form [here](http://burntsushi.net/stuff/nfldb/nfldb.pdf).  We will use this to write an SQL query to obtain statistics for players in positions relevant to fantasy football, including their own average stats over the previous several games, average stats for their team, and average stats for the opposing team.  To make things concrete we will be using [Fanduel's scoring and lineup rules](https://www.fanduel.com/rulesandscoring) which will require one quarterback, two running backs, three wide receivers, one tight end, one kicker, and one team defense.  

The SQL query will be used to populate two pandas `DataFrame` objects (one for all offensive players, one for team defenses since the nature of the features will not be similar between those two cases).  From there we will use scikit-learn to perform feature selection and/or dimensionality reduction on the data and train (and cross validate) models for each position.  Since we have no real reason to care about the effect of individual features on the model (*i.e.* determining model coefficients), we are free to transform the data as we please and use whatever model has the best overall performance.

Since the ultimate goal is to see if our gambling machine can expect to win money over the course of many bets, on top of estimating the score of our own optimal lineup, it would be useful to obtain an estimated distribution of opponent scores as well.  One way to do this is by looking at results from given contests and obtaining information about the scores of the entrants.  If it is possible to obtain the scores for all participants (which appears to be possible, but perhaps not easy at first glance), then we can simply apply the Bootstrap to estimate quantities of interest.  If not, another option would be to use a parametric distribution as an approximation (*e.g.* a normal distribution). 

## Reading the data

In [1]:
### run nfldb-update before this code (just implement here when ready)

%matplotlib inline
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns

from __future__ import print_function
from sqlalchemy import create_engine

sql_file = open('players_query.sql', 'r')
file_string = sql_file.read()
file_string = re.sub(r'--.*\n', '', file_string)   # remove SQL comments
file_string = re.sub(r'[\t\n]', ' ', file_string)  # remove tabs and newlines

engine = create_engine('postgresql://nfldb:nfldb@localhost:5432/nfldb')
df = pd.read_sql(file_string, engine)
df.head()

Unnamed: 0,gsis_id,full_name,position,start_time,season_year,week,fd_score,player_id,team,home_team,...,avg_passing_yds,avg_receiving_rec,avg_receiving_tar,avg_receiving_tds,avg_receiving_yac_yds,avg_receiving_yds,avg_rushing_att,avg_rushing_loss,avg_rushing_tds,avg_rushing_yds
0,2012010100,Matt Slauson,C,2012-01-01 13:00:00-05:00,2011,17,0.6,00-0026500,NYJ,MIA,...,229.7,22.6,39.0,1.7,101.6,229.7,26.4,0.0,0.3,111.3
1,2011091109,Patrick Peterson,CB,2011-09-11 16:15:00-04:00,2011,1,6.0,00-0027943,ARI,ARI,...,195.9,18.7,37.0,0.6,62.8,195.9,23.1,0.0,0.6,102.1
2,2015110104,Marcus Sherels,CB,2015-11-01 13:00:00-05:00,2015,8,6.0,00-0027547,MIN,CHI,...,238.6,20.2,31.0,1.4,124.8,238.6,29.0,0.0,0.0,144.2
3,2011112704,Patrick Peterson,CB,2011-11-27 13:00:00-05:00,2011,12,6.0,00-0027943,ARI,STL,...,196.4,16.6,31.9,1.1,74.0,196.4,24.5,0.0,0.3,135.0
4,2016100905,Marcus Sherels,CB,2016-10-09 13:00:00-04:00,2016,5,6.0,00-0027547,MIN,MIN,...,267.4,23.6,32.4,1.6,141.2,267.4,35.4,0.0,1.4,99.2


It appears there are players at positions we wouldn't expect scoring fantasy points as offensive players.  To get a full list of positions obtained in the SQL query, we can use `pd.DataFrame.unique()` method:

In [2]:
df['position'].unique()

array(['C', 'CB', 'DB', 'DE', 'FB', 'FS', 'G', 'ILB', 'K', 'NT', 'OT', 'P',
       'QB', 'RB', 'SS', 'T', 'TE', 'WR'], dtype=object)

Upon closer inspection of the data however, there are only a small number of defensive players, offensive lineman, and punters who scored small numbers fantasy points, mostly due to trick plays or special teams touchdowns.  Looking at the postions, the only thing we need to do is change the label `'FB'` to `'RB'`:

In [2]:
df.loc[df.position == 'FB', 'position'] = 'RB'

In [3]:
from __future__ import print_function, division
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, learning_curve
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline

positions = ['QB', 'RB', 'WR', 'TE', 'K']
metadata_cols = ['gsis_id', 'full_name', 'position', 'start_time', 'week', 'player_id', 'season_year', 
                 'home_team', 'away_team', 'team', 'start_time', 'fd_score']


class PositionEstimator(object):
    
    
    def __init__(self, position, df):
        self.position = position
        self.df = df
        self.pca = PCA()
        self.X = None
        self.y = None
        self.get_Xy()          
        self.pipeline = None
        self.grid = None
        self.fit()
        
    def get_Xy(self):
        X = self.df.loc[(self.df.position == self.position) & (self.df.fd_score > 0), :]
        self.y = self.df.loc[(self.df.position == self.position) & (self.df.fd_score > 0), 'fd_score']
        X = X.drop(metadata_cols, axis=1)
        self.X = X.fillna(0)
             
    def fit(self):
        n_comps = range(4, 21, 4)
        Cs = np.logspace(-1, 1, 4)
        scaler = StandardScaler()
        pca = PCA()
        k_fold = KFold(n_splits=3, shuffle=True)
        svr = SVR(kernel='rbf')
        params = {'pca__n_components': n_comps, 'svr__C': Cs}
        self.pipeline = Pipeline([
            ('scaler', scaler),
            ('pca', pca),
            ('svr', svr),
        ])
        self.grid = GridSearchCV(self.pipeline, cv=k_fold, param_grid=params, scoring='neg_mean_squared_error', n_jobs=-1)
        self.grid.fit(self.X, self.y)
        

estimators = {pos: PositionEstimator(pos, df) for pos in positions}
    
#for position in positions:
    # change this to a Pipeline
#    X = df.loc[(df.position == position) & (df.fd_score > 0), :]
#    y = df.loc[(df.position == position) & (df.fd_score > 0), 'fd_score']
#    X = X.drop(metadata_cols, axis=1)
#    X = X.fillna(0)
    #X = X.loc[:, X.std() > .0001]
#    scaler = StandardScaler().fit(X)
#    X = scaler.transform(X)
#    pca = PCA(n_components=8).fit(X)
#    X = pca.transform(X)
#    k_fold = KFold(n_splits=5, shuffle=True)
#    Cs = np.logspace(-1, 1, 20)
#    svr = SVR(kernel='rbf')
#    svr_optimized = GridSearchCV(svr, param_grid=dict(C=Cs), scoring='neg_mean_absolute_error', n_jobs=-1, cv=k_fold)
#    svr_optimized.fit(X, y)
#    print(position + " best C = " + str(svr_optimized.best_params_['C']) + "\n")
#    print(position + " best score = " + str(svr_optimized.best_score_) + "\n")
    
    
    
    

In [4]:
for key in estimators:
    print(key)
    print(-1.0 * estimators[key].grid.best_score_)
    print(estimators[key].grid.best_params_)

K
13.7045874035
{'pca__n_components': 20, 'svr__C': 0.10000000000000001}
QB
53.4836510769
{'pca__n_components': 20, 'svr__C': 0.46415888336127786}
WR
45.8713302487
{'pca__n_components': 20, 'svr__C': 2.1544346900318834}
RB
45.8785700871
{'pca__n_components': 8, 'svr__C': 2.1544346900318834}
TE
29.5418707525
{'pca__n_components': 4, 'svr__C': 2.1544346900318834}


In [13]:
#import nfldb

#db = nfldb.connect()
#cur_season, cur_year, cur_week = nfldb.current(db)
current_week = pd.read_csv('./FanDuel-NFL-2017-01-01-17431-players-list.csv')
current_week.columns = current_week.columns.str.replace(' ', '_')
current_week.head()

Unnamed: 0,Id,Position,First_Name,Nickname,Last_Name,FPPG,Played,Salary,Game,Team,Opponent,Injury_Indicator,Injury_Details
0,17431-28181,RB,Le'Veon,,Bell,23.324999,12,10000,CLE@PIT,PIT,CLE,,
1,17431-62497,RB,David,,Johnson,24.126666,15,9300,ARI@LA,ARI,LA,,
2,17431-6498,QB,Tom,,Brady,21.356363,11,9100,NE@MIA,NE,MIA,,
3,17431-6779,RB,LeSean,,McCoy,19.371429,14,9100,BUF@NYJ,BUF,NYJ,,
4,17431-31360,WR,Odell,,Beckham Jr.,15.946666,15,8800,NYG@WAS,NYG,WAS,,


In [18]:
current_week = current_week[~(current_week['Injury_Indicator'] == 'IR') & ~(current_week['Injury_Indicator'] == 'O')]

In [20]:
current_week['Full_Name'] = current_week['First_Name'] + ' ' + current_week['Last_Name']

In [7]:
df.loc[df.season_year == 2016, 'week'].max()

16

In [25]:
df[(df['full_name'] == 'Le\'Veon Bell') & (df['season_year'] == 2016) & (df['week'] == 16) ]

Unnamed: 0,gsis_id,full_name,position,start_time,season_year,week,fd_score,player_id,team,home_team,...,avg_passing_yds,avg_receiving_rec,avg_receiving_tar,avg_receiving_tds,avg_receiving_yac_yds,avg_receiving_yds,avg_rushing_att,avg_rushing_loss,avg_rushing_tds,avg_rushing_yds
6711,2016122500,Le'Veon Bell,RB,2016-12-25 16:30:00-05:00,2016,16,27.2,00-0030496,PIT,PIT,...,282.5,22.5,32.5,2.0,101.5,282.5,25.5,0.0,0.5,112.0


### Estimating the mean score for a week

In [11]:
df['predicted_fd_score'] = np.nan
for key in estimators:
    X = df.loc[df.position == key, [col for col in df.columns if col not in metadata_cols]]
    X.fillna(0)
    df[df.position == key].predicted_fd_score = estimators[key].grid.predict(X)
    

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [20]:
testdf = df.loc[df.position == 'QB', :]
testX = testdf.drop(metadata_cols, axis=1).fillna(0)
testPred = estimators['QB'].grid.predict(np.array(testX))

ValueError: operands could not be broadcast together with shapes (3471,71) (70,) (3471,71) 

In [18]:
print(testPred)

{'pca__n_components': 16, 'svr__C': 2.1544346900318834}
