# General Assembly DSI - Denver 2018
## Capstone Project - DFS Model
This is my capstone project at General Assembly's fifth [Data Science Immersive](https://generalassemb.ly/education/data-science-immersive) cohort in 2018. I am developing a model to assist in optimizing NFL lineups on the daily fantasy sports platforms [Draft Kings](https://www.draftkings.com/) and [Fan Duel](https://www.fanduel.com/).

### Problem Statement

Can we build a model to predict a football player’s fantasy football performance to estimate their value and implement the model in conjunction with a daily fantasy strategy to be profitable?

### Gathering NFL Defensive Data
I will be scraping [Fantasy.NFL.com](http://fantasy.nfl.com/research/pointsagainst?position=1&sort=pointsAgainst_pts&statCategory=pointsAgainst&statSeason=2010&statType=weekPointsAgainst&statWeek=2) to get the weekly defensive rankings against each position so I can tie a player's performance to the opponent they played.

In [74]:
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup

### Quarterbacks

In [None]:
# # markdown cell convert to code to run
# # collecting data for defense versus quarterback
# defenses = []

# # collecting data from years 2011 - 2017
# for year in range(2011,2018):
#     # getting weeks 1-17
#     for week in range(1,18):
#         url = 'http://fantasy.nfl.com/research/pointsagainst?position=1&sort=pointsAgainst_pts&statCategory=pointsAgainst&statSeason={}&statType=weekPointsAgainst&statWeek={}'.format(year, week)
#         res = requests.get(url)
        
#         # run if request is successful
#         if res.status_code == 200:
#             soup = BeautifulSoup(res.content, 'lxml')
#             # loop through by team (due to the HTML of the web page)
#             for team in range(1,33):
#                 team_num = 'team-{}'.format(team)
#                 # grab row of the team
#                 row = soup.find('tr', attrs = {'class': team_num})

#                 # some teams will be on a bye, in that case we want to bypass (no pun intended) the loop
#                 # the website still lists these teams, but it says their opponent is a "Bye"
#                 if row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.lower() == 'bye':
#                     continue

#                 # create an empty dictionary for the observation being scraped
#                 week_ranks = {}

#                 # the defensive stats I will collect
#                 week_ranks['Team'] = row.find('div', attrs={'class': 'c'}).text.replace(' Defensevs QB', '')
#                 week_ranks['Opponent'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.replace('@', '')
#                 week_ranks['Week'] = week
#                 week_ranks['Year'] = year
#                 week_ranks['Rank'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rank'}).text
#                 week_ranks['Completions'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-passing_completions'}).text
#                 week_ranks['Attempts'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-passing_attempts'}).text
#                 week_ranks['Yards'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-passing_yards'}).text
#                 week_ranks['Interceptions'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-passing_interceptions'}).text
#                 week_ranks['Touchdowns'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-passing_tds'}).text

#                 defenses.append(week_ranks)
                
# #                 pd.DataFrame(defenses).to_csv('../data/Defense_vs_QB.csv', index = False)
#         else:
#             print("Oooooops, something went wrong! Status Code = {}".format(res.status_code))
#         time.sleep(1)

In [79]:
# defense_vs_qb = pd.DataFrame(defenses)

In [153]:
defense_vs_qb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 10 columns):
Attempts         3584 non-null object
Completions      3584 non-null object
Interceptions    3584 non-null object
Opponent         3584 non-null object
Rank             3584 non-null int64
Team             3584 non-null object
Touchdowns       3584 non-null object
Week             3584 non-null int64
Yards            3584 non-null object
Year             3584 non-null int64
dtypes: int64(3), object(7)
memory usage: 280.1+ KB


> Every category is an object because the website uses hyphens instead of zeros.

In [154]:
columns = [col for col in defense_vs_qb.columns if col not in ['Team', 'Rank', 'Opponent', 'Week', 'Year']]

for col in columns:
    defense_vs_qb[col] = defense_vs_qb[col].map(lambda x: x.replace('-', '0'))

In [155]:
defense_vs_qb[defense_vs_qb['Attempts'] == '0'].shape

(262, 10)

In [156]:
defense_vs_qb[defense_vs_qb['Attempts'] == '0']['Opponent'].value_counts()

LA     38
SF     13
SEA    11
CHI     9
WAS     9
DET     9
KC      8
PIT     8
BUF     8
CLE     8
ARI     8
DAL     8
PHI     7
MIA     7
NO      7
NE      7
DEN     7
CIN     7
IND     7
TB      7
GB      6
BAL     6
MIN     6
JAX     6
CAR     6
NYG     6
HOU     6
ATL     6
LAC     6
OAK     6
TEN     5
NYJ     4
Name: Opponent, dtype: int64

> So apparently there are 262 games that did not record data for `Attempts`, and coincidentally the rest of the columns in those observations are zero as well. My theory is that when they replaced data for the Rams moving to LA it messed some things up, because it appears they have the highest number of these observations by a wide margin. I know it's impractical but I'm just going to impute the mean for these columns.

In [157]:
# representing them as integer values
attempts_mean = int(round(defense_vs_qb[defense_vs_qb['Attempts'] != '0']['Attempts'].astype(int).mean(), 0))
completions_mean = int(round(defense_vs_qb[defense_vs_qb['Attempts'] != '0']['Completions'].astype(int).mean(), 0))
interceptions_mean = int(round(defense_vs_qb[defense_vs_qb['Attempts'] != '0']['Interceptions'].astype(int).mean(), 0))
touchdowns_mean = int(round(defense_vs_qb[defense_vs_qb['Attempts'] != '0']['Touchdowns'].astype(int).mean(), 0))
yards_mean = int(round(defense_vs_qb[defense_vs_qb['Attempts'] != '0']['Yards'].astype(int).mean(), 0))
print(attempts_mean,
      completions_mean,
      interceptions_mean,
      touchdowns_mean,
      yards_mean)

35 22 1 2 250


In [159]:
# imputing mean values in columns with no observations
for defense in defense_vs_qb[defense_vs_qb['Attempts'] == '0'].index:
    defense_vs_qb.loc[defense, 'Attempts'] = attempts_mean
    defense_vs_qb.loc[defense, 'Completions'] = completions_mean
    defense_vs_qb.loc[defense, 'Interceptions'] = interceptions_mean
    defense_vs_qb.loc[defense, 'Touchdowns'] = touchdowns_mean
    defense_vs_qb.loc[defense, 'Yards'] = yards_mean

> So right now all the data types are messed up, but since we've replaced the hyphens with zeros I think when I read in the .csv in other notebooks pandas will read it in properly. Let's check:

In [196]:
# defense_vs_qb.to_csv('../data/Defense_vs_QB.csv', index = False)

In [197]:
test = pd.read_csv('../data/Defense_vs_QB.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 10 columns):
Attempts         3584 non-null int64
Completions      3584 non-null int64
Interceptions    3584 non-null int64
Opponent         3584 non-null object
Rank             3584 non-null int64
Team             3584 non-null object
Touchdowns       3584 non-null int64
Week             3584 non-null int64
Yards            3584 non-null int64
Year             3584 non-null int64
dtypes: int64(8), object(2)
memory usage: 280.1+ KB


> BOOM!

### Running Backs

In [169]:
# # markdown cell convert to code to run
# # collecting data for defense versus runningbacks
# defenses = []

# # collecting data from years 2011 - 2017
# for year in range(2011,2018):
#     # getting weeks 1-17
#     for week in range(1,18):
#         url = 'http://fantasy.nfl.com/research/pointsagainst?position=2&sort=pointsAgainst_pts&statCategory=pointsAgainst&statSeason={}&statType=weekPointsAgainst&statWeek={}'.format(year, week)
#         res = requests.get(url)
        
#         # run if request is successful
#         if res.status_code == 200:
#             soup = BeautifulSoup(res.content, 'lxml')
#             # loop through by team (due to the HTML of the web page)
#             for team in range(1,33):
#                 team_num = 'team-{}'.format(team)
#                 # grab row of the team
#                 row = soup.find('tr', attrs = {'class': team_num})

#                 # some teams will be on a bye, in that case we want to bypass (no pun intended) the loop
#                 # the website still lists these teams, but it says their opponent is a "Bye"
#                 if row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.lower() == 'bye':
#                     continue

#                 # create an empty dictionary for the observation being scraped
#                 week_ranks = {}

#                 # the defensive stats I will collect
#                 week_ranks['Team'] = row.find('div', attrs={'class': 'c'}).text.replace(' Defensevs RB', '')
#                 week_ranks['Opponent'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.replace('@', '')
#                 week_ranks['Week'] = week
#                 week_ranks['Year'] = year
#                 week_ranks['Rank'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rank'}).text
#                 week_ranks['Attempts'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rushing_attempts'}).text
#                 week_ranks['Yards'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rushing_yards'}).text
#                 week_ranks['Touchdowns'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rushing_tds'}).text

#                 defenses.append(week_ranks)
                
# #                 pd.DataFrame(defenses).to_csv('../data/Defense_vs_RB.csv', index = False)
#         else:
#             print("Oooooops, something went wrong! Status Code = {}".format(res.status_code))
#         time.sleep(1)

In [171]:
defense_vs_rb = pd.DataFrame(defenses)
defense_vs_rb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 8 columns):
Attempts      3584 non-null object
Opponent      3584 non-null object
Rank          3584 non-null object
Team          3584 non-null object
Touchdowns    3584 non-null object
Week          3584 non-null int64
Yards         3584 non-null object
Year          3584 non-null int64
dtypes: int64(2), object(6)
memory usage: 224.1+ KB


In [173]:
for col in ['Attempts', 'Touchdowns', 'Yards']:
    defense_vs_rb[col] = defense_vs_rb[col].map(lambda x: x.replace('-', '0'))

In [175]:
defense_vs_rb[defense_vs_rb['Attempts'] == '0'].shape

(262, 8)

In [179]:
defense_vs_rb[defense_vs_rb['Attempts'] == '0']['Opponent'].value_counts()

LA     38
SF     13
SEA    11
CHI     9
WAS     9
DET     9
KC      8
PIT     8
BUF     8
CLE     8
ARI     8
DAL     8
PHI     7
MIA     7
NO      7
NE      7
DEN     7
CIN     7
IND     7
TB      7
GB      6
BAL     6
MIN     6
JAX     6
CAR     6
NYG     6
HOU     6
ATL     6
LAC     6
OAK     6
TEN     5
NYJ     4
Name: Opponent, dtype: int64

> Same issue as before, and same number of columns. Probably no coincidence.

In [180]:
rb_attempts_mean = int(round(defense_vs_rb[defense_vs_rb['Attempts'] != '0']['Attempts'].astype(int).mean(), 0))
rb_touchdowns_mean = int(round(defense_vs_rb[defense_vs_rb['Attempts'] != '0']['Touchdowns'].astype(int).mean(), 0))
rb_yards_mean = int(round(defense_vs_rb[defense_vs_rb['Attempts'] != '0']['Yards'].astype(int).mean(), 0))
print(rb_attempts_mean,
      rb_touchdowns_mean,
      rb_yards_mean)

23 1 95


In [181]:
for defense in defense_vs_rb[defense_vs_rb['Attempts'] == '0'].index:
    defense_vs_rb.loc[defense, 'Attempts'] = rb_attempts_mean
    defense_vs_rb.loc[defense, 'Touchdowns'] = rb_touchdowns_mean
    defense_vs_rb.loc[defense, 'Yards'] = rb_yards_mean

In [194]:
# defense_vs_rb.to_csv('../data/Defense_vs_RB.csv', index = False)

In [195]:
test = pd.read_csv('../data/Defense_vs_RB.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 8 columns):
Attempts      3584 non-null int64
Opponent      3584 non-null object
Rank          3584 non-null int64
Team          3584 non-null object
Touchdowns    3584 non-null int64
Week          3584 non-null int64
Yards         3584 non-null int64
Year          3584 non-null int64
dtypes: int64(6), object(2)
memory usage: 224.1+ KB


> BOOM!

### Wide Receivers

In [184]:
# # markdown cell convert to code to run
# # collecting data for defense versus receivers
# defenses = []

# # collecting data from years 2011 - 2017
# for year in range(2011,2018):
#     # getting weeks 1-17
#     for week in range(1,18):
#         url = 'http://fantasy.nfl.com/research/pointsagainst?position=3&sort=pointsAgainst_pts&statCategory=pointsAgainst&statSeason={}&statType=weekPointsAgainst&statWeek={}'.format(year, week)
#         res = requests.get(url)
        
#         # run if request is successful
#         if res.status_code == 200:
#             soup = BeautifulSoup(res.content, 'lxml')
#             # loop through by team (due to the HTML of the web page)
#             for team in range(1,33):
#                 team_num = 'team-{}'.format(team)
#                 # grab row of the team
#                 row = soup.find('tr', attrs = {'class': team_num})

#                 # some teams will be on a bye, in that case we want to bypass (no pun intended) the loop
#                 # the website still lists these teams, but it says their opponent is a "Bye"
#                 if row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.lower() == 'bye':
#                     continue

#                 # create an empty dictionary for the observation being scraped
#                 week_ranks = {}

#                 # the defensive stats I will collect
#                 week_ranks['Team'] = row.find('div', attrs={'class': 'c'}).text.replace(' Defensevs WR', '')
#                 week_ranks['Opponent'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.replace('@', '')
#                 week_ranks['Week'] = week
#                 week_ranks['Year'] = year
#                 week_ranks['Rank'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rank'}).text
#                 week_ranks['Receptions'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_receptions'}).text
#                 week_ranks['Targets'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_targets'}).text
#                 week_ranks['Yards'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_yards'}).text
#                 week_ranks['Touchdowns'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_tds'}).text

#                 defenses.append(week_ranks)
                
# #                 pd.DataFrame(defenses).to_csv('../data/Defense_vs_WR.csv', index = False)
#         else:
#             print("Oooooops, something went wrong! Status Code = {}".format(res.status_code))
#         time.sleep(1)

In [187]:
defense_vs_wr = pd.DataFrame(defenses)
defense_vs_wr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 9 columns):
Opponent      3584 non-null object
Rank          3584 non-null object
Receptions    3584 non-null object
Targets       3584 non-null object
Team          3584 non-null object
Touchdowns    3584 non-null object
Week          3584 non-null int64
Yards         3584 non-null object
Year          3584 non-null int64
dtypes: int64(2), object(7)
memory usage: 252.1+ KB


In [188]:
for col in ['Receptions', 'Targets', 'Touchdowns', 'Yards']:
    defense_vs_wr[col] = defense_vs_wr[col].map(lambda x: x.replace('-', '0'))

In [189]:
defense_vs_wr[defense_vs_wr['Targets'] == '0'].shape

(262, 9)

In [190]:
defense_vs_wr[defense_vs_wr['Targets'] == '0']['Opponent'].value_counts()

LA     38
SF     13
SEA    11
CHI     9
WAS     9
DET     9
KC      8
PIT     8
BUF     8
CLE     8
ARI     8
DAL     8
PHI     7
MIA     7
NO      7
NE      7
DEN     7
CIN     7
IND     7
TB      7
GB      6
BAL     6
MIN     6
JAX     6
CAR     6
NYG     6
HOU     6
ATL     6
LAC     6
OAK     6
TEN     5
NYJ     4
Name: Opponent, dtype: int64

In [191]:
wr_receptions_mean = int(round(defense_vs_wr[defense_vs_wr['Targets'] != '0']['Receptions'].astype(int).mean(), 0))
wr_targets_mean = int(round(defense_vs_wr[defense_vs_wr['Targets'] != '0']['Targets'].astype(int).mean(), 0))
wr_touchdowns_mean = int(round(defense_vs_wr[defense_vs_wr['Targets'] != '0']['Touchdowns'].astype(int).mean(), 0))
wr_yards_mean = int(round(defense_vs_wr[defense_vs_wr['Targets'] != '0']['Yards'].astype(int).mean(), 0))
print(wr_receptions_mean, wr_targets_mean, wr_touchdowns_mean, wr_yards_mean)

12 20 1 158


In [193]:
for defense in defense_vs_wr[defense_vs_wr['Targets'] == '0'].index:
    defense_vs_wr.loc[defense, 'Receptions'] = wr_receptions_mean
    defense_vs_wr.loc[defense, 'Targets'] = wr_targets_mean
    defense_vs_wr.loc[defense, 'Touchdowns'] = wr_touchdowns_mean
    defense_vs_wr.loc[defense, 'Yards'] = wr_yards_mean

In [198]:
# defense_vs_wr.to_csv('../data/Defense_vs_WR.csv', index = False)

In [199]:
test = pd.read_csv('../data/Defense_vs_WR.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 9 columns):
Opponent      3584 non-null object
Rank          3584 non-null int64
Receptions    3584 non-null int64
Targets       3584 non-null int64
Team          3584 non-null object
Touchdowns    3584 non-null int64
Week          3584 non-null int64
Yards         3584 non-null int64
Year          3584 non-null int64
dtypes: int64(7), object(2)
memory usage: 252.1+ KB


> BOOM!

### Tight Ends

In [201]:
# # markdown cell convert to code to run
# # collecting data for defense versus tight ends
# defenses = []

# # collecting data from years 2011 - 2017
# for year in range(2011,2018):
#     # getting weeks 1-17
#     for week in range(1,18):
#         url = 'http://fantasy.nfl.com/research/pointsagainst?position=4&sort=pointsAgainst_pts&statCategory=pointsAgainst&statSeason={}&statType=weekPointsAgainst&statWeek={}'.format(year, week)
#         res = requests.get(url)
        
#         # run if request is successful
#         if res.status_code == 200:
#             soup = BeautifulSoup(res.content, 'lxml')
#             # loop through by team (due to the HTML of the web page)
#             for team in range(1,33):
#                 team_num = 'team-{}'.format(team)
#                 # grab row of the team
#                 row = soup.find('tr', attrs = {'class': team_num})

#                 # some teams will be on a bye, in that case we want to bypass (no pun intended) the loop
#                 # the website still lists these teams, but it says their opponent is a "Bye"
#                 if row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.lower() == 'bye':
#                     continue

#                 # create an empty dictionary for the observation being scraped
#                 week_ranks = {}

#                 # the defensive stats I will collect
#                 week_ranks['Team'] = row.find('div', attrs={'class': 'c'}).text.replace(' Defensevs WR', '')
#                 week_ranks['Opponent'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-opponent'}).text.replace('@', '')
#                 week_ranks['Week'] = week
#                 week_ranks['Year'] = year
#                 week_ranks['Rank'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-rank'}).text
#                 week_ranks['Receptions'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_receptions'}).text
#                 week_ranks['Targets'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_targets'}).text
#                 week_ranks['Yards'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_yards'}).text
#                 week_ranks['Touchdowns'] = row.find('span', attrs = {'class': 'pointsAgainstStatId-receiving_tds'}).text

#                 defenses.append(week_ranks)
                
# #                 pd.DataFrame(defenses).to_csv('../data/Defense_vs_TE.csv', index = False)
#         else:
#             print("Oooooops, something went wrong! Status Code = {}".format(res.status_code))
#         time.sleep(1)

In [202]:
defense_vs_te = pd.DataFrame(defenses)
defense_vs_te.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 9 columns):
Opponent      3584 non-null object
Rank          3584 non-null object
Receptions    3584 non-null object
Targets       3584 non-null object
Team          3584 non-null object
Touchdowns    3584 non-null object
Week          3584 non-null int64
Yards         3584 non-null object
Year          3584 non-null int64
dtypes: int64(2), object(7)
memory usage: 252.1+ KB


In [203]:
for col in ['Receptions', 'Targets', 'Touchdowns', 'Yards']:
    defense_vs_te[col] = defense_vs_te[col].map(lambda x: x.replace('-', '0'))

In [204]:
defense_vs_te[defense_vs_te['Targets'] == '0'].shape

(303, 9)

In [205]:
defense_vs_te[defense_vs_te['Targets'] == '0']['Opponent'].value_counts()

LA     38
SF     14
CHI    13
NYJ    13
KC     11
DET    11
SEA    11
WAS    10
DEN    10
BUF    10
ARI    10
TB      9
DAL     9
MIA     9
CLE     8
GB      8
IND     8
ATL     8
PIT     8
NYG     7
PHI     7
BAL     7
CIN     7
HOU     7
NE      7
NO      7
OAK     7
JAX     6
MIN     6
LAC     6
CAR     6
TEN     5
Name: Opponent, dtype: int64

In [206]:
te_receptions_mean = int(round(defense_vs_te[defense_vs_te['Targets'] != '0']['Receptions'].astype(int).mean(), 0))
te_targets_mean = int(round(defense_vs_te[defense_vs_te['Targets'] != '0']['Targets'].astype(int).mean(), 0))
te_touchdowns_mean = int(round(defense_vs_te[defense_vs_te['Targets'] != '0']['Touchdowns'].astype(int).mean(), 0))
te_yards_mean = int(round(defense_vs_te[defense_vs_te['Targets'] != '0']['Yards'].astype(int).mean(), 0))
print(te_receptions_mean, te_targets_mean, te_touchdowns_mean, te_yards_mean)

5 7 0 52


In [207]:
for defense in defense_vs_te[defense_vs_te['Targets'] == '0'].index:
    defense_vs_te.loc[defense, 'Receptions'] = te_receptions_mean
    defense_vs_te.loc[defense, 'Targets'] = te_targets_mean
    defense_vs_te.loc[defense, 'Touchdowns'] = te_touchdowns_mean
    defense_vs_te.loc[defense, 'Yards'] = te_yards_mean

In [208]:
# defense_vs_te.to_csv('../data/Defense_vs_TE.csv', index = False)

In [209]:
test = pd.read_csv('../data/Defense_vs_TE.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 9 columns):
Opponent      3584 non-null object
Rank          3584 non-null int64
Receptions    3584 non-null int64
Targets       3584 non-null int64
Team          3584 non-null object
Touchdowns    3584 non-null int64
Week          3584 non-null int64
Yards         3584 non-null int64
Year          3584 non-null int64
dtypes: int64(7), object(2)
memory usage: 252.1+ KB


> BOOM!

### Creating Rolling Averages for Statistics and Shifting Them

There are two things I need to do to this dataset in order to model with it. First, I need to create a rolling average for each statistic. Intuitively it makes far more sense to "trend" the data for each team as it will better reflect how a team has been playing. **I am going to use a rolling window of 3**, meaning that the statistic at each observation will represent what that defense has done in the last 3 games leading up the a contest. Secondly, I need to shift my dataframe so that I am not using statistics for a given week to predict an event they were a part of in the past.

In [361]:
defense_vs_qb = pd.read_csv('../data/Defense_vs_QB.csv')
defense_vs_rb = pd.read_csv('../data/Defense_vs_RB.csv')
defense_vs_wr = pd.read_csv('../data/Defense_vs_WR.csv')
defense_vs_te = pd.read_csv('../data/Defense_vs_TE.csv')

In [352]:
def roll_and_shift(df, features, window = 3, min_periods = 1):
    # groups by team and season so years don't leak into each other
    rolled_data = df.groupby(['Team', 'Year'])[features].rolling(window = window, min_periods = min_periods).mean()
    # requires additional groupby so you don't shift into other teams/seasons
    shifted_data = rolled_data.groupby(level = [0, 1]).shift()
    return shifted_data

In [356]:
def impute_rolling_stats(df, rolling_stats_df, features):
    """
    CHANGES DATAFRAME INPLACE
    The need for this function was born out of the fact that my dataframes are sorted by week and season.
    However, I want to roll the stats for each particular team, which means I cannot simply create rolling averages
    for every observation. That would mean I am mixing the performance of one team with that of another.
    """
    for index, row in df.iterrows():
        # create tuple matching indexes from rolling stats df
        rolling_index = (row['Team'], row['Year'], index)
        # replace df stats with rolling stats
        df.loc[index, features] = rolling_stats_df.loc[rolling_index, features]

In [370]:
# Quarterbacks
features = ['Attempts', 'Completions', 'Interceptions', 'Touchdowns', 'Yards']
rolling_qbs = roll_and_shift(defense_vs_qb, features = features)
impute_rolling_stats(defense_vs_qb, rolling_qbs, features = features)

In [363]:
# Runningbacks
features = ['Attempts', 'Touchdowns', 'Yards']
rolling_rbs = roll_and_shift(defense_vs_rb, features = features)
impute_rolling_stats(defense_vs_rb, rolling_rbs, features = features)

In [375]:
# Receivers
features = ['Receptions', 'Targets', 'Touchdowns', 'Yards']
rolling_wrs = roll_and_shift(defense_vs_wr, features = features)
impute_rolling_stats(defense_vs_wr, rolling_wrs, features = features)

In [377]:
# Tight Ends
features = ['Receptions', 'Targets', 'Touchdowns', 'Yards']
rolling_tes = roll_and_shift(defense_vs_te, features = features)
impute_rolling_stats(defense_vs_te, rolling_tes, features = features)

In [381]:
# apparently forgot to remove the excess text for TE's
defense_vs_te['Team'] = defense_vs_te['Team'].map(lambda x: x.replace(' Defensevs TE', ''))

In [379]:
# Dropping Week 1 Observations
defense_vs_qb.dropna(inplace = True)
defense_vs_rb.dropna(inplace = True)
defense_vs_wr.dropna(inplace = True)
defense_vs_te.dropna(inplace = True)

In [383]:
# Export Data
defense_vs_qb.to_csv('../data/Defense_vs_QB.csv', index = False)
defense_vs_rb.to_csv('../data/Defense_vs_RB.csv', index = False)
defense_vs_wr.to_csv('../data/Defense_vs_WR.csv', index = False)
defense_vs_te.to_csv('../data/Defense_vs_TE.csv', index = False)