## Receiving Stats

This notebook combines and cleans all of the receiving datasets into a single yearly file to run models on.

In [1]:
import pandas as pd
import numpy as np

Starting off the project, I wrote a function to pull in a group of our data frames: 

* The input year's receiving stats 
* The receiving stats for the two years before out input year
* The list of drafted players from 2001 to 2022
* The list of player debuts from 2001 to 2021

Which were then merged all together. Using RegEx, I stripped any player names of * (representing that the player made the Pro-Bowl) and + (representing the player was voted as an All-Pro), and added columns to represent those values. After merging all of our columns together, I cleaned up some of the missing values and then return a much cleaner dataset.

In [2]:
def probowl_map(name):
    return 1 if '*' in name else 0

def allpro_map(name):
    return 1 if '+' in name else 0

def receiving_stats(year):
    df1 = pd.read_csv('../data/receiving/' + str(year) + '_receiving_original.csv')
    df2 = pd.read_csv('../data/receiving/' + str(year - 1) + '_receiving_basic_original.csv')
    df3 = pd.read_csv('../data/receiving/' + str(year - 2) + '_receiving_basic_original.csv')
    draft = pd.read_csv('../data/draft_final.csv')
    debut = pd.read_csv('../data/debut_final.csv')
    df1['ProBowl'] = df1['Player'].map(probowl_map)
    df1['AllPro'] = df1['Player'].map(allpro_map)
    df1['Player'] = df1['Player'].str.extract(r'([A-z. -\']* [A-z. -\']*)')
    df1['Year'] = year
    simple2 = df2[['G', 'GS', 'Tgt', 'Rec', 'Yds', 'TD', 'Player-additional']].copy()
    simple2.rename(columns = {'G': 'G_-1_year', 'GS': 'GS_-1_year', 
                              'Tgt': 'Tgt_-1_year', 'Rec': 'Rec_-1_year', 
                              'Yds': 'Yds_-1_year', 'TD': 'TD_-1_year'}, inplace = True)
    simple3 = df3[['G', 'GS', 'Tgt', 'Rec', 'Yds', 'TD', 'Player-additional']].copy()
    simple3.rename(columns = {'G': 'G_-2_year', 'GS': 'GS_-2_year', 
                              'Tgt': 'Tgt_-2_year', 'Rec': 'Rec_-2_year', 
                              'Yds': 'Yds_-2_year', 'TD': 'TD_-2_year'}, inplace = True)
    df1 = df1.merge(simple2, how = 'left', on = 'Player-additional')
    df1 = df1.merge(simple3, how = 'left', on = 'Player-additional')
    draft.rename(columns = {'Pos': 'DPos'}, inplace = True)
    df1['Pos'] = df1['Pos'].map({'wr': 'WR', 'te': 'TE', 'rb': 'RB', 'qb': 'QB', 'fb': 'FB', 'lt': 'T', 'rt': 'T', 't': 'T',
                                 'WR': 'WR', 'TE': 'TE', 'RB': 'RB', 'QB': 'QB', 'FB': 'FB', 'LT': 'T', 'RT': 'T', 'T': 'T'})
    df1.drop(columns = 'Rk', inplace = True)
    df1 = df1.merge(draft, how = 'left', on = 'Player-additional')
    df1 = df1.merge(debut, how = 'left', on = 'Player-additional')
    df1['YrsPlayed'] = year - df1['RookieYear']
    df1['Rec/Br'].fillna(0, inplace = True)
    df1['1D'].fillna(0, inplace = True)
    df1['Rnd'].fillna(8, inplace = True)
    df1['Pick'].fillna(260, inplace = True)
    df1['Int'].fillna(round(df1[df1['Int'].isnull() == False]['Int'].mean(),0), inplace = True)
    df1.drop(columns = ['Pos', 'DPos', 'DraftYear', 'RookieYear'], inplace = True)
    df1.rename(columns = {'RPos': 'Pos'}, inplace = True)
    df1 = df1[(df1['Pos'] == 'RB') | (df1['Pos'] == 'WR') | (df1['Pos'] == 'TE')]
    df1.set_index('Player', inplace = True)
    return df1

In [3]:
rec2020 = receiving_stats(2020)
rec2020.head()

Unnamed: 0_level_0,Tm,Age,G,GS,Tgt,Rec,Yds,TD,1D,YBC,...,G_-2_year,GS_-2_year,Tgt_-2_year,Rec_-2_year,Yds_-2_year,TD_-2_year,Rnd,Pick,Pos,YrsPlayed
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Stefon Diggs,BUF,27,16,15,166,127,1535,8,73.0,1071,...,15.0,14.0,149.0,102.0,1021.0,9.0,5.0,146.0,WR,5
Davante Adams,GNB,28,14,14,149,115,1374,18,73.0,777,...,15.0,15.0,169.0,111.0,1386.0,13.0,2.0,53.0,WR,6
DeAndre Hopkins,ARI,28,16,16,160,115,1407,6,75.0,873,...,16.0,16.0,163.0,115.0,1572.0,11.0,1.0,27.0,WR,7
Darren Waller,LVR,28,16,15,145,107,1196,9,69.0,624,...,4.0,0.0,6.0,6.0,75.0,0.0,6.0,204.0,TE,5
Travis Kelce,KAN,31,15,15,145,105,1416,11,79.0,829,...,16.0,16.0,150.0,103.0,1336.0,10.0,3.0,63.0,TE,7


Creating a dataset of all receiving stats from 2018 through 2021 to use for data analysis and visualization.

In [4]:
rec2021 = receiving_stats(2021)
rec2020 = receiving_stats(2020)
rec2019 = receiving_stats(2019)
rec2018 = receiving_stats(2018)

eda = pd.concat([rec2021, rec2020, rec2019, rec2018])
print(eda.shape)
#eda.to_csv('../data/2018-2021_receiving.csv', index = True)

(1847, 40)


### Filling in Missing Stats for Rookies

Below is another function that I created in order to fill in missing stats due to rookie players not being in the league during previous years. For the year before they entered the league, it will fill in the stats with the average for Rookies at the same position, drafted in the same round between 2018 and 2021. For players who weren't in the league two years ago, I used those same averages for 2 years prior. `decide on this later (For players who weren't in the league two years ago, I took those averages and multiplied them by .4, to assume growth over their final college season before entering the draft).`

In [5]:
# Idea to use np.where from: https://stackoverflow.com/questions/10715519/conditionally-fill-column-values-based-on-another-columns-value-in-pandas

def rookie_stat_fill(df):
    rookie_stats = pd.read_csv('../data/rookie_stats.csv')
    for i in ['WR', 'TE', 'RB']:
        for j in range(1, 9):
            for k in ['G', 'GS', 'Yds', 'Tgt', 'Rec', 'TD']:
                df[k + '_-2_year'] = np.where((df['YrsPlayed'] == 1) & (df['Pos'] == i) & (df['Rnd'] == j),
                                        rookie_stats[(rookie_stats['RPos'] == i) & (rookie_stats['Rnd'] == j)][k],
                                        df[k + '_-2_year'])
                df[k + '_-1_year'] = np.where((df['YrsPlayed'] == 0) & (df['Pos'] == i) & (df['Rnd'] == j),
                                        rookie_stats[(rookie_stats['RPos'] == i) & (rookie_stats['Rnd'] == j)][k],
                                        df[k + '_-1_year'])
                df[k + '_-2_year'] = np.where((df['YrsPlayed'] == 0) & (df['Pos'] == i) & (df['Rnd'] == j),
                                        round(rookie_stats[(rookie_stats['RPos'] == i) & (rookie_stats['Rnd'] == j)][k],0),
                                        df[k + '_-2_year'])
    return df

In [6]:
rec2020 = rookie_stat_fill(rec2020)
rec2020.head()

Unnamed: 0_level_0,Tm,Age,G,GS,Tgt,Rec,Yds,TD,1D,YBC,...,G_-2_year,GS_-2_year,Tgt_-2_year,Rec_-2_year,Yds_-2_year,TD_-2_year,Rnd,Pick,Pos,YrsPlayed
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Stefon Diggs,BUF,27,16,15,166,127,1535,8,73.0,1071,...,15.0,14.0,149.0,102.0,1021.0,9.0,5.0,146.0,WR,5
Davante Adams,GNB,28,14,14,149,115,1374,18,73.0,777,...,15.0,15.0,169.0,111.0,1386.0,13.0,2.0,53.0,WR,6
DeAndre Hopkins,ARI,28,16,16,160,115,1407,6,75.0,873,...,16.0,16.0,163.0,115.0,1572.0,11.0,1.0,27.0,WR,7
Darren Waller,LVR,28,16,15,145,107,1196,9,69.0,624,...,4.0,0.0,6.0,6.0,75.0,0.0,6.0,204.0,TE,5
Travis Kelce,KAN,31,15,15,145,105,1416,11,79.0,829,...,16.0,16.0,150.0,103.0,1336.0,10.0,3.0,63.0,TE,7


### Dealing with Injured Players

The majority of missing values left in our dataset come from players who were injured over the course of the season. 

Export of list of players I'm assuming were injured based on stats and years played.

In [7]:
rec2020 = rookie_stat_fill(rec2020)
#rec2020[(rec2020['G_-1_year'].isnull() == True) | (rec2020['G_-2_year'].isnull() == True)].to_csv('../data/injured_players.csv', index = True)

Similar to our function for filling in stats for rookies, I created a function to fill in the stats for filling in missing stats for players who missed full seasons. I took the average on the two non-missing years and divided them by total number of games played over those two years, then multiplied it by 16 (for the number of games in a season).

In [8]:
def injury_stat_fill(df):
    for i in ['G', 'GS', 'Yds', 'Tgt', 'Rec', 'TD']:
        df[i + '_-1_year'] = np.where((df[i + '_-1_year'].isnull() == True) & (df[i + '_-2_year'].isnull() != True),
                                      round((df[i] + (df[i + '_-2_year'])) / (df['G'] + df['G_-2_year']) * 16,0),
                                      df[i + '_-1_year'])
        df[i + '_-2_year'] = np.where((df[i + '_-2_year'].isnull() == True) & (df[i + '_-1_year'].isnull() != True),
                                      round((df[i] + (df[i + '_-1_year'])) / (df['G'] + df['G_-1_year']) * 16,0),
                                      df[i + '_-2_year'])        
    return df

In [9]:
rec2020 = injury_stat_fill(rec2020)
rec2020.head()

Unnamed: 0_level_0,Tm,Age,G,GS,Tgt,Rec,Yds,TD,1D,YBC,...,G_-2_year,GS_-2_year,Tgt_-2_year,Rec_-2_year,Yds_-2_year,TD_-2_year,Rnd,Pick,Pos,YrsPlayed
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Stefon Diggs,BUF,27,16,15,166,127,1535,8,73.0,1071,...,15.0,14.0,149.0,102.0,1021.0,9.0,5.0,146.0,WR,5
Davante Adams,GNB,28,14,14,149,115,1374,18,73.0,777,...,15.0,15.0,169.0,111.0,1386.0,13.0,2.0,53.0,WR,6
DeAndre Hopkins,ARI,28,16,16,160,115,1407,6,75.0,873,...,16.0,16.0,163.0,115.0,1572.0,11.0,1.0,27.0,WR,7
Darren Waller,LVR,28,16,15,145,107,1196,9,69.0,624,...,4.0,0.0,6.0,6.0,75.0,0.0,6.0,204.0,TE,5
Travis Kelce,KAN,31,15,15,145,105,1416,11,79.0,829,...,16.0,16.0,150.0,103.0,1336.0,10.0,3.0,63.0,TE,7


### Target Set

The below columns are the target set we want to predict. It grabs the year after the current year, and takes the targets, receptions, yards and touchdowns for the following year and appends them to our dataset, so we can try to build a model to predict those values.

In [10]:
def create_target(year):
    final = receiving_stats(year)
    df_target = pd.read_csv('../data/receiving/' + str(year + 1) + '_receiving_basic_original.csv')
    df_target = df_target[['Tgt', 'Rec', 'Yds', 'TD', 'Player-additional']].copy()
    df_target.rename(columns = {'G': 'G_target', 'GS': 'GS_target', 
                                'Tgt': 'Tgt_target', 'Rec': 'Rec_target', 
                                'Yds': 'Yds_target', 'TD': 'TD_target'}, inplace = True)
    #nifty trick from stack overflow to keep my index (https://stackoverflow.com/questions/11976503/how-to-keep-index-when-using-pandas-merge)
    final = final.reset_index().merge(df_target, how="inner").set_index('Player')
    return final

In [11]:
target2020 = create_target(2020)
target2019 = create_target(2019)
target2018 = create_target(2018)

target_final = pd.concat([target2020, target2019, target2018])

# Fill in Rookie Stats with our Stat_fill function:
target_final = rookie_stat_fill(target_final)
target_final = injury_stat_fill(target_final)

target_final.dropna(inplace = True)
print(f'Dataset shape: {target_final.shape}')

Dataset shape: (985, 44)


#### Export a dataset that's ready for modeling!

In [12]:
#target_final.to_csv('../data/receiving_train.csv', index = True)

## Create a Set to Make Predictions on

We need to create another dataset using the 2021 data in order to eventually make predictions. Because we don't have the data from this year, the target columns won't be added, so we can just use our receiving_stats function for the year 2021.

In [13]:
rec2021 = receiving_stats(2021)
rec2021 = rookie_stat_fill(rec2021)
rec2021 = injury_stat_fill(rec2021)
rec2021.dropna(inplace = True)

#rec2021.to_csv('../data/receiving_test.csv', index = True)