<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Read-in-Data" data-toc-modified-id="Read-in-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in Data</a></span></li><li><span><a href="#Physical-Characteristics" data-toc-modified-id="Physical-Characteristics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Physical Characteristics</a></span><ul class="toc-item"><li><span><a href="#Weight" data-toc-modified-id="Weight-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Weight</a></span></li><li><span><a href="#Height" data-toc-modified-id="Height-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Height</a></span></li></ul></li><li><span><a href="#Yards" data-toc-modified-id="Yards-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Yards</a></span><ul class="toc-item"><li><span><a href="#Location-on-the-Field" data-toc-modified-id="Location-on-the-Field-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Location on the Field</a></span></li><li><span><a href="#Distance-to-First-Down" data-toc-modified-id="Distance-to-First-Down-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Distance to First Down</a></span></li></ul></li><li><span><a href="#The-Success-of-a-Play" data-toc-modified-id="The-Success-of-a-Play-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>The Success of a Play</a></span></li><li><span><a href="#Offensive-Line" data-toc-modified-id="Offensive-Line-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Offensive Line</a></span><ul class="toc-item"><li><span><a href="#OL-Weight" data-toc-modified-id="OL-Weight-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>OL Weight</a></span></li><li><span><a href="#OL-Positions" data-toc-modified-id="OL-Positions-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>OL Positions</a></span></li></ul></li></ul></div>

# 2019 NFL Big Data Bowl - Part 2: Exploratory Data Analysis
This is the second notebook in a series where I partake in the 2019 Big Data Bowl competition on Kaggle. The first notebook dealt with cleaning the datasets to produce 3 different files:

- Each player's physical characteristics (height, weight, etc.)
- Each player's positions and speeds on each play
- Each play's properties (down, distance, yards rushed, etc.)

We'll be examining each of these files in order to get a deeper understanding of the data. In the end, the goal of exploratory data analysis is to show information that could be useful when completing the task at hand. In this case, we are ultimately trying to predict how many yards a given rushing play will go for, based on information at the time of handoff.

## Import Packages
For plotting all of our graphs, I'll be using a package called `altair`. It is an interesting package. 

In [56]:
import numpy as np
import pandas as pd
import altair

import os
os.makedirs('./Altair JSONs/Dataframes', exist_ok=True)
os.makedirs('./Altair JSONs/Plots', exist_ok=True)

dfBaseUrl = 'https://raw.githubusercontent.com/MughilM/Data-Science-Projects/master/NFL%20NextGen/Altair%20JSONs/Dataframes'
dfLocalPath = './Altair JSONs/Dataframes'

# Custom theme that sets the width and height...
def sizeTheme(*args, **kwargs):
    font = 'Gill Sans MT'
    return {
        'config': {
            'view': {
                'width': 800,
                'height': 350
            },
            'title': {
                'font': font,
                'fontSize': 18
            },
            'axis': {
                'labelFont': font,
                'labelFontSize': 12,
                'titleFont': font,
                'titleFontSize': 14
            },
            'header': {
                'labelFont': font,
                'titleFont': font
            },
            'legend': {
                'labelFont': font,
                'labelFontSize': 12,
                'titleFont': font,
                'titleFontSize': 14
            }
        }
    }
altair.themes.register('sizeTheme', sizeTheme)
altair.themes.enable('sizeTheme')

ThemeRegistry.enable('sizeTheme')

## Read in Data
As stated before, we have 3 files we need to read in. We also set the data types as a dictionary we pass into each `read_csv` call.

In [2]:
# Player characteristics
pcdt = {
    'NflId': 'category',
    'DisplayName': 'str',
    'PlayerHeight': 'float32',
    'PlayerWeight': 'int',
    'PlayerCollegeName': 'str',
    'Position': 'category'
}
playerChara = pd.read_csv('./Data/playerData.csv', parse_dates=[4],
                         dtype=pcdt)
# Player positions
ppdt = {
    'GameId': 'category',
    'PlayId': 'category',
    'Team': 'str',
    'NflId': 'category'
}
playerPos = pd.read_csv('./Data/playerPositions.csv', dtype=ppdt)
# Play by play
# More columns here...
# Reason is none of our integer columns needs 64-bit precision...
# Use less memory if we can...
integerCols = ['Season', 'YardLine', 'Quarter', 'Down', 'Distance', 'HomeScoreBeforePlay', 
              'VisitorScoreBeforePlay', 'DefendersInTheBox', 'Yards', 'Week',
              'GameClockMinute', 'GameClockSecond', 'Half', 'RB', 'TE',
              'WR', 'OL', 'QB', 'DoO', 'DL', 'LB', 'DB', 'OoD']
catCols = ['GameId', 'PlayId', 'PossessionTeam', 'FieldPosition', 'NflIdRusher',
          'OffenseFormation', 'PlayDirection', 'HomeTeamAbbr', 'VisitorTeamAbbr',
          'Stadium', 'Location', 'Turf']
pbpdt = {}
for col in integerCols:
    pbpdt[col] = 'int32'
for col in catCols:
    pbpdt[col] = 'category'
pbpData = pd.read_csv('./Data/playByPlayData.csv', parse_dates=[15, 16], dtype=pbpdt)

## Physical Characteristics
For starters, we can examine how a player's physical characteristics affect the yards gained on a play.

### Weight
For starters, we can look at something simple. For example, how do the weight and height of the players affect how many yards they go for? Running backs are typically on the shorter side, but then we have someone like Derrick Henry who just plows through people. The plotting package has a maximum limit of 5000 rows. Therefore, we can't simply plot every single play (there are ~30k of them). 

Since we are primarily concerned about the weights of players, we can put each player into a weight class and plot their average yards per rush for each rush. This way, we are not unnecessarily throwing out plays either.

In [57]:
# Purely for demonstration...
# I pull the file from Github, the rest are also the same...
WR_FN = 'weightsAndRushes.json'
WR_JSON = os.path.join(dfLocalPath, WR_FN)
if not os.path.exists(WR_JSON):
    # Grab yards and the rusher...
    rushes = pbpData[['NflIdRusher', 'Yards']]
    # Left join it with the weight information...
    weightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerWeight', 'Position']], how='left', 
             left_on='NflIdRusher', right_on='NflId')
    # We only need the weight and yards...
    weightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)
    weightsAndRushes.to_json(WR_JSON, orient='records')

In [58]:
WR_URL = os.path.join(dfBaseUrl, WR_FN)
chart = altair.Chart(WR_URL, title='Weight vs. Yards Gained by Position').mark_point(clip=True).encode(
    altair.X('PlayerWeight:Q',
             axis=altair.Axis(title='Weight'),
             scale=altair.Scale(zero=False)),
    y='Yards:Q',
    color='Position:N'
)
chart.save('./Altair JSONs/Plots/weightsAndRushes.json')
chart

As expected, we can see clear demarcations with respect to the positions. The pinks are all running backs, while the lighter individuals are wide receivers. Past 250 lbs are a mixture of tight ends and offensive linemen for those one-off plays. Next, we'll look at height...

### Height
Same code as above.

In [60]:
HR_FN = 'heightsAndRushes.json'
HR_JSON = os.path.join(dfLocalPath, HR_FN)
if not os.path.exists(HR_JSON):
    # Grab yards and the rusher...
    rushes = pbpData[['NflIdRusher', 'Yards']]
    # Left join it with the weight information...
    heightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerHeight', 'Position']], how='left', 
             left_on='NflIdRusher', right_on='NflId')
    # We only need the weight and yards...
    heightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)
    heightsAndRushes.to_json(HR_JSON, orient='records')

HR_URL = os.path.join(dfBaseUrl, HR_FN)
chart = altair.Chart(HR_URL, title='Height vs. Yards Gained by Position').mark_point(clip=True).encode(
    altair.X('PlayerHeight:Q',
             axis=altair.Axis(title='Height'),
             scale=altair.Scale(zero=False)),
    y='Yards:Q',
    color='Position:N'
)
chart.save('./Altair JSONs/Plots/heightsAndRushes.json')
chart

The reason we get the lines is because heights are always quoted by the inch. There's a healthy bell curve of heights by running backs, though it appears there's a hard drop off after 6 foot 3 inches, which are occupied by tight ends and wide receivers. Just like the weights graph, the dots gets sparse as the yards increases. Why don't we examine the distribution of the yards gained a bit more.

## Yards
We can't just blindly graph yards, because the yards gained on a play is directly related to where you are on the field. For example, it's impossible to gain 75 yards when you are at midfield. At the very least, we should expect the yards to drop as you approach the ends of the field. Additionally, another variable that affects the yards gained is the down and distance the team is at. Who cares about calling a 50-yard run when you need to gain 2 yards just to keep the ball. Thus, we'll create two graphs here...

### Location on the Field
To plot the location on the field, we can use the yard line of where the ball is. The problem is...

In [38]:
pbpData.YardLine.describe()

count    30611.000000
mean        28.058312
std         12.915614
min          1.000000
25%         19.000000
50%         28.000000
75%         39.000000
max         49.000000
Name: YardLine, dtype: float64

Notice the range is from 1 to 49, which means we also need to know which side the team is on. Otherwise, there is no way to figure out if the team is on the OWN 3-yard line or the opponent's, for example. **In our data, the yardline has actually been mirrored if the possession team is in their opponent's territory! This is also how it's reported during play by play. NE being on the KC 2-yardline means they are 2 yards away from scoring. However, them being on the NE 2-yardline means they have 98 yards to go. We have to take this into account!**

In [39]:
# If the possession team is different than the
# field position, then we are on the opponent's side of the field...
if 'YardLine100' not in pbpData.columns:
    pbpData.insert(4, 'YardLine100', pbpData.YardLine.to_list())
    locs = pbpData.PossessionTeam != pbpData.FieldPosition
    pbpData.loc[locs, 'YardLine100'] = 100 - pbpData.loc[locs, 'YardLine100']
pbpData.head()

Unnamed: 0,GameId,PlayId,Season,YardLine,YardLine100,Quarter,PossessionTeam,Down,Distance,FieldPosition,...,TE,WR,OL,QB,DoO,DL,LB,DB,OoD,State
0,2017090700,20170907000118,2017,35,35,1,NE,3,2,NE,...,1,3,5,1,0,2,3,6,0,MA
1,2017090700,20170907000139,2017,43,43,1,NE,1,10,NE,...,1,3,5,1,0,2,3,6,0,MA
2,2017090700,20170907000189,2017,35,65,1,NE,1,10,KC,...,1,3,5,1,0,2,3,6,0,MA
3,2017090700,20170907000345,2017,2,98,1,NE,2,2,KC,...,2,0,6,1,0,4,4,3,0,MA
4,2017090700,20170907000395,2017,25,25,1,KC,1,10,KC,...,3,1,5,1,0,3,2,6,0,MA


Notice the 3rd and 4th rows where the Patriots are on the Chiefs' side of the field. Now we can plot this column with respect to the yards gained. We should see the points taper off as the yard line increases i.e. get closer to the endzone.

In [61]:
YL100_FN = 'YardLine100.json'
YL100_JSON = os.path.join(dfLocalPath, YL100_FN)
if not os.path.exists(YL100_JSON):
    pbpData[['YardLine100', 'Yards']].to_json(YL100_JSON, orient='records')

YL100_URL = os.path.join(dfBaseUrl, YL100_FN)
chart = altair.Chart(YL100_URL, title='Yards Gained vs. Field Position').mark_point(clip=True).encode(
    x=altair.X('YardLine100:Q',
              axis=altair.Axis(title='Yard Line (100)')),
    y='Yards:Q'
)
chart.save('./Altair JSONs/Plots/YardLine100.json')
chart

Couple things about this plot. The hypotenuse of this triangular plot shows all the times when a runner rushed for a touchdown at that position. Additionally, you may notice there are more points on the 25-yard line, and the 40-yard line to some extent. The reason is that whenever there is a touchback on the kickoff, the team starts at their own 25-yard line. Also, when a kickoff goes out of the bounds, that's a penalty on the kicking team, and the ball is spotted at the 40-yard line. The main things we take away from here is that **solely using the field position of the rushing play isn't enough, and other things must be taken into account.**

### Distance to First Down
The distance to a first down also matters. This mainly applies to short yardage situations, typically less than 4 yards or so. If you need 2 yards to keep the drive alive, why design for a long run when instead you can send in some big guys to push forward for the first down? Hopefully the next plot can show this...Instead of showing raw counts for how many plays were rushing plays per down and distance, we'll show **percentage**. The reason is that since you get a 1st and 10 whenever a new set of downs is established, it will cause an overabundance of the number of plays. There will also be a dropoff with distancecs greater than 10 yards because these can only occur either through a loss on the previous play, or through a penalty.

First, we'll look at **all the plays** and see how many yards they've went for, then we'll look at the percentage of rushing plays as the distance increases.

In [63]:
DISTANCE_FN = 'distance.json'
DISTANCE_JSON = os.path.join(dfLocalPath, DISTANCE_FN)
if not os.path.exists(DISTANCE_JSON):
    pbpData[['Distance', 'Yards', 'Down']].to_json(DISTANCE_JSON, orient='records')

DISTANCE_URL = os.path.join(dfBaseUrl, DISTANCE_FN)
chart = altair.Chart(DISTANCE_URL, title='Distince to 1st Down vs. Yards Gained').mark_point(clip=True).encode(
    x='Distance:Q',
    y='Yards:Q',
    color='Down:N'
)
chart.save('./Altair JSONs/Plots/distance.json')
chart

Obviously, there's an overabundance plays with 10 yards to go, because you get 1st and 10 whenever a new set of downs is established. Notice we also have some blue dots at 1st and 15 and 1st and 20. These are due to penalty. A false start is a five-yard penalty. We can see a good mixture of downs on the other distances.

## The Success of a Play
As you might have guessed from the previous plots, there are many variables that affect how many yards go for. However, depending on the situation, the number of yards gained could have differing importance. For example, **gaining 3 yards on a 3rd and 2 is much more successful than gaining 3 yards on a 1st and 10**, because the former results in a first down to keep the drive alive. That isn't to say that you *have* to get a first down for a play to be successful. For example, gaining 5 yards on that 1st and 10 could be seen as successful as it could make 2nd and 3rd down much easier.

Back in 1988, Pete Palmer, Bob Carroll, and John Thorn published a statistics book called *The Hidden Game of Football.* In it, they took apart the current notions of football stats, especially quarterback rating, and developed their own method of breaking down plays. Almost all of their statistics revolves around **yards**. Their notion of a successful play is as follows:

> On first down, a play is considered a success if it gains 45 percent of needed yards; on second down, a play needs to gain 60 percent of needed yards; on third and fourth down, only by gaining a first down (or touchdown) is considered success.

Since then, there have been many improvements to their methods that have been refined (see the reasoning behind Defense-adjusted Value Over Average (DVOA) by [Football Outsiders](https://www.footballoutsiders.com/info/methods)) but for our purposes, this success/failure breakdown should work reasonably well to measure how different aspects affect the yards gained, without worrying about the situational environment too much. 

We will add this information to our play-by-play level data, and doing so requires a few lines of code. Why didn't we do this in the data cleaning stage, you ask? Because any model might come up with a more complex method of combining the situation. In any case, we will input directly into the column `"Success"` or `"Failure"`.

In [47]:
pbpData['IsSuccess'] = 'Failure'
# Thresholds of percentages for successes
thresholds = [0.45, 0.6, 1, 1]
for down, thresh in zip(range(1, 5), thresholds):
    pbpData.loc[(pbpData.Down == down) & (pbpData.Yards >= thresh * pbpData.Distance), 'IsSuccess'] = 'Success'

Even though the success of a play doesn't directly tell us the number of yards a play goes for (what we are predicting), it still gives a sense of the outcome of the plays, which hopefully be used anyway during training and prediction.

## Offensive Line
Now, let's look at what is probably one of the more important aspects. How does the personnel groupings affect how many yards a play goes for (and consequently how successful a play is)? At first thought, it makes sense that as the weight of the offensive line increases, the more successful a play gets, because the linemen theoretically are able to push everyone forward, which greatly affects whether the running back runs for a loss or gain.

To capture this, we can take the **total weight** of the offensive line, and plot it against the yards gained. However, there's an important disclaimer for the plots that follow. The offensive line is not the only ones that might be blocking. You might have a fullback, or a big tight end as well, so keep that in mind.

We're not limited to the weight, and we can also take the speed, which way they're moving, and their positions with respect to the running back at handoff. That last bit, it seems logical that if a running back is "following their blocks", or in other words right behind their linemen, they get much more yards.

### OL Weight
The most common setup is 5 offensive linemen. But just to check, let's see how many setups there were.

In [48]:
uniques, counts = np.unique(pbpData.OL, return_counts=True)
sortedIndices = np.argsort(counts)[::-1]
uniques[sortedIndices], counts[sortedIndices]

(array([5, 6, 7]), array([28451,  2097,    63], dtype=int64))

Okay, lucky for us, there are only 3 possible setups, from 5 to 7 offensive linemen. The 5 offensive linemen setup is by far the most common. The rest are so few that we don't have to do any special analysis with them. We can plot two histograms, one for successful plays, and one for failed plays. My hypothesis is that it won't be too interesting, because it's rare that every rush is successful. I'm expecting a lot of overlap.

In [49]:
# First, we need to gather the offensive linemen that
# took part in each play...
# Join our player characteristics
# with the player positions on the ID...
merged = playerPos.merge(right=playerChara[['NflId', 'PlayerWeight', 'Position']], how='left', on='NflId')
# We don't need positional info...
merged = merged[['PlayId', 'NflId', 'PlayerWeight', 'Position']]
# Grab the offensive line positions:
# Center (C), Guard (G), Offensive Guard (OG), Offensive Tackle (OT), Tackle (T)
offLine = merged[np.isin(merged.Position, ['C', 'G', 'OG', 'OT', 'T'])]
# Now we have to group them by play id
# and take the average...
# Reset the index to move the PlayId to a column
playGroups = offLine.groupby(by='PlayId').mean().reset_index()
playGroups.columns = ['PlayId', 'WeightAverage']
playGroups.head()

Unnamed: 0,PlayId,WeightAverage
0,20170907000118,315.6
1,20170907000139,315.6
2,20170907000189,315.6
3,20170907000345,315.5
4,20170907000395,316.2


In [64]:
# Now merge with our play-by-play data...
OL_weightAvgs = pbpData[['PlayId', 'IsSuccess']].merge(right=playGroups, how='left', on='PlayId')
OL_WEIGHT_FN = 'olweight.json'
OL_WEIGHT_JSON = os.path.join(dfLocalPath, OL_WEIGHT_FN)
if not os.path.exists(OL_WEIGHT_JSON):
    OL_weightAvgs.to_json(OL_WEIGHT_JSON, orient='records')

# Plot them!
OL_WEIGHT_URL = os.path.join(dfBaseUrl, OL_WEIGHT_FN)
chart = altair.Chart(OL_WEIGHT_URL, title='OL Avg. Weight by Play Outcome').mark_area(
    opacity=0.5, 
    interpolate='step'
).encode(
    x=altair.X('WeightAverage:Q', bin=altair.Bin(step=2)),
    y=altair.Y('count()', stack=None, title='# of Plays'),
    color='IsSuccess:N'
)
chart.save('./Altair JSONs/Plots/olweight.json')
chart

Well would you look at that. The shapes of the two histograms are almost identical, just that the Success one is a bit shorter, since there were fewer success plays than failure plays. The orange histogram was plotted first, and the blue was plotted over. This is the reason the success histogram looks brownish. Okay, so we know the weight itself has nothing to do with the success, but about their positions at the time of the snap?

### OL Positions
The positions of the offensive linemen might matter much more here to determine if a running back will break it. Usually, it's a good idea for a running back to follow the blocks in front of him. If he strays from the designed direction, there's a good chance he will not be able to get the yards he was supposed to. Remember we do have X and Y data for each player. However, that won't tell the whole story. I said in the second sentence that it's a good idea for a running back to follow behind his blocks. To quantify this, **we take the running back's position as the origin, and base the rest of the players positions based off of that.** We can the plot the **average relative position of the offensive line with respect to the RB for each play.** The goal is to see that successful plays increase as the RB goes behind the blocks. My initial hypothesis is that the Y direction (up and down) will matter more than the X direction (left and right). This is when looking at the field top-down, with endzones on the left and right sides.

In [77]:
# playerPos has our positions.
# Remember we want to find the average
# relative position wrt rusher of the offensive line.
# We need to do a couple things.
# First, the pbpData has the ID of the rusher,
# so merge the X and Y coords to that dataframe,
# Next, join the played positions to playerPos, filter the OL,
# group by play, and average X and Y. Then subtract off rusher's coords.

# Merge to pbpData on play and rusher together...
mergedRushCoord = pd.merge(
    left=pbpData,
    right=playerPos[['PlayId', 'X', 'Y', 'NflId']],
    how='left',
    left_on=['PlayId', 'NflIdRusher'],
    right_on=['PlayId', 'NflId']
)
# Merge player positions to playerPos,
# which has positional data...
mergedPosition = pd.merge(
    left=playerPos[['PlayId', 'X', 'Y', 'NflId']],
    right=playerChara[['NflId', 'Position']],
    how='left',
    on='NflId'
)
# Only x, y, and play id of o-linemen
olPos = mergedPosition.loc[np.isin(mergedPosition.Position, ['C', 'G', 'OG', 'OT', 'T'])]
# Group by play, and average...
olAvgPosPlay = olPos.groupby('PlayId').mean().reset_index()
# Change column names to not cause confusion
olAvgPosPlay.columns = ['PlayId', 'OL_X', 'OL_Y']
# Join with mergedRushCoord and gather only
# columns we need...
rushOLPos = pd.merge(
    left=mergedRushCoord[['PlayId', 'IsSuccess', 'X', 'Y']],
    right=olAvgPosPlay,
    on='PlayId'
)
# To calculate relative wrt RB,
# we subtract X and Y from OL's
rushOLPos['Rel_X'] = rushOLPos.OL_X - rushOLPos.X
rushOLPos['Rel_Y'] = rushOLPos.OL_Y - rushOLPos.Y
rushOLPos.head()

Unnamed: 0,PlayId,IsSuccess,X,Y,OL_X,OL_Y,Rel_X,Rel_Y
0,20170907000118,Success,78.75,30.53,75.018,29.4,-3.732,-1.13
1,20170907000139,Failure,71.07,27.16,67.06,28.36,-4.01,1.2
2,20170907000189,Success,48.66,19.11,44.394,20.172,-4.266,1.062
3,20170907000345,Success,15.53,25.36,11.28,25.565,-4.25,0.205
4,20170907000395,Success,29.99,27.12,33.878,24.85,3.888,-2.27


Now let's show this information. Because we have two dimensions, and two outcomes (success and failure), I'll be showing two separate plots for each direction here.

In [81]:
RELXY_FN = 'relative_xy_ol.json'
RELXY_JSON = os.path.join(dfLocalPath, RELXY_FN)
if not os.path.exists(RELXY_JSON):
    rushOLPos.to_json(RELXY_JSON, orient='records')
chart = altair.Chart(RELXY_JSON).mark_area(
    opacity=0.5,
    interpolate='step'
).encode(
    x=altair.X('Rel_X:Q', bin=altair.Bin(step=0.5)),
    y=altair.Y('count()', stack=None),
    color='IsSuccess:N'
)
chart