<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Read-in-Data" data-toc-modified-id="Read-in-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in Data</a></span></li><li><span><a href="#Physical-Characteristics" data-toc-modified-id="Physical-Characteristics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Physical Characteristics</a></span><ul class="toc-item"><li><span><a href="#Weight" data-toc-modified-id="Weight-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Weight</a></span></li><li><span><a href="#Height" data-toc-modified-id="Height-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Height</a></span></li></ul></li><li><span><a href="#Yards" data-toc-modified-id="Yards-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Yards</a></span><ul class="toc-item"><li><span><a href="#Location-on-the-Field" data-toc-modified-id="Location-on-the-Field-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Location on the Field</a></span></li><li><span><a href="#Distance-to-First-Down" data-toc-modified-id="Distance-to-First-Down-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Distance to First Down</a></span></li></ul></li><li><span><a href="#Offensive-Line-Groupings" data-toc-modified-id="Offensive-Line-Groupings-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Offensive Line Groupings</a></span><ul class="toc-item"><li><span><a href="#Standard-Offensive-Groupings" data-toc-modified-id="Standard-Offensive-Groupings-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Standard Offensive Groupings</a></span></li></ul></li></ul></div>

# 2019 NFL Big Data Bowl - Part 2: Exploratory Data Analysis
This is the second notebook in a series where I partake in the 2019 Big Data Bowl competition on Kaggle. The first notebook dealt with cleaning the datasets to produce 3 different files:

- Each player's physical characteristics (height, weight, etc.)
- Each player's positions and speeds on each play
- Each play's properties (down, distance, yards rushed, etc.)

We'll be examining each of these files in order to get a deeper understanding of the data. In the end, the goal of exploratory data analysis is to show information that could be useful when completing the task at hand. In this case, we are ultimately trying to predict how many yards a given rushing play will go for, based on information at the time of handoff.

## Import Packages
For plotting all of our graphs, I'll be using a package called `altair`. It is an interesting package. 

In [4]:
import numpy as np
import pandas as pd
import altair

import os
# To disable the max rows option
# altair.data_transformers.disable_max_rows()

## Read in Data
As stated before, we have 3 files we need to read in. We also set the data types as a dictionary we pass into each `read_csv` call.

In [2]:
# Player characteristics
pcdt = {
    'NflId': 'category',
    'DisplayName': 'str',
    'PlayerHeight': 'float32',
    'PlayerWeight': 'int',
    'PlayerCollegeName': 'str',
    'Position': 'category'
}
playerChara = pd.read_csv('./Data/playerData.csv', parse_dates=[4],
                         dtype=pcdt)
# Player positions
ppdt = {
    'GameId': 'category',
    'PlayId': 'category',
    'Team': 'str',
    'NflId': 'category'
}
playerPos = pd.read_csv('./Data/playerPositions.csv', dtype=ppdt)
# Play by play
# More columns here...
# Reason is none of our integer columns needs 64-bit precision...
# Use less memory if we can...
integerCols = ['Season', 'YardLine', 'Quarter', 'Down', 'Distance', 'HomeScoreBeforePlay', 
              'VisitorScoreBeforePlay', 'DefendersInTheBox', 'Yards', 'Week',
              'GameClockMinute', 'GameClockSecond', 'Half', 'RB', 'TE',
              'WR', 'OL', 'QB', 'DoO', 'DL', 'LB', 'DB', 'OoD']
catCols = ['GameId', 'PlayId', 'PossessionTeam', 'FieldPosition', 'NflIdRusher',
          'OffenseFormation', 'PlayDirection', 'HomeTeamAbbr', 'VisitorTeamAbbr',
          'Stadium', 'Location', 'Turf']
pbpdt = {}
for col in integerCols:
    pbpdt[col] = 'int32'
for col in catCols:
    pbpdt[col] = 'category'
pbpData = pd.read_csv('./Data/playByPlayData.csv', parse_dates=[15, 16], dtype=pbpdt)

## Physical Characteristics
For starters, we can examine how a player's physical characteristics affect the yards gained on a play.

### Weight
For starters, we can look at something simple. For example, how do the weight and height of the players affect how many yards they go for? Running backs are typically on the shorter side, but then we have someone like Derrick Henry who just plows through people. The plotting package has a maximum limit of 5000 rows. Therefore, we can't simply plot every single play (there are ~30k of them). 

Since we are primarily concerned about the weights of players, we can put each player into a weight class and plot their average yards per rush for each rush. This way, we are not unnecessarily throwing out plays either.

In [8]:
# See if the JSON already exists for plotting...
WR_JSON = './Altair JSONs/weightsAndRushes.json'
if not os.path.exists(WR_JSON):
    # Grab yards and the rusher...
    rushes = pbpData[['NflIdRusher', 'Yards']]
    # Left join it with the weight information...
    weightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerWeight', 'Position']], how='left', 
             left_on='NflIdRusher', right_on='NflId')
    # We only need the weight and yards...
    weightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)
    weightsAndRushes.to_json(WR_JSON, orient='records')
    weightsAndRushes.head()

In [9]:
altair.Chart(WR_JSON).mark_point(clip=True).encode(
    altair.X('PlayerWeight:Q',
             axis=altair.Axis(title='Weight'),
             scale=altair.Scale(zero=False)),
    y='Yards:Q',
    color='Position:N'
).properties(
    width=800, 
    height=350
)

As expected, we can see clear demarcations with respect to the positions. The pinks are all running backs, while the lighter individuals are wide receivers. Past 250 lbs are a mixture of tight ends and offensive linemen for those one-off plays. Next, we'll look at height...

### Height
Same code as above.

In [10]:
HR_JSON = './Altair JSONs/heightsAndRushes.json'
if not os.path.exists(HR_JSON):
    # Grab yards and the rusher...
    rushes = pbpData[['NflIdRusher', 'Yards']]
    # Left join it with the weight information...
    heightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerHeight', 'Position']], how='left', 
             left_on='NflIdRusher', right_on='NflId')
    # We only need the weight and yards...
    heightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)
    heightsAndRushes.to_json(HR_JSON, orient='records')

altair.Chart(HR_JSON).mark_point(clip=True).encode(
    altair.X('PlayerHeight:Q',
             axis=altair.Axis(title='Height'),
             scale=altair.Scale(zero=False)),
    y='Yards:Q',
    color='Position:N'
).properties(
    width=800, 
    height=350
)

The reason we get the lines is because heights are always quoted by the inch. There's a healthy bell curve of heights by running backs, though it appears there's a hard drop off after 6 foot 3 inches, which are occupied by tight ends and wide receivers. Just like the weights graph, the dots gets sparse as the yards increases. Why don't we examine the distribution of the yards gained a bit more.

## Yards
We can't just blindly graph yards, because the yards gained on a play is directly related to where you are on the field. For example, it's impossible to gain 75 yards when you are at midfield. At the very least, we should expect the yards to drop as you approach the ends of the field. Additionally, another variable that affects the yards gained is the down and distance the team is at. Who cares about calling a 50-yard run when you need to gain 2 yards just to keep the ball. Thus, we'll create two graphs here...

### Location on the Field
To plot the location on the field, we can use the yard line of where the ball is. The problem is...

In [17]:
pbpData.YardLine.describe()

count    30611.000000
mean        28.058312
std         12.915614
min          1.000000
25%         19.000000
50%         28.000000
75%         39.000000
max         49.000000
Name: YardLine, dtype: float64

Notice the range is from 1 to 49, which means we also need to know which side the team is on. Otherwise, there is no way to figure out if the team is on the OWN 3-yard line or the opponent's, for example. **In our data, the yardline has actually been mirrored if the possession team is in their opponent's territory! This is also how it's reported during play by play. NE being on the KC 2-yardline means they are 2 yards away from scoring. However, them being on the NE 2-yardline means they have 98 yards to go. We have to take this into account!**

In [18]:
# If the possession team is different than the
# field position, then we are on the opponent's side of the field...
if 'YardLine100' not in pbpData.columns:
    pbpData.insert(4, 'YardLine100', pbpData.YardLine.to_list())
    locs = pbpData.PossessionTeam != pbpData.FieldPosition
    pbpData.loc[locs, 'YardLine100'] = 100 - pbpData.loc[locs, 'YardLine100']
pbpData.head()

Unnamed: 0,GameId,PlayId,Season,YardLine,YardLine100,Quarter,PossessionTeam,Down,Distance,FieldPosition,...,TE,WR,OL,QB,DoO,DL,LB,DB,OoD,State
0,2017090700,20170907000118,2017,35,35,1,NE,3,2,NE,...,1,3,5,1,0,2,3,6,0,MA
1,2017090700,20170907000139,2017,43,43,1,NE,1,10,NE,...,1,3,5,1,0,2,3,6,0,MA
2,2017090700,20170907000189,2017,35,65,1,NE,1,10,KC,...,1,3,5,1,0,2,3,6,0,MA
3,2017090700,20170907000345,2017,2,98,1,NE,2,2,KC,...,2,0,6,1,0,4,4,3,0,MA
4,2017090700,20170907000395,2017,25,25,1,KC,1,10,KC,...,3,1,5,1,0,3,2,6,0,MA


Notice the 3rd and 4th rows where the Patriots are on the Chiefs' side of the field. Now we can plot this column with respect to the yards gained. We should see the points taper off as the yard line increases i.e. get closer to the endzone.

In [11]:
YL100_JSON = './Altair JSONs/YardLine100.json'
if not os.path.exists(YL100_JSON):
    pbpData[['YardLine100', 'Yards']].to_json(YL100_JSON, orient='records')

altair.Chart(YL100_JSON).mark_point(clip=True).encode(
    x=altair.X('YardLine100:Q',
              axis=altair.Axis(title='Yard Line (100)')),
    y='Yards:Q'
).properties(
    width=800,
    height=350,
    title='Yards gained vs. Field Position'
)

Couple things about this plot. The hypotenuse of this triangular plot shows all the times when a runner rushed for a touchdown at that position. Additionally, you may notice there are more points on the 25-yard line, and the 40-yard line to some extent. The reason is that whenever there is a touchback on the kickoff, the team starts at their own 25-yard line. Also, when a kickoff goes out of the bounds, that's a penalty on the kicking team, and the ball is spotted at the 40-yard line. The main things we take away from here is that **solely using the field position of the rushing play isn't enough, and other things must be taken into account.**

### Distance to First Down
The distance to a first down also matters. This mainly applies to short yardage situations, typically less than 4 yards or so. If you need 2 yards to keep the drive alive, why design for a long run when instead you can send in some big guys to push forward for the first down? Hopefully the next plot can show this...Instead of showing raw counts for how many plays were rushing plays per down and distance, we'll show **percentage**. The reason is that since you get a 1st and 10 whenever a new set of downs is established, it will cause an overabundance of the number of plays. There will also be a dropoff with distancecs greater than 10 yards because these can only occur either through a loss on the previous play, or through a penalty.

First, we'll look at **all the plays** and see how many yards they've went for, then we'll look at the percentage of rushing plays as the distance increases.

In [12]:
DISTANCE_JSON = './Altair JSONs/distance.json'
if not os.path.exists(DISTANCE_JSON):
    pbpData[['Distance', 'Yards', 'Down']].to_json(DISTANCE_JSON, orient='records')

altair.Chart(DISTANCE_JSON).mark_point(clip=True).encode(
    x='Distance:Q',
    y='Yards:Q',
    color='Down:N'
).properties(
    width=800,
    height=350,
    title='Distance to First Down vs. Yards Gained'
).configure_legend(
    labelFontSize=16,
    titleFontSize=16
)

Obviously, there's an overabundance plays with 10 yards to go, because you get 1st and 10 whenever a new set of downs is established. Notice we also have some blue dots at 1st and 15 and 1st and 20. These are due to penalty. A false start is a five-yard penalty. We can see a good mixture of downs on the other distances.

## Offensive Line Groupings
Now, let's look at what is probably one of the more important aspects. How does the personnel groupings affect how many yards a play goes for? Knowing how the plays really work, it would make sense that as the number of heavier guys increase, the shorter the yards gain gets. Why? Because the only reason you **wouldn't** have 5 offensive linemen is if you need a 3rd and 1, or a 4th and goal from the 2-yard line, where you would simply need to power your way through.

To capture this, we can take the **total weight** of the guys on the field, and plot it against the yards gained. However, it's important that we differentiate between the standard 5-offensive linemen, and other extra ones. We should also note down how long the plays were, and whether or not they went for a first down.

### Standard Offensive Groupings
These are ones that have 5 offensive linemen. The most common lineup is 1 running back, 1 tight end, 3 wide receivers, 5 OL, and 1 quarterback.

In [22]:
uniques, counts = np.unique(pbpData.loc[:, ['TE', 'WR', 'OL', 'QB', 'DoO']], axis=0, return_counts=True)
sortedIndices = np.argsort(counts)[::-1]
counts[sortedIndices]

array([13921,  6957,  3213,  1774,  1573,   640,   584,   332,   214,
         202,   173,   155,   134,    97,    78,    78,    73,    65,
          48,    37,    33,    31,    31,    24,    20,    14,    14,
          10,     9,     9,     9,     8,     8,     7,     5,     4,
           3,     3,     3,     2,     2,     2,     2,     2,     2,
           1,     1,     1,     1,     1,     1], dtype=int64)

Okay, so we actually have numerous personnel groupings. The problem is that most of these have occurred less than 500 times, which isn't a large sample. In fact, 15 of them have occurred 3 times or less! Thus, the best way to do show what we want, is to take the total weight of the players on the field **for each play**, and then bin them into various bins. Once we bin them, we can then plot histograms for the yards gained for the plays. 

To account for where we are on the field, we should plot 