<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Read-in-Data" data-toc-modified-id="Read-in-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in Data</a></span></li><li><span><a href="#Physical-Characteristics" data-toc-modified-id="Physical-Characteristics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Physical Characteristics</a></span><ul class="toc-item"><li><span><a href="#Weight" data-toc-modified-id="Weight-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Weight</a></span></li><li><span><a href="#Height" data-toc-modified-id="Height-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Height</a></span></li></ul></li><li><span><a href="#Yards" data-toc-modified-id="Yards-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Yards</a></span></li></ul></div>

# 2019 NFL Big Data Bowl - Part 2: Exploratory Data Analysis
This is the second notebook in a series where I partake in the 2019 Big Data Bowl competition on Kaggle. The first notebook dealt with cleaning the datasets to produce 3 different files:

- Each player's physical characteristics (height, weight, etc.)
- Each player's positions and speeds on each play
- Each play's properties (down, distance, yards rushed, etc.)

We'll be examining each of these files in order to get a deeper understanding of the data. In the end, the goal of exploratory data analysis is to show information that could be useful when completing the task at hand. In this case, we are ultimately trying to predict how many yards a given rushing play will go for, based on information at the time of handoff.

## Import Packages
For plotting all of our graphs, I'll be using a package called `altair`. It is an interesting package. 

In [1]:
import numpy as np
import pandas as pd
import altair
# To disable the max rows option
altair.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Read in Data
As stated before, we have 3 files we need to read in. We also set the data types as a dictionary we pass into each `read_csv` call.

In [2]:
# Player characteristics
pcdt = {
    'NflId': 'category',
    'DisplayName': 'str',
    'PlayerHeight': 'float32',
    'PlayerWeight': 'int',
    'PlayerCollegeName': 'str',
    'Position': 'category'
}
playerChara = pd.read_csv('./Data/playerData.csv', parse_dates=[4],
                         dtype=pcdt)
# Player positions
ppdt = {
    'GameId': 'category',
    'PlayId': 'category',
    'Team': 'str',
    'NflId': 'category'
}
playerPos = pd.read_csv('./Data/playerPositions.csv', dtype=ppdt)
# Play by play
# More columns here...
# Reason is none of our integer columns needs 64-bit precision...
# Use less memory if we can...
integerCols = ['Season', 'YardLine', 'Quarter', 'Down', 'Distance', 'HomeScoreBeforePlay', 
              'VisitorScoreBeforePlay', 'DefendersInTheBox', 'Yards', 'Week',
              'GameClockMinute', 'GameClockSecond', 'Half', 'RB', 'TE',
              'WR', 'OL', 'QB', 'DoO', 'DL', 'LB', 'DB', 'OoD']
catCols = ['GameId', 'PlayId', 'PossessionTeam', 'FieldPosition', 'NflIdRusher',
          'OffenseFormation', 'PlayDirection', 'HomeTeamAbbr', 'VisitorTeamAbbr',
          'Stadium', 'Location', 'Turf']
pbpdt = {}
for col in integerCols:
    pbpdt[col] = 'int32'
for col in catCols:
    pbpdt[col] = 'category'
pbpData = pd.read_csv('./Data/playByPlayData.csv', parse_dates=[15, 16], dtype=pbpdt)

## Physical Characteristics
For starters, we can examine how a player's physical characteristics affect the yards gained on a play.

### Weight
For starters, we can look at something simple. For example, how do the weight and height of the players affect how many yards they go for? Running backs are typically on the shorter side, but then we have someone like Derrick Henry who just plows through people. The plotting package has a maximum limit of 5000 rows. Therefore, we can't simply plot every single play (there are ~30k of them). 

Since we are primarily concerned about the weights of players, we can put each player into a weight class and plot their average yards per rush for each rush. This way, we are not unnecessarily throwing out plays either.

In [3]:
# Grab yards and the rusher...
rushes = pbpData[['NflIdRusher', 'Yards']]
# Left join it with the weight information...
weightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerWeight', 'Position']], how='left', 
         left_on='NflIdRusher', right_on='NflId')
# We only need the weight and yards...
weightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)
weightsAndRushes.head()

Unnamed: 0,Yards,PlayerWeight,Position
0,8,205,RB
1,3,205,RB
2,5,205,RB
3,2,210,RB
4,7,216,RB


In [4]:
altair.Chart(weightsAndRushes).mark_point(clip=True).encode(
    altair.X('PlayerWeight',
             axis=altair.Axis(title='Weight'),
             scale=altair.Scale(zero=False)),
    y='Yards',
    color='Position'
).properties(
    width=800, 
    height=350
)

As expected, we can see clear demarcations with respect to the positions. The pinks are all running backs, while the lighter individuals are wide receivers. Past 250 lbs are a mixture of tight ends and offensive linemen for those one-off plays. Next, we'll look at height...

### Height
Same code as above.

In [8]:
# Grab yards and the rusher...
rushes = pbpData[['NflIdRusher', 'Yards']]
# Left join it with the weight information...
heightsAndRushes = pd.merge(rushes, playerChara[['NflId', 'PlayerHeight', 'Position']], how='left', 
         left_on='NflIdRusher', right_on='NflId')
# We only need the weight and yards...
heightsAndRushes.drop(columns=['NflIdRusher', 'NflId'], inplace=True)

altair.Chart(heightsAndRushes).mark_point(clip=True).encode(
    altair.X('PlayerHeight',
             axis=altair.Axis(title='Height'),
             scale=altair.Scale(zero=False)),
    y='Yards',
    color='Position'
).properties(
    width=800, 
    height=350
)

The reason we get the lines is because heights are always quoted by the inch. There's a healthy bell curve of heights by running backs, though it appears there's a hard drop off after 6 foot 3 inches, which are occupied by tight ends and wide receivers. Just like the weights graph, the dots gets sparse as the yards increases. Why don't we examine the distribution of the yards gained a bit more.

## Yards
We can't just blindly graph yards, because the yards gained on a play is directly related to where you are on the field. For example, it's impossible to gain 75 yards when you are at midfield. At the very least, we should expect the yards to drop as you approach the ends of the field. Additionally, another variable that affects the yards gained is the down and distance the team is at. Who cares about calling a 50-yard run when you need to gain 2 yards just to keep the ball. Thus, we'll create two graphs here...