<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Read-in-Data" data-toc-modified-id="Read-in-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in Data</a></span></li></ul></div>

# 2019 NFL Big Data Bowl - Part 2: Exploratory Data Analysis
This is the second notebook in a series where I partake in the 2019 Big Data Bowl competition on Kaggle. The first notebook dealt with cleaning the datasets to produce 3 different files:

- Each player's physical characteristics (height, weight, etc.)
- Each player's positions and speeds on each play
- Each play's properties (down, distance, yards rushed, etc.)

We'll be examining each of these files in order to get a deeper understanding of the data. In the end, the goal of exploratory data analysis is to show information that could be useful when completing the task at hand. In this case, we are ultimately trying to predict how many yards a given rushing play will go for, based on information at the time of handoff.

## Import Packages
For plotting all of our graphs, I'll be using a package called `altair`. It is an interesting package. 

In [1]:
import numpy as np
import pandas as pd
import altair

## Read in Data
As stated before, we have 3 files we need to read in. We also set the data types as a dictionary we pass into each `read_csv` call.

In [37]:
# Player characteristics
pcdt = {
    'NflId': 'category',
    'DisplayName': 'str',
    'PlayerHeight': 'float32',
    'PlayerWeight': 'int',
    'PlayerCollegeName': 'str',
    'Position': 'category'
}
playerChara = pd.read_csv('./Data/playerData.csv', parse_dates=[4],
                         dtype=pcdt)
# Player positions
ppdt = {
    'GameId': 'category',
    'PlayId': 'category',
    'Team': 'str',
    'NflId': 'category'
}
playerPos = pd.read_csv('./Data/playerPositions.csv', dtype=ppdt)
# Play by play
# More columns here...
# Reason is none of our integer columns needs 64-bit precision...
# Use less memory if we can...
integerCols = ['Season', 'YardLine', 'Quarter', 'Down', 'Distance', 'HomeScoreBeforePlay', 
              'VisitorScoreBeforePlay', 'DefendersInTheBox', 'Yards', 'Week',
              'GameClockMinute', 'GameClockSecond', 'Half', 'RB', 'TE',
              'WR', 'OL', 'QB', 'DoO', 'DL', 'LB', 'DB', 'OoD']
catCols = ['GameId', 'PlayId', 'PossessionTeam', 'FieldPosition', 'NflIdRusher',
          'OffenseFormation', 'PlayDirection', 'HomeTeamAbbr', 'VisitorTeamAbbr',
          'Stadium', 'Location', 'Turf']
pbpdt = {}
for col in integerCols:
    pbpdt[col] = 'int32'
for col in catCols:
    pbpdt[col] = 'category'
pbpData = pd.read_csv('./Data/playByPlayData.csv', parse_dates=[15, 16], dtype=pbpdt)

In [38]:
pbpData.dtypes

GameId                               category
PlayId                               category
Season                                  int32
YardLine                                int32
Quarter                                 int32
PossessionTeam                       category
Down                                    int32
Distance                                int32
FieldPosition                        category
HomeScoreBeforePlay                     int32
VisitorScoreBeforePlay                  int32
NflIdRusher                          category
OffenseFormation                     category
DefendersInTheBox                       int32
PlayDirection                        category
TimeHandoff               datetime64[ns, UTC]
TimeSnap                  datetime64[ns, UTC]
Yards                                   int32
HomeTeamAbbr                         category
VisitorTeamAbbr                      category
Week                                    int32
Stadium                           

In [39]:
pbpData.head()

Unnamed: 0,GameId,PlayId,Season,YardLine,Quarter,PossessionTeam,Down,Distance,FieldPosition,HomeScoreBeforePlay,...,RB,TE,WR,OL,QB,DoO,DL,LB,DB,OoD
0,2017090700,20170907000118,2017,35,1,NE,3,2,NE,0,...,1,1,3,5,1,0,2,3,6,0
1,2017090700,20170907000139,2017,43,1,NE,1,10,NE,0,...,1,1,3,5,1,0,2,3,6,0
2,2017090700,20170907000189,2017,35,1,NE,1,10,KC,0,...,1,1,3,5,1,0,2,3,6,0
3,2017090700,20170907000345,2017,2,1,NE,2,2,KC,0,...,2,2,0,6,1,0,4,4,3,0
4,2017090700,20170907000395,2017,25,1,KC,1,10,KC,7,...,1,3,1,5,1,0,3,2,6,0


In [43]:
np.unique(pbpData.Stadium, return_counts=True)

(array(['AT&T Stadium', 'Arrowhead Stadium', 'Azteca Stadium',
        'Bank of America Stadium', 'Broncos Stadium At Mile High',
        'Broncos Stadium at Mile High', 'CenturyField', 'CenturyLink',
        'CenturyLink Field', 'Dignity Health Sports Park',
        'Empower Field at Mile High', 'Estadio Azteca', 'EverBank Field',
        'Everbank Field', 'FedExField', 'FedexField',
        'First Energy Stadium', 'FirstEnergy', 'FirstEnergy Stadium',
        'FirstEnergyStadium', 'Ford Field', 'Gillette Stadium',
        'Hard Rock Stadium', 'Heinz Field', 'Lambeau Field',
        'Lambeau field', 'Levis Stadium', 'Lincoln Financial Field',
        'Los Angeles Memorial Coliesum', 'Los Angeles Memorial Coliseum',
        'Lucas Oil Stadium', 'M & T Bank Stadium', 'M&T Bank Stadium',
        'M&T Stadium', 'Mercedes-Benz Dome', 'Mercedes-Benz Stadium',
        'Mercedes-Benz Superdome', 'MetLife', 'MetLife Stadium',
        'Metlife Stadium', 'NRG', 'NRG Stadium', 'New Era Field',
  

In [44]:
np.unique(pbpData.Location, return_counts=True)

(array(['Arlington, TX', 'Arlington, Texas', 'Atlanta, GA',
        'Baltimore, MD', 'Baltimore, Maryland', 'Baltimore, Md.',
        'Carson, CA', 'Charlotte North Carolina', 'Charlotte, NC',
        'Charlotte, North Carolina', 'Chicago', 'Chicago, IL',
        'Chicago. IL', 'Cincinnati, OH', 'Cincinnati, Ohio', 'Cleveland',
        'Cleveland Ohio', 'Cleveland, OH', 'Cleveland, Ohio',
        'Cleveland,Ohio', 'Denver CO', 'Denver, CO', 'Detroit',
        'Detroit, MI', 'E. Rutherford, NJ', 'East Rutherford, N.J.',
        'East Rutherford, NJ', 'Foxborough, MA', 'Foxborough, Ma',
        'Glendale, AZ', 'Green Bay, WI', 'Houston, TX', 'Houston, Texas',
        'Indianapolis, Ind.', 'Jacksonville Florida', 'Jacksonville, FL',
        'Jacksonville, Fl', 'Jacksonville, Florida', 'Kansas City,  MO',
        'Kansas City, MO', 'Landover, MD', 'London', 'London, England',
        'Los Angeles, CA', 'Los Angeles, Calif.', 'Mexico City',
        'Mexico City, Mexico', 'Miami Gardens, FL'