<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Read-in-Data" data-toc-modified-id="Read-in-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in Data</a></span></li></ul></div>

# 2019 NFL Big Data Bowl - Part 2: Exploratory Data Analysis
This is the second notebook in a series where I partake in the 2019 Big Data Bowl competition on Kaggle. The first notebook dealt with cleaning the datasets to produce 3 different files:

- Each player's physical characteristics (height, weight, etc.)
- Each player's positions and speeds on each play
- Each play's properties (down, distance, yards rushed, etc.)

We'll be examining each of these files in order to get a deeper understanding of the data. In the end, the goal of exploratory data analysis is to show information that could be useful when completing the task at hand. In this case, we are ultimately trying to predict how many yards a given rushing play will go for, based on information at the time of handoff.

## Import Packages
For plotting all of our graphs, I'll be using a package called `altair`. It is an interesting package. 

In [1]:
import numpy as np
import pandas as pd
import altair

## Read in Data
As stated before, we have 3 files we need to read in. We also set the data types as a dictionary we pass into each `read_csv` call.

In [26]:
# Player characteristics
pcdt = {
    'NflId': 'category',
    'DisplayName': 'str',
    'PlayerHeight': 'float32',
    'PlayerWeight': 'int',
    'PlayerCollegeName': 'str',
    'Position': 'category'
}
playerChara = pd.read_csv('./Data/playerData.csv', parse_dates=[4],
                         dtype=pcdt)
# Player positions
ppdt = {
    'GameId': 'category',
    'PlayId': 'category',
    'Team': 'str',
    'NflId': 'category'
}
playerPos = pd.read_csv('./Data/playerPositions.csv', dtype=ppdt)
# Play by play
pbpData = pd.read_csv('./Data/playByPlayData.csv')

In [29]:
pbpData.dtypes

GameId                     int64
PlayId                     int64
Season                     int64
YardLine                   int64
Quarter                    int64
PossessionTeam            object
Down                       int64
Distance                   int64
FieldPosition             object
HomeScoreBeforePlay        int64
VisitorScoreBeforePlay     int64
NflIdRusher                int64
OffenseFormation          object
DefendersInTheBox          int64
PlayDirection             object
TimeHandoff               object
TimeSnap                  object
Yards                      int64
HomeTeamAbbr              object
VisitorTeamAbbr           object
Week                       int64
Stadium                   object
Location                  object
Turf                      object
GameClockMinute            int64
GameClockSecond            int64
Half                       int64
RB                         int64
TE                         int64
WR                         int64
OL        

In [30]:
pbpData.head()

Unnamed: 0,GameId,PlayId,Season,YardLine,Quarter,PossessionTeam,Down,Distance,FieldPosition,HomeScoreBeforePlay,...,RB,TE,WR,OL,QB,DoO,DL,LB,DB,OoD
0,2017090700,20170907000118,2017,35,1,NE,3,2,NE,0,...,1,1,3,5,1,0,2,3,6,0
1,2017090700,20170907000139,2017,43,1,NE,1,10,NE,0,...,1,1,3,5,1,0,2,3,6,0
2,2017090700,20170907000189,2017,35,1,NE,1,10,KC,0,...,1,1,3,5,1,0,2,3,6,0
3,2017090700,20170907000345,2017,2,1,NE,2,2,KC,0,...,2,2,0,6,1,0,4,4,3,0
4,2017090700,20170907000395,2017,25,1,KC,1,10,KC,7,...,1,3,1,5,1,0,3,2,6,0


In [18]:
playerChara.Position.cat.categories

Index(['C', 'CB', 'DB', 'DE', 'DL', 'DT', 'FB', 'FS', 'G', 'HB', 'ILB', 'LB',
       'MLB', 'NT', 'OG', 'OLB', 'OT', 'QB', 'RB', 'S', 'SAF', 'SS', 'T', 'TE',
       'WR'],
      dtype='object')