<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages</a></span></li><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Data</a></span></li><li><span><a href="#Missing-Data-and-Data-Types" data-toc-modified-id="Missing-Data-and-Data-Types-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Missing Data and Data Types</a></span><ul class="toc-item"><li><span><a href="#Checking-on-NaNs" data-toc-modified-id="Checking-on-NaNs-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Checking on NaNs</a></span></li><li><span><a href="#Fixing-DataTypes" data-toc-modified-id="Fixing-DataTypes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Fixing DataTypes</a></span></li></ul></li></ul></div>

# 2019 NFL Big Data Bowl - Part 1: Data Cleaning
This set of notebooks will go through a past Kaggle competition known as the NFL Big Data Bowl. The data bowl this time focuses on **running plays**, and specifically, how far a running play will go. We've all watched the games on television, and during plays, once the running back gets the ball, we sometimes think "Oh that play is going nowhere" or "That's a touchdown". How are we making these decisions? Presumably, a standard armchair spectator is looking at the blocking that is taking place in the offensive line. These sorts of cues are what we hope to capture when we analyze the running plays. 

The notebooks will be arranged into parts that will take us through the various analyses that I perform. With any project, the data cleaning will always come first, to make sure the data can be effectively analyzed. This includes getting rid of duplicates, dropping missing data, splitting columns, and fixing data types. After this comes exploratory data analysis (EDA).

## Import Packages
This cell contains all the packages that we plan to use in this notebook. The `altair` package is a neat plotting package, similar to `matplotlib` and `plotly`.

In [1]:
import numpy as np
import pandas as pd
import altair as alt

## Import Data
Our provided training data is in a file called train.csv. This has all the columns we would ever need.

In [2]:
allData = pd.read_csv('./Data/train.csv')
allData.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,...,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.4,81.99,177.18,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,27.61,198.7,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
2,2017090700,20170907000118,away,74.0,33.2,1.22,0.59,0.31,3.01,202.73,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
3,2017090700,20170907000118,away,71.46,27.7,0.42,0.54,0.02,359.77,105.64,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,12.63,164.31,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW


In [16]:
print('Data Shape:', allData.shape)

Data Shape: (682154, 49)


There are a ton of variables and observations here. The information some of the columns provide might overlap with others. Due to the sheer immense number, [here](https://www.kaggle.com/c/nfl-big-data-bowl-2020/data) is an external link to description of what each column is, and the reference values for the positional information.

## Missing Data and Data Types
The first step in cleaning is to examine how much missing data we have. Depending on how important the column is, we can either drop any rows where that value is missing, or just simply drop the column if we are not going to use it. Because this came from a Kaggle competition, it's likely most of the values will already be present, but it doesn't hurt to double-check.

As for data types, `pandas` usually assigns the most memory to the integer and float values i.e. `int64`. Depending on the use-case, this may not be needed. Furthermore, it's common to have some variables as categorical, such as the team or ID.

### Checking on NaNs
To get a broad overview, we can count up the number of NaNs we have in each column, to see which ones are the biggest offenders.

In [34]:
nasums = allData.isna().sum()
for i, (name, value) in enumerate(nasums.items()):
    print('{:22}\t{}'.format(name, value), end='')
    if i % 4 == 3:
        print()
    else:
        print('\t', end='')

GameId                	0	PlayId                	0	Team                  	0	X                     	0
Y                     	0	S                     	0	A                     	0	Dis                   	0
Orientation           	23	Dir                   	28	NflId                 	0	DisplayName           	0
JerseyNumber          	0	Season                	0	YardLine              	0	Quarter               	0
GameClock             	0	PossessionTeam        	0	Down                  	0	Distance              	0
FieldPosition         	8602	HomeScoreBeforePlay   	0	VisitorScoreBeforePlay	0	NflIdRusher           	0
OffenseFormation      	88	OffensePersonnel      	0	DefendersInTheBox     	22	DefensePersonnel      	0
PlayDirection         	0	TimeHandoff           	0	TimeSnap              	0	Yards                 	0
PlayerHeight          	0	PlayerWeight          	0	PlayerBirthDate       	0	PlayerCollegeName     	0
Position              	0	HomeTeamAbbr          	0	VisitorTeamAbbr       	0	Week              

Looking at the stats, the data is mostly clean on the key points. The biggest offenders are columns pertaining to the weather during the game. We can always use more reliable external sources to obtain what the weather was that day, so these are not important to the data itself.

The **stadium type** tells us what the stadium environment is like. For example, is it outdoors/indoors, in a dome, have a retractable roof? The only place where this would effect our prediction, is if the stadium was outdoors, and it was raining pretty heavily, which would make players slip around (the 2011 "Monsoon Bowl" between the Jaguars and Panthers is a good example). However, this is a relative rarity that may happen 1 or 2 times a season, and so training in these specific conditions will not be productive. Additionally, we also have the stadium and turf type itself, so even if we were to include the weather, the model would ideally learn that in some stadiums the weather has no effect.

The next variable is **field position**. This tells us which side of the field the possession team is on. I would say this is heavily important, especially since this determines the distance you are from the endzone. Additionally, the **yard line** is listed as a number between 0-50, so we would absolutely need the side to fully determine where the ball is. Therefore, we will remove all rows where we don't have the field position.

The other columns with low amounts will also have their rows removed, as those as datapoins which are quite important with respect to the running game. For example, it's much easier to block with 3 tight ends instead of the base 1. Personnel packages also let defenses infer one way or the other.

In conclusion, we will flat out **drop** the weather and stadium type columns, and **remove the rows** where the field position, offensive personnel, and orientation are absent.

In [18]:
colsToDrop = ['StadiumType', 'GameWeather', 'Temperature', 'Humidity',
             'WindSpeed', 'WindDirection']
# Delete the columns in colsToDrop
# and the rows where there are NaNs
cutDownData = allData.drop(columns=colsToDrop).dropna()
print('Shape after dropping:', cutDownData.shape)

Shape after dropping: (673432, 43)


Looks like we lost about 9000 records, which isn't too bad, considering we have over 670 thousand records in total.

### Fixing DataTypes
Another thing we might need to fix are data type mismatches. For example, some of these columns are direct strings, such as the **Team** column. However, this column can only take one of two values, "home" or "away". Thus, it's better if we code this as a categorical variable. The same would be a true for a number of other categories. Additionally, the default number data type is 64 bit, of which none of our values are that precise, so we can also save a bit on memory.

In [48]:
integerCols = ['Season', 'YardLine', 'Quarter', 'Down', 'Distance', 
               'HomeScoreBeforePlay', 'VisitorScoreBeforePlay', 'DefendersInTheBox',
               'Yards', 'PlayerWeight', 'Week']
floatCols = ['X', 'Y', 'S', 'A', 'Dis', 'Orientation', 'Dir']
categoricalCols = ['GameId', 'PlayId', 'Team', 'NflId', 'JerseyNumber',
                  'PossessionTeam', 'FieldPosition', 'NflIdRusher',
                  'OffenseFormation', 'Position', 'HomeTeamAbbr',
                  'VisitorTeamAbbr', 'Stadium']
cutDownData[integerCols] = cutDownData[integerCols].astype('int32')
cutDownData[floatCols] = cutDownData[floatCols].astype('float32')
cutDownData[categoricalCols] = cutDownData[categoricalCols].astype('category')
cutDownData.dtypes

GameId                    category
PlayId                    category
Team                      category
X                          float32
Y                          float32
S                          float32
A                          float32
Dis                        float32
Orientation                float32
Dir                        float32
NflId                     category
DisplayName                 object
JerseyNumber              category
Season                       int32
YardLine                     int32
Quarter                      int32
GameClock                   object
PossessionTeam            category
Down                         int32
Distance                     int32
FieldPosition             category
HomeScoreBeforePlay          int32
VisitorScoreBeforePlay       int32
NflIdRusher               category
OffenseFormation          category
OffensePersonnel            object
DefendersInTheBox            int32
DefensePersonnel            object
PlayDirection       

We've taken care of all the data types that we could assign directly to the data. For the other columns, there needs to be some data munging and extraction to get them into a form that we can work with.