# Lab 1: Exploring NFL Play-By-Play Data

## Data Loading and Preprocessing

To begin, we load the data into a Pandas data frame from a csv file.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/data.csv') # read in the csv file

Let's take a cursory glance at the data to see what we're working with.

In [None]:
df.head()

There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.

We define a list of the columns which we're not interested in, and then we delete them

In [None]:
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder', 
                     'PosTeamScore', 'PassAttempt', 'RushAttempt', 
                     'DefTeamScore', 'Season', 'PlayAttempted']

#Iterate through and delete the columns we don't want
for col in columns_to_delete:
    if col in df:
        del df[col]

We can then grab a list of the remaining column names

In [None]:
df.columns

Temporary simple data replacement so that we can cast to integers (instead of objects)

In [None]:
df = df.replace(to_replace=np.nan,value=-1)

At this point, lots of things are encoded as objects, or with excesively large data types

In [None]:
df.info()

We define four lists based on the types of features we're using.
Binary features are separated from the other categorical features so that they can be stored in less space

In [None]:
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
                       'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
                       'ScoreDiff', 'AbsScoreDiff']

ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)

We then cast all of the columns to the appropriate underlying data types

In [None]:
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)

Now all of the objects are encoded the way we'd like them to be

In [None]:
df.info()

Now we can start to take a look at what's in each of our columns

In [None]:
df.describe()