# **Data Cleaning**

#### Data cleaning is a key part of data science. 

In [3]:
import pandas as pd
import numpy as np

data = pd.read_csv('datasets/NFL Play by Play 2009-2016 (v3).csv', nrows=100000, dtype='unicode')

display(data.head())
print(data.shape)

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600,0,TEN,...,,0.4856749860969135,0.5143250139030865,0.5464328397122977,0.4535671602877023,0.4856749860969135,0.0607578536153842,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593,7,PIT,...,1.1460758055888012,0.5464328397122977,0.4535671602877023,0.5510878171614882,0.4489121828385117,0.5464328397122977,0.0046549774491905,-0.0322439869661942,0.0368989644153847,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556,37,PIT,...,,0.5510878171614882,0.4489121828385117,0.5107930906986009,0.4892069093013991,0.5510878171614882,-0.0402947264628873,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515,41,PIT,...,-5.031424582246548,0.5107930906986009,0.4892069093013991,0.4612171140114365,0.5387828859885635,0.5107930906986009,-0.0495759766871644,0.1066625552993268,-0.1562385319864913,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507,8,PIT,...,,0.4612171140114365,0.5387828859885635,0.5589294012224711,0.4410705987775288,0.4612171140114365,0.0977122872110346,,,2009


(100000, 102)


## **How many missing data points do we have?**

In [4]:

display(data.describe(include='all'))

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
count,100000,100000,100000,100000,84850,99938,100000,99938,99876,99864,...,38408,93659.0,93659.0,93241,93241,93647.0,98540,38402.0,38348,100000
unique,93,573,34,5,4,901,16,3963,119,34,...,32753,90549.0,90550.0,90652,90708,90425.0,91105,38368.0,32735,3
top,2011-01-02,2009100401,1,2,1,15:00,2,900,0,DET,...,0,0.4856749860969135,0.5143250139030865,1,0,0.4856749860969135,0,-0.0222869972404158,0,2010
freq,2772,218,4782,27597,33756,2628,10065,1106,12293,3355,...,5585,359.0,361.0,583,536,677.0,7056,6.0,5608,44751


In [7]:
# get the number of missing data points per column
missing_values_count = data.isnull().sum()
print(missing_values_count[0:20])


Date                 0
GameID               0
Drive                0
qtr                  0
down             15150
time                62
TimeUnder            0
TimeSecs            62
PlayTimeDiff       124
SideofField        136
yrdln              215
yrdline100         215
ydstogo              0
ydsnet               0
GoalToGo           215
FirstDown         7031
posteam           6364
DefensiveTeam     6364
desc                 1
PlayAttempted        0
dtype: int64


In [8]:
# how many total missing values do we have?
total_cells = np.product(data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

24.928696078431372


#### Almost a **quarter** of the cells in this dataset are empty! I

## **Figure out why the data is missing?**

### **Data Intution**
#### I means Looking at your data and trying to figure out why it is the way it is and how that will affect your analysis.
#### For dealing with missing values, you'll need to use your intution to figure out why the value is missing.
#### One of the most important questions you can ask yourself to help figure this out is this:

 > ### **Is this value missing because it wasn't recorded or because it doesn't exist?**

#### If a value is missing because it **doesn't exist** (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probably do want to keep as **NaN**. On the other hand, if a value is missing because it **wasn't recorded**, then you can try to guess what it might have been **based on the other values in that column and row.**

### **Tip:** This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more information.

## **Drop missing values.**

#### If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values.

In [9]:
# remove all the rows that contain a missing value
data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


#### it looks like that's removed all our data! This is because **every row in our dataset had at least one missing value**. 

### Removing all the **columns** that have at least one missing value instead

In [10]:
# remove all columns with at least one missing value
columns_with_na_dropped = data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,0,,3,3,3,3,3,0,0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,0,,3,3,3,3,3,0,0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,0,,3,3,3,3,3,0,0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0,0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0,0,2009


## **Showing the difference**

In [11]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 41


## **Filling in missing values automatically**

#### We can use the Panda's **```fillna()```** function to fill in missing values in a dataframe for us. One option we have is to specify what we want the **NaN** values to be replaced with. Here, I'm saying that I would like to replace all the **NaN** values with **0**.

In [12]:
# replace all NA's with 0
subset_of_data = data.loc[:, 'EPA':'Season']
display(subset_of_data.fillna(0).head(10))


Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014473905391463,0.0,0.0,0.4856749860969135,0.5143250139030865,0.5464328397122977,0.4535671602877023,0.4856749860969135,0.0607578536153842,0.0,0.0,2009
1,0.0779071580166195,-1.0681686475721817,1.1460758055888012,0.5464328397122977,0.4535671602877023,0.5510878171614882,0.4489121828385117,0.5464328397122977,0.0046549774491905,-0.0322439869661942,0.0368989644153847,2009
2,-1.4027598314188183,0.0,0.0,0.5510878171614882,0.4489121828385117,0.5107930906986009,0.4892069093013991,0.5510878171614882,-0.0402947264628873,0.0,0.0,2009
3,-1.7125834938947782,3.3188410883517703,-5.031424582246548,0.5107930906986009,0.4892069093013991,0.4612171140114365,0.5387828859885635,0.5107930906986009,-0.0495759766871644,0.1066625552993268,-0.1562385319864913,2009
4,2.097795973997438,0.0,0.0,0.4612171140114365,0.5387828859885635,0.5589294012224711,0.4410705987775288,0.4612171140114365,0.0977122872110346,0.0,0.0,2009
5,-0.6963021337954107,0.0,0.0,0.5589294012224711,0.4410705987775288,0.5784532294322996,0.4215467705677004,0.4410705987775288,-0.0195238282098284,0.0,0.0,2009
6,-0.1791493790923768,-0.3430848653093639,0.163935486216987,0.5784532294322996,0.4215467705677004,0.5828805133034717,0.4171194866965283,0.4215467705677004,-0.004427283871172,-0.010456104535247,0.0060288206640749,2009
7,-1.119477208411666,0.0,0.0,0.5828805133034717,0.4171194866965283,0.617543685977305,0.382456314022695,0.4171194866965283,-0.0346631726738333,0.0,0.0,2009
8,-0.0213126223616981,0.0,0.0,0.617543685977305,0.382456314022695,0.5914892015469919,0.4085107984530081,0.382456314022695,0.0260544844303131,0.0,0.0,2009
9,-0.2152925258472589,-0.7568943841299216,0.5416018582826627,0.5914892015469919,0.4085107984530081,0.5854050485525447,0.4145949514474553,0.5914892015469919,-0.0060841529944472,-0.0245258002571672,0.01844164726272,2009


#### Replace all **NA's** the value that comes **directly after** it in the same column. Then replace all the remaining **NA's** with **0**.



In [13]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_of_data.fillna(method='bfill', axis=0).fillna(0).head(10)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014473905391463,-1.0681686475721817,1.1460758055888012,0.4856749860969135,0.5143250139030865,0.5464328397122977,0.4535671602877023,0.4856749860969135,0.0607578536153842,-0.0322439869661942,0.0368989644153847,2009
1,0.0779071580166195,-1.0681686475721817,1.1460758055888012,0.5464328397122977,0.4535671602877023,0.5510878171614882,0.4489121828385117,0.5464328397122977,0.0046549774491905,-0.0322439869661942,0.0368989644153847,2009
2,-1.4027598314188183,3.3188410883517703,-5.031424582246548,0.5510878171614882,0.4489121828385117,0.5107930906986009,0.4892069093013991,0.5510878171614882,-0.0402947264628873,0.1066625552993268,-0.1562385319864913,2009
3,-1.7125834938947782,3.3188410883517703,-5.031424582246548,0.5107930906986009,0.4892069093013991,0.4612171140114365,0.5387828859885635,0.5107930906986009,-0.0495759766871644,0.1066625552993268,-0.1562385319864913,2009
4,2.097795973997438,-0.3430848653093639,0.163935486216987,0.4612171140114365,0.5387828859885635,0.5589294012224711,0.4410705987775288,0.4612171140114365,0.0977122872110346,-0.010456104535247,0.0060288206640749,2009
5,-0.6963021337954107,-0.3430848653093639,0.163935486216987,0.5589294012224711,0.4410705987775288,0.5784532294322996,0.4215467705677004,0.4410705987775288,-0.0195238282098284,-0.010456104535247,0.0060288206640749,2009
6,-0.1791493790923768,-0.3430848653093639,0.163935486216987,0.5784532294322996,0.4215467705677004,0.5828805133034717,0.4171194866965283,0.4215467705677004,-0.004427283871172,-0.010456104535247,0.0060288206640749,2009
7,-1.119477208411666,-0.7568943841299216,0.5416018582826627,0.5828805133034717,0.4171194866965283,0.617543685977305,0.382456314022695,0.4171194866965283,-0.0346631726738333,-0.0245258002571672,0.01844164726272,2009
8,-0.0213126223616981,-0.7568943841299216,0.5416018582826627,0.617543685977305,0.382456314022695,0.5914892015469919,0.4085107984530081,0.382456314022695,0.0260544844303131,-0.0245258002571672,0.01844164726272,2009
9,-0.2152925258472589,-0.7568943841299216,0.5416018582826627,0.5914892015469919,0.4085107984530081,0.5854050485525447,0.4145949514474553,0.5914892015469919,-0.0060841529944472,-0.0245258002571672,0.01844164726272,2009


## Filling missing data With Statistical Value

In [14]:
# Finding the mean of non zero values
display(subset_of_data['airEPA'].astype(np.float32).fillna(0).mean())

print('\n', 'Filling the Missing one With Mean Value', '\n')
# Filling the data with mean value
display(subset_of_data['airEPA'].fillna(subset_of_data['airEPA'].astype(np.float32).fillna(0).mean()).head(10))

0.19538628


 Filling the Missing one With Mean Value 



0               0.195386
1    -1.0681686475721817
2               0.195386
3     3.3188410883517703
4               0.195386
5               0.195386
6    -0.3430848653093639
7               0.195386
8               0.195386
9    -0.7568943841299216
Name: airEPA, dtype: object

## **Using a model**

#### Basically, leverage the data that is not missing to learn what the missing variable might be. In our **height and weight example, you could use height to predict weight when not missing and then apply that model to heights with missing weights to fill in your missing data.** A popular algorithm for this is **K-Nearest Neighbors.**

## **Outliers**

## **Dealing with categorical data**

### **Label encoding**

#### Label encoding works by converting the **unique values to a numeric representation**. For example, if we have two categories **male and female**, we can categorize them as numbers:

1. male as 0
2. female 1

#### Pandas provides an easy way to do this by using the **category** type.

In [15]:
import pandas as pd

# Create series with male and female values
non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])

# Convert the text series to a categorical series
categorical_series = non_categorical_series.astype('category')

# printing the categorical series
print(categorical_series)

# Print the numeric codes for each value
print(categorical_series.cat.codes)

# Print the category names
print(categorical_series.cat.categories)

0      male
1    female
2      male
3    female
dtype: category
Categories (2, object): ['female', 'male']
0    1
1    0
2    1
3    0
dtype: int8
Index(['female', 'male'], dtype='object')


#### You can get the integer values by adding ```.cat.codes``` to the end of your category series.
#### You get the string values by adding ```.cat.categories``` to the end of your category series.

### **One-hot encoding**

#### One-hot encoding is similar but creates a new column for each category and **fills it with a 1 for each row with that value and zero otherwise.**

In [16]:
import pandas as pd

# Create series with male and female values
non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])

# Create dummy or one-hot encoded variables
print(pd.get_dummies(non_categorical_series))

   female  male
0       0     1
1       1     0
2       0     1
3       1     0
