# Guide for Data Cleaning

It's a tutorial guide for the whole process of data analysis, which includes data cleaning, data visualization, data preprocessing, modeling, evaluation.  

<big>Deal with the missing values, outliers, duplicates.  
Then convert the data format, process text data.</big>

The NFL Play dataframe is about all games from 2009 through 2018 (week 15) 

In [22]:
import pandas as pd, numpy as np
df=pd.read_csv("E:\\Programming\\Dataset_prac\\NFL Play by Play 2009-2016 (v3).csv",low_memory=False)
print(df.shape)

df.head()

(362447, 102)


Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


# 1. Handling Missing Values
## 1. Check how many values are missing.
`isnull().sum()` or `isna().sum()`

In [8]:
df.isna().sum().head(10)

Date                0
GameID              0
Drive               0
qtr                 0
down            54218
time              188
TimeUnder           0
TimeSecs          188
PlayTimeDiff      374
SideofField       450
dtype: int64

## 2. Figure out  why the data is missing?
If it is missing because it is not recorded, try to fill the missing data;  
if it is missing because there should not be such data, just drop it.

- `dropna(axis=0)` remove the rows(axis=0) or columns(axis=1) contain missing data  
We can see that when removing the cols which contain missing values,the dataframe will only have 41 cols.

In [10]:
df_drop_cols=df.dropna(axis=1)
df_drop_cols.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,0,,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,0,,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,0,,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009


- Fill the missing data.  
`.fillna(value, method='string'/'bfill'/'ffill', axis=1, inplace)`  
value is the specified target to replace the missing value, you can use number, string,  
'bfill'(use the next un-missing value), 'ffill'(use the last un-missing value).

In [12]:
# get a subset to print well 
sub_df=df.loc[:,'EPA':'Season'].head()
sub_df

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


In [15]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
sub_df.fillna(method='bfill', axis=0).fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


## 3. Imputation
Like mentioned before, drop null or fill null are both usefull ways to impute.  
We can also use some Machine Learning packages to apply imputation.

- <big>An Extension To Imputation</big>  
we can impute the missing values as before(drop or fill), then we create an additional column to describe the missing status(T/F)
<img src="attachment:image.png" alt="image" style="width:500px;">


**SimpleImputer** is a sklearn function that replaces missing values with the mean value along each column.
strategy：指定填充缺失值的策略。可选的策略有： 
- "mean"：使用列的平均值来填充缺失值。
- "median"：使用列的中位数来填充缺失值。
- "most_frequent"：使用列中出现最频繁的值来填充缺失值。
- "constant"：使用指定的常数值来填充缺失值。需要同时指定 fill_value 参数。  

 fill_value：仅在 strategy="constant" 时使用，用于指定要用于填充缺失值的常数值。  
 
 注意，impute操作可能使得列名缺失，可以进行如下操作来恢复列名: `imputed_x.columns=ori_x.columns`


In [48]:
from sklearn.impute import SimpleImputer

imputer=SimpleImputer()
imputed_df=pd.DataFrame(imputer.fit_transform(df[['airEPA']]))

print(df.airEPA.head(),'\n',imputed_df.head())


0         NaN
1   -1.068169
2         NaN
3    3.318841
4         NaN
Name: airEPA, dtype: float64 
           0
0  0.526933
1 -1.068169
2  0.526933
3  3.318841
4  0.526933


# 2. Data Scaling and Normalization
- in scaling, you're changing the range of your data.
- in normalization, you're changing the shape of the distribution of your data.

## 1. Scaling
It means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points are, like **support vector machines (SVM)** or **k-nearest neighbors (KNN)**. With these algorithms, a change of "1" in any numeric feature is given the same importance.