# Handling Missing Values

## 0. Modules

In [1]:
import numpy as np
import pandas as pd

## 1. Take a first look at the data

In [2]:
# NFL Play by Play 2009-2017 (v4).csv.zip

# read in all our data
nfl_data = pd.read_csv("NFL Play by Play 2009-2017 (v4).csv.zip")

# set seed for reproducibility
np.random.seed(0)

  nfl_data = pd.read_csv("NFL Play by Play 2009-2017 (v4).csv.zip")


當你獲得一個新的資料集時，首先應該做的是查看其中的一部分。這讓你確認資料是否都正確讀取，  
並讓你對資料的狀況有所了解。在這個情況下，讓我們看看是否有任何遺失的值，這些值將以 `NaN` 或 `None` 表示。

In [3]:
# look at the first five rows of the nfl_data file. 
# I can see a handful of missing data already!
nfl_data.head()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


可以看到有遺失值

## 2. How many missing data points do we have?

好的，既然我們知道確實有一些遺失的值，接下來讓我們看看每個欄位中有多少遺失的值。

In [4]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]


Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

看起來數量不少！為了更好地了解這個問題的規模，我們可以計算出資料集中遺失值的百分比，這可能會有幫助。

In [5]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

27.66722370547874


哇，這個資料集中近四分之一的單元格是空的！下一步，我們將更仔細地檢查一些有遺失值的欄位，並嘗試弄清楚可能的情況。

## 3. Figure out why the data is missing

這是我們進入資料科學中我喜歡稱之為「資料直覺」的階段，意即「真正觀察你的資料並試圖了解為什麼會是這樣以及這將如何影響你的分析」。  
這可能是資料科學中令人沮喪的部分，尤其是如果你對這個領域較為新手，而且沒有很多經驗的話。處理遺失值時，你需要使用你的直覺來推測為什麼該值會遺失。  
為了幫助弄清楚這一點，你可以問自己一個非常重要的問題：

**這個值是遺失是因為它沒有被記錄下來，還是因為它本來就不存在？**  
(Is this value missing because it wasn't recorded or because it doesn't exist?)

如果一個值因為本來就不存在而遺失（例如某人沒有孩子的話，他最大的孩子的身高），那麼嘗試猜測它可能是什麼是沒有意義的。  
這些值你可能會想保留為 `NaN`。另一方面，如果一個值是因為沒有被記錄下來而遺失，那麼你可以嘗試根據該列和行中的其他值來猜測它可能是什麼。  
這稱為**填補（imputation）**，我們接下來將學習如何進行！ :)

讓我們來進行一個例子。查看 `nfl_data` 資料框中的遺失值數量時，我注意到「TimesSec」這一欄中有許多遺失值：

In [6]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

通過查看[相關文件](https://www.kaggle.com/datasets/maxhorowitz/nflplaybyplay2009to2016)，我們可以看到這個欄位記錄了進行比賽時遊戲剩餘的秒數。這意味著這些值很可能是因為沒有被記錄下來而遺失，  
而不是因為它們不存在。因此，我們嘗試猜測它們應有的數值，而不是僅僅將它們留為 `NA`，這樣做是有道理的。

另一方面，像是「PenalizedTeam」這樣的欄位也有許多遺失的數值。然而，在這個案例中，如果比賽中沒有罰款，那麼指出哪個隊伍被罰款就沒有意義。  
對於這個欄位，將其保留為空或添加一個第三個值，如「無」，並使用它來替換 `NA`，可能更有意義。

**Tip : 如果你還沒有閱讀過資料集的文件，現在是一個很好的時機！如果你正在處理從其他人那裡獲得的資料集，你也可以嘗試聯絡他們以獲取更多訊息。  
這樣做可以幫助你更好地理解數據的背景和細節，從而更有效地處理數據中的問題。** 

如果你在進行非常仔細的資料分析，這時候你會針對每個欄位單獨分析，以確定填補遺失值的最佳策略。在本筆記本的其餘部分，  
我們將介紹一些「快速而粗糙」的技巧，這些技巧可以幫助你處理遺失值，但可能也會刪除一些有用的信息或給你的數據添加一些噪聲。

## 4. Drop missing values

如果你比較趕或者沒有理由去探究為什麼你的數據會遺失，一個選項是直接移除包含遺失值的任何行或欄。  
（**注意：我通常不推薦在重要的專案中使用這種方法！通常花時間逐一檢視每個包含遺失值的欄位，真正了解你的資料集是值得的。**）

如果你確定想要刪除包含遺失值的行，pandas 提供了一個方便的函數 `dropna()` 來幫助你做到這點。  
讓我們在我們的 NFL 資料集上試用它！

In [7]:
# remove all the rows that contain a missing value
nfl_data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


哎呀，看起來這樣做已經刪除了我們所有的數據！😱 這是因為我們資料集中的每一行至少有一個遺失值。  
我們改為刪除至少有一個遺失值的所有欄位。

In [8]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,AwayTeam,Timeout_Indicator,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009


In [9]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 37


我們已經失去了相當多的數據，但到目前為止，我們已經成功地從我們的數據中移除了所有的 NaN。這可以清理數據，  
但同時也需要注意可能丟失的重要資訊。

## 5. Filling in missing values automatically

另一個選項是嘗試填補遺失的值。接下來，我將獲取 NFL 數據的一小部分子集，以便能夠清楚地打印出來。

In [10]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


我們可以使用 Pandas 的 `fillna()` 函數來為我們填補資料框中的遺失值。  
我們有一個選項是指定我們希望用什麼值來替換 NaN 值。  
在這裡，我指定我希望將所有的 NaN 值替換為 0。

In [11]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


我也可以更巧妙一些，用同一欄位中直接後面的值來替換遺失的值。（**對於那些觀察數據具有某種邏輯順序的資料集來說，這種做法非常有意義。**）  
這種方法在Pandas中通常使用 `fillna()` 函數配合 `method='bfill'` 參數來實現，這樣可以將 NaN 值替換為之後的非 NaN 值。

In [12]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_nfl_data = subset_nfl_data.bfill(axis=0).fillna(0)
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009
