# Missing Data

In [4]:
#  Pandas fills in a missing point with a null or NaN by default.
#  We'll learn how we can use methods like dropna or fillna to drop or fill in those missing values.
import numpy as np
import pandas as pd

In [5]:
# For having gridlines

In [6]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}

In [7]:
d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]} # using np.nan to signify missing value

In [8]:
df = pd.DataFrame(d)

In [9]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


* **Row A and Column C has no missing values.**
* **dropna method - To drop the missing values from your dataset. Call this as a method.**
* **Pandas will drop any row with one or more misssing values.**
* **By default axis = 0, which means that operation occurs along the rows, set axis = 1 for columns.**

In [10]:
df.dropna()  



Unnamed: 0,A,B,C
0,1.0,5.0,1


In [11]:
df.dropna(axis=1) # Drops columns with missing values. REMEMBER Rows have axis = 0, for column drop we specify
# the parameter axis = 0

Unnamed: 0,C
0,1
1,2
2,3


# Threshold

In [12]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [15]:
df.dropna() # drops any row or column with NaN and leaves non NaN rows and columns only.

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [16]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


# Explanation df.dropna(thresh=2)

* Keps row 1 as row 1 had atleast 2 non NaN values 2.0 and 2. So it will keep it.
* Same logic for row 1 as it has all values as non NaN.
* thresh argument is an int value and we need that many not NaN values to not get dropped.

### fillna method

* We have already seen how to drop the missing values.
* Most of the times it'd be better to fill in the missing values.
* We make use of fillna method, to fill the missing values which are either null or NaN.

In [17]:
# Original DataFrame 
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [18]:
# DataFrame after using fillna
df.fillna(value = 'FILLED VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILLED VALUE,2
2,FILLED VALUE,FILLED VALUE,3


* Most of the times we might want to fill in the value of the mean of the column where a NaN is present.
 1. Taking the column where NaN is present.
 2. Calling the fillna() method.
 3. Set the value parameter of fillna same as column specified in step 1.
 4. Call the mean() method on the value in step 3.
 5. For e.g. we have DataFrame called df and we want to fill it's column A's NaN with mean of column A we do so in the manner shown below :** 



In [21]:
df['A'].fillna(value=df['A'].mean()) # Replaces Nan or multiple NaN's of column A to mean of non NaN values in A.

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

# There are multiple approaches and statistical methods to fill in the missing values appropriately but their usage primarily depends on the nature of data you are dealing with. How to select what approach to deply ? Go Figure :)