#  Missing Data with Pandas

## Overview of Missing Data

   - Missing data occurs in Data very often, and this occurence happens for different reasons. 
   
   - Missing data resides in the step of cleaning an preparing the data for further anlysis. 
   
   - Missing data has to be studied carefully in order to deal with it correctly.
   
   - There are three types of missing data:
       1. **MCAR**: Missing Completely AT Random.
       2. **MAR**: Missing At Random.
       3. **MNAR**: Missing Not At Random.

## Missing Data Representation in Pands
- Missing data in Pandas is represented by __NaN__(Not a Number). 
   
   - Missing data can be referred to as __NA__ (Not Available) like in The R language. 
   
   - **None** is considered an __NaN__ in Pandas.

## Setting up the environment 

In [1]:
import pandas as pd
import numpy as np
from random import choices, seed

In [2]:
df = pd.DataFrame({'A':[1,2, np.nan, 4],
                 'B':[10, 23, np.nan, np.nan],
                 'C':[5, 10, np.nan, 15]})
df

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,,,
3,4.0,,15.0


### Checking for  Missing Values

- To check for missing data use __isnull__ method. (__isna__ also gives the same result)

- __isnull__ returns **True** if data is missing and **False** if not. 

- We often need where the data is not missing; hence, the __notnull__ method, which is the reverse of __isnull__.

Here is the syntax:

```python
df.isnull()           ---> Returns an object of booleans
df[df.notnull()].     ---> Filtering where the data is not missing.

# To check the docs
df.isnull?
```

### Missing Data Checking Example

In [3]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,False,False
2,True,True,True
3,False,True,False


Indeed, the result is DataFrame of booleans of the same size. The last value of column A is missing which why we have **True**, the second and the third of column B is missing as well, but the third columns has no missing values.

### Counting the missing data

  - Chaining a **sum method with isnull** method will give us the number of missing values in each column. Because the **Trues** are ones and the **Falses** are zeros.

In [4]:
df.isnull().sum()

A    1
B    2
C    1
dtype: int64

### Filtering Data 

- You can filter where the data is missing in a DataFrame object by variable. This happens because we are passing a Series of booleans to a DataFrame to filter by.

### Filtering where a variable has missing data

In [5]:
df[df['B'].isnull()]

Unnamed: 0,A,B,C
2,,,
3,4.0,,15.0


### Filtering where a variable does not have missing data

In [6]:
df[df['A'].notnull()]

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
3,4.0,,15.0


## Handling missing values in DataFrames Objects 

### Dropping NaNs

- One of the techniques to deal with missing data points is dropping them. But you have to make a decision whether to drop by rows, columns or both.

- Pandas has __dropna__ method that drops missing points on both rows and columns by default. 

- __dropna__ has several options: 
  - **axis**: rows by default, so any row that has a missing point will be dropped. We can set __axis to 1__ for columns.
  - **how**: determines how to drop rows or columns:
      - **any**: a row or column that has one or more missing point will be dropped
      - **all**: a row or column that has all its values are missing will be dropped
  - **threshold**: the number of non-missing data required for keeping the row or column.
  - **inplace**: False by default. whether the original data will be changed.
  
The syntax:
```python
df.dropna()                 ---> default drops all NAs across rows and columns
df.dropna(axis =1)          ---> drop columns with NAs
df.dropna(how="all").       ---> drop a row that has all NAs
df.dropna(threshold=2).     ---> keep the row that has at least 2 values are not missing

# For the Docs
df.dropna?
```

### Removing Missing Values Example:

In [7]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0


### Dropping values on the columns

In [8]:
df.dropna(axis=1)

0
1
2
3


### Drop all rows or columns

In [9]:
df.dropna(how="all")

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
3,4.0,,15.0


In [10]:
df.dropna(how="all", axis = 1)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,,,
3,4.0,,15.0


### Using Threshold

In [11]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
3,4.0,,15.0


In [12]:
df.dropna(thresh=1)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
3,4.0,,15.0


## Example 02 

In [13]:
df2 = pd.DataFrame(np.array(choices(range(99), k = 28)).reshape(7,4)) 
df2.iloc[:4, 2] = np.nan
df2.iloc[:2, 3] = np.nan
df2.iloc[5] = np.nan
df2.iloc[:, 0] = np.nan

In [14]:
df2

Unnamed: 0,0,1,2,3
0,,8.0,,
1,,47.0,,
2,,44.0,,40.0
3,,84.0,,27.0
4,,10.0,3.0,46.0
5,,,,
6,,33.0,13.0,74.0


### 1. Dropping Rows with all NaN values

In [15]:
df2.dropna(how = 'all')

Unnamed: 0,0,1,2,3
0,,8.0,,
1,,47.0,,
2,,44.0,,40.0
3,,84.0,,27.0
4,,10.0,3.0,46.0
6,,33.0,13.0,74.0


### 2. Dropping columns with all NaN values

In [16]:
df2.dropna(how = 'all', axis=1)

Unnamed: 0,1,2,3
0,8.0,,
1,47.0,,
2,44.0,,40.0
3,84.0,,27.0
4,10.0,3.0,46.0
5,,,
6,33.0,13.0,74.0


### 3. Threshold 

In [17]:
df2.dropna(thresh = 2)

Unnamed: 0,0,1,2,3
2,,44.0,,40.0
3,,84.0,,27.0
4,,10.0,3.0,46.0
6,,33.0,13.0,74.0


In [18]:
df2.dropna(thresh = 3)

Unnamed: 0,0,1,2,3
4,,10.0,3.0,46.0
6,,33.0,13.0,74.0


In [19]:
df2.dropna(thresh = 2, axis = 1)

Unnamed: 0,1,2,3
0,8.0,,
1,47.0,,
2,44.0,,40.0
3,84.0,,27.0
4,10.0,3.0,46.0
5,,,
6,33.0,13.0,74.0


In [20]:
df2.dropna(thresh = 3, axis = 1)

Unnamed: 0,1,3
0,8.0,
1,47.0,
2,44.0,40.0
3,84.0,27.0
4,10.0,46.0
5,,
6,33.0,74.0


### Filling Missing Values

In [21]:
df.fillna(value=-999)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,-999.0,-999.0,-999.0
3,4.0,-999.0,15.0


### Filling a specific column

  - Filtering the data out may lead to data loss. A simple way to deal with missingnes is by filling it using different techniques. 
  
 - To perform filling missing data use __fillna__ method.
 
 - Calling __fillna__ on a DataFrame has several options:
 
     - **value**: a constant number to replace every NaN in the DataFrame. other data type is possible such as a dict.
     - **method**: one of the following:
         - 'ffill': Fill forward or the last valid value carried forward (same as 'pad')
         - 'bfill': Fill backward or the last valid value carried backward (same as 'backfill')
         - 'None': is is the default.
     - **Axis**: Filling missing is done along the rows by default. set __axis to 1__ to fill along the columns.
     - **inplace**: False by default. The change takes place permanently if __inplace__ is set to **True**.
     - **limit**: It has two options
         1. Method is specified: in this case, limit determines the maximum number of consecutive NaN values to forward/backward fill.
         2. Method is not specified: it determines the maximum number of entries along the entire axis where NaNs will be filled. (must be greater than zero in not None).

The Syntax:

```python
df.fillna(value='scalar')        ---> fills NAs with a constant    
df.fillna(method ='ffill')       ---> fill forward
df.fillna(method = 'bfill').     ---> fill backward
df.fillna(limit = int).          ---> max NaNs to fill

#for docs
df.fillna?
```

### Filling with a constant

In [22]:
df.fillna(-99)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,-99.0,-99.0,-99.0
3,4.0,-99.0,15.0


### Filling forward

In [23]:
df.fillna(method = 'ffill')

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,2.0,23.0,10.0
3,4.0,23.0,15.0


### Filling Backward

In [24]:
df.fillna(method = 'bfill')

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,4.0,,15.0
3,4.0,,15.0


### Filling Backward with limit

In [25]:
df.fillna(method='bfill', limit =2)

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,4.0,,15.0
3,4.0,,15.0


### Filling Different Columns with Different Values

   Calling fillna with a dict, you can use a different fill value for each column:

In [26]:
df.fillna({'A': 0, 'B': -99, 'C': -1})

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,0.0,-99.0,-1.0
3,4.0,-99.0,15.0


### Filling and make the change permanent
fillna returns a new object, but you can modify the existing object in-place:

In [27]:
_ = df.fillna(method = 'bfill', inplace= True)
_ = df.fillna(method = 'ffill', inplace = True)
df

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,4.0,23.0,15.0
3,4.0,23.0,15.0


### Filling with the mean or median value

  - With __fillna__ method we can even more sophisticated missing data imputation like the mean, median, max or min. 

In [28]:
df3 = pd.DataFrame({'A':[1,2, np.nan, 4],
                 'B':[10, 23, np.nan, np.nan],
                 'C':[5, 10, np.nan, 15]})
df3

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,,,
3,4.0,,15.0


In [29]:
df3.fillna({'A': df3['A'].mean(),
           'B':df3['B'].median(), 
           'C': df3['C'].min()})

Unnamed: 0,A,B,C
0,1.0,10.0,5.0
1,2.0,23.0,10.0
2,2.333333,16.5,5.0
3,4.0,16.5,15.0


# Conclusion: 
This is a very basic introduction to handling missing data. Advanced techniques will be discussed in anothe tutorial.