# Handling Missing Data
---

## **Excel**

* Use filtering to find blank rows or other missing values

## **Python**

*Find missing values*

    df.isnull()                # df.isnull().sum() for a count of missing in each column

<br>

*Replace missing values:*

    df.fillna()
    df.bfill()
    df.ffill()

<br>

*Remove rows with missing values:*

    df.dropna()


<br><br>

### Load required packages and data
---

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Save Github location paths to a variable
failed_bank_path = 'https://github.com/The-Calculated-Life/python_analysis_for_excel/blob/main/data/failed_banks.xlsx?raw=true'
bx_users_path = 'https://raw.githubusercontent.com/The-Calculated-Life/python_analysis_for_excel/main/data/bx_users.csv'

# Read excel and CSV files
bank_detail = pd.read_excel(failed_bank_path, sheet_name='detail')
bx_users = pd.read_csv(bx_users_path)

<br><br>
### Finding missing values
---



In [None]:
# Look at the first 5 rows of "bank_detail"
bank_detail.head()

Unnamed: 0,CERT,FIN,CHARTER,ESTIMATED LOSS,ASSETS,DEPOSITS,RESOLUTION
0,14361,10536.0,COMMERCIAL,,152400,139526,FAILURE
1,18265,10535.0,COMMERCIAL,,100879,95159,FAILURE
2,21111,10534.0,COMMERCIAL,2491.0,120574,111234,FAILURE
3,58112,10532.0,COMMERCIAL,4547.0,29726,26473,FAILURE
4,58317,10533.0,OTHER,2188.0,27119,26151,FAILURE


<br>

In [None]:
# Identify "cells" with nulls
bank_detail.isnull()

Unnamed: 0,CERT,FIN,CHARTER,ESTIMATED LOSS,ASSETS,DEPOSITS,RESOLUTION
0,False,False,False,True,False,False,False
1,False,False,False,True,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
574,False,False,False,False,False,False,False
575,False,False,False,False,False,False,False
576,False,False,False,False,False,False,False
577,False,False,False,False,False,False,False


<br>

In [None]:
# Find the total number of null values by column
bank_detail.isnull().sum()

CERT               0
FIN               13
CHARTER            0
ESTIMATED LOSS    16
ASSETS             0
DEPOSITS           0
RESOLUTION         0
dtype: int64

<br>

<br><br>
**QUICK CHALLENGE #1:**

**Task: Write code which shows the number of missing values in each column of the `bx_users` dataframe**


In [None]:
# Your code for quick challenge #1 here
bx_users.isnull().sum()

user_id          0
location         0
age         110762
dtype: int64

<br><br>
### Replace missing data
---


*1. Replace with a string or number*

In [None]:
# Replace missing values in "ESTIMATED LOSS" with string "Missing"
bank_detail['ESTIMATED LOSS'].fillna('Missing')

0      Missing
1      Missing
2         2491
3         4547
4         2188
        ...   
574      14592
575       1363
576        617
577       1322
578      11574
Name: ESTIMATED LOSS, Length: 579, dtype: object

<br>

In [None]:
# Replace missing values in "ESTIMATED LOSS" with zero
bank_detail['ESTIMATED LOSS'].fillna(0)

0          0.0
1          0.0
2       2491.0
3       4547.0
4       2188.0
        ...   
574    14592.0
575     1363.0
576      617.0
577     1322.0
578    11574.0
Name: ESTIMATED LOSS, Length: 579, dtype: float64

<br>

*2. Replace with mean or median of the column*

In [None]:
# Replace missing values in "ESTIMATED LOSS" with the .mean()
bank_detail['ESTIMATED LOSS'].fillna(bank_detail['ESTIMATED LOSS'].mean())

0      130299.838366
1      130299.838366
2        2491.000000
3        4547.000000
4        2188.000000
           ...      
574     14592.000000
575      1363.000000
576       617.000000
577      1322.000000
578     11574.000000
Name: ESTIMATED LOSS, Length: 579, dtype: float64

130299.83836589698

<br>

*3. Fill using previous or next value*

In [None]:
# Use backfill to replace missing values
bank_detail['ESTIMATED LOSS'].bfill()

0       2491.0
1       2491.0
2       2491.0
3       4547.0
4       2188.0
        ...   
574    14592.0
575     1363.0
576      617.0
577     1322.0
578    11574.0
Name: ESTIMATED LOSS, Length: 579, dtype: float64

<br>

In [None]:
# Use forward fill
bank_detail['ESTIMATED LOSS'].ffill().isnull().sum()

2

<br>
Save the replaced values into the original column 

In [None]:
# Save the replacements for the null values: fill with zero
bank_detail['ESTIMATED LOSS'] = bank_detail['ESTIMATED LOSS'].fillna(0)

<br>

In [None]:
# Check bank_detail for null values again
bank_detail.isnull().sum()

CERT               0
FIN               13
CHARTER            0
ESTIMATED LOSS     0
ASSETS             0
DEPOSITS           0
RESOLUTION         0
dtype: int64

<br><br> 
### Removing missing values
---



In [None]:
# Remove the rows which are not bank failures
bank_detail = bank_detail.dropna()

<br>

In [None]:
# Check bank_detail for null values again
bank_detail.isnull().sum()

CERT              0
FIN               0
CHARTER           0
ESTIMATED LOSS    0
ASSETS            0
DEPOSITS          0
RESOLUTION        0
dtype: int64

<br><br>
**QUICK CHALLENGE #2:**

**Task: Write code which replaces the null age values with *median* age in `bx_users` dataframe**


In [None]:
# Your code for quick challenge #2 here:
bx_users['age'].fillna(bx_users['age'].median())

0         32.0
1         18.0
2         32.0
3         17.0
4         32.0
          ... 
278853    32.0
278854    50.0
278855    32.0
278856    32.0
278857    32.0
Name: age, Length: 278858, dtype: float64