## Handling Missing Data

This notebooks walks you through methods of detecting, removing, and replacing missing data. We cover the following topics here:  

1. How pandas handles missing values - NaN, None and typecasting.
2. How to detect missing values.
3. How to fill/drop missing values.

In [None]:
import numpy as np
import pandas as pd
import random
import string

## Ways to represent Missing Values

1. Masked Boolean Arrays - A separate Array indicating the missing values with Booleans. Adds additional storage and computation complexity.
2. Sentinel Values - Using a data-specific convention to indicate missing value like -99999 or a bit pattern or using the IEEE floating point value NaN(Not a Number).

Of the 2 options described above, NaN is a commonly used missing data representation.

In [None]:
##choice of data type of NaNs
values = np.array([10, 20, 30, np.nan, 40])
values.dtype

In [None]:
##Operations with nan
10 + np.nan
0 * np.nan


> NaN is specifically a floating point value; we don't have any equivalent NaN value for other data types.


### Missing data in Pandas

How Pandas interprets different types of missing values

In [None]:
##creating a sample dataframe with None and nan objects
sample_dict = {'Scores': [10, None, 20, 30, np.nan, 33]}
df = pd.DataFrame(sample_dict)
df.head()

In [None]:
##typecasts an integer array to floating point
df.dtypes 

### Detecting Null Values

In [None]:
##reading advertising revenue data from a file
df = pd.read_csv('../data/advertising.csv')
df.head()

In [None]:
##detecting null values
print(df.isnull())

## summing up null values
print(df.isnull().sum())

In [None]:
##if data contains a lot of missing values, we can check for not null values
df.notnull()

### Dropping Missing Values

In [None]:
##dropping all the rows with a null value
df.dropna()

In [None]:
##dropping the entire column if it contains null values
df.dropna(axis=1)

In [None]:
##dropping columns that are all the null values

##adding a column with all null values
df['Null Values'] = np.nan
print(df.head())

df.dropna(axis="columns", how='all')

### Filling Null Values

In [None]:
##filling null values with zeros
df.fillna(0)

In [None]:
##forward fill - specifying the previous value to be propagated forward
print(df.head())
df.fillna(method='ffill')

In [None]:
##backward fill - specifying the next value to be propagated backwards
print(df.head())
df.fillna(method='bfill')

In [None]:
##imputing the values with mean
df.fillna(df.mean())

In [None]:
##imputing the values with median
df.fillna(df.median())