## Handling Missing Data

This notebooks walks you through methods of detecting, removing, and replacing missing data. We cover the following topics here:  

1. How pandas handles missing values - NaN, None and typecasting.
2. How to detect missing values.
3. How to fill/drop missing values.

In [1]:
import numpy as np
import pandas as pd
import random
import string

## Ways to represent Missing Values

1. Masked Boolean Arrays - A separate Array indicating the missing values with Booleans. Adds additional storage and computation complexity.
2. Sentinel Values - Using a data-specific convention to indicate missing value like -99999 or a bit pattern or using the IEEE floating point value NaN(Not a Number).

Of the 2 options described above, NaN is a commonly used missing data representation.

In [2]:
##choice of data type of NaNs
values = np.array([10, 20, 30, np.nan, 40])
values.dtype

dtype('float64')

In [3]:
##Operations with nan
10 + np.nan
0 * np.nan


nan

> NaN is specifically a floating point value; we don't have any equivalent NaN value for other data types.


### Missing data in Pandas

How Pandas interprets different types of missing values

In [5]:
##creating a sample dataframe with None and nan objects
sample_dict = {'Scores': [10, None, 20, 30, np.nan, 33]}

##create a dataframe
df = pd.DataFrame(sample_dict)
df.head()

Unnamed: 0,Scores
0,10.0
1,
2,20.0
3,30.0
4,


In [6]:
##check data type
df.dtypes

Scores    float64
dtype: object

### Detecting Null Values

In [7]:
##reading advertising revenue data from a file
df = pd.read_csv('../data/advertising.csv')
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,,,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,,
4,180.8,,58.4,17.9


In [9]:
##detecting null values
print(df.isnull())
## summing up null values
print(df.isnull().sum())

        TV  Radio  Newspaper  Sales
0    False  False      False  False
1    False   True       True  False
2    False  False      False  False
3    False  False       True   True
4    False   True      False  False
..     ...    ...        ...    ...
195  False  False      False  False
196  False  False      False  False
197  False  False      False  False
198  False  False      False  False
199  False  False      False  False

[200 rows x 4 columns]
TV           0
Radio        6
Newspaper    5
Sales        3
dtype: int64


In [10]:
##if data contains a lot of missing values, we can check for not null values
df.dropna()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
2,17.2,45.9,69.3,12.0
5,8.7,48.9,75.0,7.2
7,120.2,19.6,11.6,13.2
9,199.8,2.6,21.2,15.6
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


### Dropping Missing Values

In [None]:
##dropping all the rows with a null value
df.dropna()

In [11]:
##dropping the entire column if it contains null values
df.dropna(axis=1)

Unnamed: 0,TV
0,230.1
1,44.5
2,17.2
3,151.5
4,180.8
...,...
195,38.2
196,94.2
197,177.0
198,283.6


In [12]:
##dropping columns that are all the null values

##adding a column with all null values
df['Null Values'] = np.nan
print(df.head())

df.dropna(axis="columns", how='all')

      TV  Radio  Newspaper  Sales  Null Values
0  230.1   37.8       69.2   22.1          NaN
1   44.5    NaN        NaN   10.4          NaN
2   17.2   45.9       69.3   12.0          NaN
3  151.5   41.3        NaN    NaN          NaN
4  180.8    NaN       58.4   17.9          NaN


Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,,,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,,
4,180.8,,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


### Filling Null Values

In [13]:
##filling null values with zeros
df.fillna(0)

Unnamed: 0,TV,Radio,Newspaper,Sales,Null Values
0,230.1,37.8,69.2,22.1,0.0
1,44.5,0.0,0.0,10.4,0.0
2,17.2,45.9,69.3,12.0,0.0
3,151.5,41.3,0.0,0.0,0.0
4,180.8,0.0,58.4,17.9,0.0
...,...,...,...,...,...
195,38.2,3.7,13.8,7.6,0.0
196,94.2,4.9,8.1,14.0,0.0
197,177.0,9.3,6.4,14.8,0.0
198,283.6,42.0,66.2,25.5,0.0


In [14]:
##forward fill - specifying the previous value to be propagated forward
print(df.head())

df.fillna(method='ffill')

      TV  Radio  Newspaper  Sales  Null Values
0  230.1   37.8       69.2   22.1          NaN
1   44.5    NaN        NaN   10.4          NaN
2   17.2   45.9       69.3   12.0          NaN
3  151.5   41.3        NaN    NaN          NaN
4  180.8    NaN       58.4   17.9          NaN


Unnamed: 0,TV,Radio,Newspaper,Sales,Null Values
0,230.1,37.8,69.2,22.1,
1,44.5,37.8,69.2,10.4,
2,17.2,45.9,69.3,12.0,
3,151.5,41.3,69.3,12.0,
4,180.8,41.3,58.4,17.9,
...,...,...,...,...,...
195,38.2,3.7,13.8,7.6,
196,94.2,4.9,8.1,14.0,
197,177.0,9.3,6.4,14.8,
198,283.6,42.0,66.2,25.5,


In [16]:
##imputing the values with mean
print(df.head())

df.fillna(df.mean())

      TV  Radio  Newspaper  Sales  Null Values
0  230.1   37.8       69.2   22.1          NaN
1   44.5    NaN        NaN   10.4          NaN
2   17.2   45.9       69.3   12.0          NaN
3  151.5   41.3        NaN    NaN          NaN
4  180.8    NaN       58.4   17.9          NaN


Unnamed: 0,TV,Radio,Newspaper,Sales,Null Values
0,230.1,37.800000,69.200000,22.100000,
1,44.5,23.043814,30.310769,10.400000,
2,17.2,45.900000,69.300000,12.000000,
3,151.5,41.300000,30.310769,15.139594,
4,180.8,23.043814,58.400000,17.900000,
...,...,...,...,...,...
195,38.2,3.700000,13.800000,7.600000,
196,94.2,4.900000,8.100000,14.000000,
197,177.0,9.300000,6.400000,14.800000,
198,283.6,42.000000,66.200000,25.500000,


In [17]:
##imputing the values with median
df.fillna(df.median())

Unnamed: 0,TV,Radio,Newspaper,Sales,Null Values
0,230.1,37.8,69.2,22.1,
1,44.5,22.0,25.6,10.4,
2,17.2,45.9,69.3,12.0,
3,151.5,41.3,25.6,16.0,
4,180.8,22.0,58.4,17.9,
...,...,...,...,...,...
195,38.2,3.7,13.8,7.6,
196,94.2,4.9,8.1,14.0,
197,177.0,9.3,6.4,14.8,
198,283.6,42.0,66.2,25.5,
