Title: Working with Missing Values
Slug: pandas/working-with-missing-values
Category: Pandas
Tags: DataFrame, dropna,  
Date: 2017-12-05
Modified: 2017-12-05

#### Import libraries

In [1]:
import numpy as np
import pandas as pd

#### Create data

In [2]:
index = ['Theresa', 'David', 'Gordon', 'Tony', 'John']
data = {
    'colour': [None, 'Blue', 'Red', 'Red', 'Blue'],
    'score1': [None, 5, 5, None, 5],
    'score2': [None, 3, 7, None, 7],
    'score3': [None, 5, 6, 9, None]
}

df = pd.DataFrame(data=data, index=index)
df

Unnamed: 0,colour,score1,score2,score3
Theresa,,,,
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0
Tony,Red,,,9.0
John,Blue,5.0,7.0,


#### Finding missing data and filtering

In [3]:
df.isnull()

Unnamed: 0,colour,score1,score2,score3
Theresa,True,True,True,True
David,False,False,False,False
Gordon,False,False,False,False
Tony,False,True,True,False
John,False,False,False,True


In [4]:
df.loc[df["colour"].notnull()]

Unnamed: 0,colour,score1,score2,score3
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0
Tony,Red,,,9.0
John,Blue,5.0,7.0,


#### Filling missing data

In [5]:
df.fillna("missing")

Unnamed: 0,colour,score1,score2,score3
Theresa,missing,missing,missing,missing
David,Blue,5,3,5
Gordon,Red,5,7,6
Tony,Red,missing,missing,9
John,Blue,5,7,missing


In [6]:
df.fillna(method="ffill")

Unnamed: 0,colour,score1,score2,score3
Theresa,,,,
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0
Tony,Red,5.0,7.0,9.0
John,Blue,5.0,7.0,9.0


In [7]:
filler = df[["score1", "score2", "score3"]].mean()
df.fillna(filler)

Unnamed: 0,colour,score1,score2,score3
Theresa,,5.0,5.666667,6.666667
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0
Tony,Red,5.0,5.666667,9.0
John,Blue,5.0,7.0,6.666667


#### Dropping missing data

In [8]:
df.dropna(how="any")

Unnamed: 0,colour,score1,score2,score3
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0


In [9]:
df.dropna(how="all")

Unnamed: 0,colour,score1,score2,score3
David,Blue,5.0,3.0,5.0
Gordon,Red,5.0,7.0,6.0
Tony,Red,,,9.0
John,Blue,5.0,7.0,


#### Including missing data when counting values

In [10]:
df["score1"].value_counts(dropna=False)

 5.0    3
NaN     2
Name: score1, dtype: int64

#### Equality of missing data
When working with missing data, you'll probably see `NaN` fairly often. It's important to know that this value, which comes from the Numpy library, is not the same as `None` as found in vanilla Python.

In [11]:
type(None)

NoneType

In [12]:
type(np.nan)

float

In [13]:
None == np.nan

False