Real world data is rarely clean and homogenous, and different data sources may indicate missing data in many different ways. There are two general strategies to clean data:

 - Using a mask to globally indicate missing values. However, this has alot of overhead since it creates a second boolean array as a mask
 - A sentinal variable is declared to represent a datapoint that doesn't exist in the dataset. This decreases the number of usable characters and increases the load on the CPU/GPU

Pandas uses a "Not a Number" (NaN) sentinal value to describe missing data. NaN can also be treated as a null value.

pd.isnull() and pd.notnull() can be used to detect the presense of null data and will return a boolean mask over the data.

pd.dropna() removes null values

In [3]:
import numpy as np
import pandas as pd

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [4]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


pd.fillna() replaces null values with a specified value

In [5]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [6]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64