# Filtering Out Missing Data

You have a number of options for filtering out missing data. While doing it by hand is always an option, dropna can be very helpful. On a Series, it returns the Series with only the non-null data and index values:

In [1]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from numpy import nan as na

In [2]:
data = Series ([1, na, 2, 5, na, 7])

data

0    1.0
1    NaN
2    2.0
3    5.0
4    NaN
5    7.0
dtype: float64

In [3]:
data.dropna()

0    1.0
2    2.0
3    5.0
5    7.0
dtype: float64

Naturally, you could have computed this yourself by boolean indexing:

In [8]:
data[data.notna()]

0    1.0
2    2.0
3    5.0
5    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows or columns which are all NA or just those containing any NAs. dropna by default drops any row containing a missing value:

In [11]:
data = DataFrame([[1., 6.5, 3.], [1., na, na],
[na, na, na], [na, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing *how='all'* will only drop rows that are all NA:

In [18]:
data.dropna(how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Dropping columns in the same way is only a matter of passing axis=1

In [24]:
data.dropna(axis= 1, how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [25]:
df = DataFrame(np.random.randn(7,3))

df

Unnamed: 0,0,1,2
0,0.027164,1.291704,-0.556247
1,1.654347,0.145568,-1.290081
2,1.276144,0.777206,0.359808
3,-0.646338,-0.341887,0.332793
4,0.132837,0.915649,0.456672
5,0.78255,-0.935399,0.059388
6,-2.233196,0.556606,0.213768


In [40]:
df.iloc[:5, 1] = na; df.iloc[:2, 2]= na
df

Unnamed: 0,0,1,2
0,0.027164,,
1,1.654347,,
2,1.276144,,0.359808
3,-0.646338,,0.332793
4,0.132837,,0.456672
5,0.78255,-0.935399,0.059388
6,-2.233196,0.556606,0.213768


In [41]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,0.78255,-0.935399,0.059388
6,-2.233196,0.556606,0.213768
