In [4]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

In most data analysis applications, missing data is prevalent. One of the aims of creating pandas was to make working with missing data as simple as possible.

In both floating and non-floating point arrays, pandas utilizes the floating point value NaN (Not a Number) to indicate missing data. It is simply used as a readily detectable sentinel:

In [5]:
string_data = Series(['kaggle', 'Raimondo', 'Elon Musk', 'artichoke', np.nan, 'Naruto'])

In [7]:
string_data

In [9]:
string_data.isnull()

In object arrays, the built-in Python None value is also handled as NA:

In [10]:
string_data[0] = None

In [11]:
string_data.isnull()

***NA handling methods***

**dropna:** Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.

**fillna:** Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.

**isnull:** Return like-type object containing boolean values indicating which values are missing / NA.

**notnull:** Negation of isnull.

# Filtering Out Missing Data

You have several choices for removing missing data. While doing things by hand is always an option, *dropna* can be of great assistance. When applied to a Series, it produces the Series with just non-null data and index values:

In [12]:
from numpy import nan as NA

In [13]:
data = Series([6, NA, 8, NA, 12, NA])

In [14]:
# the same function removes the NaN of the 'string_data' dataset
data.dropna()

Naturally, you could have calculated this yourself using boolean indexing:

In [16]:
data[data.notnull()]

These are a little more complicated with DataFrame objects. You could choose to remove all NA rows or columns, or simply those that have any NAs. Dropna by default removes any row with an empty value:

In [17]:
data = DataFrame([[6., 6.9, 4.], [2., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

In [20]:
data

In [18]:
cleaned = data.dropna()

In [21]:
cleaned

Passing how='all' will only drop rows that are all NA:

In [22]:
data.dropna(how='all')

Dropping columns in the same way is as simple as passing axis=1:

In [23]:
data[4] = NA

In [24]:
data

In [25]:
data.dropna(axis=1, how='all')

A related technique to filter out DataFrame rows is to use time series data. Assume you wish to maintain just rows with a particular amount of observations. You may express this using the thresh argument:

In [26]:
df = DataFrame(np.random.randn(7, 3))

In [27]:
df

In [29]:
df.iloc[:4, 1] = NA; df.iloc[:2, 2] = NA

In [30]:
df

In [36]:
#thresh = 3: at least three observations
df.dropna(thresh=3)

# Filling in Missing Data

Rather than filtering out missing data (and potentially deleting other data), you may fill in the "holes" in a variety of ways. The fillna technique is the workhorse function to employ for most tasks. When fillna is called with a constant, it fills missing values with that value:

In [37]:
df.fillna(0)

When using a dict to call fillna, you may use a different fill value for each column:

In [40]:
df.fillna({1: 0.5, 2: -1})

*fillna* returns a new object, but you can modify the existing object in place:

In [41]:
_ = df.fillna(0, inplace=True)

In [42]:
df

Fillna can use the same interpolation methods that are available for reindexing:

In [43]:
df = DataFrame(np.random.randn(6, 3))

In [44]:
df

In [45]:
df.iloc[2:, 1] = NA; df.iloc[4:, 2] = NA

In [46]:
df

In [47]:
df.fillna(method='ffill')

In [52]:
# limit = 2: fill only two rows
df.fillna(method='ffill', limit=2)

Using a little imagination, you can accomplish a lot of different things with fillna. For example, you might send the mean or median value of a Series:

In [53]:
data = Series([2., NA, 4, NA, 9, NA, 3.])

In [54]:
data

In [55]:
data.fillna(data.mean())

The same method is applicable to the DataFrame

In [56]:
df = DataFrame(np.random.randn(7, 3))

In [57]:
df

In [58]:
df.iloc[2:, 1] = NA; df.iloc[4:, 2] = NA

In [59]:
df

In [60]:
df.fillna(df.mean())

***fillna function arguments***

**value:** Scalar value or dict-like object to use to fill missing values

**method:** Interpolation, by default 'ffill' if function called with no other arguments

**axis:** Axis to fill on, default axis=0

**inplace:** Modify the calling object without producing a copy

**limit:** For forward and backward filling, maximum number of consecutive periods to fill