# Missing Data

In [2]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [3]:
falsy_values = (0, False, None, '', [], {})

For Python, all the values above are considered "falsy":

In [4]:
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [5]:
np.nan

nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [6]:
3 + np.nan

nan

In [7]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [8]:
a.sum()

nan

In [9]:
a.mean()

nan

This is better than regular None values, which in the previous examples would have raised an exception:

In [10]:
3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the None value is replaced by np.nan:

In [11]:
a = np.array([1, 2, 3, np.nan, None, 4], dtype='float')

In [12]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [13]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [14]:
a.mean()

nan

In [15]:
a.sum()

nan

Numpy also supports an "Infinite" type:

In [16]:
np.inf

inf

Which also behaves as a virus:

In [18]:
3 + np.inf

inf

In [19]:
np.inf / 3

inf

In [20]:
np.inf / np.inf

nan

In [23]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=float)
b

array([ 1.,  2.,  3., inf, nan,  4.])

In [24]:
b.sum()

nan