# Missing Data

In [3]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [4]:
falsy_values = (0, False, None, '', [], {})

For Python, all the values above are considered "falsy":



In [5]:
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [6]:
np.nan

nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [7]:
3 + np.nan

nan

In [8]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [9]:
a.sum()

nan

In [10]:
a.mean()

nan

In [12]:
a = np.array([1, 2, 3, np.nan, None, 4], dtype='float')

In [13]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [15]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [28]:
a.mean()

nan

In [29]:
a.sum()

nan

Numpy also supports an "Infinite" type:

In [30]:
np.inf

inf

# Which also behaves as a virus:

In [31]:
3 + np.inf

inf

np.inf / 3

In [32]:
np.inf / np.inf

nan

In [33]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)


In [34]:
b.sum()

nan

# Checking for nan or inf

There are two functions: np.isnan and np.isinf that will perform the desired checks:



In [35]:
np.isnan(np.nan)

True

In [36]:
np.isinf(np.inf)

True


And the joint operation can be performed with np.isfinite.

In [37]:
np.isfinite(np.nan), np.isfinite(np.inf)

(False, False)

In [38]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [39]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])