In [1]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a _`Salary`_ field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy".

In [2]:
falsy_values= (0, False, None, '', [], {})

For Python, all the values above are considered "falsy"

In [3]:
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is `np.nan`. It's _NaN_: "Not a number". The `np.nan` value is kind of a virus. Everything that it touches becomes `np.nan`.

In [4]:
np.nan

nan

In [5]:
3 + np.nan

nan

In [6]:
a= np.array([1, 2, np.nan, 7])

In [7]:
a.sum()

nan

In [8]:
a.mean()

nan

`np.nan` is better than regular `None` values, which in the previous examples would have raised an exception.

In [9]:
3 + None

TypeError: ignored

For a numeric array, the `None` value is replaced by `np.nan` .

In [10]:
a= np.array([1, 2, np.nan, None, 7], dtype='float')

In [11]:
a

array([ 1.,  2., nan, nan,  7.])

Numpy also supports an "infinite" type which also behaves as a virus.

In [12]:
np.inf

inf

In [13]:
3 + np.inf

inf

In [14]:
np.inf / 3

inf

In [15]:
np.inf / np.inf

nan

In [16]:
b= np.array([2, 4, np.inf, np.nan, 7], dtype=np.float)

In [17]:
b.sum()

nan

### **Checking for `nan` or `inf`**

There are two functions : `np.isnan` and `np.isinf` that will check for nan and infinite values.

In [18]:
np.isnan(np.nan)

True

In [19]:
np.isinf(np.inf)

True

In [20]:
np.isnan(np.array([1, 4, np.nan, np.inf, 7]))

array([False, False,  True, False, False])

In [21]:
np.isinf(np.array([1, 4, np.nan, np.inf, 7]))

array([False, False, False,  True, False])

And the joint operation can be performed with `np.isfinite`.

In [22]:
np.isfinite(np.nan)

False

In [23]:
np.isfinite(np.inf)

False

In [24]:
np.isfinite(np.array([3, 4, np.inf, np.nan, 9]))

array([ True,  True, False, False,  True])

Note : It's not so common to find infinite values. 

### **Filtering nan values**

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid `nan` propagation. We'll use a combination of the previous `np.isnan` + boolean arrays for this purpose.

In [25]:
a= np.array([1, 4, np.nan, np.nan, 7])

In [26]:
a[~np.isnan(a)]

array([1., 4., 7.])

which is equivalent to :

In [27]:
a[np.isfinite(a)]

array([1., 4., 7.])

In [28]:
a[np.isfinite(a)].sum()

12.0

In [29]:
a[np.isfinite(a)].mean()

4.0