![rmotr](https://i.imgur.com/jiPp4hj.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39117173-a433bf6a-46e6-11e8-8a40-b4d4d6422493.jpg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Missing Data

![separator2](https://i.imgur.com/4gX5WFr.png)

## Hands on!

In [1]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a _`Salary`_ field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [None]:
falsy_values = (0, False, None, '', [], {})

For Python, all the values above are considered "falsy":

In [None]:
any(falsy_values)

Numpy has a special "nullable" value for numbers which is `np.nan`. It's _NaN_: "Not a number"

In [None]:
np.nan

The `np.nan` value is kind of a virus. Everything that it touches becomes `np.nan`:

In [None]:
3 + np.nan

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a.sum()

In [None]:
a.mean()

This is better than regular `None` values, which in the previous examples would have raised an exception:

In [None]:
3 + None

For a numeric array, the `None` value is replaced by `np.nan`:

In [None]:
a = np.array([1, 2, 3, np.nan, None, 4], dtype='float')

In [None]:
a

As we said, `np.nan` is like a virus. If you have any `nan` value in an array and you try to perform an operation on it, you'll get unexpected results:

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a.mean()

In [None]:
a.sum()

Numpy also supports an "Infinite" type:

In [None]:
np.inf

Which also behaves as a virus:

In [None]:
3 + np.inf

In [None]:
np.inf / 3

In [None]:
np.inf / np.inf

In [None]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

In [None]:
b.sum()

![separator1](https://i.imgur.com/ZUWYTii.png)

### Checking for `nan` or `inf`

There are two functions: `np.isnan` and `np.isinf` that will perform the desired checks:

In [None]:
np.isnan(np.nan)

In [None]:
np.isinf(np.inf)

And the joint operation can be performed with `np.isfinite`.

In [None]:
np.isfinite(np.nan), np.isfinite(np.inf)

`np.isnan` and `np.isinf` also take arrays as inputs, and return boolean arrays as results:

In [None]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

In [None]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

In [None]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

_Note: It's not so common to find infinite values. From now on, we'll keep working with only `np.nan`_

![separator1](https://i.imgur.com/ZUWYTii.png)

### Filtering them out

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid `nan` propagation. We'll use a combination of the previous `np.isnan` + boolean arrays for this purpose:

In [None]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [None]:
a[~np.isnan(a)]

Which is equivalent to:

In [None]:
a[np.isfinite(a)]

And with that result, all the operation can be now performed:

In [None]:
a[np.isfinite(a)].sum()

In [None]:
a[np.isfinite(a)].mean()

![separator2](https://i.imgur.com/4gX5WFr.png)