<a href="https://colab.research.google.com/github/CodeVerse-team/python-for-data-analysis-learning-libraries/blob/main/Missing%20Data%20(Data%20Cleaning%20-%201).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning


---

## Missing Data

In [2]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a `Salary` field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider `"Falsy"`:

In [3]:
falsy_values = (0, False, None, '', [], {})

For python, all the values above are considered "Falsy":

In [4]:
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is `np.nan`. It's NaN: "Not a Number"

In [5]:
np.nan

nan

The `np.nan` value is kind of virus. Everything that it touches becomes `np.nan`:

In [6]:
3 + np.nan

nan

In [7]:
a = np.array([1,2,3,  np.nan, np.nan, 4])

In [8]:
a.sum()

np.float64(nan)

In [9]:
a.mean()

np.float64(nan)

This is better than regular `None` values, which in the previous examples would have raised an exception:

In [11]:
# 3 + None

For a numeric array, the `None` value is replaced by `np.nan`:

In [12]:
a = np.array([1,2,3, np.nan, None, 4], dtype = 'float')

In [13]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, `np.nan` is like a virus. if you have any `nan` value in an array and you try to perform an operation on it, you'll get unexpected results:

In [14]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [15]:
a.mean()

np.float64(nan)

In [16]:
a.sum()

np.float64(nan)

Numpy also supports an "Infinite" Type:

In [17]:
np.inf

inf

Which also behaves as a virus:

In [18]:
3 + np.inf

inf

In [19]:
np.inf / 3

inf

In [21]:
np.inf / np.inf

nan

In [23]:
b = np.array([1,2,3, np.inf, np.nan, 4], dtype = float)

In [24]:
b.sum()

np.float64(nan)

## Checking for `nan` or `inf`
These are two functions: `np.isnan` an `np.isinf` that will perform the desired checks:

In [25]:
np.isnan(np.nan)

np.True_

In [26]:
np.isinf(np.inf)

np.True_

In [27]:
np.isinf(np.nan)

np.False_

And the joint operation can be performed with `np.isfinite`.

In [28]:
np.isfinite(np.nan), np.isfinite(np.inf)

(np.False_, np.False_)

`np.isnan` and `np.isinf` also take arrays as inputs, and return boolean arrays as results:

In [29]:
np.isnan(np.array([1,2,3,np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [30]:
np.isinf(np.array([1,2,3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [31]:
np.isfinite(np.array([1,2,3, np.nan, np.inf]))

array([ True,  True,  True, False, False])

NOTE: It's not so common to find infinite values. From now on, we'll keepworking with only `np.nan`

## Filtering them out

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid `nan` propagation. We'll use a combination of the previous `np.isnan` + boolean arrays for this purpose:

In [32]:
a = np.array([1,2,3, np.nan, np.nan, 4])

In [33]:
a[~np.isnan(a)]

array([1., 2., 3., 4.])

Which is equivalent to:

In [34]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [36]:
a[np.isfinite(a)].sum()

np.float64(10.0)

In [37]:
a[np.isfinite(a)].mean()

np.float64(2.5)