## **Missing data**
---

### Hands on!!

In [2]:
# importing stuffs
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [3]:
falsy_values = (0, False, None, '', [], {})

All the above values are considered falsy

In [4]:
# the any(iterables) gives the following as result :
# If at least one element is truthy → Returns True
# If all elements are falsy → Returns False
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [5]:
np.nan

nan

np.nan is like a void and null value hence any thing operating with np.nan becomes nan  
The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [6]:
3 + np.nan

nan

### **Understanding NaN**

- It is used in Pandas and NumPy to indicate missing values.
- It is not equal to any number, including itself (NaN != NaN).
- NaN is a float type by default.

In [7]:
# creating a numpy array using nan values
a = np.array([1, 2, 3, np.nan, np.nan, 4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [8]:
# now performing sum operation 
a.sum()

nan

See told ya ! any number operating using nan is nan

In [9]:
# now performing mean()
a.mean()

nan

This is better than regular None values, which in the previous examples would have raised an exception:

| Feature                  | None                                      | NaN (`numpy.nan` / `pd.NA`)                 |
|--------------------------|------------------------------------------|----------------------------------------------|
| **Meaning**              | Absence of a value (Python null value)   | "Not a Number" (used for missing numerical data) |
| **Type**                 | `NoneType`                               | `float` (`numpy.nan` is a float)            |
| **Use Case**             | General-purpose missing values in Python | Missing values in numerical computations    |
| **Comparison (`==`)**    | `None == None` → ✅ `True`               | `np.nan == np.nan` → ❌ `False`             |
| **Check for Missing Values** | `x is None`                        | `pd.isna(x)`, `np.isnan(x)`                 |


### 🔥 Quick recap

- ✅ None is Python’s null value, used for general missing values.
- ✅ NaN (numpy.nan) is used in Pandas/NumPy for missing numerical data.
- ✅ NaN != NaN, while None == None.
- ✅ None is converted to NaN in Pandas when used in a numeric column.

##### 🚀 TL;DR: Use None for general-purpose missing values in Python and NaN for missing numerical data in Pandas/NumPy! 😊

In [10]:
# now checking None value
# 3 + None 
# this gives error since it does matches datatype

🔹 Why?
- None is not a number → It represents absence of a value, not 0 or any numeric type.
- Python doesn’t know how to add an int and NoneType → Unlike NaN, which is a float, None has no mathematical meaning.
- Python is strongly typed → It doesn’t automatically convert None to 0 or any other number.

##### For a numeric array, the None value is replaced by np.nan:

In [11]:
# now trying to convert the content of a to a float datatype
a = np.array([1,2,3,np.nan,None,4],dtype = "float")

In [12]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [13]:
a = np.array([1,2,3,np.nan,np.nan,4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [14]:
a.sum()

nan

In [15]:
a.mean()

nan

Numpy also supports an "Infinite" type:

In [16]:
np.inf

inf

Which also behaves as a virus:

### 1️⃣ Understanding np.inf
- np.inf stands for positive infinity (∞).
- -np.inf stands for negative infinity (-∞).
- It behaves like a very large number in computations.
- Often appears in division by zero or limits.

In [17]:
# now performing operation using np.inf
3 + np.inf

inf

In [18]:
np.inf/3

inf

In [19]:
np.inf / np.inf

nan

#### 1️⃣ Basic Operations with `np.inf`

| **Operation**     | **Result** | **Explanation** |
|------------------|-----------|----------------|
| `np.inf + 10`   | `inf`     | Adding a finite number to infinity is still infinity. |
| `np.inf - 100`  | `inf`     | Subtracting a finite number doesn't change infinity. |
| `-np.inf + 50`  | `-inf`    | Negative infinity remains negative. |
| `np.inf * 2`    | `inf`     | Infinity times any positive number is still infinity. |
| `np.inf * -3`   | `-inf`    | Multiplying by a negative flips the sign. |
| `np.inf / 2`    | `inf`     | Infinity divided by any positive number is still infinity. |
| `np.inf / -1`   | `-inf`    | Dividing by a negative flips the sign. |
| `1 / np.inf`    | `0.0`     | Anything divided by infinity approaches zero. |

---

#### 2️⃣ Special Cases Leading to `NaN`

| **Operation**      | **Result** | **Why?** |
|-------------------|-----------|----------|
| `np.inf - np.inf` | `nan`     | **Undefined**: Infinity minus infinity is ambiguous. |
| `np.inf / np.inf` | `nan`     | **Undefined**: Infinity divided by infinity is ambiguous. |
| `np.inf * 0`      | `nan`     | **Undefined**: Anything times zero is undefined. |
| `0 / np.inf`      | `0.0`     | Zero divided by infinity is well-defined (approaches zero). |
|-------------------|-----------|----------------|
| `np.nan + np.inf` | `nan`     | Undefined behavior (NaN represents an unknown value, and Infinity cannot resolve it). |
| `np.nan - np.inf` | `nan`     | Same as above (Infinity with NaN is still unknown). |
| `np.nan * np.inf` | `nan`     | Multiplication with NaN results in NaN. |
| `np.nan / np.inf` | `nan`     | NaN divided by any number remains NaN. |


In [20]:
# now applying this into np array
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=float)
b

array([ 1.,  2.,  3., inf, nan,  4.])

In [21]:
b.sum()

nan

### **Checking for nan or inf**

There are two functions: **np.isnan** and **np.isinf** that will perform the desired checks:

1️⃣ np.isnan() – Check for NaN (Not a Number)  
✅ Used to detect missing or undefined numerical values (NaN).

2️⃣ np.isinf() – Check for Infinity (inf, -inf)  
✅ Used to detect positive or negative infinity (np.inf, -np.inf).

#### 🔹 NumPy Functions for Checking Special Values

| **Function**      | **Checks for**       | **Returns `True` for**             |
|------------------|---------------------|----------------------------------|
| `np.isnan(x)`   | `NaN` values         | `np.nan`                        |
| `np.isinf(x)`   | Infinite values      | `np.inf`, `-np.inf`             |
| `np.isfinite(x)` | Finite numbers      | Everything except `NaN`, `inf`, `-inf` |


In [22]:
np.isnan(np.nan)

True

In [24]:
np.isinf(np.inf)

True

And the joint operation can be performed with np.isfinite.

**np.isfinite()** checks whether each element in an array is a finite number, meaning it is not NaN, inf, or -inf.

In [25]:
np.isfinite(np.nan) , np.isfinite(np.inf)

(False, False)

**np.isnan** and **np.isinf** also take arrays as inputs, and return boolean arrays as results:

In [26]:
np.isnan([1,2,3,np.nan,np.inf,4])

array([False, False, False,  True, False, False])

In [27]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [28]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([ True,  True,  True, False, False,  True])

**Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan**

### **Filtering them out**

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [29]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [30]:
a[~np.isnan(a)]

array([1., 2., 3., 4.])

Which is equivalent to:

In [32]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [33]:
# now we can perform operations using finite values of a
a[np.isfinite(a)].sum()

10.0

In [35]:
a[np.isfinite(a)].mean()

2.5