<a href="https://colab.research.google.com/github/Thorne-Musau/pd/blob/main/Handling_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

Missing data is handled using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry

In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.

* The first sentinel value used by Pandas is None


In [1]:
import numpy as np
import pandas as pd

In [2]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [3]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
70.2 ms ± 8.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.03 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error:

In [4]:
vals1.sum()

TypeError: ignored

* The other msissing data interpretation is NaN - Not a Number

In [5]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

It is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation. the result of arithmetic with NaN will be another NaN

In [6]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

* NaN and None in Pandas

NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [7]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present

In [8]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [9]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

* Detecting null values

In [10]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [11]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [12]:
# Dropping null values
data.dropna()

0        1
2    hello
dtype: object

In [13]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [14]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


Filling up null data is done by the fillna(value) function

In [15]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [16]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [17]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [18]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64