#### Approaches of representing missing values

In order to represent the missing values, we see two approaches that are commonly applied to the data in tables or dataframes. The first approach involves a mask to point out the missing values whereas the second uses a datatype-specific sentinel value to represent a missing value.

When masking, a mask could either be a global or a local one. A global mask consists of a separate boolean array for each data array whereas a local mask utilises a single bit in the element’s bit-wise representation.

On the other hand, in a sentinel approach, a datatype-specific sentinel value is defined. This could either be a typical value based on best practices or a uniquely defined bit-wise representation. 

pandas utilises two existing Python sentinels to denote the nullness. These are the IEEE’s standard floating-point value NaN (available as numpy.nan), and the Python singleton object None (as used in the Python code).

#### So repeat 3 times fast: object==bad, float==good

NaN is used as a placeholder for missing data consistently in pandas, consistency is good.

np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.

In [1]:
import pandas as pd
import numpy as np

In [4]:
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype = object)
s_good = pd.Series([1, np.nan])

In [5]:
s_bad.dtype, s_good.dtype

(dtype('O'), dtype('float64'))

In [6]:
s_bad.sum(), s_good.sum()

(1, 1.0)

#### You should be using pd.isnull and pd.notnull to test for missing data (NaN).

np.nan is not comparable to np.nan... directly.

if a data is missing and showing NaN, be careful to use NaN ==np.nan .

In [8]:
np.nan == np.nan

False

In [9]:
np.isnan(np.nan)

True

In [10]:
pd.isnull(np.nan)

True

In [12]:
# filters nothing because nothing is equal to np.nan

s = pd.Series([1, np.nan, 2])
s[s != np.nan]

0    1.0
1    NaN
2    2.0
dtype: float64

In [14]:
# filters out the null

s = pd.Series([1, np.nan, 2])
s[s.notnull()]

0    1.0
2    2.0
dtype: float64

In [16]:
# odd comparison

s = pd.Series([1, np.nan, 2])
s[s == s]

0    1.0
2    2.0
dtype: float64

In [18]:
# dropna()

s = pd.Series([1, np.nan, 2])
s.dropna()

0    1.0
2    2.0
dtype: float64

In [19]:
df = pd.DataFrame([1, np.nan])
df

Unnamed: 0,0
0,1.0
1,


In [22]:
df1 = df.where(pd.notnull(df), None)

In [23]:
df1

Unnamed: 0,0
0,1.0
1,


In [24]:
df.astype(object).replace(np.nan, 'None')

Unnamed: 0,0
0,1.0
1,
