### Missing Data Conventions

- **METHOD 1** Masking Approach
The mask might be an entirely separate Boolean array or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

Trade off: Adds overhead in both storage and computation
<br>
- **METHOD 2** Sentinel Approach
The sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or some global convention such as NaN, a special value which is part of IEEE floating point spec. 

Trade off: Reduces the range of valid values that can be represented, may require extra (often non-optimized) logic in CPU and GPU arithmetic.

### Missing Data in Pandas
Constraint in R: 
<br> R uses specifying **bit patterns** for each induvidual datatype to indicate nullness, but this approach turns out to be rather unwieldly. While R contains 4 basic data types, NumPy supports fourteen basic integer types and making changes to the bit pattern may lead to chaos and may even require a fork of the Numpy package. Sacrificing a bit to use as a mask will significantly reduce the range of values it can represent. 


Constraint in Numpy:
<br> Overhead in storage, computation, and code maintenance

What is used in Pandas then?
<br> **Sentinels** and two already existing Python null values: the special floating point `NaN` value and the Python `None` object. 

### `None` Missing data
A python singleton object that is often used for missing data in Python code. Because it's a python object, `None` cannot be used in any arbitrary NumPy/ Pandas array, but only in arrays with datatype `'object'`

In [1]:
import pandas as pd
import numpy as np

In [2]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [3]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
55.8 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
2.04 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [4]:
# Addition between None and integer is undefined
print(1+None)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### `NaN` Missing numerical data
It's a special floating point numerb that is used in the standard IEEE floating point representation. 

In [5]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:

In [6]:
print(1+np.nan)

nan


In [7]:
print(0*np.nan)

nan


In [8]:
# General aggregations also result in nan result
print(vals2.sum(), vals2.min(), vals2.max())

# Special aggregations to ignore missing values
print(np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2))

nan nan nan
8.0 1.0 4.0


### `Nan` and `None` in Pandas

In [9]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [10]:
# Python automatically converts None to a NaN value when df is float64
x = pd.Series(range(2), dtype=int)
print(x)

# NOw - Note the datatype
x[0] = None
x

0    0
1    1
dtype: int32


0    NaN
1    1.0
dtype: float64

### Operation on Null values
The Pandas provide different methods for detecting, removing, and replacing hte null values. They are:
- `isnull()` :"Generates a boolean mask indicating the missing values
- `notnull()`: Opposite of isnull()
- `dropna()` : Returns a filtered version of the data
- `fillna()` : Return a copy of the data with missing values filled or imputed.

In [11]:
# Detecting null values
data = pd.Series([1, np.nan, 'hello', None])
print(data.isnull())

0    False
1     True
2    False
3     True
dtype: bool


In [12]:
# Using boolean mask to get non-null data
print(data[data.notnull()])

0        1
2    hello
dtype: object


In [13]:
# Dropping null values
data.dropna()

0        1
2    hello
dtype: object

In [14]:
# Dropping null values in a Dataframe
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
print(df)

     0    1  2
0  1.0  NaN  2
1  2.0  3.0  5
2  NaN  4.0  6


We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame.

By default, dropna() will drop all rows in which any null value is present:

In [15]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [16]:
df.dropna(axis='columns') # Drops all columns containing null values

Unnamed: 0,2
0,2
1,5
2,6


In [20]:
# how and thresh parameters
df[3] = np.nan

print(df.dropna(axis='columns', how='all'))
print("-------------------------------")
print(df.dropna(axis='rows', thresh=3))

     0    1  2
0  1.0  NaN  2
1  2.0  3.0  5
2  NaN  4.0  6
-------------------------------
     0    1  2   3
1  2.0  3.0  5 NaN


### Filling Null values
We can replace the null values with some single value like zero, or it might be some sort of imputation or interpolation from the good values. We can do this inplace using the `isnull()` method as a mask, but because it is such a common operation. Pandas provides the `fillna()` method, which returns the copy of the array with the null values replaced. 

In [21]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [22]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [23]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [24]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [25]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [26]:
df.fillna(method='ffill', axis=1) # No previous value is not available during ffill

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


---