<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/5.%20Handling_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling missing Data

In [None]:
import numpy as np
import pandas as pd

Real world data is not homogeneous and has missing values.

These missing values in Pandas are represented as NaN, null or NA

## Trade offs in missing data conventions

The missing value in a table or DataFrame follow one of the two schemes :-
1. Using a *mask* that globally indicates missing value
2. Choosing a *sentinel* value that indicates the missing entry

In masking approach, the mask is an entierly different boolean array or it may involve appropriation of one bit in data representation to locally indicate null status.

In sentinel aproach, the sentinel value could be any data specific convention such as an integer like -9999 or any rare pattern or it can be a global convention as NaN.

None of these approaches come without trade offs : in masking, use of a separate mask array requires allocation of addtional boolean array which adds overhead in both storage and computation.

A sentinel value reduces the range of values and requires extra logic CPU and GPU arithmetic.

As in most cases, no universal convention exists and diferent languages use different convention.

For example, the R language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the SciDB
system uses an extra byte attached to every cell to indicate a NA state.

## Missing Data in Pandas

The way in which Pandas handles missing values is constrained by its relaince on the Numpy package which does not have a built in notion of NA value for non-floating point data types.

Pandas chose to use sentinels for missing data and further chose to use 2 existing Python null values *NaN* and *None* object

### None: Pythonic missig data

The first sentinel is the Python None object which is a singleton Python object used for null values in Python code. <br>
It is of type object and can't be used with any arbitrary Numpy/Pandas array but only in arrays that have *object* data type

In [None]:
val=np.array([1,None,2,3])
val

array([1, None, 2, 3], dtype=object)

In [None]:
#the use of aggregations like sum() and min() will give an error
val.min()

TypeError: ignored

Addition is not possible between int and None

### NaN: Missing numerical data

The next representation of missing value is the use of *NaN* it is a **special floating point** value recognised by all systrms that use IEEE floating point representation

In [None]:
val1=np.array([1,np.nan,2,3])
val1

array([ 1., nan,  2.,  3.])

This val1 array is of type of float64 and not of type object and supports fast opearations.<br>
Opeartions with nan gives another nan

In [None]:
1+np.nan

nan

In [None]:
0*np.nan

In [None]:
#aggregation functions will not give an error
val1.sum()

In [None]:
#Numpy provides aggreagtion functions which can ignore nan values
np.nansum(val1)

### NaN and None in Pandas

NaN and None both have their place and Pandas is built to handle the two of them nearly inrechangeably conveting whereever required.

In [None]:
pd.Series([1,np.nan,2,None])

For types that dont have an available sentinel value, Pandas automatically type casts when NA values are prsent.<br>
For example, if we set a value in an integer array to
np.nan, it will automatically be upcast to a floating-point type to accommodate the NA

In [None]:
x=pd.Series(range(2),dtype=int)
x

In [None]:
x[0]=None
x

With the casting from int to float array, Pandas automatically converts None to NaN value.


**Summary** :-<br>
A sentinel value is a specific value used in computer programming to indicate a special condition or state. It serves as a signal to mark the presence of that condition. In the context of pandas and missing data, the sentinel value used is "NaN," which stands for "Not a Number." NaN is a floating-point value that is commonly used to represent missing or undefined data in numerical datasets. It allows users to easily detect and handle missing values within pandas objects, making it a recognized marker for such cases.<br>
A sentinel value is like a special code that tells a computer, "Hey, there's something unique or different here!" In pandas, "NaN" is used as a sentinel value to say, "This is where data is missing." It helps pandas understand and work with missing data more easily.

## Operating on Null values

As we have seen that Pandas treats Null and NaN interchangeably to indicate missing values. To facilitate this convention, there are several methods for removing, detecting and modifying null values.

They are:-
1. `isnull()` - returns a Boolean mask indicating null values
2. `notnull()` - opposite of `isnull()`
3. `fillna()` - to fill missing values
4. `dropna()` - to return a filtered version of the array

### Detecting null values

In [None]:
s=pd.Series([1,np.nan,4,None])
s

0    1.0
1    NaN
2    4.0
3    NaN
dtype: float64

In [None]:
s.isnull()

0    False
1     True
2    False
3     True
dtype: bool

This returns a Boolean mask

In [None]:
s[s.notnull()]

0    1.0
2    4.0
dtype: float64

Both isnull() and notnull() gives similar Boolean results for Dataframes.

### Dropping null values

The `dropna()` is used to remove NA values and `fillna()` is used to fill NA values.

For a Series, th result is straightforward

In [None]:
s.dropna()

0    1.0
2    4.0
dtype: float64

But for a DataFrame, there are many options like dropping along row or along column. By default, `dropna()` drops along row.

In [None]:

df = pd.DataFrame([[1, np.nan, 2],
 [2, 3, 5],
 [np.nan, 4, 6]])

In [None]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [None]:
df.dropna()
#removes any row which has null value in it

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [None]:
df.dropna(axis=1)
#removes columns which have null value

Unnamed: 0,2
0,2
1,5
2,6


But this may remove any important data, and if we want to specify that we want to drop rows or columns with all null values or a majority of null valeus, we can specify this with the help of how and thresh parameter.

By default, how='any' which mean to drop row or column which have a null value.
how ='all', then it'll remove rows or columns which have all null values.

In [None]:
df.dropna(axis='columns',how='any')

Unnamed: 0,2
0,2
1,5
2,6


In [None]:
df.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


You can also specify the minimum number of non null values to be present while dropping using the thresh parameter.

In [None]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


### Filling null values

We can also fill in null values instead of dropping with values like 0 or any imputation.

In [None]:
s.fillna(0)

0    1.0
1    0.0
2    4.0
3    0.0
dtype: float64

Froward fill - to fill the next value with the previous one

In [None]:
s.fillna(method='ffill')

0    1.0
1    1.0
2    4.0
3    4.0
dtype: float64

Backward fill - to fill the mull value with the previous one.

In [None]:
s.fillna(method='bfill')

0    1.0
1    4.0
2    4.0
3    NaN
dtype: float64

Same methods can be used for Dataframes but for Dataframes, we specify the axis additionally.

In [None]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.0,0.0,2
1,2.0,3.0,5
2,0.0,4.0,6


Forward fill

In [None]:
df.fillna(method='ffill',axis=0)

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,2.0,4.0,6


Backward fill

In [None]:
df.fillna(method='bfill',axis=1)

Unnamed: 0,0,1,2
0,1.0,2.0,2.0
1,2.0,3.0,5.0
2,4.0,4.0,6.0


If a previous data, is not available, then NaN is maintained as it is.