# <font color=#14F278>Unit 6 - Missing Data</font>
---

## <font color=#14F278>1. Missing Data - Definition:</font>
Data comes in many shapes and forms and more often than not, missing data is one of the biggest issues we encounter in real life! This is why Pandas aims to be flexible with regards to handling it. But before we learn about the ways to handle missing data, let's take a look at some definitions:

- <font color=#14F278>**NAN (Not a Number)**</font> - a numerical value that best refers to cases of <font color=#14F278>**numerical invalidity**</font>
-  <font color=#14F278>**None (NoneType)**</font> - an internal Python data type, which refers to <font color=#14F278>**inexistent (empty)**</font> values

While __NaN__ is the default missing value marker, we need to be able to easily detect and handle this value with data of different data types - float, integer, boolean, etc. This is why we also conside __None__ to be flagging __missing, not avaiable, or NA__ values too!

In [5]:
import pandas as pd
import numpy as np

---
### <font color=#14F278>1.1 Importance of Identifying and Handling Missing Data:</font>
There are countless reasons why Missing Data should be considered an <font color=#14F278>**enemy**</font> to any dataset or data analysis:

1. <font color=#14F278>**Missing Data can indicate:**</font>
- data collection errors
- calculation errors
- incomplete data collection or data implementation

2. <font color=#14F278>**Missing Data - risk:**</font>
- adverse impact on the quality of **Deterministic Models** (e.g. Machine Learning Models)
- negative consequences to businesses and sectors from a regulatory standpoint


<font color=#FF8181>**Important:**</font> Analysis on incomplete Data (i.e., data with missing values) can lead to false conclusions and costly mistakes. Therefore, think of <font color=#FF8181>**incomplete data as inaccurate data!**</font>


In [6]:
# Write a function that constructs a DataFrame using the .reshape() function
# Note that our function assigns nan values to certain DataFrame cells
def make_df3():
    data = np.array(range(24)).reshape(-1,3)
    df   = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
    df.iloc[0,0] = np.nan
    df.iloc[0,1] = np.nan
    df.iloc[4,0] = np.nan
    df.iloc[6,0] = np.nan
    return df

In [7]:
# Create a demo DataFrame
df = make_df3()
display(df)

Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


---
## <font color=#14F278>2. Identifying Missing Data:</font>

The first step to tackling missing data is to <font color=#14F278>**identify it**</font>. Here we will be using the demo DataFrame, generated above, however, all the steps are applicable to <font color=#14F278>**any data set of any size and shape**</font>:
- a good first step is to apply the `info()` method on your dataframe - it returns the number of **non-null (i.e. non-missing) values per column**


In [4]:
df.info() # identifies the non nulls

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5 non-null      float64
 1   col2    7 non-null      float64
 2   col3    8 non-null      int32  
dtypes: float64(2), int32(1)
memory usage: 292.0 bytes


- alternatively, use the `isna()` method to create a <font color=#14F278>**boolean dataframe**</font> with the same dimensions as the original one
- all <font color=#14F278>**NULL**</font> values will be replace with <font color=#14F278>**True**</font> and all <font color=#14F278>**non-NULL**</font> values - with <font color=#14F278>**False**</font>:

In [5]:
is_missing = df.isna()
display(is_missing)

Unnamed: 0,col1,col2,col3
0,True,True,False
1,False,False,False
2,False,False,False
3,False,False,False
4,True,False,False
5,False,False,False
6,True,False,False
7,False,False,False


- apply `sum(axis=0)` on `is_missing` to find <font color=#14F278>**the number of NULL values per column**</font>
- apply `sum(axis=1)` on `is_missing` to find <font color=#14F278>**the number of NULL values per row**</font>
    - when identifying the missing data per row, it's worthwile aggregating the output to something actionable
    - further apply `value_counts()` to `is_missing.sum(axis =1)` to find how many rows in total have 0, 1, 2, etc. missing data entries

In [6]:
missing_per_column = is_missing.sum(axis =0)
display(missing_per_column)

col1    3
col2    1
col3    0
dtype: int64

In [7]:
missing_per_row = is_missing.sum(axis = 1).value_counts()
display(missing_per_row)

0    5
1    2
2    1
Name: count, dtype: int64

<center>
    <div>
        <img src="..\images\missing_001.png"/>
    </div>
</center>


---
## <font color=#14F278>3. Handling Missing Data:</font>

Once identified, missing data can be handled in many ways, depending on the business case:
- in certain situations, it is okay to <font color=#14F278>**to not handle missing data**</font>
- the second easiest solution is to <font color=#14F278>**drop the missing data**</font>
- alternatively, we can <font color=#14F278>**impute (fill in the blanks)**</font>
- or <font color=#14F278>**interpolate**</font>

---
### <font color=#14F278>3.1 Dropping:</font>
Dropping Missing values refers to <font color=#14F278>**'getting rid of'**</font> any records, containing a NULL - be it columns or rows in DataFrame.

The ease of this method, however, comes at the expense of <font color=#FF8181>**losing important data points**</font> - often critical from a business point of view. Therefore, dropping must be conducted carefully and very selectively!

To perform dropping of values, we use the `dropna()` method:
- `dropna(axis = 1)` - drops <font color=#FF8181>**all columns with NULLs**</font>
- `dropna(axis = 0)` - drops <font color=#FF8181>**all rows with NULLs**</font>

<center>
    <div>
        <img src="..\images\missing_002.png"/>
    </div>
</center>


In [8]:
# Create two DataFrames using different axis values:

# df1 will contain only non-null columns of df - drops the columns that contain any nulls
df1 = df.dropna(axis = 1)
display(df1)

# df2 will contain only non-null rows of df - drops therows that contain any nulls
df2 = df.dropna(axis = 0)
display(df2)

Unnamed: 0,col3
0,2
1,5
2,8
3,11
4,14
5,17
6,20
7,23


Unnamed: 0,col1,col2,col3
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
5,15.0,16.0,17
7,21.0,22.0,23


---
### <font color=#14F278>3.2 Imputation:</font>
To <font color=#14F278>**imputate**</font> means to <font color=#14F278>**assign (ascribe)**</font> to something, and in that sense, imputation simply means to <font color=#14F278>**fill in the blanks**</font> of a dataset. 

- to impute, use the `fillna()` method
- can be used to fill in with single or multiple static values


In [9]:
# filling in all blanks with a single value
df.fillna(9000)

Unnamed: 0,col1,col2,col3
0,9000.0,9000.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9000.0,13.0,14
5,15.0,16.0,17
6,9000.0,19.0,20
7,21.0,22.0,23


In [10]:
# filling in a column's blanks with a single value using a mask
# note - here we use the .isnull() method and explicit indexing
mask = df['col1'].isnull()
df.loc[mask, 'col1'] = 9000
display(df)

Unnamed: 0,col1,col2,col3
0,9000.0,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9000.0,13.0,14
5,15.0,16.0,17
6,9000.0,19.0,20
7,21.0,22.0,23


In [11]:
# filling in columns' blanks with different values
df = make_df3()
df.fillna({'col1':6666, 'col2': 33333})

Unnamed: 0,col1,col2,col3
0,6666.0,33333.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,6666.0,13.0,14
5,15.0,16.0,17
6,6666.0,19.0,20
7,21.0,22.0,23


---
### <font color=#14F278>3.3 Forward and Backward Fill:</font>
<font color=#14F278>**Forward Fill & Backward Fill**</font> are both parameters that can be passed on to the `fillna()` method:
- <font color=#14F278>**Forward Fill**</font> propagates forward the last valid observation onto the missing data point. All missing values are thus replaces with the value above them in the corresponding column. If the missing value is the first element in a column, it will remain **NaN**
- use `fillna(method = 'ffill')`
- <font color=#14F278>**Backward Fill**</font> works in the opposite way - it fills in a missing data field with the value beneath it in the corresponding column. If the missing value is the last element in the column, it will remain **NaN**.
- use `fillna(method = 'bfill')`


In [12]:
# Forward fill
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.fillna(method='ffill'))

Before:


Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


---------------------
After:


  display(df.fillna(method='ffill'))


Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,9.0,13.0,14
5,15.0,16.0,17
6,15.0,19.0,20
7,21.0,22.0,23


In [13]:
# Backward fill
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.fillna(method='bfill'))

Before:


Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


---------------------
After:


  display(df.fillna(method='bfill'))


Unnamed: 0,col1,col2,col3
0,3.0,4.0,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,15.0,13.0,14
5,15.0,16.0,17
6,21.0,19.0,20
7,21.0,22.0,23


---
### <font color=#14F278>3.4 Interpolation:</font>

<font color=#14F278>**Interpolation**</font> is a method of finding a simple function from a given data set, which can then be used to derive data points in between the given data ones. There are many interpolation methods, but we will consider the simples one - <font color=#14F278>**Linear Interpolation**</font>. 
- A <font color=#14F278>**Linear Interpolation**</font> will take the two closest values to a missing data field and will fill it in with the <font color=#14F278>**mid-point (average)**</font> of the two
-  Use the `interpolate()` method

In [14]:
# Interpolation
df = make_df3()
print('Before:')
display(df)
print('---------------------')
print('After:')
display(df.interpolate())

Before:


Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,,13.0,14
5,15.0,16.0,17
6,,19.0,20
7,21.0,22.0,23


---------------------
After:


Unnamed: 0,col1,col2,col3
0,,,2
1,3.0,4.0,5
2,6.0,7.0,8
3,9.0,10.0,11
4,12.0,13.0,14
5,15.0,16.0,17
6,18.0,19.0,20
7,21.0,22.0,23


In [15]:
# when there are multiple adjasent blanks, interpolate() will take this into account and assign equidistant values
pd.Series([1,np.nan,np.nan,4]).interpolate()

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

---
## <font color=#14F278> 4. Summary:</font>
- Handling missing data is important, as it can indicate incomplete data sets, calculation errors, as well as negatively impact data analyses
- The two main methods to handle missing data is by __Dropping Values__ and __Imputation (filling in blanks)__
- When handling missing data, always pick the most adequate and relevant method to your data set in order to minimise critical data loss

---
## <font color=#FF8181> 5. Concept Check: </font>

1. Suppose we have a DataFrame `df=pd.DataFrame({'col1':[1,2,np.nan, 4,5], 'col2':[6,7,8,9,10], 'col3':[np.nan, 12,13, np.nan,15]})`. Without running a code, determine:
- the shape of the output produced by `df.dropna(axis=1)`
- the shape of the output produced by `df.dropna(axis=0)`
2. Using the DataFrame from question 1 and without running a code, determine:
- the value of `df.loc[0,'col3']` after applying imputation via forward fill `.fillna(method = 'ffill')`
- the value of `df.loc[2,'col1']` after applying imputation via backward fill `.fillna(method = 'bfill')`

In [18]:
df=pd.DataFrame({'col1':[1,2,np.nan, 4,5], 'col2':[6,7,8,9,10], 'col3':[np.nan, 12,13, np.nan,15]})
df

#1 a) (1,4)
# b) (3,2)
#TODO

Unnamed: 0,col1,col2,col3
0,1.0,6,
1,2.0,7,12.0
2,,8,13.0
3,4.0,9,
4,5.0,10,15.0
