<a href="https://colab.research.google.com/github/ChiomaUU/CMPT2400/blob/main/EDA_Lab_3_(Students).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: Handling Missing Values In your Data

Machine learning models cannot work with missing values in a dataset (`NaNs` or `'-'`). It is crucial to fix any missing values in your data. The following are 2 ways to deal with missing data:
- deleting rows that contain missing features;
- replacing missing features using proper techniques.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

In [None]:
# loading the 'Lab3_df1.csv' csv data
df1 = pd.read_csv('Lab3_df1.csv')

In [None]:
# preparing data, drop unnecessary columns
df1 = df1.drop(columns=['Unnamed: 0', 'index'])
df1.head()

# Lab Activity One: Handling Missing Data Without Time-Series Specific Methods

#### You will be dealing with `NaN` values for the ``H`` and ``Mode`` columns. You will go through several ways of handling the missing values. For each method, you will first create a temporary copy of the master dataframe (`df1`) to not alter the original dataset.

In [None]:
# Checking how many missing values are in the H column
df1["H"].isnull().sum()

##### [A] Drop all the rows which contain `NaN` values in the `'H'` column. Be sure to only drop the rows with missing values in the `'H'` column and no other column. Remember to use the `df_clean` dataframe.

In [None]:
df_clean = df1.copy()


#### Instead of just deleting the rows we can try to replace with a logical value. Generally, the value to replace with is up to the data analyst (you) to figure out. Having domain knowledge helps.

##### [A] Using `df_temp0` replace all the `NaN` values in the `'H'` column with zero ($0$).

In [None]:
df_temp0 = df1.copy()


##### [A] Using `df_temp_mean` replace all the `NaN` values in the `'H'` column with the *mean* of the `'H'` column.

In [None]:
df_temp_mean = df1.copy()


##### [A] Using `df_temp_median` replace all the `NaN` values in the `'H'` column with the *median* of the `'H'` column.

In [None]:
df_temp_median = df1.copy()


##### [A] Using `df_temp_mode` replace all the `NaN` values in the `'H'` column with the *mode* of the `'H'` column.
> Hint: Replace with the mode is trickier that the two previous methods.

In [None]:
df_temp_mode = df1.copy()


#### Now, you will be dealing with missing values for the `'Mode'` column. This contains categorical values, which are harder to replace with a logical value, unlike numeric (mean, median, mode). Below are the recommended steps for this:
- Replace with the most recurring category (similar to mode).
- Create a new category called 'missing' and replace with it that.
- Replace with a category depending on another feature (requires further data analysis).
- Simply delete the rows with missing values.

Note: For our dataset, instead of containing `NaN`s the `'Mode'` column has `'_'` to indicate missing values.

In [None]:
df_temp = df1.copy()
df_temp['Mode'].unique()

##### [A] Using `df_temp`, replace the `'_'` values with the string `'missing'`.

# Lab Activity Two: Handling Missing Data With Time-Series Specific Methods

The three most popular methods for dealing with time-series data are listed below
- **Last Observation Carried Forward (LOCF):** Replacing the missing value with the value in the previous cell
- **Next Observation Carried Backward (NOCB):** Replacing the missing value with the value in the next cell
- **Linear interpolation:** Replacing missing values with estimates from previous values

##### Preparing the data

In [None]:
#Loading data
df2 = pd.read_csv('Lab3_df2.csv')

# Dropping unwated column


# Change datatype of the date column


# Here we are splitting our dataframe into 2, depend on if the site of device in is clinic 1 or 2
df2_1 = df2[df2["Site Name"] == 'Clinic1'].copy()
df2_2 = df2[df2["Site Name"] == 'Clinic2'].copy()

df2_1.reset_index(inplace=True, drop=True)
df2_2.reset_index(inplace=True, drop=True)

In [None]:
df2_1.head()

In [None]:
df2_2.head()

##### Checking for missing values in the `'T1'` column

##### [A] An important first step when using time-series specific methods is to sort the dataset by time (date in our case). Sort both `df2_1` and `df2_2` by the `'Inspection Date'` column.

#### Last Observation Carried Forward (LOCF)

##### [A] using `df_temp1`, Use LOCF to fill the `NaN` values of the entire `df2_1 dataset`.

In [None]:
df_temp1 = df2_1.copy()


#### Next Observation Carried Backward (NOCB)



##### [A] Using `df_temp2`, Use NOCB to fill the `NaN` values of the entire dataset.

In [None]:
df_temp2 = df2_2.copy()



#### Linear interpolation



##### [A] Using `df_temp3`, Use Interpolation to fill the NaN values of the `'T1'` column.
> Hint: Set the index of the dataframe to be the Inspection column and use the `.interpolate function(method='index)`. Reading documentation of the functions will help you do this.

In [None]:
df_temp3 = df2_1.copy()

