# Preprocessing Workflow


🎯 This exercice will take you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

👇 Download the `ML_Houses_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv) and place it in the `data` folder. Then, run the code below to load the dataset and features you will be working with.

In [None]:
## YOUR CODE HERE

👉 Take the time to do a preliminary investigation of the features by reading the dataset description available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt). Make sure to use refer to it throughout the day.

# Duplicates

ℹ️ Duplicates in datasets can cause data leakage. It is important to locate and remove any meaningless duplicates.

❓ How many duplicated rows are there in the dataset? Save your answer under variable name `duplicate_count`.

In [None]:
## YOUR CODE HERE

👇 Remove the duplicates from the dataset. Overwite the dataframe `data`.

In [None]:
## YOUR CODE HERE

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('duplicates',
                         duplicates = duplicate_count,
                         dataset = data
)

result.write()
print(result.check())

# Missing data

👇 Print out the percentage of missing values for all columns of the dataframe.

In [None]:
## YOUR CODE HERE

## `GarageFinish`

👇 Investigate the missing values in `GarageFinish`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using Sklearn's `SimpleImputer`
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.

ℹ️ According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.

In [None]:
## YOUR CODE HERE

## `RoofSurface`

👇 Investigate the missing values in `RoofSurface`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using Sklearn's `SimpleImputer`
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.

ℹ️ `RoofSurface` has a few missing values that can be imputed by the median value.

In [None]:
## YOUR CODE HERE

👇 When you are done, print out the percentage of missing values for the entire dataframe.

In [None]:
## YOUR CODE HERE

⚠️ Be careful: not all missing values are represented `np.nans`, and python's `isnull()` only detects `np.nans` ⚠️

## `ChimneyStyle`

👇 Investigate the missing values in `ChimneyStyle`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.

ℹ️ `ChimneyStyle` has a lot of missing values. The description does not touch on what they represent. As such, it is better not to make any assumptions and to drop the column entirely.

In [None]:
## YOUR CODE HERE

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = data
)

result.write()
print(result.check())

# Scaling

##  `RoofSurface` 

👇 Investigate `RoofSurface` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scale
2. Robust Scale
3. MinMax Scale

Replace the original columns by the transformed values.

In [None]:
## YOUR CODE HERE

ℹ️ Since `RoofSurface` does not seem to have a normal distribution, it is better to MinMax scale.

In [None]:
## YOUR CODE HERE

## `GrLivArea`

👇 Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scale
2. Robust Scale
3. MinMax Scale

Replace the original columns by the transformed values.

In [None]:
## YOUR CODE HERE

ℹ️ `GrLivArea` has a normal distribution, and some outliers. It needs to be Robust scaled.

In [None]:
## YOUR CODE HERE

## `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

👇 Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scale
2. Standard Scale
3. Robust Scale

Replace the original columns by the transformed values.

ℹ️ `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr` are ordinal features that can be MinMax scaled.

In [None]:
## YOUR CODE HERE

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = data
)

result.write()
print(result.check())

# Feature Engineering

## `GarageFinish`

👇 Investigate `GarageFinish` and chose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.

ℹ️ `GarageFinish` is a multicategorical feature that must be One hot encoded.

In [None]:
## YOUR CODE HERE

## Encoding  `CentralAir`

👇 Investigate `CentralAir` and chose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column by the encoding.

ℹ️ `CentralAir` is a binary categorical feature.

In [None]:
## YOUR CODE HERE

## `MoSold` - Cyclical engineering 

Data can be continuous, discrete, categorical, ordinal, but it can also be cyclical. Temporal data is a prime example of that: months, days, minutes. Such data needs specific preprocessing for Machine Learning models to understand and consider its cyclical nature.

Consider the feature `MoSold`, the month on which the house was sold. If left as is, a model would not understand that after 12 (December) comes 1 (January). It would only consider the values on a linear scale.

👇 Do your own investigation on how to preprocess cyclical features in Machine Learning. Then, transform `MoSold` according to your findings.

Replace the original column by the transformed values.

ℹ️ This [article]((https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/)) explains how to deal with cyclical features.

In [None]:
## YOUR CODE HERE

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding',
                         dataset = data
)

result.write()
print(result.check())

# Export the dataset

👇 Now that the dataset has been preprocessed, execute the code below to export it. You will keep working on it in the next exercice.

In [None]:
data.to_csv("../02-Feature-Selection/data/clean_dataset.csv", index=False)

# 🏁