# Preprocessing Workflow


🎯 This exercise will take you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

👇 Download the `ML_Houses_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv) and save it in the `data` folder as `ML_Houses_dataset.csv`. Then, run the code below to load the dataset and features you will be working with.

In [10]:
import numpy as np

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv")

data = data[['GrLivArea','BedroomAbvGr','KitchenAbvGr', 'OverallCond','RoofSurface','GarageFinish','CentralAir','ChimneyStyle','MoSold','SalePrice']].copy()

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,bricks,2,208500
1,1262,3,1,8,874.0,RFn,Y,bricks,5,181500
2,1786,3,1,5,1593.0,RFn,Y,castiron,9,223500
3,1717,3,1,5,2566.0,Unf,Y,castiron,2,140000
4,2198,4,1,5,3130.0,RFn,Y,bricks,12,250000


👉 Take the time to do a preliminary investigation of the features by reading the dataset description available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt). Make sure to use refer to it throughout the day.

# Duplicates

ℹ️ Duplicates in datasets can cause data leakage. It is important to locate and remove any meaningless duplicates.

❓ How many duplicated rows are there in the dataset? Save your answer under variable name `duplicate_count`.

In [3]:
# YOUR CODE HERE
duplicate_count = data.duplicated().sum()
duplicate_count

300

👇 Remove the duplicates from the dataset. Overwite the dataframe `data`.

In [4]:
# YOUR CODE HERE
data.drop_duplicates(inplace=True)

### ☑️ Test your code

In [5]:
from nbresult import ChallengeResult

result = ChallengeResult('duplicates',
                         duplicates = duplicate_count,
                         dataset = data
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/01-Preprocessing-Workflow
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 2 items

tests/test_duplicates.py::TestDuplicates::test_dataset_length [32mPASSED[0m[32m     [ 50%][0m
tests/test_duplicates.py::TestDuplicates::test_duplicate_count [32mPASSED[0m[32m    [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/duplicates.pickle

[32mgit[39m commit -m [33m'Completed duplicates step'[39m

[32mgit[39m push origin master


# Missing data

👇 Print out the percentage of missing values for all columns of the dataframe.

In [7]:
# YOUR CODE HERE
data.isnull().sum()/len(data)

GrLivArea       0.000000
BedroomAbvGr    0.000000
KitchenAbvGr    0.000000
OverallCond     0.000000
RoofSurface     0.006164
GarageFinish    0.055479
CentralAir      0.000000
ChimneyStyle    0.000000
MoSold          0.000000
SalePrice       0.000000
dtype: float64

## `GarageFinish`

👇 Investigate the missing values in `GarageFinish`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using Sklearn's `SimpleImputer`
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
ℹ️ According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.
</details>

In [8]:
# YOUR CODE HERE
data.GarageFinish.unique()

array(['RFn', 'Unf', 'Fin', nan], dtype=object)

In [13]:
data.GarageFinish.replace(np.nan, "NoGarage", inplace=True)
data.GarageFinish.unique()

array(['RFn', 'Unf', 'Fin', 'NoGarage'], dtype=object)

## `RoofSurface`

👇 Investigate the missing values in `RoofSurface`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using Sklearn's `SimpleImputer`
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
ℹ️ `RoofSurface` has a few missing values that can be imputed by the median value.
</details>

In [15]:
# YOUR CODE HERE
from sklearn.impute import SimpleImputer

spl = SimpleImputer(strategy="mean")
spl.fit(data[['RoofSurface']])
data[['RoofSurface']] = spl.transform(data[['RoofSurface']])
data.RoofSurface

0       1995.0
1        874.0
2       1593.0
3       2566.0
4       3130.0
         ...  
1455    1698.0
1456    2645.0
1457     722.0
1458    3501.0
1459    3082.0
Name: RoofSurface, Length: 1460, dtype: float64

👇 When you are done, print out the percentage of missing values for the entire dataframe.

In [16]:
# YOUR CODE HERE
data.isnull().sum()/len(data)

GrLivArea       0.0
BedroomAbvGr    0.0
KitchenAbvGr    0.0
OverallCond     0.0
RoofSurface     0.0
GarageFinish    0.0
CentralAir      0.0
ChimneyStyle    0.0
MoSold          0.0
SalePrice       0.0
dtype: float64

⚠️ Be careful: not all missing values are represented `np.nans`, and python's `isnull()` only detects `np.nans` ⚠️

## `ChimneyStyle`

👇 Investigate the missing values in `ChimneyStyle`. Then, chose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace by actual meaning

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
ℹ️ `ChimneyStyle` has a lot of missing values. The description does not touch on what they represent. As such, it is better not to make any assumptions and to drop the column entirely.
</details>

In [23]:
# YOUR CODE HERE

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,2,208500
1,1262,3,1,8,874.0,RFn,Y,5,181500
2,1786,3,1,5,1593.0,RFn,Y,9,223500
3,1717,3,1,5,2566.0,Unf,Y,2,140000
4,2198,4,1,5,3130.0,RFn,Y,12,250000


### ☑️ Test your code

In [24]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = data
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/01-Preprocessing-Workflow
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 2 items

tests/test_missing_values.py::TestMissing_values::test_nans [32mPASSED[0m[32m       [ 50%][0m
tests/test_missing_values.py::TestMissing_values::test_number_of_columns [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master


# Scaling

##  `RoofSurface` 

👇 Investigate `RoofSurface` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scale
2. Robust Scale
3. MinMax Scale

Replace the original columns by the transformed values.

In [33]:
# YOUR CODE HERE
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
mms.fit(data[['RoofSurface']])
data[['RoofSurface']] = mms.transform(data[['RoofSurface']])


RoofSurface    0.0
dtype: float64

<details>
    <summary>💡 Hint</summary>
ℹ️ Since `RoofSurface` does not seem to have a normal distribution, it is better to MinMax scale.
</details>

In [0]:
# YOUR CODE HERE


## `GrLivArea`

👇 Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scale
2. Robust Scale
3. MinMax Scale

Replace the original columns by the transformed values.

In [34]:
# YOUR CODE HERE
from sklearn.preprocessing import RobustScaler

rs = RobustScaler()
rs.fit(data[['GrLivArea']])
data[['GrLivArea']] = rs.transform(data[['GrLivArea']])


0       0.380070
1      -0.312090
2       0.497489
3       0.390885
4       1.134029
          ...   
1455    0.282735
1456    0.940904
1457    1.353418
1458   -0.596369
1459   -0.321360
Name: GrLivArea, Length: 1460, dtype: float64

<details>
    <summary>💡 Hint</summary>
ℹ️ `GrLivArea` has a normal distribution, and some outliers. It needs to be Robust scaled.
</details>

In [0]:
# YOUR CODE HERE

## `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

👇 Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scale
2. Standard Scale
3. Robust Scale

Replace the original columns by the transformed values.

<details>
    <summary>💡 Hint</summary>
ℹ️ `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr` are ordinal features that can be MinMax scaled.
</details>

In [42]:
# YOUR CODE HERE
features_to_transform = ['BedroomAbvGr', 'OverallCond' , 'KitchenAbvGr']
for feature in features_to_transform:
    trans_model = MinMaxScaler()
    trans_model.fit(data[[feature]])
    data[[feature]] = trans_model.transform(data[[feature]])
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,Y,2,208500
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,Y,5,181500
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,Y,9,223500
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,Y,2,140000
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,Y,12,250000


### ☑️ Test your code

In [43]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = data
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/01-Preprocessing-Workflow
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 3 items

tests/test_scaling.py::TestScaling::test_bedroom_kitchen_condition [32mPASSED[0m[32m [ 33%][0m
tests/test_scaling.py::TestScaling::test_gr_liv_area [32mPASSED[0m[32m              [ 66%][0m
tests/test_scaling.py::TestScaling::test_roof_surface [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master


# Feature Engineering

## `GarageFinish`

👇 Investigate `GarageFinish` and chose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.


<details>
    <summary>💡 Hint</summary>
ℹ️ `GarageFinish` is a multicategorical feature that must be One hot encoded.
</details>

In [64]:
data.GarageFinish.unique()

array(['RFn', 'Unf', 'Fin', 'NoGarage'], dtype=object)

In [75]:
# YOUR CODE HERE
from sklearn.preprocessing import OneHotEncoder

_ = data[['GarageFinish']]

ohe = OneHotEncoder(sparse = False)
ohe.fit(_)
garage_encoded = ohe.transform(_)
garage_encoded.T[3]


data["Unf"], data["Fin"], data['NoGarage'], data['RFn'] = garage_encoded.T

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice,RFn,Unf,Fin,NoGarage
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,Y,2,208500,0.0,0.0,0.0,1.0
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,Y,5,181500,0.0,0.0,0.0,1.0
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,Y,9,223500,0.0,0.0,0.0,1.0
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,Y,2,140000,1.0,0.0,0.0,0.0
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,Y,12,250000,0.0,0.0,0.0,1.0


## Encoding  `CentralAir`

👇 Investigate `CentralAir` and chose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column by the encoding.


<details>
    <summary>💡 Hint</summary>
ℹ️ `CentralAir` is a binary categorical feature.
</details>

In [81]:
# YOUR CODE HERE
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(drop='if_binary', sparse = False)
ohe.fit(data[['CentralAir']])
data['CentralAir'] = ohe.transform(data[['CentralAir']])

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice,RFn,Unf,Fin,NoGarage
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,1.0,2,208500,0.0,0.0,0.0,1.0
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,1.0,5,181500,0.0,0.0,0.0,1.0
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,1.0,9,223500,0.0,0.0,0.0,1.0
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,1.0,2,140000,1.0,0.0,0.0,0.0
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,1.0,12,250000,0.0,0.0,0.0,1.0


## `MoSold` - Cyclical engineering 

Data can be continuous, discrete, categorical, ordinal, but it can also be cyclical. Temporal data is a prime example of that: months, days, minutes. Such data needs specific preprocessing for Machine Learning models to understand and consider its cyclical nature.

Consider the feature `MoSold`, the month on which the house was sold. If left as is, a model would not understand that after 12 (December) comes 1 (January). It would only consider the values on a linear scale.

👇 Do your own investigation on how to preprocess cyclical features in Machine Learning. Then, transform `MoSold` according to your findings.

⚠️ Replace the original column by the new features.

<details>
    <summary>💡 Hint</summary>
ℹ️ This <a href='https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time' target='blank'>article</a> explains how to deal with cyclical features.
    
</details>

**❓ How would you name these 2 new features?**

Let's add two new columns to your data frame, `sin_MoSold` and `cos_MoSold`, containing the sine and cosine of the `MoSold` column respectively.  Once these columns are added we can drop the original `MoSold` column.

In [92]:
# YOUR CODE HERE
data['sin_MoSold'] = np.sin(2*np.pi*data.MoSold/12)
data['cos_MoSold'] = np.cos(2*np.pi*data.MoSold/12)
data.drop(columns = ['sin_month','cos_month','MoSold','GarageFinish'], inplace=True)
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,SalePrice,RFn,Unf,Fin,NoGarage,sin_MoSold,cos_MoSold
0,0.38007,0.375,0.333333,0.5,0.316729,1.0,208500,0.0,0.0,0.0,1.0,0.8660254,0.5
1,-0.31209,0.375,0.333333,0.875,0.06965,1.0,181500,0.0,0.0,0.0,1.0,0.5,-0.8660254
2,0.497489,0.375,0.333333,0.5,0.228124,1.0,223500,0.0,0.0,0.0,1.0,-1.0,-1.83697e-16
3,0.390885,0.375,0.333333,0.5,0.442583,1.0,140000,1.0,0.0,0.0,0.0,0.8660254,0.5
4,1.134029,0.5,0.333333,0.5,0.566894,1.0,250000,0.0,0.0,0.0,1.0,-2.449294e-16,1.0


### ☑️ Test your code

In [93]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding', dataset = data, new_features = ['sin_MoSold', 'cos_MoSold'])

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/01-Preprocessing-Workflow
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 4 items

tests/test_encoding.py::TestEncoding::test_central_air [32mPASSED[0m[32m            [ 25%][0m
tests/test_encoding.py::TestEncoding::test_columns [32mPASSED[0m[32m                [ 50%][0m
tests/test_encoding.py::TestEncoding::test_month_sold_features [32mPASSED[0m[32m    [ 75%][0m
tests/test_encoding.py::TestEncoding::test_month_sold_features_number [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/encoding.pickle

[32mgit[39m commit -m [33m'Completed encoding step'[39m

[32mgit[39m push origin master


# Export the dataset

👇 Now that the dataset has been preprocessed, execute the code below to export it. You will keep working on it in the next exercise.

In [94]:
data.to_csv("../02-Feature-Selection/data/clean_dataset.csv", index=False)

# 🏁