# Preprocessing Workflow


🎯 This exercise will guide you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

🌤 We stored the `ML_Houses_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv) in the cloud.

👇 Run the code down below to load the dataset and features you will be working with.

In [30]:
import pandas as pd

# Loading the dataset
url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv"
data = pd.read_csv(url)

# Selecting some columns of interest
selected_features = ['GrLivArea',
                     'BedroomAbvGr',
                     'KitchenAbvGr',
                     'OverallCond',
                     'RoofSurface',
                     'GarageFinish',
                     'CentralAir',
                     'ChimneyStyle',
                     'MoSold',
                     'SalePrice']

# Overwriting the "data" variable to keep only the columns of interest
# Notice the .copy() to copy the values
data = data[selected_features].copy()

# Showing the first five rows
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,bricks,2,208500
1,1262,3,1,8,874.0,RFn,Y,bricks,5,181500
2,1786,3,1,5,1593.0,RFn,Y,castiron,9,223500
3,1717,3,1,5,2566.0,Unf,Y,castiron,2,140000
4,2198,4,1,5,3130.0,RFn,Y,bricks,12,250000


📚 Take the time to do a ***preliminary investigation*** of the features by reading the ***dataset description*** available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt). Make sure to refer to it throughout the day.

## (1) Duplicates

ℹ️ ***Duplicates in datasets cause data leakage.*** 

👉 It is important to locate and remove duplicates.

❓ How many duplicated rows are there in the dataset ❓

<i>Save your answer under variable name `duplicate_count`.</i>

In [31]:
duplicate_count = data.duplicated().sum()
duplicate_count

300

❓ Remove the duplicates from the dataset. Overwite the dataframe `data`❓

In [32]:
#removing duplicates
data = data.drop_duplicates()

🧪 **Test your code**

In [33]:
from nbresult import ChallengeResult

result = ChallengeResult('duplicates',
                         duplicates = duplicate_count,
                         dataset = data
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/baska/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/baska/code/Svvcm/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/tests
plugins: typeguard-2.13.3, anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_duplicates.py::TestDuplicates::test_dataset_length [32mPASSED[0m[32m           [ 50%][0m
test_duplicates.py::TestDuplicates::test_duplicate_count [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/duplicates.pickle

[32mgit[39m commit -m [33m'Completed duplicates step'[39m

[32mgit[39m push origin master



## (2) Missing data

❓ Print the percentage of missing values for every column of the dataframe. ❓

In [34]:
data.isnull().sum() / len(data)

GrLivArea       0.000000
BedroomAbvGr    0.000000
KitchenAbvGr    0.000000
OverallCond     0.000000
RoofSurface     0.006164
GarageFinish    0.055479
CentralAir      0.000000
ChimneyStyle    0.000000
MoSold          0.000000
SalePrice       0.000000
dtype: float64

### `GarageFinish`

❓ **Questions** about `GarageFinish` ❓

Investigate the missing values in `GarageFinish`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using `SimpleImputer` from Scikit-Learn
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.
</details>

In [35]:
import numpy as np

data.GarageFinish = data.GarageFinish.replace(np.nan, "NoGarage") # Replace NaN by "NoGarage"
data.GarageFinish.value_counts() # Check count of each category

Unf         605
RFn         422
Fin         352
NoGarage     81
Name: GarageFinish, dtype: int64

### `RoofSurface`

❓ **Questions** about `RoofSurface` ❓

Investigate the missing values in `RoofSurface`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using sklearn's `SimpleImputer`
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `RoofSurface` has a few missing values that can be imputed by the median value.
</details>

In [36]:
data.sample(10)

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
242,1440,3,1,4,4234.0,Unf,N,?,4,79000
397,1812,4,1,5,682.0,RFn,Y,?,7,169500
1184,1572,3,1,4,4555.0,RFn,Y,?,5,186700
437,904,2,1,7,2888.0,Unf,Y,?,1,119000
638,796,2,1,7,2297.0,NoGarage,Y,?,5,85000
803,2822,4,1,5,4099.0,RFn,Y,?,1,582933
847,864,3,1,6,2721.0,Unf,Y,?,8,133500
478,1718,3,1,5,1200.0,RFn,Y,?,9,297000
86,1560,3,1,5,4652.0,Fin,Y,?,3,174000
75,988,2,1,5,1709.0,Unf,Y,?,11,91000


In [37]:
from sklearn.impute import SimpleImputer

impute = SimpleImputer(strategy="median")

impute.fit(data[["RoofSurface"]])

data["RoofSurface"] = impute.transform(data[["RoofSurface"]])

impute.statistics_

array([2906.])

### `ChimneyStyle`

❓ **Questions** about `ChimneyStyle` ❓

Investigate the missing values in `ChimneyStyle`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
* ⚠️ Be careful: not all missing values are represented as `np.nans`, and Python's `isnull()` only detects `np.nans`...
    
* ℹ️ `ChimneyStyle` has a lot of missing values. The description does not touch on what they represent. As such, it is better not to make any assumptions and to drop the column entirely.
    

</details>

In [38]:
data = data.drop(columns="ChimneyStyle")


In [39]:
data.sample(10)

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
191,1479,4,1,9,2446.0,Fin,Y,6,184000
1080,1258,3,1,7,2020.0,RFn,Y,10,145000
476,1493,3,1,5,4537.0,Fin,Y,4,208900
901,1306,2,1,7,5065.0,Unf,Y,5,153000
422,988,2,1,5,2391.0,RFn,Y,2,113000
179,968,2,1,6,1168.0,Unf,Y,7,100000
665,1936,3,1,5,4054.0,RFn,Y,11,230500
471,1969,3,1,6,3190.0,RFn,Y,11,190000
858,1370,3,1,5,1675.0,RFn,Y,6,152000
1437,1932,2,1,5,3482.0,Fin,Y,11,394617


🧪 **Test your code**

In [40]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = data
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/baska/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/baska/code/Svvcm/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/tests
plugins: typeguard-2.13.3, anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_missing_values.py::TestMissing_values::test_nans [32mPASSED[0m[32m             [ 50%][0m
test_missing_values.py::TestMissing_values::test_number_of_columns [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master



❓ When you are done with handling missing value, print out the percentage of missing values for the entire dataframe ❓

You should no longer have missing values !

In [41]:
data.isnull().sum() / len(data)

GrLivArea       0.0
BedroomAbvGr    0.0
KitchenAbvGr    0.0
OverallCond     0.0
RoofSurface     0.0
GarageFinish    0.0
CentralAir      0.0
MoSold          0.0
SalePrice       0.0
dtype: float64

## (3) Scaling

**First of all, before scaling...**

To understand the effects of scaling and encoding on model performance, let's get a **base score without any data transformation**.

❓ Cross-validate a linear regression model that predicts `SalePrice` using the other features ❓

⚠️ Note that a linear regression model can only handle numeric features. [DataFrame.select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) can help.

In [42]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split

X_old=data[["GrLivArea","BedroomAbvGr","KitchenAbvGr","OverallCond","RoofSurface","MoSold"]]
y_old=data["SalePrice"]

# Instantiate model
model = LinearRegression()
# 5-Fold Cross validate model
cv_results = cross_validate(model, X_old, y_old, cv=5)
# Scores
print(cv_results['test_score'])
# Mean of scores
cv_results['test_score'].mean()

[0.62648299 0.587701   0.63354831 0.57682692 0.43874229]


0.5726603017210607

Keep this score in mind! You will train a new model after data preprocessing in Challenge #2 - see if it improves your average score 😉

🚀 Now, back to **feature scaling**!

###  `RoofSurface` 

❓ **Question** about `RoofSurface` ❓

👇 Investigate `RoofSurface` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [43]:
data

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,2,208500
1,1262,3,1,8,874.0,RFn,Y,5,181500
2,1786,3,1,5,1593.0,RFn,Y,9,223500
3,1717,3,1,5,2566.0,Unf,Y,2,140000
4,2198,4,1,5,3130.0,RFn,Y,12,250000
...,...,...,...,...,...,...,...,...,...
1455,1647,3,1,5,1698.0,RFn,Y,8,175000
1456,2073,3,1,6,2645.0,Unf,Y,2,210000
1457,2340,4,1,9,722.0,RFn,Y,5,266500
1458,1078,2,1,6,3501.0,Unf,Y,4,142125


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMax Scaler
mm_scaler = MinMaxScaler()

# fit
mm_scaler.fit(data[["RoofSurface"]])

# Step 2 - Scale / Transform
data[["RoofSurface"]]=mm_scaler.transform(data[["RoofSurface"]])


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ Since `RoofSurface` has neither a Gaussian distribution, nor outliers $\rightarrow$ MinMaxScaler.
</details>

### `GrLivArea`

❓ **Question** about `GrLivArea` ❓

👇 Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [45]:
data[['GrLivArea']].boxplot();

In [46]:
from sklearn.preprocessing import RobustScaler

rb_scaler = RobustScaler()

rb_scaler.fit(data[["GrLivArea"]])

data[["GrLivArea"]] = rb_scaler.transform(data[["GrLivArea"]])

data

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.380070,3,1,5,0.316729,RFn,Y,2,208500
1,-0.312090,3,1,8,0.069650,RFn,Y,5,181500
2,0.497489,3,1,5,0.228124,RFn,Y,9,223500
3,0.390885,3,1,5,0.442583,Unf,Y,2,140000
4,1.134029,4,1,5,0.566894,RFn,Y,12,250000
...,...,...,...,...,...,...,...,...,...
1455,0.282735,3,1,5,0.251267,RFn,Y,8,175000
1456,0.940904,3,1,6,0.459996,Unf,Y,2,210000
1457,1.353418,4,1,9,0.036147,RFn,Y,5,266500
1458,-0.596369,2,1,6,0.648667,Unf,Y,4,142125


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `GrLivArea` has many outliers $\rightarrow$ RobustScaler()
</details>

### `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

❓ **Questions** about `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr` ❓

👇 Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scaler
2. Standard Scaler
3. Robust Scaler

Replace the original columns with the transformed values.

<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr` are ordinal features. There are less than 0.1% of outliers so no need to use _RobustScaler()_. The distribution is not Gaussian, hence no _StandardScaler()_. By elimination, you can confidently choose _MinMaxScaler()_.
</details>

In [47]:
data.describe()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,MoSold,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,0.07841,2.866438,1.046575,5.575342,0.508148,6.321918,180921.19589
std,0.813952,0.815778,0.220338,1.112799,0.291583,2.703626,79442.502883
min,-2.263422,0.0,0.0,1.0,0.0,1.0,34900.0
25%,-0.516802,2.0,1.0,5.0,0.246143,5.0,129975.0
50%,0.0,3.0,1.0,5.0,0.517523,6.0,163000.0
75%,0.483198,3.0,1.0,6.0,0.761406,8.0,214000.0
max,6.455002,8.0,3.0,9.0,1.0,12.0,755000.0


In [48]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMax Scaler
MM_scaler = MinMaxScaler()

MM_scaler.fit(data[["BedroomAbvGr"]])
data[["BedroomAbvGr"]] = MM_scaler.transform(data[["BedroomAbvGr"]])

MM_scaler.fit(data[["OverallCond"]])
data[["OverallCond"]] = MM_scaler.transform(data[["OverallCond"]])

MM_scaler.fit(data[["KitchenAbvGr"]])
data[["KitchenAbvGr"]] = MM_scaler.transform(data[["KitchenAbvGr"]])


data

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.380070,0.375,0.333333,0.500,0.316729,RFn,Y,2,208500
1,-0.312090,0.375,0.333333,0.875,0.069650,RFn,Y,5,181500
2,0.497489,0.375,0.333333,0.500,0.228124,RFn,Y,9,223500
3,0.390885,0.375,0.333333,0.500,0.442583,Unf,Y,2,140000
4,1.134029,0.500,0.333333,0.500,0.566894,RFn,Y,12,250000
...,...,...,...,...,...,...,...,...,...
1455,0.282735,0.375,0.333333,0.500,0.251267,RFn,Y,8,175000
1456,0.940904,0.375,0.333333,0.625,0.459996,Unf,Y,2,210000
1457,1.353418,0.500,0.333333,1.000,0.036147,RFn,Y,5,266500
1458,-0.596369,0.250,0.333333,0.625,0.648667,Unf,Y,4,142125


🧪 **Test your code**

In [49]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = data
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/baska/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/baska/code/Svvcm/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/tests
plugins: typeguard-2.13.3, anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_scaling.py::TestScaling::test_bedroom_kitchen_condition [32mPASSED[0m[32m      [ 33%][0m
test_scaling.py::TestScaling::test_gr_liv_area [32mPASSED[0m[32m                    [ 66%][0m
test_scaling.py::TestScaling::test_roof_surface [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master



## (4) Feature Encoding

### `GarageFinish`

❓ **Question** about `GarageFinish`❓

👇 Investigate `GarageFinish` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.


<details>
    <summary>💡 <i>Hint</i></summary>
        
ℹ️ `GarageFinish` is a multicategorical feature that should be One-hot-encoded. You could also consider an Ordinal Encoding but we would have to know for sure that Unfinished or no garage are definitely worse that rough finished!
</details>

In [50]:
from sklearn.preprocessing import OneHotEncoder

GarageFinish_ohe = OneHotEncoder(sparse_output=False)

GarageFinish_ohe.fit(data[["GarageFinish"]])


data[GarageFinish_ohe.get_feature_names_out()] = GarageFinish_ohe.transform(data[["GarageFinish"]])

data = data.drop(columns="GarageFinish")

data.sample(10)


Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,MoSold,SalePrice,GarageFinish_Fin,GarageFinish_NoGarage,GarageFinish_RFn,GarageFinish_Unf
14,-0.325995,0.25,0.333333,0.5,0.668724,Y,5,157000,0.0,0.0,1.0,0.0
580,-0.054075,0.375,0.333333,0.625,0.469914,Y,6,181900,0.0,0.0,0.0,1.0
247,-0.137505,0.25,0.333333,0.5,0.108001,Y,6,140000,0.0,0.0,0.0,1.0
965,0.0309,0.375,0.333333,0.5,0.604585,Y,7,178900,1.0,0.0,0.0,0.0
769,2.804171,0.5,0.333333,0.5,0.836897,Y,6,538000,1.0,0.0,0.0,0.0
1191,-0.035535,0.25,0.333333,0.5,0.441702,Y,11,174000,0.0,0.0,0.0,1.0
396,-0.880649,0.25,0.333333,0.5,0.726692,Y,9,123000,0.0,0.0,1.0,0.0
908,-0.868289,0.25,0.333333,0.5,0.049592,Y,6,131000,0.0,0.0,0.0,1.0
1375,0.165315,0.375,0.333333,0.5,0.845493,Y,12,239000,1.0,0.0,0.0,0.0
320,1.748938,0.5,0.333333,0.5,0.205863,Y,9,342643,0.0,0.0,1.0,0.0


In [51]:
GarageFinish_ohe.get_feature_names_out()

array(['GarageFinish_Fin', 'GarageFinish_NoGarage', 'GarageFinish_RFn',
       'GarageFinish_Unf'], dtype=object)

### Encoding  `CentralAir`

❓ **Question** about `CentralAir`❓

Investigate `CentralAir` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column with the newly generated encoded columns.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `CentralAir` is a binary categorical feature.
</details>

In [52]:
ohe = OneHotEncoder(sparse_output=False,drop='if_binary')
# Fit encoder
ohe.fit(data[['CentralAir']])
data["CentralAir"] = ohe.transform(data[['CentralAir']])

In [53]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False,drop='if_binary')
# Fit encoder
ohe.fit(data[['CentralAir']])

data["CentralAir"] = ohe.transform(data[['CentralAir']])


data

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,MoSold,SalePrice,GarageFinish_Fin,GarageFinish_NoGarage,GarageFinish_RFn,GarageFinish_Unf
0,0.380070,0.375,0.333333,0.500,0.316729,1.0,2,208500,0.0,0.0,1.0,0.0
1,-0.312090,0.375,0.333333,0.875,0.069650,1.0,5,181500,0.0,0.0,1.0,0.0
2,0.497489,0.375,0.333333,0.500,0.228124,1.0,9,223500,0.0,0.0,1.0,0.0
3,0.390885,0.375,0.333333,0.500,0.442583,1.0,2,140000,0.0,0.0,0.0,1.0
4,1.134029,0.500,0.333333,0.500,0.566894,1.0,12,250000,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0.282735,0.375,0.333333,0.500,0.251267,1.0,8,175000,0.0,0.0,1.0,0.0
1456,0.940904,0.375,0.333333,0.625,0.459996,1.0,2,210000,0.0,0.0,0.0,1.0
1457,1.353418,0.500,0.333333,1.000,0.036147,1.0,5,266500,0.0,0.0,1.0,0.0
1458,-0.596369,0.250,0.333333,0.625,0.648667,1.0,4,142125,0.0,0.0,0.0,1.0


In the next part, we will do some feature engineering on the `MoSold` feature to take into account it's cyclical nature.

We will create two new features `sin_MoSold` and `cos_MoSold`.

If you're interested in why and how we do that, you can read the next part, but you can also **proceed to the next challenge** if you want.

## (5) [Optional] Feature Engineering

### `MoSold` - Cyclical engineering 

👨🏻‍🏫 A feature can be numerical (continuous or discrete), categorical or ordinal. But a feature can also be temporal (e.g. quarters, months, days, minutes, ...). 

Cyclical features like time need some specific preprocessing. Indeed, if you want any Machine Learning algorithm to capture this cyclicity, your cyclical features must be preprocessed in a certain way.

👉 Consider the feature `MoSold`, the month on which the house was sold.

In [54]:
data["MoSold"].value_counts()


6     253
7     234
5     204
4     141
8     122
3     106
10     89
11     79
9      63
12     59
1      58
2      52
Name: MoSold, dtype: int64

* Many houses were sold in June (6), July (7) and May (5) (Spring/Summer)
* Only a few houses were sold in December (12), January (1) and February (2) (~ Fall/Winter)
    * But for any Machine Learning model, there is no reason why December (12) and January (1) would be "close"...

👩🏻‍🏫 ***How to deal with cyclical features?***

1.  Look at the following illustration and read the explanations to distinguish two different months.

<img src="https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/02-Prepare-the-dataset/cyclical_feature_engineering.png" alt="Cyclical features" width="1000">


2. Read this [article](https://ianlondon.github.io/posts/encoding-cyclical-features-24-hour-time/) for more details.




❓ **Question** about `MoSold` ❓ 
- Create two new features `sin_MoSold` and `cos_MoSold` which correspond respectively to the sine and cosine of MoSold.
- Drop the original column `MoSold`

<details>
    <summary>💡 <i>Hint</i></summary>
    
To create a time engineered feature based on a column which gives the second in the day!
```python
seconds_in_day = 24*60*60

df['sin_time'] = np.sin(2*np.pi*df.seconds/seconds_in_day)
df['cos_time'] = np.cos(2*np.pi*df.seconds/seconds_in_day)
df.drop(columns=['seconds'], inplace=True)

df.head()
```


</details>

In [None]:

data['sin_MoSold'] = np.sin(2*np.pi*data["MoSold"]/12)
data['cos_MoSold'] = np.cos(2*np.pi*data["MoSold"]/12)

data.drop(columns=['MoSold'], inplace=True)


In [56]:
data

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,SalePrice,GarageFinish_Fin,GarageFinish_NoGarage,GarageFinish_RFn,GarageFinish_Unf,sin_MoSold,cos_MoSold
0,0.380070,0.375,0.333333,0.500,0.316729,1.0,208500,0.0,0.0,1.0,0.0,8.660254e-01,5.000000e-01
1,-0.312090,0.375,0.333333,0.875,0.069650,1.0,181500,0.0,0.0,1.0,0.0,5.000000e-01,-8.660254e-01
2,0.497489,0.375,0.333333,0.500,0.228124,1.0,223500,0.0,0.0,1.0,0.0,-1.000000e+00,-1.836970e-16
3,0.390885,0.375,0.333333,0.500,0.442583,1.0,140000,0.0,0.0,0.0,1.0,8.660254e-01,5.000000e-01
4,1.134029,0.500,0.333333,0.500,0.566894,1.0,250000,0.0,0.0,1.0,0.0,-2.449294e-16,1.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0.282735,0.375,0.333333,0.500,0.251267,1.0,175000,0.0,0.0,1.0,0.0,-8.660254e-01,-5.000000e-01
1456,0.940904,0.375,0.333333,0.625,0.459996,1.0,210000,0.0,0.0,0.0,1.0,8.660254e-01,5.000000e-01
1457,1.353418,0.500,0.333333,1.000,0.036147,1.0,266500,0.0,0.0,1.0,0.0,5.000000e-01,-8.660254e-01
1458,-0.596369,0.250,0.333333,0.625,0.648667,1.0,142125,0.0,0.0,0.0,1.0,8.660254e-01,-5.000000e-01


🧪 **Test your code**

In [57]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding', dataset = data, new_features = ['sin_MoSold', 'cos_MoSold'])

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/baska/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/baska/code/Svvcm/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/tests
plugins: typeguard-2.13.3, anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 4 items

test_encoding.py::TestEncoding::test_central_air [32mPASSED[0m[32m                  [ 25%][0m
test_encoding.py::TestEncoding::test_columns [32mPASSED[0m[32m                      [ 50%][0m
test_encoding.py::TestEncoding::test_month_sold_features [32mPASSED[0m[32m          [ 75%][0m
test_encoding.py::TestEncoding::test_month_sold_features_number [32mPASSED[0m[32m   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/encoding.pickle

[32mgit[39m commit -m [33m'Completed encoding step'[39m

[32mgit[39m push origin master



## (6) Export the preprocessed dataset

👇 Now that the dataset has been preprocessed, execute the code below to export it. You will keep working on it in the next exercise.

In [58]:
data.to_csv("data/clean_dataset.csv", index=False)

🏁 Congratulations! Now, you know how to ***preprocess a dataset*** !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!