# **Feature Engineering**

## Objectives

**Perform Business requirement 2 user story task: Feature engineering ML tasks**
* Perform categorical encoding on categorical features.
* Perform feature selection to distill the most significant features, and also remove redundant features.
* Carry out feature scaling/transformations to normalise the distributions of remaining features.
* Create the data cleaning and feature engineering pipeline.

## Inputs
* cleaned train set: outputs/datasets/ml/cleaned/train_set.csv
* cleaned test set: outputs/datasets/ml/cleaned/test_set.csv

## Outputs


---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load cleaned train and test datasets

In [None]:
import pandas as pd

train_set_df = pd.read_csv(filepath_or_buffer='outputs/datasets/ml/cleaned/train_set.csv')
test_set_df = pd.read_csv(filepath_or_buffer='outputs/datasets/ml/cleaned/test_set.csv')

---

## Categorical feature encoding

In the sale price correlation study notebook, the categorical features were encoded using an ordinal encoder; this was deemed most suitable since all the cateogrical features are ordinal, with an obvious ordering based around a rating.

The exact same encoding will be used for the cleaned train and test sets. 

In [None]:
train_set_categorical_df = train_set_df.select_dtypes(include='object')
print(train_set_categorical_df.columns)
test_set_categorical_df = test_set_df.select_dtypes(include='object')
test_set_categorical_df.columns

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

# Designating the ordered categories
bsmt_fin_type1_cat = np.array(list(reversed(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'])))
bsmt_exposure_cat = np.array(['None', 'No', 'Mn', 'Av', 'Gd'])
garage_finish_cat = np.array(['None', 'Unf', 'RFn', 'Fin'])
kitchen_quality_cat = np.array(['Po', 'Fa', 'TA', 'Gd', 'Ex'])

categories = [bsmt_exposure_cat, bsmt_fin_type1_cat, garage_finish_cat, kitchen_quality_cat]
encoder = OrdinalEncoder(categories=categories, dtype='int64')
encoder.set_output(transform='pandas')

# fitting and transforming each set
train_set_df[train_set_categorical_df.columns] = encoder.fit_transform(X=train_set_categorical_df)
print(train_set_df[train_set_categorical_df.columns].head())
test_set_df[test_set_categorical_df.columns] = encoder.transform(X=test_set_categorical_df)
test_set_df[test_set_categorical_df.columns].head()


---