# Feature Engineering Notebook

## Objectives

*   Engineer features for Classification and Cluster models


## Inputs

* outputs/datasets/cleaned/x_train_cleaned.csv
* outputs/datasets/cleaned/x_test_cleaned.csv
* outputs/datasets/cleaned/y_train_cleaned.csv
* outputs/datasets/cleaned/y_test_cleaned.csv

## Outputs

* Encode categorical variable and perform normalization
* Perform PCA

## Conclusions


---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

Train Set

In [None]:
import pandas as pd
x_train_path = "outputs/datasets/cleaned/x_train_cleaned.csv"
y_train_path = "outputs/datasets/cleaned/y_train_cleaned.csv"
x_train = pd.read_csv(x_train_path)
y_train = pd.read_csv(y_train_path)
x_train.head(3)

Test Set

In [None]:
import pandas as pd
x_test_path = "outputs/datasets/cleaned/x_test_cleaned.csv"
y_test_path = "outputs/datasets/cleaned/y_test_cleaned.csv"
x_test = pd.read_csv(x_test_path)
y_test = pd.read_csv(y_test_path)
x_test.head(3)

# Feature Engineering

* We noticed one ordinal-categorical variable 'Contract' we will use ordinal encoder for it.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_categorical = ['Contract']
encoder = OrdinalEncoder(categories=[list(x_train['Contract'].unique())])
x_train[ordinal_categorical] = encoder.fit_transform(x_train[ordinal_categorical])
x_test[ordinal_categorical] = encoder.fit_transform(x_test[ordinal_categorical])

* Other categorical variables will be transformed with OneHotEncoder.

In [None]:
from feature_engine.encoding import OneHotEncoder

categorical_vars = x_train.columns[x_train.dtypes=='object'].to_list()
nominal_categorical = [var for var in categorical_vars if var not in ordinal_categorical]
encoder = OneHotEncoder(variables=nominal_categorical, drop_last=True)
x_train_encoded = encoder.fit_transform(x_train)
x_test_encoded = encoder.fit_transform(x_test)
print(x_train_encoded.shape)
x_train_encoded.head(3)

* We will normalize the data

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
x_train_enc_norm = pd.DataFrame(scaler.fit_transform(x_train_encoded), columns=x_train_encoded.columns)
x_test_enc_norm = pd.DataFrame(scaler.fit_transform(x_test_encoded), columns=x_test_encoded.columns)
x_train_enc_norm.head(3)

---

# Push cleaned data to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/engineered') # create outputs/datasets/engineered folder
except Exception as e:
  print(e)


## Train Set

In [None]:
x_train.to_csv("outputs/datasets/engineered/x_train_cleaned.csv", index=False)
y_train.to_csv("outputs/datasets/engineered/y_train_cleaned.csv", index=False)

## Test Set

In [None]:
x_test.to_csv("outputs/datasets/engineered/x_test_cleaned.csv", index=False)
y_test.to_csv("outputs/datasets/engineered/y_test_cleaned.csv", index=False)