# Feature Engineering Notebook

## Objectives

*   Engineer features for Classification and Cluster models


## Inputs

* outputs/datasets/cleaned/x_train_cleaned.csv
* outputs/datasets/cleaned/x_test_cleaned.csv
* outputs/datasets/cleaned/y_train_cleaned.csv
* outputs/datasets/cleaned/y_test_cleaned.csv

## Outputs

* Encode categorical variable and perform normalization
* Perform PCA

## Conclusions


---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

In [None]:
import pandas as pd
import glob
input_folder = "outputs/datasets/cleaned"
csv_files = glob.glob(f"{input_folder}/*.csv")
print(csv_files)
df = pd.read_csv(csv_files[0])
df.head(5)

# Feature Engineering

* We pop target variable and use lambda to map it with 0 and 1 values.

In [None]:
target = df.pop('Churn').apply(lambda x: 0 if x =="No" else 1)
target.head(3)

* We noticed one ordinal-categorical variable 'Contract' we will use ordinal encoder for it.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_categorical = ['Contract']
encoder = OrdinalEncoder(categories=[list(df['Contract'].unique())])
df[ordinal_categorical] = encoder.fit_transform(df[ordinal_categorical])

* Other categorical variables will be transformed with OneHotEncoder.

In [None]:
from feature_engine.encoding import OneHotEncoder

categorical_vars = df.columns[df.dtypes=='object'].to_list()
nominal_categorical = [var for var in categorical_vars if var not in ordinal_categorical]
encoder = OneHotEncoder(variables=nominal_categorical, drop_last=True)
df_encoded = encoder.fit_transform(df)
print(df_encoded.shape)
df_encoded.head(3)

* We will normalize the data

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_enc_norm = pd.DataFrame(scaler.fit_transform(df_encoded), columns=df_encoded.columns)
df_enc_norm.head(3)

---

## Split data to Train and Test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_enc_norm, target, test_size=0.2, random_state=42)
print(f"Train sample shape {x_train.shape}")
print(f"Test sample shape {x_test.shape}")

# Push cleaned data to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/engineered') # create outputs/datasets/engineered folder
except Exception as e:
  print(e)


## Train Set

In [None]:
x_train.to_csv("outputs/datasets/engineered/x_train_cleaned.csv", index=False)
y_train.to_csv("outputs/datasets/engineered/y_train_cleaned.csv", index=False)

## Test Set

In [None]:
x_test.to_csv("outputs/datasets/engineered/x_test_cleaned.csv", index=False)
y_test.to_csv("outputs/datasets/engineered/y_test_cleaned.csv", index=False)