# Feature Engieering Notebook

__Our cleaned dataset is now ready for the tranformation process. We will carry out transformations on our column values to prepare our data for the model we will be building__
__We will performing operations on our data:__

+ Encoding Variables: We are going to be encoding categorical columns(Label encoding, One-hot encoding) from texts to numbers for the model to be able to understand.
+ Feature Creation: After going through our columns, we may have to create new features out of our existing ones.
+ Scaling/Normalization: for our numerical columns, we wiil perform transformations to ensure they are on a similar scale for our model.

__Even complex models will perform poorly when given bad data to work with, so this notebook is a major point in being able to properly predict our customers' behaviour.__

In [1]:
import pandas as pd
cleaned_churn_df = pd.read_excel("../data/cleaned/E Commerce Dataset (cleaned).xlsx")

In [2]:
df = cleaned_churn_df.copy()
df.head()

Unnamed: 0,CustomerID,Churn,Tenure(months),PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferredOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear(%),CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4,Phone,3,6,Debit Card,Female,3,3,Laptop & Accessory,2,Single,9,1,11,1,1,5,159.93
1,50002,1,9,Phone,1,8,Unified Payments Interface,Male,3,4,Mobile Phone,3,Single,7,1,15,0,1,0,120.9
2,50003,1,9,Phone,1,30,Debit Card,Male,2,4,Mobile Phone,3,Single,6,1,14,0,1,3,120.28
3,50004,1,0,Phone,3,15,Debit Card,Male,2,4,Laptop & Accessory,5,Single,8,0,23,0,1,3,134.07
4,50005,1,0,Phone,1,12,Credit Card,Male,0,3,Mobile Phone,5,Single,3,0,11,1,1,3,129.6


In [3]:
# We don't need the customer ID in training our model


## Encoding Categorical Variables

In [4]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

<p>We will use different encoding types depending on the nature of our features</p>

Label Encoding for Binary value features <br/>
One-Hot encoding for when we have > 2 categories with no order or hierarchy <br/>
Ordinal encoding when we have more than 2 ordinal categories <br/>

__We will be making use of sklearn pipeline so we don't have to manually do this everytime new data comes in.__

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

We'll group our columns into different types to specify how we will encode them 

In [6]:
binary_value_columns = ["PreferredLoginDevice", "Gender", "Complain"]   # Columns with just 2 categories

ordinal_value_columns = ["CityTier", "SatisfactionScore"]    # Columns having a hierarchy tier where one is higher than the others

nominal_value_columns = ["PreferredPaymentMode", "PreferredOrderCat", "MaritalStatus"]   # Column values more than 2 and no meaningful order


numeric_columns = [
    "Tenure(months)", "WarehouseToHome", "HourSpendOnApp",
    "NumberOfDeviceRegistered", "NumberOfAddress",
    "OrderAmountHikeFromlastYear(%)", "CouponUsed",
    "OrderCount", "DaySinceLastOrder", "CashbackAmount"
] 

# Churn and CustomerID not included


In [7]:
binary_encoder = OrdinalEncoder()
ordinal_encoder = OrdinalEncoder(categories=[
    [1, 2, 3],  # for city tier
    [1, 2, 3, 4, 5]   # for satisfaction score 
,])
nominal_encoder = OneHotEncoder(drop="first", sparse_output=False)

Label encoder won't work in a model column transformer because it works on one column at once and not multiple at a time, it usually used for encoding target variables.<br/>
We use ordinal encoder because it works on 2D arrays and it does not assign an order on its own unless we specify. with binary categories, it just maps valiues with 0 and 1

We will also scale our numerical columns using `StandardScaler` to apply it to our column transformer as well

In [8]:
from sklearn.preprocessing import StandardScaler

numerical_scaler = StandardScaler()



We will create the `ColumnTransformer` object and get it ready for the next steps to be applied on it

In [9]:
categorical_encodings = ColumnTransformer(transformers=[    # transformer for  columns
    ("binary", binary_encoder, binary_value_columns),   # The binary_encoder(OrdinalEncoder) transformer is named 'binary' and will apply to the binary_value_columns
    
    ("ordinal", ordinal_encoder, ordinal_value_columns),

    ("nominal", nominal_encoder, nominal_value_columns),

    ("numeric_scaler", numerical_scaler, numeric_columns),      # Scaling numerical columns
], remainder="passthrough")         # any column not specified in the transformer is skipped


Now we have defined the encoding operations to apply to our columns, we will create our model pipeline and include the transforations/encodings

In [10]:
model_pipeline = Pipeline(steps=[
    ("categorical_encoding", categorical_encodings)
])

### Preparing for model training and evaluation

In [11]:

from sklearn.model_selection import train_test_split

X = df.drop(columns=["CustomerID", "Churn"])
y = df["Churn"]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y    # startify to handle categorical class imbalance
)




In [12]:
features_encoded_train = model_pipeline.fit_transform(X_train)
features_encoded_test = model_pipeline.transform(X_test)

In [None]:
# We have to get the fitted encoder from the pipeline to use its column names
fitted_nominal_encoder = model_pipeline.named_steps["categorical_encoding"].named_transformers_["nominal"]

# A tuple containing all the column names of all our features after encoding
encoded_feature_names = (
    binary_value_columns +
    ordinal_value_columns +
    list(fitted_nominal_encoder.get_feature_names_out(nominal_value_columns)) +
    [col for col in X_train.columns if col not in binary_value_columns + ordinal_value_columns + nominal_value_columns]
)


X_train_encoded = pd.DataFrame(features_encoded_train, columns=encoded_feature_names, index=X_train.index)
X_test_encoded = pd.DataFrame(features_encoded_test, columns=encoded_feature_names, index=X_test.index)


X_train_encoded.isna().sum()


PreferredLoginDevice                               0
Gender                                             0
Complain                                           0
CityTier                                           0
SatisfactionScore                                  0
PreferredPaymentMode_Credit Card                   0
PreferredPaymentMode_Debit Card                    0
PreferredPaymentMode_E wallet                      0
PreferredPaymentMode_Unified Payments Interface    0
PreferredOrderCat_Grocery                          0
PreferredOrderCat_Laptop & Accessory               0
PreferredOrderCat_Mobile Phone                     0
PreferredOrderCat_Others                           0
MaritalStatus_Married                              0
MaritalStatus_Single                               0
Tenure(months)                                     0
WarehouseToHome                                    0
HourSpendOnApp                                     0
NumberOfDeviceRegistered                      

We don't encode y_train and y_test because they are already numeric

We will now save our ready-to-model datasets for our model evaluation notebook <br/> We have to create training and testing datasets that our models will use.

In [14]:
#  We reset index to avoid rows being missaligned when we concatenate
train_df = pd.concat([X_train_encoded.reset_index(drop=True),
                      y_train.reset_index(drop=True)], axis=1)

test_df = pd.concat([X_test_encoded.reset_index(drop=True),
                     y_test.reset_index(drop=True)], axis=1)


# Adding our target columns back for each dataset
train_df["Churn"] = y_train.reset_index(drop=True)
test_df["Churn"] = y_test.reset_index(drop=True)


train_df.to_csv("../data/processed/train.csv", index=False)
test_df.to_csv("../data/processed/test.csv", index=False)
