# **📊 Feature Engineering & Transformation**  
📌 **Notebook:** `02_feature_engineering.ipynb`  

## **🎯 Objective**  
The goal of this notebook is to **enhance our dataset** by engineering new features, scaling numerical variables, and reducing dimensionality to improve model performance.  



 **Let'sss startt!** 🔥


### Import libraries 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn


### Load the dataset

In [None]:
file_path = "../data/processed_data.csv"  
df_cleaned = pd.read_csv(file_path)


df_cleaned.head()


Unnamed: 0,Age,Gender,Income,CampaignChannel,CampaignType,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,Conversion
0,56,0,136912,4,0,6497.870068,0.043919,0.088031,0,2.399017,7.396803,19,6,9,4,688,1
1,69,1,41760,0,3,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,2,7,2,3459,1
2,46,0,88456,1,0,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,11,2,8,2337,1
3,32,0,44085,1,2,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,2,2,0,2463,1
4,60,0,83964,1,2,1678.043573,0.252851,0.10994,0,2.046847,13.99337,6,6,6,8,4345,1


### ✅ **1. Scaling & Normalization**  

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler


standard_scaler = StandardScaler()


df_cleaned['AdSpend_Standard'] = standard_scaler.fit_transform(df_cleaned[['AdSpend']])
df_cleaned['TimeOnSite_Standard'] = standard_scaler.fit_transform(df_cleaned[['TimeOnSite']])

df_cleaned.drop(columns=['AdSpend', 'TimeOnSite'], inplace=True)


StandardScaler standardizes data with mean = 0 and std = 1

MinMaxScaler forces all values into [0,1] range, which may compress important variations in high-value features like AdSpend.

### ✅ **2. Feature Creation**  

In [None]:

df_cleaned['Engagement_Score'] = (
    df_cleaned['WebsiteVisits'] * df_cleaned['PagesPerVisit'] * df_cleaned['TimeOnSite_Standard']
)


df_cleaned['Marketing_Efficiency'] = df_cleaned['ConversionRate'] / (df_cleaned['AdSpend_Standard'] + 1e-5) 

Engagement Score quantifies user activity (more visits, pages, and time → higher score).

Marketing Spend Efficiency helps measure how well ad spend translates into conversions.

In [24]:
df_cleaned

Unnamed: 0,Age,Gender,Income,CampaignChannel,CampaignType,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,Conversion,AdSpend_Standard,TimeOnSite_Standard,Engagement_Score,Marketing_Efficiency
0,56,0,136912,4,0,0.043919,0.088031,0,2.399017,19,6,9,4,688,1,0.527484,-0.078268,-0.000000,0.166886
1,69,1,41760,0,3,0.155725,0.182725,42,2.917138,5,2,7,2,3459,1,-0.388418,-0.561778,-68.828863,-0.470445
2,46,0,88456,1,0,0.277490,0.076423,2,8.223619,0,11,2,8,2337,1,-1.217296,1.435016,23.602057,-0.062781
3,32,0,44085,1,2,0.137611,0.088004,47,4.540939,89,2,2,0,2463,1,-1.572106,1.646339,351.368420,-0.055979
4,60,0,83964,1,2,0.252851,0.109940,0,2.046847,6,6,6,8,4345,1,-1.170918,1.481958,0.000000,-0.093893
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,21,1,24849,0,0,0.243792,0.116773,23,9.693602,70,13,6,7,286,0,1.239442,1.537405,342.768755,0.094213
7996,43,0,44718,3,3,0.236740,0.190061,49,9.499010,52,13,1,5,1502,0,-1.260221,-0.999683,-465.303757,-0.150817
7997,28,0,125471,2,1,0.056526,0.133826,35,2.853241,38,16,0,3,738,1,-0.137924,1.629773,162.754707,-0.970357
7998,19,0,107862,1,1,0.023961,0.138386,49,1.002964,86,1,5,7,2709,1,1.576949,-0.910865,-44.764691,0.087755


### ✅ **3. Dimensionality Reduction (Feature Selection)** 

In [None]:
from sklearn.feature_selection import mutual_info_classif


features = df_cleaned.drop(columns=['Conversion'])
target = df_cleaned['Conversion']


mi_scores = mutual_info_classif(features, target, random_state=42)
feature_importance = pd.Series(mi_scores, index=features.columns).sort_values(ascending=False)

top_features = feature_importance[feature_importance > 0.01].index.tolist()
df_selected = df_cleaned[top_features + ['Conversion']] 


Mutual Information (MI) measures feature relevance to Conversion.

We keep only important features, improving model efficiency without losing key information.

In [None]:

df_selected.to_csv("../data/feature_engineered.csv.csv", index=False)

print("✅ feature data saved successfully as 'feature_engineered' in the 'data' folder.")


✅ feature data saved successfully as 'feature_engineered' in the 'data' folder.
