# Features Engineering:
---

Feature engineering was performed to create additional marketing performance metrics and time-based insights, enabling more accurate analysis and future machine learning experimentation.

### Objectives:
- Create new analytical marketing KPIs.
- Generate time-based campaign performance attributes.
- Prepare enriched dataset for modeling and deeper insight extraction.

### Engineered Features:

---
| Feature                             | Formula                                 | Meaning                             |
| ----------------------------------- | --------------------------------------- | ----------------------------------------------------- |
| CTR (Click-Through Rate)            | `clicks / impressions`                  | Measures audience engagement with ads                 |
| Conversion Rate (CR)                | `conversions / clicks`                  | Measures how efficiently clicks turn into conversions |
| CPC (Cost per Click)                | `spend_usd / clicks`                    | Shows cost efficiency of attracting each click        |
| CPM (Cost per Thousand Impressions) | `spend_usd / (impressions / 1000)`      | Shows cost to reach 1k people                         |
| CPA (Cost per Acquisition)          | `spend_usd / conversions`               | Cost per conversion/ critical efficiency metric      |
| ROI (Return on Investment)          | `(revenue_usd - spend_usd) / spend_usd` | Measures profitability of marketing spend             |
| ROAS (Return on Ad Spend)           | `revenue_usd / spend_usd`               | Shows revenue return per dollar spent                 |
| Revenue per Click                   | `revenue_usd / clicks`                  | How much each click generates in revenue              |
| Revenue per Conversion              | `revenue_usd / conversions`             | Revenue per successful conversion                     |
| Campaign Duration (days)            | `end_date - start_date`                 | Campaign timeline context                             |


---


### Data Handling Notes:

- Zeros in clicks, impressions, conversions, spend_usd are temporarily treated as NaN to avoid invalid division errors.

- After metric calculation, NaN values are safely converted back to zero for reporting clarity.

### Purpose:

These engineered features form the foundation for:
- Performance benchmarking across campaigns and channels
- Exploratory data analysis (EDA)
- Predictive modeling (ROI forecasting, conversion modeling)
- Automation workflows in the MLOps pipeline (future phase)

---

## Import Libraries:

In [62]:
#Libraries
import numpy as np
import pandas as pd
import os

## Ingest Data as DF:

In [63]:
#INGEST DATA

df = pd.read_csv("../data/processed/marketing_campaign_all_clean.csv")


## Data Check:

In [64]:
#DATA CHECK

print(df.shape)
print(df.info())

df.head()

(1000, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   campaign_id       1000 non-null   object 
 1   campaign_name     1000 non-null   object 
 2   start_date        1000 non-null   object 
 3   end_date          1000 non-null   object 
 4   channel           1000 non-null   object 
 5   region            1000 non-null   object 
 6   impressions       1000 non-null   int64  
 7   clicks            1000 non-null   int64  
 8   conversions       1000 non-null   int64  
 9   spend_usd         1000 non-null   float64
 10  revenue_usd       1000 non-null   float64
 11  target_audience   1000 non-null   object 
 12  product_category  1000 non-null   object 
 13  device            1000 non-null   object 
 14  year              1000 non-null   int64  
 15  dataset_year      1000 non-null   int64  
dtypes: float64(2), int64(5), object(

Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,impressions,clicks,conversions,spend_usd,revenue_usd,target_audience,product_category,device,year,dataset_year
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,28252,5609,65466,39193.43,79017.74,Youth,Electronics,Desktop,2024,2024
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,89608,83584,26865,17291.53,49868.54,Adults,Home,Mobile,2024,2024
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,37853,62661,43662,6729.63,63021.28,Seniors,Electronics,Desktop,2024,2024
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,10577,41421,75023,15077.58,133106.71,Seniors,Clothing,Desktop,2024,2024
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,84039,56010,11283,16877.69,144736.99,Adults,Home,Mobile,2024,2024


## Convert Data types - Dates, Float & Numeric:

In [65]:
#FIX THE DATA TYPE:

for df in [df]:
    #df['campaign_id'] = pd.to_numeric(df['campaign_id'], errors='coerce')  #I'm keeping this as Obj unlike in SQL or anyother tool this still works for Python also is an identifier & not a numeric feature
    df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
    df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')
    df['spend_usd'] = pd.to_numeric(df['spend_usd'], errors='coerce')
    df['revenue_usd'] = pd.to_numeric(df['revenue_usd'], errors='coerce')
    

In [66]:
# Prevent Divsion by 0

df['impressions'] = df['impressions'].replace(0, np.nan)
df['clicks'] = df['clicks'].replace(0, np.nan)
df['conversions'] = df['conversions'].replace(0, np.nan)
df['spend_usd'] = df['spend_usd'].replace(0, np.nan)


## Compute the Metrix:

In [67]:
#ADD CUSTOM COLUMNS: 
#ctr
#conversion_rate 
#roi 
#campaign_duration_days


df["ctr"] = df["clicks"] / df["impressions"]  # Click-through rate
df["conversion_rate"] = df["conversions"] / df["clicks"]
df["roi"] = (df["revenue_usd"] - df["spend_usd"]) / df["spend_usd"]
df["campaign_duration_days"] = (df["end_date"] - df["start_date"]).dt.days

df['CPC'] = df['spend_usd'] / df['clicks']
df['CPM'] = df['spend_usd'] / (df['impressions'] / 1000)
df['CPA'] = df['spend_usd'] / df['conversions']
df['ROAS'] = df['revenue_usd'] / df['spend_usd']

df['revenue_per_click'] = df['revenue_usd'] / df['clicks']
df['revenue_per_conversion'] = df['revenue_usd'] / df['conversions']

#updated df

print(df.shape)
print(df.columns)
print(df.info())

df.head()

(1000, 26)
Index(['campaign_id', 'campaign_name', 'start_date', 'end_date', 'channel',
       'region', 'impressions', 'clicks', 'conversions', 'spend_usd',
       'revenue_usd', 'target_audience', 'product_category', 'device', 'year',
       'dataset_year', 'ctr', 'conversion_rate', 'roi',
       'campaign_duration_days', 'CPC', 'CPM', 'CPA', 'ROAS',
       'revenue_per_click', 'revenue_per_conversion'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   campaign_id             1000 non-null   object        
 1   campaign_name           1000 non-null   object        
 2   start_date              1000 non-null   datetime64[ns]
 3   end_date                1000 non-null   datetime64[ns]
 4   channel                 1000 non-null   object        
 5   region                  1000 non-null 

Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,impressions,clicks,conversions,spend_usd,...,ctr,conversion_rate,roi,campaign_duration_days,CPC,CPM,CPA,ROAS,revenue_per_click,revenue_per_conversion
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,28252,5609,65466,39193.43,...,0.198535,11.671599,1.016097,92,6.987597,1387.279839,0.598684,2.016097,14.08767,1.207004
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,89608,83584,26865,17291.53,...,0.932774,0.321413,1.883987,190,0.206876,192.968597,0.643645,2.883987,0.596628,1.856264
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,37853,62661,43662,6729.63,...,1.655377,0.696797,8.364747,203,0.107397,177.783267,0.15413,9.364747,1.00575,1.44339
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,10577,41421,75023,15077.58,...,3.916139,1.811231,7.828122,188,0.364008,1425.506287,0.200973,8.828122,3.213508,1.774212
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,84039,56010,11283,16877.69,...,0.666476,0.201446,7.57564,199,0.301334,200.831638,1.495851,8.57564,2.584128,12.827882


In [68]:
# Convert zeros to NaN only where needed for math

df.fillna(0, inplace=True)

# I"LL DO IMPUTATION LATER FOR FORMALITY AND FOR ML FOR NOW REPLACE TO 0 IS FINE

In [75]:
# # VALIDATE NEW COLUMNS

# df['ctr_check'] = df['clicks'] / df['impressions']
# df['cr_check'] = df['conversions'] / df['clicks']
# df['roi_check'] = (df['revenue_usd'] - df['spend_usd']) / df['spend_usd']

# # Compare difference
# df['ctr_diff'] = df['ctr_check'] - df['ctr']
# df['cr_diff'] = df['cr_check'] - df['conversion_rate']
# df['roi_diff'] = df['roi_check'] - df['roi']



In [76]:
print(df.columns)
df.head()

Index(['campaign_id', 'campaign_name', 'start_date', 'end_date', 'channel',
       'region', 'impressions', 'clicks', 'conversions', 'spend_usd',
       'revenue_usd', 'target_audience', 'product_category', 'device', 'year',
       'dataset_year', 'ctr', 'conversion_rate', 'roi',
       'campaign_duration_days', 'CPC', 'CPM', 'CPA', 'ROAS',
       'revenue_per_click', 'revenue_per_conversion'],
      dtype='object')


Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,impressions,clicks,conversions,spend_usd,...,ctr,conversion_rate,roi,campaign_duration_days,CPC,CPM,CPA,ROAS,revenue_per_click,revenue_per_conversion
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,28252,5609,65466,39193.43,...,0.198535,11.671599,1.016097,92,6.987597,1387.279839,0.598684,2.016097,14.08767,1.207004
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,89608,83584,26865,17291.53,...,0.932774,0.321413,1.883987,190,0.206876,192.968597,0.643645,2.883987,0.596628,1.856264
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,37853,62661,43662,6729.63,...,1.655377,0.696797,8.364747,203,0.107397,177.783267,0.15413,9.364747,1.00575,1.44339
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,10577,41421,75023,15077.58,...,3.916139,1.811231,7.828122,188,0.364008,1425.506287,0.200973,8.828122,3.213508,1.774212
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,84039,56010,11283,16877.69,...,0.666476,0.201446,7.57564,199,0.301334,200.831638,1.495851,8.57564,2.584128,12.827882


## Save point:

In [77]:
#Save processed dataset
#File: marketing_campaign_2024_2025_processed


processed_path = "../data/processed/marketing_campaign_2024_2025_processed.csv"
df.to_csv(processed_path, index=False)

print(f"Processed dataset saved to: {processed_path}")


Processed dataset saved to: ../data/processed/marketing_campaign_2024_2025_processed.csv
