## **4. Feature Engineering**

### **4.1 Overview**
This notebook demonstrates feature engineering for customer segmentation using RFM (Recency, Frequency, Monetary) analysis. The code has been modularized into the `src/feature_engineering.py` module for production use.

**Key Features Created:**
1. **RFM Metrics**: Recency, Frequency, Monetary values
2. **Behavioral Metrics**: TotalItems, UniqueProducts, AvgOrderValue, ItemsPerOrder
3. **Customer-Level Aggregation**: Transaction data â†’ Customer features

**Production Usage:**
```python
from src.feature_engineering import create_customer_features
customer_features = create_customer_features(df_processed)

### **4.2 Load Processed Data**

In [1]:
# Import libraries and load processed data
import pandas as pd
import sys
import os

# Add src directory to path for importing our module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Load processed data
df = pd.read_csv('../data/processed/Online_Retail_Cleaned.csv')
print(f"Processed data shape: {df.shape}")
df.head()

Processed data shape: (461035, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-01-12,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-01-12,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-01-12,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-01-12,3.39,17850.0,United Kingdom


### **4.3 Calculate Total Price**

In [2]:
# Calculate total price for each transaction
from feature_engineering import calculate_total_price, prepare_date_column

# Prepare date column and calculate total price
df = prepare_date_column(df)
df = calculate_total_price(df)

print("Sample data with TotalPrice:")
print(df[['InvoiceDate', 'UnitPrice', 'Quantity', 'TotalPrice']].head())

# Note: This functionality is now available in src/feature_engineering.py
# Functions: prepare_date_column(df), calculate_total_price(df)

Sample data with TotalPrice:
  InvoiceDate  UnitPrice  Quantity  TotalPrice
0  2010-01-12       2.55         6       15.30
1  2010-01-12       3.39         6       20.34
2  2010-01-12       2.75         8       22.00
3  2010-01-12       3.39         6       20.34
4  2010-01-12       3.39         6       20.34


### **4.4 RFM Feature Calculation**

In [3]:
# Calculate RFM features using our modular function
from feature_engineering import calculate_rfm_features

customer_data = calculate_rfm_features(df)
print(f"Customer data shape: {customer_data.shape}")
print("\nRFM Features:")
print(customer_data[['CustomerID', 'Recency', 'Frequency', 'Monetary']].head())

# Note: This functionality is now available in src/feature_engineering.py
# Function: calculate_rfm_features(df)

Customer data shape: (13196, 7)

RFM Features:
  CustomerID  Recency  Frequency  Monetary
0    12347.0       96          7   3412.53
1    12348.0      221          3     90.20
2    12349.0      698          1   1197.15
3    12350.0      312          1    294.40
4    12352.0      275          7   1147.44


### **4.5 Additional Behavioral Features**

In [4]:
# Calculate additional features using our modular function
from feature_engineering import calculate_additional_features

customer_data = calculate_additional_features(customer_data)
print("Additional Features:")
print(customer_data[['CustomerID', 'AvgOrderValue', 'ItemsPerOrder']].head())

# Note: This functionality is now available in src/feature_engineering.py
# Function: calculate_additional_features(df)

Additional Features:
  CustomerID  AvgOrderValue  ItemsPerOrder
0    12347.0     487.504286     272.142857
1    12348.0      30.066667      46.666667
2    12349.0    1197.150000     547.000000
3    12350.0     294.400000     196.000000
4    12352.0     163.920000      71.714286


### **4.6 Feature Scaling**

In [6]:
# Scale features for clustering using our modular function
from feature_engineering import scale_features, get_default_feature_columns

# Get default feature columns for scaling
feature_columns = get_default_feature_columns()
print(f"Features to scale: {feature_columns}")

# Scale features
scaled_features, scaler = scale_features(customer_data, feature_columns)
print(f"\nScaled features shape: {scaled_features.shape}")
print("Sample scaled features:")
scaled_features.head()

# Note: This functionality is now available in src/feature_engineering.py
# Functions: scale_features(df, feature_columns), get_default_feature_columns()

Features to scale: ['Recency', 'Frequency', 'Monetary', 'UniqueProducts', 'AvgOrderValue', 'ItemsPerOrder']

Scaled features shape: (13196, 6)
Sample scaled features:


Unnamed: 0,Recency_scaled,Frequency_scaled,Monetary_scaled,UniqueProducts_scaled,AvgOrderValue_scaled,ItemsPerOrder_scaled
0,-0.253606,-0.464574,2.144055,1.430145,2.320637,2.136988
1,0.345852,-1.092186,-0.228514,-0.427015,-0.339961,-0.034825
2,2.633386,-1.405993,0.56199,0.756669,6.448155,4.784445
3,0.782258,-1.405993,-0.08269,-0.222931,1.197483,1.403571
4,0.604818,-0.464574,0.52649,0.470952,0.438571,0.206436


### **4.7 Complete Feature Engineering Pipeline**

In [11]:
# Run the complete feature engineering pipeline using our modular function
from feature_engineering import create_customer_features, save_customer_features

# This single function performs all the steps above:
# 1. Prepare date column
# 2. Calculate total price
# 3. Calculate RFM features
# 4. Calculate additional features

customer_features = create_customer_features(df)
print(f"Final customer features shape: {customer_features.shape}")
print("\nFeature columns:")
print(customer_features.columns.tolist())

# Display feature summary
print("\nFeature Summary:")
customer_features.describe()

# Save customer features
# save_customer_features(customer_features, '../data/processed/Customer_RFM_Features.csv')
# print("\nCustomer features saved to: ../data/processed/Customer_RFM_Features.csv")

Final customer features shape: (13196, 9)

Feature columns:
['CustomerID', 'Recency', 'Frequency', 'Monetary', 'TotalItems', 'UniqueProducts', 'Country', 'AvgOrderValue', 'ItemsPerOrder']

Feature Summary:


Unnamed: 0,Recency,Frequency,Monetary,TotalItems,UniqueProducts,AvgOrderValue,ItemsPerOrder
count,13196.0,13196.0,13196.0,13196.0,13196.0,13196.0,13196.0
mean,148.882389,9.960897,410.190897,216.308275,26.923537,88.516339,50.282197
std,208.529431,6.373602,1400.362109,765.909353,49.001417,171.936881,103.823254
min,1.0,1.0,0.0,1.0,1.0,0.0,1.0
25%,30.0,5.0,71.8175,26.0,11.0,6.062048,2.181818
50%,67.0,11.0,109.91,40.0,14.0,8.689911,3.111111
75%,155.0,14.0,223.385,100.0,19.0,123.395,63.297619
max,698.0,199.0,91215.12,50203.0,1644.0,3774.27,2529.0
