# Phase 3: Feature Engineering for Business Signals

## 1. Business Problem Statement
Raw transaction data does not directly capture customer value, demand patterns,
or operational risk. To enable forecasting, segmentation, churn prediction, and
profit optimization, we must engineer features that represent meaningful business
behavior rather than individual transactions.

## 2. Why This Matters to the Business
Well-designed features allow the business to anticipate customer needs, forecast
demand accurately, identify high-value customers, and make proactive decisions.
Poor feature design leads to fragile models and misleading insights.

In [2]:
import pandas as pd

df = pd.read_csv("../data/processed/cleaned_data.csv")
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


## FEATURE ENGINEERING THEMES

A. TIME-BASED FEATURES

In [3]:
df['Order Date'] = pd.to_datetime(df['Order Date'])

df['order_year'] = df['Order Date'].dt.year
df['order_month'] = df['Order Date'].dt.month
df['order_quarter'] = df['Order Date'].dt.quarter
df['order_dayofweek'] = df['Order Date'].dt.dayofweek

B. CUSTOMER VALUE FEATURES 

In [11]:
snapshot_date = df['Order Date'].max() + pd.Timedelta(days=1)

rfm = df.groupby('Customer ID').agg({
    'Order Date': lambda x: (snapshot_date - x.max()).days,
    'Order ID': 'nunique',
    'Sales': 'sum'
}).reset_index()

rfm.columns = ['Customer ID', 'Recency', 'Frequency', 'Monetary']
snapshot_date

Timestamp('2017-12-31 00:00:00')

In [5]:
rfm.to_csv("../data/processed/customer_rfm.csv", index=False)

C. PROFIT & DISCOUNT BEHAVIOR FEATURES

In [10]:
df['profit_margin'] = df['Profit'] / df['Sales']
df['is_discounted'] = (df['Discount'] > 0).astype(int)
df

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Sales,Quantity,Discount,Profit,order_year,order_month,order_quarter,order_dayofweek,profit_margin,is_discounted
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,261.9600,2,0.00,41.9136,2016,11,4,1,0.1600,0
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,731.9400,3,0.00,219.5820,2016,11,4,1,0.3000,0
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,14.6200,2,0.00,6.8714,2016,6,2,6,0.4700,0
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,957.5775,5,0.45,-383.0310,2015,10,4,6,-0.4000,1
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,22.3680,2,0.20,2.5164,2015,10,4,6,0.1125,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,2014-01-21,2014-01-23,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,25.2480,3,0.20,4.1028,2014,1,1,1,0.1625,1
9990,9991,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,91.9600,2,0.00,15.6332,2017,2,1,6,0.1700,0
9991,9992,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,258.5760,2,0.20,19.3932,2017,2,1,6,0.0750,1
9992,9993,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,29.6000,4,0.00,13.3200,2017,2,1,6,0.4500,0


D. AGGREGATED PRODUCT / CUSTOMER FEATURES

In [9]:
customer_agg = df.groupby('Customer ID').agg({
    'Sales': 'sum',
    'Profit': 'sum',
    'Discount': 'mean',
    'Order ID': 'nunique'
}).reset_index()
customer_agg

Unnamed: 0,Customer ID,Sales,Profit,Discount,Order ID
0,AA-10315,5563.560,-362.8825,0.090909,5
1,AA-10375,1056.390,277.3824,0.080000,9
2,AA-10480,1790.512,435.8274,0.016667,4
3,AA-10645,5086.935,857.8033,0.063889,6
4,AB-10015,886.156,129.3465,0.066667,3
...,...,...,...,...,...
788,XP-21865,2374.658,621.2300,0.046429,11
789,YC-21895,5454.350,1305.6290,0.075000,5
790,YS-21880,6720.444,1778.2923,0.050000,8
791,ZC-21910,8025.707,-1032.1490,0.254839,13


In [8]:
df.to_csv("../data/processed/featured_data.csv", index=False)

## 3. Executive Summary

This phase transformed transactional data into business-level signals capturing
customer value, purchasing behavior, profitability, and seasonality. These features
form the foundation for forecasting, segmentation, churn prediction, and profit
optimization models in subsequent phases.