## **Notebook Content**

Project: *Customer Prioritization Under Constraints*

File: 02_feature_engineering.ipynb <br>
Author: Bryan Melvida

Purpose:
- Ingest cleaned transactional data
- Derive behavior-based and aggregate features
- Prepare feature set for downstream modeling and analysis

Input: [`customer_cleaned.parquet`](../data/cleaned/customer_cleaned.parquet) <br>

Output: [`customer_features.parquet`](../data/preprocessed/)<br>
Related Documentation: [`feature_definitions.md`](../docs/feature_engineering/feature_definitions.md)

<br>

---

<br>

In [1]:
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)
from pathlib import Path

import sys
sys.path.append('../')
import src.assessment_views as av
from src import plot_settings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plot_settings.set()

In [2]:
df = pd.read_parquet('../data/cleaned/customer_cleaned.parquet')

# Instantiate customer features dataframe
customer_features = (df[['CustomerID']].astype(int).drop_duplicates().reset_index(drop= True))

<br>

---

## **Population Scope Definition**

- One time buyers were excluded from the clustering population due to insufficient behavioral history to support reliable segmentation
- Including single transaction customers would introduce incomplete feature profiles and add noise to cluster structure without supporting differentiated action


In [3]:
tx_counts = df.groupby('CustomerID')['InvoiceDate'].count().reset_index(name='TxCount')
onetime_buyer = tx_counts.loc[tx_counts['TxCount']==1, :]

df_clean = df.loc[~df['CustomerID'].isin(onetime_buyer['CustomerID']), :]

min_tx_check = df_clean.groupby('CustomerID')['InvoiceDate'].count().min()
removed_customer = onetime_buyer['CustomerID'].nunique()

print(f'Customer Minimum Transaction: {min_tx_check}')
av.df_shape(df_clean)

Customer Minimum Transaction: 2
Total Rows: 313,096
Total Columns: 8


<br>

---

## **Feature Engineering**

Constructs customer-level behavioral features that focus on:
- Timing and persistence of engagement
- Efficiency and complexity of transactions
- Behavioral signals that justify differentiated treatment

**Cancellation Rate**
>Captures the degree of transaction reversal behavior, serving as a signal of execution friction, behavioral instability, and operational cost associated with serving the customer

In [4]:
total_tx = df_clean['CustomerID'].value_counts()
cancelled_tx = df_clean.loc[df_clean['InvoiceNo'].str.startswith('C'), :]

cancellation_count = (
    cancelled_tx['CustomerID']
    .astype(int)
    .value_counts()
    .reset_index(name= 'cancelled_tx')
    )

cancellation_count['total_tx'] = cancellation_count['CustomerID'].map(total_tx).fillna(0)

# Assign 100% cancellcation rate for customer w/o transactions
cancellation_count['cancellation_rate'] = np.where(
    cancellation_count['total_tx'] == 0, 1.0, 
    cancellation_count['cancelled_tx'] / cancellation_count['total_tx']
    )

# Assign 0% cancellation rate for customers w/o cancelled transactions
customer_features['cancellation_rate'] = customer_features['CustomerID'].map(
    cancellation_count.set_index('CustomerID')['cancellation_rate']).fillna(0)

<br>

**RFM Based Behavioral Features**

**`Days Since Last Purchase`** *( Timing )* <br>
>Indicates how recently a customer has engaged, supporting assessment of engagement recency and the expected relevance of near-term action
>

**`Total Transactions`** *( Persistence )*
>Represents the extent of repeated engagement, distinguishing sustained customer relationships from isolated or episodic interactions
>

**`Average Order Value`** *( Efficiency )*
>Reflects transaction-level efficiency, indicating whether higher-touch actions are justified relative to the typical scale of customer interaction
>