# Feature Engineering

## Purpose

This notebook builds the **Machine Learning Feature Engineering layer** of the Crisis Recovery Lakehouse.

The purpose of this layer is to convert **clean, trusted Silver-layer data**
into **customer-level, model-ready features** that can be used for:

- Churn prediction
- Crisis impact modeling
- Retention strategy optimization
- Machine learning experimentation and governance

Unlike Gold tables, this notebook does **not** serve dashboards or BI tools.

Instead, it focuses on:
- Predictive signal quality
- Feature consistency
- Reproducibility
- Safe handoff into ML pipelines (MLflow)

---

## Business Context

During a crisis, customer churn is rarely random.

Customers disengage due to:
- Repeated delivery delays
- Food safety concerns
- Accumulation of negative sentiment
- Reduced engagement following bad experiences

Leadership and retention teams need predictive answers to:
- Which customers are at risk because of the crisis?
- Which customers should receive proactive offers?
- Which customers are most valuable to retain?

This notebook transforms behavioral and sentiment data
into **quantitative signals** that machine learning models can learn from.

---

## Inputs and Outputs

### Inputs (from Silver Layer)

| Source Table | Purpose |
|-------------|---------|
| `silver_orders_enriched` | Order history, sentiment, customer attributes |
| `fact_reviews` | Review scores and customer feedback |

---

### Output (ML Feature Table)

| Table | Business Purpose |
|------|------------------|
| `ml_churn_features` | Customer-level feature set for churn modeling |

---

## Design Principles of the ML Feature Layer

- Features are **aggregated at the customer level**
- Every feature has a **clear behavioral meaning**
- No future information leakage
- Features are numerical and model-ready
- Transformations are deterministic and reproducible


## 1. Basic Behavioral Features

### Business Problem

Before predicting churn, we must understand **baseline customer behavior**.

Key questions:
- How active is the customer?
- How frequently do they order?
- How long have they been engaged with the platform?

---

### Approach

We aggregate order data at the customer level to compute:
- Total lifetime orders
- First order date
- Last order date

These features represent **long-term engagement strength**.

In [0]:
from pyspark.sql.functions import col, avg, count, sum, when, lit, datediff, current_date, max as max_col
from pyspark.ml.feature import StringIndexer, VectorAssembler

# 1. Define the Analysis Date (The "Today" of the simulation)
simulated_today = lit("2025-12-31")

# 2. Load the Silver Data
df_silver = spark.table("food_delivery.silver_orders_enriched")

# 3. Create Customer Features
# We aggregate at the Customer Level
customer_features = df_silver.groupBy("customer_id", "segment").agg(
    count("order_id").alias("total_orders"),
    avg("review_score").alias("avg_star_rating"),
    
    # Feature: Late Ratio (Percentage of orders that were late)
    (sum(when(col("sentiment_category") == "Late Delivery", 1).otherwise(0)) / count("order_id")).alias("late_order_ratio"),
    
    # Feature: Days Since Last Order (Recency)
    datediff(simulated_today, max_col("created_at_simulated")).alias("recency_days")
)

# 4. Define the Target (Label)
# Rule: If Recency > 30 days, they are Churned (1). Else Active (0).
training_data = customer_features.withColumn(
    "label", 
    when(col("recency_days") > 30, 1).otherwise(0)
)


# 5. Handle Categorical Data (Segment)
# Machine Learning models need numbers, not strings like "Student".
# StringIndexer converts "Student" -> 0.0, "Corporate" -> 1.0
indexer = StringIndexer(inputCol="segment", outputCol="segment_index")
training_data_indexed = indexer.fit(training_data).transform(training_data)

# 6. Save as a Feature Table
training_data_indexed.write.format("delta").mode("overwrite").saveAsTable("food_delivery.ml_churn_features")


**Basic Behavioral Features** :
- total_orders
- avg_star_rating
- late_order_ratio
- recency_days
- segment_index
- label

These features tells about the customer and "how active they are?".

## Advanced Temporal & Crisis Features

## 1. crisis_flag_per_customer

### Business Problem

During an operational crisis, not all customers are impacted equally.
Some customers experience **severe service degradation**, while others are
largely unaffected.

For churn prediction and retention prioritization, we need a way to identify:
- Which customers were exposed to crisis-level delays
- Whether the customer experienced **any extreme failure**, not just average delays

Event-level delay metrics are too granular for customer-level modeling.

---

### Approach

We derive a **binary crisis exposure signal** at the customer level by:

1. Computing delivery delay beyond the expected duration
2. Flagging orders with **extreme delays** (greater than 45 minutes)
3. Aggregating per customer using a max operation

The resulting feature, `crisis_exposure_index`, indicates whether a customer
experienced **at least one crisis-level incident**.

This creates a robust, noise-resistant indicator suitable for ML models.

In [0]:
from pyspark.sql.functions import col, when, max as max_col

# Load cleaned orders data
df_orders = spark.table("food_delivery.silver_orders_clean")

# Calculate crisis exposure index per customer
crisis_flag_per_customer = (
    df_orders
    # Calculate delivery delay in seconds
    .withColumn(
        "delivery_delay_seconds",
        col("actual_delivery_time_simulated").cast("long")
        - col("created_at_simulated").cast("long")
        - col("estimated_store_to_consumer_driving_duration")
    )
    # Flag orders with delivery delay > 45 minutes (2700 seconds) as crisis
    .withColumn(
        "crisis_flag",
        when(col("delivery_delay_seconds") > 2700, 1).otherwise(0)
    )
    # Aggregate to get the maximum crisis flag per customer
    .groupBy("customer_id")
    .agg(
        max_col("crisis_flag").alias("crisis_exposure_index")
    )
)

## 2. sentiment_velocity
### Business Problem

Customer churn is often triggered not by static dissatisfaction,
but by **sudden deterioration in experience**.

A customer whose sentiment is rapidly worsening is at higher churn risk
than one who has been consistently neutral or mildly negative.

Simple average sentiment scores fail to capture this dynamic behavior.

---

### Approach

We compute **sentiment velocity** by:
- Ordering each customer’s reviews chronologically
- Calculating the change in sentiment score between consecutive orders
- Retaining the **most negative sentiment change** per customer

The resulting feature captures the **sharpest decline** in customer sentiment,
which is highly predictive of churn during crisis periods.


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import lag, min as min_col

# Load review scores and timestamps for each customer
df_reviews = spark.table("food_delivery.silver_orders_enriched") \
    .select("customer_id", "created_at_simulated", "review_score")

# Define window to order reviews by time for each customer
window_spec = Window.partitionBy("customer_id").orderBy("created_at_simulated")

sentiment_velocity = (
    df_reviews
    # Get previous review score for each order
    .withColumn(
        "prev_review_score",
        lag("review_score").over(window_spec)
    )
    # Calculate change in review score compared to previous order
    .withColumn(
        "sentiment_delta",
        col("review_score") - col("prev_review_score")
    )
    # Aggregate: For each customer, get the minimum sentiment delta (largest drop)
    .groupBy("customer_id")
    .agg(
        min_col("sentiment_delta").alias("sentiment_velocity")
    )
)


## 3. RFM Score

### Business Problem

Not all customers carry the same business value.
During crisis recovery, intervention resources must be prioritized toward
**high-value customers**.

Raw order counts or revenue totals alone do not provide a balanced view
of customer importance.

---

### Approach

We apply **RFM analysis** by scoring customers across three dimensions:

- **Recency (R):** How recently the customer ordered
- **Frequency (F):** How often the customer orders
- **Monetary (M):** How much the customer has spent

Each dimension is bucketed into quintiles and combined into a
single composite `rfm_score`, producing a compact and interpretable
measure of customer lifetime value.


In [0]:
from pyspark.sql.functions import sum as sum_col

# Aggregate monetary value (total spend) per customer from enriched orders
monetary_value = (
    spark.table("food_delivery.silver_orders_enriched")
    .groupBy("customer_id")
    .agg(
        sum_col("subtotal").alias("monetary_value")
    )
)

from pyspark.sql.functions import ntile
from pyspark.sql.window import Window

# Join customer features with monetary value
rfm_base = (
    training_data_indexed   
    .join(monetary_value, "customer_id", "left")
)

# Define windows for RFM scoring
r_window = Window.partitionBy().orderBy(col("recency_days").asc())      # Recency: lower is better
f_window = Window.partitionBy().orderBy(col("total_orders").desc())     # Frequency: higher is better
m_window = Window.partitionBy().orderBy(col("monetary_value").desc())   # Monetary: higher is better

# Calculate RFM scores using quintiles and combine into a single score
rfm_scored = (
    rfm_base
    .withColumn("R", ntile(5).over(r_window))                          # Recency score (1-5)
    .withColumn("F", ntile(5).over(f_window))                          # Frequency score (1-5)
    .withColumn("M", ntile(5).over(m_window))                          # Monetary score (1-5)
    .withColumn(
        "rfm_score",
        col("R") * 100 + col("F") * 10 + col("M")                     # Composite RFM score
    )
)

In [0]:
# Combine all engineered features into a single DataFrame for ML modeling
ml_features = (
    training_data_indexed         
    .join(crisis_flag_per_customer, "customer_id", "left")         # Add crisis exposure index
    .join(sentiment_velocity, "customer_id", "left")               # Add sentiment velocity feature
    .join(rfm_scored.select("customer_id", "rfm_score"), "customer_id", "left")  # Add RFM score
    .fillna({
        "crisis_exposure_index": 0,    # Fill missing crisis exposure with 0
        "sentiment_velocity": 0,       # Fill missing sentiment velocity with 0
        "rfm_score": 0                 # Fill missing RFM score with 0
    })
)

# Save features as a Delta table if it doesn't already exist
if not spark.catalog.tableExists("food_delivery.ml_churn_features"):
    ml_features.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.ml_churn_features")
else:
    print("ml churn features table already exists → skipping creation")

# Display the final ML features DataFrame
display(ml_features)

%md
**Advanced Temporal & Crisis Features** :
- crisis_flag_per_customer
- sentiment_velocity
- rfm_scored

These features capture change, shock, or behavioral acceleration so it shos "how customer sentiment and experience deteriorate over time".

In [0]:
# Display summary statistics (count, mean, stddev, min, max, etc.) for key engineered features
display(
    ml_features.select(
        "crisis_exposure_index",
        "sentiment_velocity",
        "rfm_score"
    ).summary()
)

Due to extreme class imbalance (0.12% churn rate - derived form mean value,0.9976), supervised churn prediction was not feasible without resampling or label redesign. The analysis instead highlights strong customer retention and the importance of churn definition.

## Downstream Dependencies

The ML Feature Engineering layer feeds:

- **Churn Prediction Models**
  - Baseline churn model (behavioral features only)
  - Crisis-aware churn model (temporal + sentiment + RFM features)

- **MLflow Experiments**
  - Feature set comparisons (v1 vs v2)
  - Model versioning and governance
  - Recall-driven evaluation for crisis scenarios

- **Gold Customer Risk Tables**
  - `gold_customer_risk_profile`
  - Retention and CRM targeting workflows

- **Business Decision Systems**
  - Discount / incentive allocation
  - Crisis recovery prioritization
  - Executive churn risk reporting

Any error or leakage in this layer directly impacts
**model accuracy, explainability, and business decisions** —
which is why feature engineering must be **precise, defensible, and reproducible**.

---

## Summary

This notebook establishes a **production-grade feature foundation** for
crisis-aware churn prediction by:

- Translating raw operational and sentiment data into **model-ready features**
- Capturing **temporal dynamics** (recency, velocity, crisis exposure)
- Quantifying **customer value** using RFM scoring
- Designing features that align with **real business behavior**
- Preventing data leakage through customer-level aggregation

It forms the **bridge between analytics and AI**, ensuring that
machine learning models are trained on signals that are:
- Interpretable
- Actionable
- Aligned with crisis recovery objectives

This layer enables models that do not just predict churn,
but **explain why customers are at risk**.
