# Step 2 ‚Äî FEATURE ENGINEERING (LLM-AIDED)

## üéØ Objective

Transform the **frozen dataset** into a **model-ready feature table**, using:
* Encoding
* Derived variables
* **LLM/SLM-assisted text feature extraction**
* Target variable definition (Loaded from Input)

## 1Ô∏è‚É£ Input (DO NOT CHANGE)

Person B **must not regenerate or modify raw data**.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/synthetic_customers_raw.csv")
df.head()

Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text,churn
0,1,58,74592,22,133.88,13,It was okay,0
1,2,61,131482,42,161.12,45,Fast delivery and great quality,0
2,3,50,138907,50,219.59,14,Will definitely buy again,0
3,4,44,64446,22,258.07,230,Poor customer service,1
4,5,62,115392,32,204.42,175,Delivery was slow,0


## 2Ô∏è‚É£ Decide Feature Categories (Design First)

### Feature Types

| Type | Examples |
| --- | --- |
| **Numerical (Raw)** | `age`, `income`, `total_orders`, `avg_order_value`, `days_since_last_purchase` |
| **Derived Numerical** | `spend_ratio` (Econ), `estimated_spend` |
| **LLM-Based Text Features** | `sentiment_score`, `risk_score` (from `review_text`) |
| **Target Variable** | `churn` (Pre-defined in Simulation) |

## 3Ô∏è‚É£ LLM-AIDED TEXT FEATURE EXTRACTION

We use an LLM design to classify unstructured `review_text` into structured features.

### Step 3A ‚Äî LLM Prompt (Documentation)

```text
You are a customer sentiment analyst.
Given short e-commerce reviews, classify them into:
1. Sentiment: Positive / Neutral / Negative
2. Churn Risk: Low / Medium / High

Provide keyword-based rules for Python implementation.
```

### LLM Output (Summarised Rule Set)
- **Positive / Low Risk**: "satisfied", "excellent", "recommended", "great", "buy again"
- **Negative / High Risk**: "disappointed", "poor", "slow", "bad", "not worth"
- **Neutral / Medium Risk**: Words like "okay", "average", "acceptable"

In [2]:
## Step 3B ‚Äî Implement LLM Rules in Python

def extract_sentiment_and_risk(text):
    text = text.lower()
    
    # Keywords derived from LLM suggestions
    negative_keywords = ["disappointed", "poor", "slow", "bad", "not worth"]
    positive_keywords = ["excellent", "satisfied", "recommended", "great", "buy again"]

    if any(word in text for word in negative_keywords):
        return "negative", "high"
    elif any(word in text for word in positive_keywords):
        return "positive", "low"
    else:
        return "neutral", "medium"

df[["sentiment", "churn_risk"]] = df["review_text"].apply(
    lambda x: pd.Series(extract_sentiment_and_risk(x))
)

df[["review_text", "sentiment", "churn_risk"]].head()

Unnamed: 0,review_text,sentiment,churn_risk
0,It was okay,neutral,medium
1,Fast delivery and great quality,positive,low
2,Will definitely buy again,positive,low
3,Poor customer service,negative,high
4,Delivery was slow,negative,high


## 4Ô∏è‚É£ Encode Features & Define Target

### Target Definition: Churn
Target is already defined in the raw data (probabilistic logic).

In [3]:
# 1. Select Target (Already loaded)
# df["churn"] exists

# 2. Ordinal Encoding for LLM Features
sentiment_map = {"negative": 0, "neutral": 1, "positive": 2}
risk_map = {"low": 0, "medium": 1, "high": 2}

df["sentiment_score"] = df["sentiment"].map(sentiment_map)
df["risk_score"] = df["churn_risk"].map(risk_map)

# 3. Derived Features (Economic Logic)
# Replaced weak 'income_per_order' with meaningful ratios
df["estimated_spend"] = df["total_orders"] * df["avg_order_value"]
df["spend_ratio"] = df["estimated_spend"] / df["income"]

df.head()

Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text,churn,sentiment,churn_risk,sentiment_score,risk_score,estimated_spend,spend_ratio
0,1,58,74592,22,133.88,13,It was okay,0,neutral,medium,1,1,2945.36,0.039486
1,2,61,131482,42,161.12,45,Fast delivery and great quality,0,positive,low,2,0,6767.04,0.051467
2,3,50,138907,50,219.59,14,Will definitely buy again,0,positive,low,2,0,10979.5,0.079042
3,4,44,64446,22,258.07,230,Poor customer service,1,negative,high,0,2,5677.54,0.088098
4,5,62,115392,32,204.42,175,Delivery was slow,0,negative,high,0,2,6541.44,0.056689


## 5Ô∏è‚É£ Prepare Final Feature Set (FOR MODELS)

In [4]:
features = df[[
    "age", "income", "total_orders", "avg_order_value", 
    "days_since_last_purchase", "estimated_spend", "spend_ratio",
    "sentiment_score", "risk_score"
]]

target = df["churn"]

features.to_csv("../data/features.csv", index=False)
target.to_csv("../data/target.csv", index=False)

print("Features and Target saved successfully.")

Features and Target saved successfully.


## 6Ô∏è‚É£ Document AI Usage (Assessment Requirement)

> **AI-Assisted Feature Engineering**
>
> A Large Language Model was used to design rule-based mappings for extracting **sentiment** and **churn risk** from unstructured review texts. 
> These AI-informed rules were implemented in Python to derive ordinal numerical features (`sentiment_score`, `risk_score`) used in predictive modelling.