# Step 2 ‚Äî FEATURE ENGINEERING (LLM-AIDED)

## üéØ Objective

Transform the **frozen dataset** into a **model-ready feature table**, using:
* Encoding
* Derived variables
* **LLM/SLM-assisted text feature extraction**
* Target variable definition

## 1Ô∏è‚É£ Input (DO NOT CHANGE)

Person B **must not regenerate or modify raw data**.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/synthetic_customers_raw.csv")
df.head()

Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text
0,1,58,44592,24,55.75,13,Average experience
1,2,26,126530,49,180.43,53,Highly recommended
2,3,20,33905,7,50.25,48,Decent service
3,4,19,103563,49,296.83,102,Nothing special
4,5,55,66463,15,233.59,4,Product is acceptable


## 2Ô∏è‚É£ Decide Feature Categories (Design First)

### Feature Types

| Type | Examples |
| --- | --- |
| **Numerical (Raw)** | `age`, `income`, `total_orders`, `avg_order_value`, `days_since_last_purchase` |
| **Derived Numerical** | `income_per_order` |
| **LLM-Based Text Features** | `sentiment_score`, `risk_score` (from `review_text`) |
| **Target Variable** | `churn` (Derived from inactivity rules) |

## 3Ô∏è‚É£ LLM-AIDED TEXT FEATURE EXTRACTION

We use an LLM design to classify unstructured `review_text` into structured features.

### Step 3A ‚Äî LLM Prompt (Documentation)

```text
You are a customer sentiment analyst.
Given short e-commerce reviews, classify them into:
1. Sentiment: Positive / Neutral / Negative
2. Churn Risk: Low / Medium / High

Provide keyword-based rules for Python implementation.
```

### LLM Output (Summarised Rule Set)
- **Positive / Low Risk**: "satisfied", "excellent", "recommended", "great", "buy again"
- **Negative / High Risk**: "disappointed", "poor", "slow", "bad", "not worth"
- **Neutral / Medium Risk**: Words like "okay", "average", "acceptable"

In [2]:
## Step 3B ‚Äî Implement LLM Rules in Python

def extract_sentiment_and_risk(text):
    text = text.lower()
    
    # Keywords derived from LLM suggestions
    negative_keywords = ["disappointed", "poor", "slow", "bad", "not worth"]
    positive_keywords = ["excellent", "satisfied", "recommended", "great", "buy again"]

    if any(word in text for word in negative_keywords):
        return "negative", "high"
    elif any(word in text for word in positive_keywords):
        return "positive", "low"
    else:
        return "neutral", "medium"

df[["sentiment", "churn_risk"]] = df["review_text"].apply(
    lambda x: pd.Series(extract_sentiment_and_risk(x))
)

df[["review_text", "sentiment", "churn_risk"]].head()

Unnamed: 0,review_text,sentiment,churn_risk
0,Average experience,neutral,medium
1,Highly recommended,positive,low
2,Decent service,neutral,medium
3,Nothing special,neutral,medium
4,Product is acceptable,neutral,medium


## 4Ô∏è‚É£ Encode Features & Define Target

### Target Definition: Churn
Since the raw data doesn't have a label, we derive it using business logic:
* **Churned (1)**: Inactive for > 180 days
* **Active (0)**: Inactive for <= 180 days

In [3]:
# 1. Define Target
df["churn"] = (df["days_since_last_purchase"] > 180).astype(int)

# 2. Ordinal Encoding for LLM Features
sentiment_map = {"negative": 0, "neutral": 1, "positive": 2}
risk_map = {"low": 0, "medium": 1, "high": 2}

df["sentiment_score"] = df["sentiment"].map(sentiment_map)
df["risk_score"] = df["churn_risk"].map(risk_map)

# 3. Derived Features
df["income_per_order"] = df["income"] / df["total_orders"]

df.head()

Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text,sentiment,churn_risk,churn,sentiment_score,risk_score,income_per_order
0,1,58,44592,24,55.75,13,Average experience,neutral,medium,0,1,1,1858.0
1,2,26,126530,49,180.43,53,Highly recommended,positive,low,0,2,0,2582.244898
2,3,20,33905,7,50.25,48,Decent service,neutral,medium,0,1,1,4843.571429
3,4,19,103563,49,296.83,102,Nothing special,neutral,medium,0,1,1,2113.530612
4,5,55,66463,15,233.59,4,Product is acceptable,neutral,medium,0,1,1,4430.866667


## 5Ô∏è‚É£ Prepare Final Feature Set (FOR MODELS)

In [4]:
features = df[[
    "age", "income", "total_orders", "avg_order_value", 
    "days_since_last_purchase", "income_per_order", 
    "sentiment_score", "risk_score"
]]

target = df["churn"]

features.to_csv("../data/features.csv", index=False)
target.to_csv("../data/target.csv", index=False)

print("Features and Target saved successfully.")

Features and Target saved successfully.


## 6Ô∏è‚É£ Document AI Usage (Assessment Requirement)

> **AI-Assisted Feature Engineering**
>
> A Large Language Model was used to design rule-based mappings for extracting **sentiment** and **churn risk** from unstructured review texts. 
> These AI-informed rules were implemented in Python to derive ordinal numerical features (`sentiment_score`, `risk_score`) used in predictive modelling.