# Step A ‚Äî Use LLM to DESIGN the DATA LOGIC (Once)

### Example Prompt

```text
You are a e-commerce data analyst.

Design realistic rules for simulating customer purchase behavior and churn risk.
I need:
- Typical income ranges vs order frequency and value
- How review sentiment reflects recency and frequency
- Business rules for 'High Value' vs 'Risk' customers

Summarise the logic clearly so I can implement it in Python.
```

### Example LLM Output (Summarised)

**1. Income vs. Spending Behavior**
*   **High Income (> $100k)**: High frequency (15-50 orders), High value ($150-$500/order)
*   **Medium Income ($60k-$100k)**: Medium frequency (5-40 orders), Medium value ($80-$300/order)
*   **Low Income (< $60k)**: Low frequency (1-25 orders), Low value ($20-$150/order)

**2. Review Sentiment Logic**
*   **Positive**: Loyal customers (Total Orders > 30 AND Recent Purchase < 60 days ago)
*   **Negative**: Churn risk (Last Purchase > 180 days ago)
*   **Neutral**: Everyone else

üìå **This logic is used to ground the synthetic data generation.**

# Step B ‚Äî IMPLEMENT That Logic in Python

### 1Ô∏è‚É£ Imports & Setup

In [1]:
import pandas as pd
import numpy as np
import random

np.random.seed(42)
random.seed(42)

### 2Ô∏è‚É£ Helper Functions (LLM-Guided)

In [2]:
def get_purchase_behavior(income):
    """
    Returns (total_orders, avg_order_value) based on income level.
    Logic derived from GenAI suggestions.
    """
    if income > 100000:
        # High income: High frequency, High value
        orders = random.randint(15, 50)
        value = round(random.uniform(150, 500), 2)
    elif income > 60000:
        # Medium income: Medium frequency, Medium value
        orders = random.randint(5, 40)
        value = round(random.uniform(80, 300), 2)
    else:
        # Low income: Low frequency, Low value
        orders = random.randint(1, 25)
        value = round(random.uniform(20, 150), 2)
    return orders, value

def get_review_text(total_orders, days_since_last_purchase):
    """
    Returns a review string based on customer loyalty and recency.
    """
    positive_reviews = [
        "Very satisfied with the service", "Fast delivery and great quality",
        "Excellent shopping experience", "Highly recommended", "Will definitely buy again"
    ]
    neutral_reviews = [
        "It was okay", "Average experience", "Product is acceptable",
        "Nothing special", "Decent service"
    ]
    negative_reviews = [
        "Very disappointed", "Poor customer service", "Delivery was slow",
        "Product quality was bad", "Not worth the money"
    ]

    # Loyal active customers -> Positive
    if total_orders > 30 and days_since_last_purchase < 60:
        return random.choice(positive_reviews)
    # Inactive/Churned customers -> Negative
    elif days_since_last_purchase > 180:
        return random.choice(negative_reviews)
    # Standard customers -> Neutral
    else:
        return random.choice(neutral_reviews)

### 3Ô∏è‚É£ Main Data Generation Loop

In [3]:
NUM_RECORDS = 1200
data = []

for i in range(NUM_RECORDS):
    # 1. Generate Info
    age = random.randint(18, 65)
    income = random.randint(30000, 150000)
    days_since_last_purchase = random.randint(1, 365)
    
    # 2. Apply Logic
    total_orders, avg_order_value = get_purchase_behavior(income)
    review_text = get_review_text(total_orders, days_since_last_purchase)

    # 3. Store Record
    data.append([
        i + 1,
        age,
        income,
        total_orders,
        avg_order_value,
        days_since_last_purchase,
        review_text
    ])

### 4Ô∏è‚É£ Create DataFrame & Verify

In [4]:
columns = [
    "customer_id", "age", "income", "total_orders", 
    "avg_order_value", "days_since_last_purchase", "review_text"
]

df = pd.DataFrame(data, columns=columns)

# Minimal display to check structure
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               1200 non-null   int64  
 1   age                       1200 non-null   int64  
 2   income                    1200 non-null   int64  
 3   total_orders              1200 non-null   int64  
 4   avg_order_value           1200 non-null   float64
 5   days_since_last_purchase  1200 non-null   int64  
 6   review_text               1200 non-null   object 
dtypes: float64(1), int64(5), object(1)
memory usage: 65.8+ KB
None


Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text
0,1,58,44592,24,55.75,13,Average experience
1,2,26,126530,49,180.43,53,Highly recommended
2,3,20,33905,7,50.25,48,Decent service
3,4,19,103563,49,296.83,102,Nothing special
4,5,55,66463,15,233.59,4,Product is acceptable


### 5Ô∏è‚É£ Save Dataset (Freeze)

In [5]:
df.to_csv("../data/synthetic_customers_raw.csv", index=False)
print("Dataset saved to data/synthetic_customers_raw.csv")

Dataset saved to data/synthetic_customers_raw.csv


# Step C ‚Äî DOCUMENT AI USAGE (THIS IS WHAT MARKERS LOOK FOR)

> **Use of Generative AI for Dataset Simulation**
>
> A Large Language Model (LLM) was used to design realistic data generation rules, including **income vs. spending habits** and **sentiment vs. recency logic**.
>
> Based on the LLM‚Äôs guidance, rule-based logic was implemented in Python (using `get_purchase_behavior` and `get_review_text` helper functions) to simulate 1,200 realistic customer records. 
> This approach ensures reproducibility (`seed=42`) while incorporating AI-informed domain knowledge into the dataset design.