# Dataset Simulation (Person A)

## Business Context
This project simulates an e-commerce customer dataset for analyzing customer churn and purchase behavior.

## GenAI Usage
Generative AI was used to assist in designing the dataset structure and business rules
(e.g., relationships between purchase frequency, recency, and customer sentiment).
The final dataset is generated programmatically using Python for reproducibility.

## Dataset Schema
- customer_id
- age
- income
- total_orders
- avg_order_value
- days_since_last_purchase
- review_text

In [2]:
import pandas as pd
import random
import numpy as np

random.seed(42)
np.random.seed(42)

NUM_RECORDS = 1200

positive_reviews = [
    "Very satisfied with the service",
    "Fast delivery and great quality",
    "Excellent shopping experience",
    "Highly recommended",
    "Will definitely buy again"
]

neutral_reviews = [
    "It was okay",
    "Average experience",
    "Product is acceptable",
    "Nothing special",
    "Decent service"
]

negative_reviews = [
    "Very disappointed",
    "Poor customer service",
    "Delivery was slow",
    "Product quality was bad",
    "Not worth the money"
]

data = []

for i in range(NUM_RECORDS):
    age = random.randint(18, 65)
    income = random.randint(30000, 150000)
    total_orders = random.randint(1, 50)
    avg_order_value = round(random.uniform(20, 500), 2)
    days_since_last_purchase = random.randint(1, 365)

    # Business rule enforcement
    if total_orders > 30 and days_since_last_purchase < 60:
        review_text = random.choice(positive_reviews)
    elif days_since_last_purchase > 180:
        review_text = random.choice(negative_reviews)
    else:
        review_text = random.choice(neutral_reviews)

    data.append([
        i + 1,
        age,
        income,
        total_orders,
        avg_order_value,
        days_since_last_purchase,
        review_text
    ])

columns = [
    "customer_id",
    "age",
    "income",
    "total_orders",
    "avg_order_value",
    "days_since_last_purchase",
    "review_text"
]

df = pd.DataFrame(data, columns=columns)
df.head()

Unnamed: 0,customer_id,age,income,total_orders,avg_order_value,days_since_last_purchase,review_text
0,1,58,44592,2,375.94,126,Average experience
1,2,26,126530,7,344.82,280,Very disappointed
2,3,55,85302,3,34.3,112,Average experience
3,4,50,108907,2,289.4,333,Not worth the money
4,5,44,58893,29,302.85,4,Average experience


In [5]:
df.to_csv("../data/synthetic_customers_raw.csv", index=False)


In [6]:
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               1200 non-null   int64  
 1   age                       1200 non-null   int64  
 2   income                    1200 non-null   int64  
 3   total_orders              1200 non-null   int64  
 4   avg_order_value           1200 non-null   float64
 5   days_since_last_purchase  1200 non-null   int64  
 6   review_text               1200 non-null   object 
dtypes: float64(1), int64(5), object(1)
memory usage: 65.8+ KB


customer_id                 0
age                         0
income                      0
total_orders                0
avg_order_value             0
days_since_last_purchase    0
review_text                 0
dtype: int64

The dataset contains 1200 records with no missing values and reasonable numerical distributions.