# Notebook 00: Raw Data Generation

## Objective
In this notebook, we:
- Generate realistic synthetic customer and transaction data  
- Simulate real-world purchase behavior, noise, and missing values  
- Create raw datasets that resemble production business data  
- Export raw tables for further processing  

> Note: This notebook creates data only. No cleaning, transformation, or modeling is performed here.

## Problem Context

Real business data often contains inconsistencies, behavioral variation, and noise.  
Since real customer financial data is confidential, this project uses synthetic but realistic data to simulate:

- Customer demographics  
- Purchase histories  
- Seasonal shopping patterns  
- High-value and low-value customer behavior  

This dataset will serve as the foundation for feature engineering and CLV prediction.


### Step 1: Import Libraries

In [7]:
import pandas as pd
import numpy as np
from faker import Faker
from datetime import datetime, timedelta
import random

fake = Faker()
np.random.seed(42)

### Step 2: Generate Customer Table

In [8]:
n_customers = 10000

customers = pd.DataFrame({
    "customer_id": range(1, n_customers + 1),
    "gender": np.random.choice(["Male", "Female", "Other"], n_customers, p=[0.48, 0.48, 0.04]),
    "age": np.random.normal(35, 10, n_customers).astype(int),
    "city_tier": np.random.choice(["Tier 1", "Tier 2", "Tier 3"], n_customers, p=[0.5, 0.3, 0.2]),
    "signup_date": [fake.date_between(start_date="-3y", end_date="-1y") for _ in range(n_customers)],
    "acquisition_channel": np.random.choice(["Ads", "Organic", "Referral", "Email"], n_customers)
})

# Add some missing ages to simulate messy data
customers.loc[np.random.choice(customers.index, 500), "age"] = np.nan

customers.head()


Unnamed: 0,customer_id,gender,age,city_tier,signup_date,acquisition_channel
0,1,Male,20.0,Tier 1,2024-10-29,Ads
1,2,Female,23.0,Tier 1,2025-01-17,Email
2,3,Female,38.0,Tier 1,2024-08-16,Referral
3,4,Female,23.0,Tier 1,2024-03-01,Ads
4,5,Male,46.0,Tier 2,2023-12-03,Organic


### Step 3: Generate Transaction Table

In [9]:
n_transactions = 120000

product_categories = ["Electronics", "Fashion", "Grocery", "Home", "Beauty"]
payment_methods = ["Card", "UPI", "COD", "Wallet"]

transactions = []

for _ in range(n_transactions):
    cust_id = np.random.randint(1, n_customers + 1)
    quantity = np.random.randint(1, 5)
    price = round(np.random.uniform(100, 5000), 2)
    discount = np.random.choice([0, 5, 10, 15, 20, 30], p=[0.4, 0.1, 0.15, 0.15, 0.1, 0.1])
    returned = np.random.choice([0, 1], p=[0.9, 0.1])

    transactions.append([
        cust_id,
        fake.date_between(start_date="-2y", end_date="today"),
        random.choice(product_categories),
        quantity,
        price,
        discount,
        random.choice(payment_methods),
        returned
    ])

transactions = pd.DataFrame(transactions, columns=[
    "customer_id", "purchase_date", "product_category",
    "quantity", "price_per_unit", "discount_percent",
    "payment_method", "returned"
])

transactions["transaction_id"] = range(1, len(transactions) + 1)

transactions.head()


Unnamed: 0,customer_id,purchase_date,product_category,quantity,price_per_unit,discount_percent,payment_method,returned,transaction_id
0,3379,2024-02-19,Beauty,2,3238.38,30,UPI,0,1
1,7443,2024-09-26,Electronics,1,1960.78,15,Wallet,0,2
2,1780,2024-12-21,Electronics,1,797.5,5,Wallet,1,3
3,5708,2025-05-01,Electronics,3,4801.28,0,COD,0,4
4,7443,2024-06-28,Home,1,2003.54,15,Card,0,5


### Step 4: Add Realistic Behavior Patterns

In [10]:
# High-value customers spend more:

high_value_customers = np.random.choice(customers["customer_id"], size=1000)

transactions.loc[
    transactions["customer_id"].isin(high_value_customers),
    "price_per_unit"
] *= 1.5

In [11]:
# Seasonal spikes:

transactions["purchase_date"] = pd.to_datetime(transactions["purchase_date"])
transactions["month"] = transactions["purchase_date"].dt.month

# November & December shopping boost
transactions.loc[transactions["month"].isin([11,12]), "quantity"] += 1
transactions.drop("month", axis=1, inplace=True)

### Step 5: Save Files

In [14]:
customers.to_csv("../data/raw/customers.csv", index=False)
transactions.to_csv("../data/raw/transactions.csv", index=False)