## PHASE 1 – DATA ECOSYSTEM SIMULATION

### TOOLS NEEDED

* Python (main driver)
* Libraries: `faker`, `uuid`, `random`, `pandas`, `datetime`
* PostgreSQL Desktop or pgAdmin (for ingesting after)
* Optional: Excel/CSV preview for each table

---

## TABLES TO GENERATE

We’ll build **5 core tables**, each with 200–5,000 rows (adjustable later):

| Table Name     | Description                                         |
| -------------- | --------------------------------------------------- |
| `customers`    | Kuda business users with churn markers, signup date |
| `transactions` | Deposits, withdrawals, transfers with timestamps    |
| `merchants`    | Vendors linked to transactions                      |
| `campaigns`    | Marketing actions with A/B test tags                |
| `tickets`      | Support tickets with categories and resolution info |

---

## FILE STRUCTURE FOR PHASE 1

```
01_data_simulation/
│
├── simulate_customers.py
├── simulate_transactions.py
├── simulate_merchants.py
├── simulate_campaigns.py
├── simulate_tickets.py
├── init_db.sql                  # Table creation scripts
├── data_dictionary.xlsx         # Column descriptions
├── ERD.drawio                   # ER diagram (I'll generate this after schema)
└── output_data/
    ├── customers.csv
    ├── transactions.csv
    ├── merchants.csv
    ├── campaigns.csv
    └── tickets.csv
```

---

## STEP 1: SIMULATE CUSTOMERS (Base Script Preview)


### Customer data.csv

In [1]:
# simulate_customers.py
from faker import Faker
import pandas as pd
import uuid
import random
from datetime import datetime, timedelta

fake = Faker()
Faker.seed(42)
random.seed(42)

In [3]:
def generate_customers(n=100000):
    data = []
    for _ in range(n):
        customer_id = str(uuid.uuid4())
        name = fake.company()
        industry = random.choice(['Retail', 'Tech', 'Logistics', 'Healthcare', 'Finance'])
        region = random.choice(['Lagos', 'Abuja', 'Port Harcourt', 'Ibadan'])
        signup_date = fake.date_between(start_date='-2y', end_date='today')
        is_churned = random.choices([0, 1], weights=[0.85, 0.15])[0]  # 15% churn
        ab_group = random.choice(['control', 'treatment'])

        data.append({
            'customer_id': customer_id,
            'business_name': name,
            'industry': industry,
            'region': region,
            'signup_date': signup_date,
            'is_churned': is_churned,
            'ab_group': ab_group
        })

    return pd.DataFrame(data)

df_customers = generate_customers(100000)
df_customers.to_csv('customers.csv', index=False)
print("✅ Customers simulated and saved.")

✅ Customers simulated and saved.


In [7]:
df_customers.head()

Unnamed: 0,customer_id,business_name,industry,region,signup_date,is_churned,ab_group
0,1ea79026-d3da-4402-b0a8-89b8a532d475,Short-Phelps,Logistics,Lagos,2024-12-19,0,control
1,d1f4b956-a41b-40ae-b3ca-90d4baa3f774,Ramos Group,Logistics,Ibadan,2025-06-17,0,control
2,a7e14e38-5fb8-4c01-9cfb-8f9d65e59feb,Mason PLC,Finance,Ibadan,2024-05-29,0,treatment
3,72c212e2-79e7-4ba2-89b8-5102b92f33dd,"Zimmerman, Mendoza and White",Tech,Abuja,2024-02-22,0,control
4,95d80777-c500-45c9-8139-4af6e8bde46b,"Bradley, Mills and French",Healthcare,Lagos,2024-02-04,1,treatment


In [5]:
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   customer_id    100000 non-null  object
 1   business_name  100000 non-null  object
 2   industry       100000 non-null  object
 3   region         100000 non-null  object
 4   signup_date    100000 non-null  object
 5   is_churned     100000 non-null  int64 
 6   ab_group       100000 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB
