# PHASE - 0

Payment Risk System â€“ Synthetic Data Generator (Part A)

This notebook generates time-based synthetic payment data for a fintech-style
cross-border payments platform.

What this part does:

Creates a realistic customer universe

Generates transactions for a given date range (monthly runs)

Keeps customer identity and trust state consistent across months


How to use this notebook:

Edit assumptions and knobs in the next cell

Run customer generation only once

Run monthly transaction generation for Jan, Feb, Mar, etc.

Each run produces one month of data

All parameters are intentionally visible and editable so that anyone reading
this notebook can understand and experiment with the data generation logic.

In [12]:
# ============================================================
# PHASE 0: ASSUMPTIONS AND USER CONTROLS
# ============================================================

# ----------------------------
# Reproducibility
# ----------------------------
# Change this if you want a different random world.
# Keeping it fixed ensures the same inputs give the same outputs.
RANDOM_SEED = 42


# ----------------------------
# Platform & Time Settings
# ----------------------------
PLATFORM_START_DATE = "2010-01-01"

# These two define the month you are generating data for.
# You will change these for Jan, Feb, Mar, etc.

#TRANSACTION_START_DATE = "2025-01-01"
#TRANSACTION_END_DATE   = "2025-01-31"

TRANSACTION_START_DATE = "2025-02-01"
TRANSACTION_END_DATE   = "2025-02-28"


# ----------------------------
# Customer Population Settings
# ----------------------------
# This is used ONLY when creating customers for the first time.
NUMBER_OF_CUSTOMERS = 10000

# Percentage of customers that are considered "new" in a given month
# New customers will get a fresh trust score automatically.
PERCENTAGE_NEW_CUSTOMERS = 0.05   # 5% new customers per month


# ----------------------------
# Customer Activity Mix
# ----------------------------
# Distribution of customer activity levels.
# These should sum to 1.0
ACTIVITY_SEGMENT_DISTRIBUTION = {
    "low": 0.60,
    "medium": 0.30,
    "high": 0.10
}

# Monthly transaction ranges by activity segment
TRANSACTION_RANGE_BY_SEGMENT = {
    "low":    (1, 5),
    "medium": (6, 20),
    "high":   (21, 80)
}


# ----------------------------
# Geography & Corridor Assumptions
# ----------------------------
# Supported source countries (sender side)
SOURCE_COUNTRIES = ["IN"]

# Supported destination countries
DESTINATION_COUNTRIES = ["US", "UK", "AE", "SG", "NG"]

# Corridor risk classification (policy-driven, not ML)
# Higher value = higher inherent risk
CORRIDOR_RISK_SCORE = {
    ("IN", "US"): 0.2,
    ("IN", "UK"): 0.2,
    ("IN", "AE"): 0.3,
    ("IN", "SG"): 0.3,
    ("IN", "NG"): 0.9
}

# Currency mapping per corridor
CURRENCY_MAPPING = {
    ("IN", "US"): ("INR", "USD"),
    ("IN", "UK"): ("INR", "GBP"),
    ("IN", "AE"): ("INR", "AED"),
    ("IN", "SG"): ("INR", "SGD"),
    ("IN", "NG"): ("INR", "NGN")
}


# ----------------------------
# Risk Environment Controls
# ----------------------------
# Controls how aggressive fraud patterns are in the synthetic world.
# Accepted values: "low", "medium", "high"
FRAUD_INTENSITY_LEVEL = "medium"

# Probability that a transaction happens from a new device
NEW_DEVICE_PROBABILITY = 0.08


# ----------------------------
# Time-based Behaviour Rules
# ----------------------------
# Transactions during these hours are considered odd-hour transactions
ODD_HOURS = list(range(0, 5))  # 12 AM to 5 AM


# ----------------------------
# Trust Score Settings
# ----------------------------
# Base trust score range for newly created customers
BASE_TRUST_SCORE_RANGE = (30, 80)

# Monthly trust adjustment limits
MAX_TRUST_INCREASE_PER_MONTH = 5
MAX_TRUST_DECREASE_PER_MONTH = 10


# ----------------------------
# File & Folder Structure
# ----------------------------
CUSTOMER_FILE_PATH = r"\CROSS_BORDER_FRAUD\INPUTS\CUSTOMER_DATA\customers.csv"
TRUST_STATE_FOLDER = r"\CROSS_BORDER_FRAUD\INPUTS\TRSUT_STATE_FOLDER"
TRANSACTION_OUTPUT_FOLDER = r"\CROSS_BORDER_FRAUD\INPUTS\TRANSACTION_OUTPUT_FOLDER"

print("PHASE 0 loaded successfully.")
print("You can now proceed to customer generation.")


PHASE 0 loaded successfully.
You can now proceed to customer generation.


# PHASE - 1

Phase 1: Customer Generation (Run Once)

In this phase, we create the customer universe for the platform.

Key points:

Customers are generated only once

Customer IDs remain stable across all months

Each customer gets:

an onboarding date

a base trust score

an activity segment (low / medium / high)

base behavioural parameters

This file will be reused for every monthly run (Jan, Feb, Mar, etc.).

If this file already exists, do not rerun this phase unless you want
to intentionally reset the entire system with new customers.

In [2]:
# ============================================================
# PHASE 1: CUSTOMER GENERATION (RUN ONCE)
# ============================================================

import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta


# ----------------------------
# Set random seed
# ----------------------------
np.random.seed(RANDOM_SEED)


# ----------------------------
# Helper: create folders if missing
# ----------------------------
os.makedirs(os.path.dirname(CUSTOMER_FILE_PATH), exist_ok=True)


# ----------------------------
# Check if customer file already exists
# ----------------------------
if os.path.exists(CUSTOMER_FILE_PATH):
    print("Customer file already exists.")
    print("Path:", CUSTOMER_FILE_PATH)
    print("Skipping customer generation.")
else:
    print("Generating new customer universe...")

    # ----------------------------
    # Generate customer IDs
    # ----------------------------
    customer_ids = [f"CUST_{i:06d}" for i in range(1, NUMBER_OF_CUSTOMERS + 1)]

    # ----------------------------
    # Generate onboarding dates
    # ----------------------------
    platform_start = pd.to_datetime(PLATFORM_START_DATE)
    txn_start = pd.to_datetime(TRANSACTION_START_DATE)

    onboarding_dates = [
        platform_start + timedelta(days=np.random.randint(0, (txn_start - platform_start).days))
        for _ in range(NUMBER_OF_CUSTOMERS)
    ]

    # ----------------------------
    # Assign activity segments
    # ----------------------------
    activity_segments = np.random.choice(
        list(ACTIVITY_SEGMENT_DISTRIBUTION.keys()),
        size=NUMBER_OF_CUSTOMERS,
        p=list(ACTIVITY_SEGMENT_DISTRIBUTION.values())
    )

    # ----------------------------
    # Generate base trust scores
    # ----------------------------
    base_trust_scores = np.random.randint(
        BASE_TRUST_SCORE_RANGE[0],
        BASE_TRUST_SCORE_RANGE[1] + 1,
        size=NUMBER_OF_CUSTOMERS
    )

    # ----------------------------
    # Assemble customer dataframe
    # ----------------------------
    customers_df = pd.DataFrame({
        "customer_id": customer_ids,
        "onboarding_date": onboarding_dates,
        "activity_segment": activity_segments,
        "base_trust_score": base_trust_scores
    })

    # ----------------------------
    # Save to disk
    # ----------------------------
    customers_df.to_csv(CUSTOMER_FILE_PATH, index=False)

    print(f"Customer universe created: {len(customers_df)} customers")
    print("Saved to:", CUSTOMER_FILE_PATH)

    print("\nSample customers:")
    display(customers_df.head())


Generating new customer universe...
Customer universe created: 10000 customers
Saved to: C:\Users\91833\Desktop\FRAUD_PROJECTS\CROSS_BORDER_FRAUD\INPUTS\CUSTOMER_DATA\customers.csv

Sample customers:


Unnamed: 0,customer_id,onboarding_date,activity_segment,base_trust_score
0,CUST_000001,2012-05-10,low,54
1,CUST_000002,2024-10-04,low,32
2,CUST_000003,2024-04-23,low,54
3,CUST_000004,2024-03-19,low,57
4,CUST_000005,2020-04-30,low,41


In [3]:
customers_df.base_trust_score.max() , customers_df.base_trust_score.min()

(80, 30)

In [4]:
customers_df.onboarding_date.max() , customers_df.onboarding_date.min()

(Timestamp('2024-12-30 00:00:00'), Timestamp('2010-01-02 00:00:00'))

In [5]:
customers_df.activity_segment.value_counts()

activity_segment
low       6024
medium    3006
high       970
Name: count, dtype: int64

# PHASE - 2

Phase 2: Monthly Transaction Generation

In this phase, we generate transactions for one specific date range
(e.g. January 2025).

What this phase does:

Loads the existing customer universe

Identifies existing vs new customers for the month

Assigns transactions based on customer activity segment

Generates realistic transaction timestamps, amounts, corridors, and devices

Important notes:

This phase can be run every month

Customer IDs remain consistent

New customers are introduced gradually

No ML or risk decisions happen here

Output:

One transaction-level dataset for the selected month

In [13]:
# ============================================================
# PHASE 2.1: LOAD CUSTOMERS & IDENTIFY NEW CUSTOMERS
# ============================================================

import numpy as np
import pandas as pd
from datetime import datetime

# ----------------------------
# Load customers
# ----------------------------
customers_df = pd.read_csv(CUSTOMER_FILE_PATH, parse_dates=["onboarding_date"])

total_customers = len(customers_df)

# ----------------------------
# Determine new customers for this month
# ----------------------------
num_new_customers = int(total_customers * PERCENTAGE_NEW_CUSTOMERS)

# Randomly select new customers for this month
new_customer_ids = np.random.choice(
    customers_df["customer_id"],
    size=num_new_customers,
    replace=False
)

customers_df["is_new_customer"] = customers_df["customer_id"].isin(new_customer_ids)

print(f"Total customers: {total_customers}")
print(f"New customers this month: {num_new_customers}")

display(customers_df["is_new_customer"].value_counts())


Total customers: 10000
New customers this month: 500


is_new_customer
False    9500
True      500
Name: count, dtype: int64

In [14]:
# ============================================================
# PHASE 2.2: GENERATE TRANSACTIONS FOR THE MONTH
# ============================================================

from datetime import timedelta
import random

np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# ----------------------------
# Date range
# ----------------------------
start_date = pd.to_datetime(TRANSACTION_START_DATE)
end_date = pd.to_datetime(TRANSACTION_END_DATE)
num_days = (end_date - start_date).days + 1

# ----------------------------
# Corridor setup
# ----------------------------
corridors = list(CORRIDOR_RISK_SCORE.keys())

# ----------------------------
# Transaction generation
# ----------------------------
transactions = []

for _, row in customers_df.iterrows():
    customer_id = row["customer_id"]
    activity_segment = row["activity_segment"]
    is_new = row["is_new_customer"]

    # Determine number of transactions for this customer
    txn_min, txn_max = TRANSACTION_RANGE_BY_SEGMENT[activity_segment]
    num_txns = np.random.randint(txn_min, txn_max + 1)

    for _ in range(num_txns):
        txn_day_offset = np.random.randint(0, num_days)
        txn_time = start_date + timedelta(days=txn_day_offset)

        # Random hour
        hour = np.random.randint(0, 24)
        txn_time = txn_time.replace(hour=hour)

        # Choose corridor
        src_country, dest_country = random.choice(corridors)
        src_currency, dest_currency = CURRENCY_MAPPING[(src_country, dest_country)]

        # Amount logic (simple but realistic)
        if activity_segment == "low":
            amount = np.random.uniform(50, 500)
        elif activity_segment == "medium":
            amount = np.random.uniform(200, 2000)
        else:
            amount = np.random.uniform(500, 10000)

        # Device change simulation
        is_new_device = np.random.rand() < NEW_DEVICE_PROBABILITY

        transactions.append({
            "transaction_id": f"TXN_{np.random.randint(1_000_000, 9_999_999)}_{customer_id}",
            "customer_id": customer_id,
            "transaction_timestamp": txn_time,
            "source_country": src_country,
            "destination_country": dest_country,
            "source_currency": src_currency,
            "destination_currency": dest_currency,
            "transaction_amount": round(amount, 2),
            "is_new_device": int(is_new_device),
            "is_new_customer": int(is_new)
        })

transactions_df = pd.DataFrame(transactions)

print(f"Generated {len(transactions_df)} transactions")

display(transactions_df.head())


Generated 106325 transactions


Unnamed: 0,transaction_id,customer_id,transaction_timestamp,source_country,destination_country,source_currency,destination_currency,transaction_amount,is_new_device,is_new_customer
0,TXN_3234489_CUST_000001,CUST_000001,2025-02-15 10:00:00,IN,US,INR,USD,400.86,0,0
1,TXN_5472471_CUST_000001,CUST_000001,2025-02-19 22:00:00,IN,US,INR,USD,76.14,0,0
2,TXN_5521373_CUST_000001,CUST_000001,2025-02-03 21:00:00,IN,AE,INR,AED,75.39,0,0
3,TXN_6664789_CUST_000001,CUST_000001,2025-02-06 01:00:00,IN,UK,INR,GBP,131.82,0,0
4,TXN_5721339_CUST_000002,CUST_000002,2025-02-12 16:00:00,IN,UK,INR,GBP,286.15,0,0


# PHASE - 3

Phase 3: Trust State Initialization and Monthly Update

In this phase, we manage the customer trust state, which evolves over time.

Key ideas:

Trust is a customer-level, long-term signal

Trust is not regenerated every month

Trust is updated based on monthly behaviour

Trust changes are capped to keep them realistic

What happens here:

If this is the first-ever run, trust is initialized from base trust

If a previous trust file exists, it is reused

Trust is updated using simple behaviour indicators

Updated trust is saved for the next month

This mirrors how real payment platforms track customer trust.

In [15]:
# ============================================================
# PHASE 3.1: LOAD OR INITIALIZE TRUST STATE
# ============================================================

import os

# ----------------------------
# Create trust state folder if missing
# ----------------------------
os.makedirs(TRUST_STATE_FOLDER, exist_ok=True)

# ----------------------------
# Identify latest trust state file (if any)
# ----------------------------
trust_files = [
    f for f in os.listdir(TRUST_STATE_FOLDER)
    if f.startswith("trust_state_") and f.endswith(".csv")
]

trust_files.sort()

if len(trust_files) == 0:
    print("No existing trust state found. Initializing trust from base trust.")

    trust_df = customers_df[["customer_id", "base_trust_score"]].copy()
    trust_df.rename(columns={"base_trust_score": "trust_score"}, inplace=True)

else:
    latest_trust_file = trust_files[-1]
    print("Loading existing trust state:", latest_trust_file)

    trust_df = pd.read_csv(
        os.path.join(TRUST_STATE_FOLDER, latest_trust_file)
    )

print(f"Trust records loaded: {len(trust_df)}")
display(trust_df.head())


Loading existing trust state: trust_state_2025_01.csv
Trust records loaded: 10000


Unnamed: 0,customer_id,trust_score
0,CUST_000001,54
1,CUST_000002,32
2,CUST_000003,54
3,CUST_000004,57
4,CUST_000005,41


In [16]:
# ============================================================
# PHASE 3.2: UPDATE TRUST SCORE FOR THE MONTH
# ============================================================

# ----------------------------
# SAFETY: Remove old intermediate columns
# ----------------------------
if "trust_change" in trust_df.columns:
    trust_df = trust_df.drop(columns=["trust_change"])


# ----------------------------
# Merge trust with monthly transactions
# ----------------------------
txn_trust_df = transactions_df.merge(
    trust_df,
    on="customer_id",
    how="left"
)

# ----------------------------
# Behaviour indicators
# ----------------------------
monthly_stats = txn_trust_df.groupby("customer_id").agg(
    total_txns=("transaction_id", "count"),
    new_device_txns=("is_new_device", "sum"),
    avg_amount=("transaction_amount", "mean")
).reset_index()

# ----------------------------
# Trust logic
# ----------------------------
def compute_trust_change(row):
    change = 0

    if row["total_txns"] > 0 and (row["new_device_txns"] / row["total_txns"]) > 0.3:
        change -= 5

    if row["total_txns"] >= 5 and row["new_device_txns"] == 0:
        change += 3

    return change


monthly_stats["trust_change"] = monthly_stats.apply(
    compute_trust_change, axis=1
).clip(
    lower=-MAX_TRUST_DECREASE_PER_MONTH,
    upper=MAX_TRUST_INCREASE_PER_MONTH
)

# ----------------------------
# Update trust score
# ----------------------------
trust_updated = trust_df.merge(
    monthly_stats[["customer_id", "trust_change"]],
    on="customer_id",
    how="left"
)

trust_updated["trust_change"] = trust_updated["trust_change"].fillna(0)
trust_updated["trust_score"] = trust_updated["trust_score"] + trust_updated["trust_change"]
trust_updated["trust_score"] = trust_updated["trust_score"].clip(0, 100)

# ----------------------------
# Save ONLY trust state
# ----------------------------
month_tag = pd.to_datetime(TRANSACTION_START_DATE).strftime("%Y_%m")
trust_state_path = os.path.join(
    TRUST_STATE_FOLDER,
    f"trust_state_{month_tag}.csv"
)

trust_updated[["customer_id", "trust_score"]].to_csv(
    trust_state_path, index=False
)

print("Updated trust state saved to:")
print(trust_state_path)

display(trust_updated.head())


Updated trust state saved to:
C:\Users\91833\Desktop\FRAUD_PROJECTS\CROSS_BORDER_FRAUD\INPUTS\TRSUT_STATE_FOLDER\trust_state_2025_02.csv


Unnamed: 0,customer_id,trust_score,trust_change
0,CUST_000001,54,0
1,CUST_000002,32,0
2,CUST_000003,54,0
3,CUST_000004,57,0
4,CUST_000005,41,0


In [17]:
trust_updated.trust_change.max() , trust_updated.trust_change.min()

(3, -5)

# PHASE - 4

Phase 4: Save Monthly Synthetic Output

In this phase, we create the final synthetic dataset for the month.

What happens here:

Transactions are enriched with the updated trust score

The dataset represents the platform state at the end of the month

One clean output file is saved for this month

This file is the only output that downstream notebooks will use:

Risk scoring

ML

Rules

Performance analysis

After this phase:

Part A (Synthetic Data Generator) is complete

The notebook can be re-run for another month by changing dates

Trust state will automatically carry forward

In [18]:
# ============================================================
# PHASE 4: SAVE MONTHLY SYNTHETIC OUTPUT
# ============================================================

# ----------------------------
# Merge updated trust into transactions
# ----------------------------
final_monthly_df = transactions_df.merge(
    trust_updated[["customer_id", "trust_score"]],
    on="customer_id",
    how="left"
)

# ----------------------------
# Basic sanity checks
# ----------------------------
assert final_monthly_df["trust_score"].isna().sum() == 0, \
    "Some transactions are missing trust scores"

# ----------------------------
# Create output folder if needed
# ----------------------------
os.makedirs(TRANSACTION_OUTPUT_FOLDER, exist_ok=True)

# ----------------------------
# Save monthly output
# ----------------------------
month_tag = pd.to_datetime(TRANSACTION_START_DATE).strftime("%Y_%m")
monthly_output_path = os.path.join(
    TRANSACTION_OUTPUT_FOLDER,
    f"synthetic_transactions_{month_tag}.csv"
)

final_monthly_df.to_csv(monthly_output_path, index=False)

print("Monthly synthetic dataset saved to:")
print(monthly_output_path)

print("\nFinal dataset shape:", final_monthly_df.shape)
display(final_monthly_df.head())


Monthly synthetic dataset saved to:
C:\Users\91833\Desktop\FRAUD_PROJECTS\CROSS_BORDER_FRAUD\INPUTS\TRANSACTION_OUTPUT_FOLDER\synthetic_transactions_2025_02.csv

Final dataset shape: (106325, 11)


Unnamed: 0,transaction_id,customer_id,transaction_timestamp,source_country,destination_country,source_currency,destination_currency,transaction_amount,is_new_device,is_new_customer,trust_score
0,TXN_3234489_CUST_000001,CUST_000001,2025-02-15 10:00:00,IN,US,INR,USD,400.86,0,0,54
1,TXN_5472471_CUST_000001,CUST_000001,2025-02-19 22:00:00,IN,US,INR,USD,76.14,0,0,54
2,TXN_5521373_CUST_000001,CUST_000001,2025-02-03 21:00:00,IN,AE,INR,AED,75.39,0,0,54
3,TXN_6664789_CUST_000001,CUST_000001,2025-02-06 01:00:00,IN,UK,INR,GBP,131.82,0,0,54
4,TXN_5721339_CUST_000002,CUST_000002,2025-02-12 16:00:00,IN,UK,INR,GBP,286.15,0,0,32
