## National Transaction Producer

### Executive Summary
This notebook simulates a high-velocity, national-scale banking environment. It acts as the **Data Producer**, generating transaction streams for 1,000 **unique customers** across all 9 South African Provinces

The goal of this `Mock API` is to produce data complex enough to test **stateful streaming, geospatial anomaly detection, and behavioral risk scoring** in downstream Bronze, Silverm and Gold Layers. 



## Geographical Setup (9 provinces)
Define the coordinates for the major hubs un each province. This acts as the `Spatial Source of Truth`

In [0]:
import random, uuid, json, time
from datetime import datetime

# Primary hubs for all 9 South African provinces
REGIONAL_BASES = {
    "Gauteng": {"lat": -26.2041, "lon": 28.0473},      # Johannesburg
    "Western_Cape": {"lat": -33.9249, "lon": 18.4241}, # Cape Town
    "KwaZulu_Natal": {"lat": -29.8587, "lon": 31.0218}, # Durban
    "Eastern_Cape": {"lat": -33.9608, "lon": 25.6022}, # Gqeberha
    "Limpopo": {"lat": -23.8962, "lon": 29.4486},      # Polokwane
    "Mpumalanga": {"lat": -25.4753, "lon": 30.9694},   # Mbombela
    "North_West": {"lat": -25.8560, "lon": 25.6403},   # Mahikeng
    "Free_State": {"lat": -29.1181, "lon": 26.2232},   # Bloemfontein
    "Northern_Cape": {"lat": -28.7282, "lon": 24.7499}  # Kimberley
}

STORAGE_PATH = "abfss://fraud-sentinel@giftmapote2ete.dfs.core.windows.net/raw/transactions/"



## Persistent National Customer Registry

We assign 1000 customers across these regions. **Residential Jitter** â€”this ensures no two customers have the exact same home address, even if they are in the same city. 

## 1000 Customer registry

In [0]:
# 1. Define the Pool: 1,000 Customers
CUSTOMER_POOL = [f"CUST-{i}" for i in range(1000, 2000)]

# 2. Reset and Rebuild the Dictionary to avoid KeyError
CUSTOMER_PROFILES = {}

for cid in CUSTOMER_POOL:
    province = random.choice(list(REGIONAL_BASES.keys()))
    base = REGIONAL_BASES[province]
    
    # Senior Logic: Set spread based on province size
    # Limpopo/Northern Cape = ~200km spread; Gauteng = ~15km spread
    spread = 1.8 if province in ["Limpopo", "Northern_Cape", "Eastern_Cape"] else 0.15

    tier = random.random()

    if tier < 0.6: # 60% low spenders
        avg_spend = random.uniform(400, 2000)
    elif tier < 0.9: # 30% medium spenders
        avg_spend = random.uniform(2000,6000)
    else: # 10% high spenders
        avg_spend = random.uniform(6000,10000)

    
    # Store the unique profile
    CUSTOMER_PROFILES[cid] = {
        "home_province": province,
        "home_lat": base["lat"] + random.uniform(-spread, spread),
        "home_lon": base["lon"] + random.uniform(-spread, spread),
        "avg_spend": round(avg_spend, 2),
        "trusted_device": f"DEV-{cid}"
    }

print(f"Success: Registry rebuilt with {len(CUSTOMER_PROFILES)} profiles.")

## Creating the Registry Table

In [0]:
%sql

-- Create a top-level container for the project
CREATE CATALOG IF NOT EXISTS fraud_sentinel_catalog;

--Use this catalog for all subsequent operations
USE CATALOG fraud_sentinel_catalog;

--Create the schema (database) inside this catalog
CREATE SCHEMA IF NOT EXISTS detection_service
MANAGED LOCATION 'abfss://fraud-sentinel@giftmapote2ete.dfs.core.windows.net/managed_tables/'

In [0]:
CATALOG = "fraud_sentinel_catalog"
SCHEMA_NAME = "detection_service"

# Standardized Table Names
REGISTRY_TABLE = f"{CATALOG}.{SCHEMA_NAME}.customer_registry"

# 2. Convert dictionary to list of rows
registry_data = [
    {
        "customer_id": cid, 
        "home_lat": p["home_lat"], 
        "home_lon": p["home_lon"], 
        "home_province": p["home_province"],
        "avg_spend": p["avg_spend"],
        "trusted_device": p["trusted_device"]
    } 
    for cid, p in CUSTOMER_PROFILES.items()
]

# 3. Create DataFrame and Write to Unity Catalog
# mode("overwrite") ensures no duplicate errors if you run this multiple times
(spark.createDataFrame(registry_data)
      .write
      .format("delta")
      .mode("overwrite") 
      .saveAsTable(REGISTRY_TABLE))

print(f"Registry table synchronized for 1,000 customers.")

In [0]:
%sql 
select * from fraud_sentinel_catalog.detection_service.customer_registry limit 10;

### Multi-Scenario Generator
This function creates individual events based on three distinct behavioral modes.

In [0]:
def generate_transaction():
    # Pick a random customer form out 1000 person pool
    cust_id = random.choice(CUSTOMER_POOL)

    # Safety Check: Ensure the profile exists to prevent KeyError
    if cust_id not in CUSTOMER_PROFILES:
        return None
    
    p = CUSTOMER_PROFILES[cust_id]

    # Logic: 10% Fraud, 30% Commute/Regional, 60% Home-Based
    rand_val = random.random()

    if rand_val < 0.10:
        # -- SCENARIO A: HIGH-VELOCITY FRAUD ---
        # Pick a province completely separate from their home
        other_provs = [v for k, v in REGIONAL_BASES.items() if k != p["home_province"]]
        target = random.choice(other_provs)
        lat, lon = target["lat"], target["lon"]
        amount = p["avg_spend"] * random.uniform(5, 12) # Major spend deviation
        device = f"NEW-{uuid.uuid4().hex[:6]}"
        status = "NEW_RECIPIENT"
        hour = random.randint(1, 4) # Suspicious hours
    
    elif rand_val < 0.40:
        #----SCENARIO B: REGIONAL/COMMUTE (NORMAL)---
        # Move up to 100km from home (e.g Polokwane to Burgersfort)
        lat = p["home_lat"] + random.uniform(-0.9, 0.9)
        lon = p["home_lon"] + random.uniform(-0.9, 0.9)
        amount = p["avg_spend"] * random.uniform(0.5, 2.0)
        device = p["trusted_device"]
        status = "FAVORITE"
        hour = random.randint(7, 21)
    
    else:
        # --- SCENARIO C: LOCAL/HOME (NORMAL) ---
        lat, lon = p["home_lat"], p["home_lon"]
        amount = p["avg_spend"] * random.uniform(0.1, 1.2)
        device = p["trusted_device"]
        status = "FAVORITE"
        hour = random.randint(7, 21)
    
    return {
        "transaction_id": str(uuid.uuid4()),
        "customer_id": cust_id,
        "amount": round(amount, 2),
        "device_id": device,
        "location": {"lat": lat, "lon": lon},
        "recipient_status": status,
        "timestamp": datetime.now().replace(hour=hour).isoformat()
    }

## Production-Scale Batch Writer
This is the engine that writes the data. It uses NewLine Delimited JSON (NDJSON) to stay efficient with 1000 customers

In [0]:
# Clear instruction: Run this and leave it active to feed the pipeline
print("Producer Active: Simulating 1,000 customers across South Africa...")

try:
    while True:
        # Batch writing strategy
        # Generate 20 transactions per batch to simulate high-throughput
        batch = [generate_transaction() for _ in range(20)]

        # Filter out any 'None' returns from safety checks
        batch = [t for t in batch if t is not None]

        # Format as NewLine Delimited JSON
        batch_id = str(uuid.uuid4())[:8]
        file_path = f"{STORAGE_PATH}national_batch_{batch_id}.json"
        json_payload = "\n".join([json.dumps(t) for t in batch])

        # Write to cloud storage
        dbutils.fs.put(file_path, json_payload)

        # Consile logging for monitoring
        print(f"[{datetime.now().strftime('%H:%M:%S')}] Batch {batch_id} Published | 20 txns")

        # Pause for 4 seconds
        time.sleep(4)

except KeyboardInterrupt:
    print("Producer halted manually")