# Data Simulation and Augmentation

## Notebook Purpose


Real production systems rarely provide clean, crisis-ready datasets. To demonstrate end-to-end data engineering, analytics, and ML decisioning skills, we must first *simulate* a realistic operating environment.

This notebook is responsible for **creating a high-fidelity synthetic dataset** that mimics a real food-delivery platform (QuickBite) during a crisis period.

This notebook deliberately runs in **batch / offline mode** and does **NOT** use streaming or Auto Loader. Its sole responsibility is **data generation and augmentation**, not ingestion.

---

## Business Context

QuickBite experienced a **monsoon-driven operational crisis** causing:
- Late deliveries
- Food safety complaints
- Increased customer churn risk

However, historical public datasets do not contain:
- Crisis signals
- Customer sentiment
- Operational stress indicators

Therefore, I simulate:
- A time-shifted dataset (2025)
- Crisis impact windows
- Customer behavior changes
- Review sentiment signals

These outputs will later be treated **as if they were generated by a live application**.
________________________________________
## Outputs of This Notebook

This notebook produces **foundational tables** used downstream:

| Table | Purpose |
|------|--------|
| `fact_orders` | Simulated order-level facts (ground truth) |
| `fact_reviews` | Customer reviews + sentiment labels |
| `dim_customers` | Synthetic customer profiles |

---

Note:
_ML-ready features (ml_churn_features) are intentionally created in a
separate notebook (Crisis_recovery_ml_feature_engineering) to ensure
clean separation between data generation and feature engineering._



In [0]:
%pip install kaggle
%pip install faker



## 1: Importing the Public Dataset

### Business Problem

We need a **realistic base distribution** of orders to avoid artificial patterns.

### Approach

We ingest a public DoorDash-like dataset and treat it as **historical operational truth**.

### Key Decisions

- CSV chosen for transparency
- Schema inference enabled

In [0]:
import os
import shutil

# # Paste the exact path you copied here
kaggle_json_path = "/Volumes/workspace/default/kaggle_config/kaggle.json"

# # Kaggle expects the directory, not the file itself
os.environ["KAGGLE_CONFIG_DIR"] = os.path.dirname(kaggle_json_path)

src = "/Volumes/workspace/default/kaggle_config/kaggle.json"
dst_dir = "/tmp/.kaggle"
dst = f"{dst_dir}/kaggle.json"

os.makedirs(dst_dir, exist_ok=True)
shutil.copy(src, dst)

print("Copied kaggle.json to:", dst)

In [0]:
import os

os.environ["KAGGLE_CONFIG_DIR"] = "/tmp/.kaggle"
print("KAGGLE_CONFIG_DIR set to:", os.environ["KAGGLE_CONFIG_DIR"])

In [0]:
%sh
mkdir -p /tmp/doordash_data

kaggle datasets download \
  -d dharun4772/doordash-eta-prediction \
  -p /tmp/doordash_data \
  --unzip


In [0]:
%sh
cd /tmp/doordash_data
unzip -o doordash-eta-prediction.zip
ls -lh 

In [0]:
%sh
head -5 /tmp/doordash_data/*.csv


### food_delivery SCHEMA and VOLUME creation

In [0]:
spark.sql("""
CREATE SCHEMA IF NOT EXISTS workspace.food_delivery
""")

In [0]:
spark.sql("""
CREATE VOLUME IF NOT EXISTS workspace.food_delivery.food_delivery_data
""")

In [0]:
%sh
mv /tmp/doordash_data/*.csv /Volumes/workspace/food_delivery/food_delivery_data/
   

## 2: Time Warp Simulation (Temporal Shift)

### Business Problem

The dataset ends years before our crisis scenario. Dashboards and ML models must reflect **current timelines**.

### Approach

- Identify max historical timestamp
- Warp all events forward so the dataset ends in **Dec 2025**

### Why this approach

- Preserves **relative temporal patterns**
- Avoids fabricating random timestamps
- Enables realistic seasonality analysis

---

## 3: Order ID & Deterministic Identity

### Business Problem

The public dataset lacks globally unique order identifiers suitable for joins.

### Approach

- Generate monotonic `order_id`
- Ensure deterministic joins across tables

### Design Choice

We intentionally **do not reuse existing IDs** to prevent accidental leakage of source semantics.


In [0]:
# 1. Load the data, treating "NA" as proper NULLs
volume_path = "/Volumes/workspace/food_delivery/food_delivery_data/historical_data.csv"

df_backbone = (spark.read.format("csv") 
  .option("header", "true") 
  .option("inferSchema", "true") 
  .option("nullValue", "NA") 
  .load(volume_path))

# 3. Check the date range to determine our "Warp Factor"
from pyspark.sql.functions import min, max

print("Data Schema:")
df_backbone.printSchema()

print("\nDate Range:")
df_backbone.select(min("created_at"), max("created_at")).show()




In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, count

# Create a deterministic ordering of all rows
# This ensures timestamps progress consistently
w = Window.orderBy("created_at")

df_indexed = df_backbone.withColumn(
    "row_num",
    row_number().over(w)
)

# Total number of rows (used for linear interpolation)
total_rows = df_indexed.count()

# Define the simulated crisis window
target_start = "2025-07-01 00:00:00"
target_end   = "2025-12-31 23:59:59"

# Generate simulated timestamps by spreading rows evenly across the window
df_simulated = (
    df_indexed
    .withColumn(
        "created_at_simulated",
        expr(f"""
            CAST(
              from_unixtime(
                unix_timestamp('{target_start}')
                + (
                    (row_num - 1) *
                    (
                        (unix_timestamp('{target_end}')
                        - unix_timestamp('{target_start}'))
                        / ({total_rows} - 1)
                    )
                )
              ) AS TIMESTAMP
            )
        """)
    )
    # Simulate actual delivery time using estimated driving duration
    .withColumn(
        "actual_delivery_time_simulated",
        expr("""
            CAST(
              created_at_simulated
              + estimated_store_to_consumer_driving_duration * INTERVAL 1 SECOND
              AS TIMESTAMP
            )
        """)
    )
    # Generate a unique surrogate order_id
    .withColumn("order_id", monotonically_increasing_id())
    .drop("created_at", "actual_delivery_time", "row_num")
)


df_simulated = df_simulated.select( "order_id", "market_id", "store_id", "store_primary_category", "order_protocol", "total_items", "subtotal", "num_distinct_items", "min_item_price", "max_item_price", "total_onshift_dashers", "total_busy_dashers", "total_outstanding_orders", "estimated_order_place_duration", "estimated_store_to_consumer_driving_duration", "created_at_simulated", "actual_delivery_time_simulated" )

print("Data Schema:")
df_simulated.printSchema()
df_simulated.show(5)


# 5. Verify the new range
df_simulated.select(min("created_at_simulated"), max("created_at_simulated")).show()

display(df_simulated.select(
    "order_id", 
    "total_items",
    "order_protocol",
    "market_id",
    "store_id",
    "created_at_simulated",

).limit(5))

print("Data loaded and ID generated.")




In [0]:
from pyspark.sql.functions import year, month, count

df_simulated \
    .groupBy(
        year("created_at_simulated").alias("year"),
        month("created_at_simulated").alias("month")
    ) \
    .agg(count("*").alias("rows")) \
    .orderBy("year", "month") \
    .show()


## 4: Customer Simulation

### Business Problem

Customer-level analytics and churn modeling require:
- Stable identities
- Behavioral diversity

### Approach

- Randomly assign orders to ~5,000 customers
- Create a separate customer dimension table

### Why not real customer IDs?

- Public data lacks PII
- Synthetic profiles allow controlled experimentation

### Assign Synthetic Customer IDs

In [0]:
from pyspark.sql.functions import rand, floor

# Number of synthetic customers to simulate
num_customers = 5000

# Idempotent creation of fact_orders
if not spark.catalog.tableExists("food_delivery.fact_orders"):
    print("Creating fact_orders with frozen random customer_ids")

    # Randomly assign each order to a customer
    df_orders = (
        df_simulated
        .withColumn("customer_id", floor(rand() * num_customers) + 1)
    )

    # Persist the result as a Delta table
    df_orders.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.fact_orders")

else:
    print("fact_orders already exists — reusing it")

    df_orders = spark.table("food_delivery.fact_orders")

df_orders.printSchema()

display(df_orders.select(
    "order_id",
    "customer_id",
    "total_items",
    "order_protocol",
    "market_id",
    "store_id",
    "created_at_simulated"

).limit(5))


### Generate Customer Profiles

In [0]:
import pandas as pd
from faker import Faker
import random

# Define our table name explicitly
table_name = "food_delivery.dim_customers"

# 1. Check if the table already exists
if not spark.catalog.tableExists(table_name):
    print(f"Generating {table_name} from scratch...")

    fake = Faker()
    customer_data = []
    segments = ["Student", "Young Professional", "Family", "Corporate"]

    # Use the num_customers variable we defined earlier
    for i in range(1, num_customers + 1):
        profile = {
            "customer_id": i,
            "customer_name": fake.name(),
            "segment": random.choice(segments),
            #need to cast the date to string or timestamp for Spark compatibility
            "signup_date": str(fake.date_between(start_date='-2y', end_date='today')),
            "email": fake.email()
        }
        customer_data.append(profile)

    # Convert to Spark DataFrame
    df_customers = spark.createDataFrame(pd.DataFrame(customer_data))

    # Write as Delta
    df_customers.write.format("delta").mode("overwrite").saveAsTable(table_name)
    print("Customer profiles generated and frozen.")

else:
    print(f"Table {table_name} already exists. Loading frozen data...")
    df_customers = spark.table(table_name)

# Display to verify
display(df_customers)

In [0]:
# Join Orders with Customers
df_enriched = df_orders.join(df_customers, "customer_id", "left")

# Select a few key columns to inspect
display(df_enriched.select(
    "order_id", #
    "customer_name",
    "segment",
    "created_at_simulated",
    "actual_delivery_time_simulated"
).limit(5))

## 5: Crisis Injection Logic

### Business Problem

To test recovery systems, we need a **controlled crisis window**.

### Approach

- Identify Oct 2025 as crisis window
- Inflate delivery times
- Increase negative review probability

### Why deterministic windows?

- Enables before/after comparisons
- Simplifies evaluation for dashboards and ML


In [0]:
from pyspark.sql.functions import col, when, lit, rand

# 1. Load the frozen orders
df_orders = spark.table("food_delivery.fact_orders")

# 2. Define the Crisis Period (October 2025)
# I will target orders created in October 2025
crisis_condition = (col("created_at_simulated") >= "2025-10-01") & \
                   (col("created_at_simulated") <= "2025-10-30")

if not spark.catalog.tableExists("food_delivery.fact_orders"):

    # 3. Inject the "Monsoon Delay"
    # Logic: If it's October 2025, multiply delivery time by 1.5x to 2.0x (Random chaos)
    df_orders_crisis = df_orders.withColumn(
        "actual_delivery_time_simulated",
        when(
            crisis_condition,
            col("actual_delivery_time_simulated")
            + expr("INTERVAL 2700 SECONDS")                       # base 45 min
            + expr("CAST(rand() * 2700 AS INT) * INTERVAL 1 SECONDS")  # +0–45 min
        ).otherwise(col("actual_delivery_time_simulated"))
    ).withColumn(
        "created_at_simulated",
        col("created_at_simulated").cast("timestamp")
    ) #To keep created_at_simulated also in timestamp

    # 4. Overwrite the table with this new "Broken" data
    df_orders_crisis.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("food_delivery.fact_orders")

    print("Crisis injected: Oct 2025 delivery times have been destabilized.")
else:
    print("Crisis already injected.")


In [0]:
spark.table("food_delivery.fact_orders").printSchema()

## 6: Review & Sentiment Simulation

### Business Problem

Crisis detection requires **qualitative signals**, not just numbers.

### Approach

We simulate customer reviews based on:
- Delivery delays
- Random food safety incidents

Sentiment categories:
- Positive
- Late Delivery
- Food Safety

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, IntegerType

# 1. Load the (now delayed) orders
df_orders = spark.table("food_delivery.fact_orders")

# 2. Simulate an "Estimated Delivery Time" (Target)
# I assume the 'Target' was 45 mins (2700s) after creation. 
# In reality, this varies, but this baseline works for calculating "Lateness".
df_orders_w_delay = df_orders.withColumn(
    "delay_seconds", 
    (F.col("actual_delivery_time_simulated").cast("long") - F.col("created_at_simulated").cast("long")) - 2700
)

# 3. Define the Review Logic (The "Brain" of the customer)
def get_review_sentiment(delay_sec, is_food_safety_incident):
    # Case A: The Food Safety Scandal (Random 5% trigger passed in)
    if is_food_safety_incident:
        return "Critical", 1, "Food poisoning risk! Undercooked and smelled bad. Never ordering again."
    
    # Case B: The Monsoon/Late Delivery (> 45 mins late)
    elif delay_sec > 2700: 
        return "Negative", 2, "Took forever to arrive. Food was cold due to delay."
    
    # Case C: Happy Path
    else:
        return "Positive", 5, "Great service, arrived on time!"

# I have written UDF as an alter approach (User Defined Function) but it isn't efficient for large data, 
# so i will use Spark SQL Native functions for speed.

# 4. Generate the Reviews using Spark Logic
df_reviews = df_orders_w_delay.withColumn("is_safety_incident", F.rand() < 0.05) \
    .withColumn(
        "review_score",
        F.when(F.col("is_safety_incident"), 1)
         .when(F.col("delay_seconds") > 2700, 2)
         .otherwise(5)
    ).withColumn(
        "review_text",
        F.when(F.col("is_safety_incident"), "Food tasted off and undercooked. Felt sick afterwards. #HealthHazard")
         .when(F.col("delay_seconds") > 2700, "Delivery took way too long. Food was completely cold.")
         .otherwise("Delicious and on time! Loved it.")
    ).withColumn(
        "sentiment_category",
        F.when(F.col("is_safety_incident"), "Food Safety")
         .when(F.col("delay_seconds") > 2700, "Late Delivery")
         .otherwise("Positive")
    )

# 5. Select only relevant columns for the Reviews Table
df_fact_reviews = df_reviews.select(
    "order_id",
    "customer_id",
    "review_score",
    "review_text",
    "sentiment_category"
)


table_name = "food_delivery.fact_reviews"
if not spark.catalog.tableExists(table_name):
    print(f"Generating {table_name}...")
    df_fact_reviews.write.format("delta").mode("overwrite").saveAsTable(table_name)
    print("Reviews generated and frozen.")
else:
    print(f"{table_name} already exists. Skipping generation.")

display(spark.table(table_name).limit(1000))

### Checking the crisis spike

In [0]:
%sql

SELECT 
  DATE_FORMAT(o.created_at_simulated, 'yyyy-MM') as month_year,
  r.sentiment_category,
  COUNT(*) as review_count
FROM food_delivery.fact_orders o
JOIN food_delivery.fact_reviews r 
  ON o.order_id = r.order_id
WHERE o.created_at_simulated BETWEEN '2025-07-01' AND '2025-12-31'
GROUP BY 1, 2
ORDER BY 1, 2;



October 2025 Spike: Even though Food safety complaints is distributed evenly across the months ,we have 29,550 "Late Delivery" complaints on October. compared to 0 in September, this is exactly the kind of "anomaly" which was intended to capture.