# Samping raw data to simulate data used for dev, stag and prod

split the chronological timeline into Dev (oldest) → Staging (recent past) → Prod (most recent). Each environment can then be used differently.

split the 6.36M rows into:
- Dev → ~70% oldest steps
- Staging → ~15% middle steps
- Prod → ~15% newest steps

In [5]:
import pandas as pd
from pathlib import Path

In [6]:
RAW_PATH = Path("../data/raw")
CREDITCARD_PATH = RAW_PATH / "paysim_data.csv"

df = pd.read_csv(CREDITCARD_PATH)
df = df.sample(n=30000, random_state=42)  # Take random 30k rows for speed
df.head()
len(df)

30000

To avoid data leakage & keep fraud class balance:
Use stratified sampling on the target column (Class) so the fraud ratio stays the same.

In [7]:
# Check unique step values (time simulation)
print("Step range:", df["step"].min(), "to", df["step"].max())

# Split by quantiles on step
q1 = df["step"].quantile(0.70)   # 70% cutoff
q2 = df["step"].quantile(0.85)   # 85% cutoff

dev_df = df[df["step"] <= q1].copy()
staging_df = df[(df["step"] > q1) & (df["step"] <= q2)].copy()
prod_df = df[df["step"] > q2].copy()

dev_df.to_csv(RAW_PATH / "paysim-dev.csv", index=False)
staging_df.to_csv(RAW_PATH / "paysim-staging.csv", index=False)
prod_df.to_csv(RAW_PATH / "paysim-prod.csv", index=False)

print(f"Dev: {len(dev_df)} rows, Staging: {len(staging_df)} rows, Prod: {len(prod_df)} rows")
print("Fraud rates:", dev_df['isFraud'].mean(), staging_df['isFraud'].mean(), prod_df['isFraud'].mean())

Step range: 1 to 718
Dev: 21043 rows, Staging: 4507 rows, Prod: 4450 rows
Fraud rates: 0.001188043529914936 0.00022187708009762592 0.004943820224719101


In [8]:
# then push the data to s3 via dvc
eg
dvc add data/raw/paysim-prod.csv
dvc push

SyntaxError: invalid syntax (4031129170.py, line 3)