# Samping raw data to simulate data used for dev, stag and prod

- Dev (tiny) → 1–5% of the data (fast iteration, debugging code). -> 10000 rows

- Staging (medium) → ~10–20% of the data (system integration tests, pipelines). -> 20172 rows

- Prod (full) → 100% of the data (training final models, evaluation). -> 284807 rows

In [13]:
import numpy as np
import pandas as pd
from pathlib import Path

In [14]:
# Load dataset to dataframe
RAW_PATH = Path("../data/raw")
CREDITCARD_PATH = RAW_PATH / "creditcard.csv"

df = pd.read_csv(CREDITCARD_PATH)
df.head()
len(df)

284807

To avoid data leakage & keep fraud class balance:
Use stratified sampling on the target column (Class) so the fraud ratio stays the same.

In [15]:
from sklearn.model_selection import train_test_split

# Dev (small sample)
dev_df = df.sample(n=10000, random_state=42)
dev_df.to_csv(RAW_PATH / "dev" / "creditcard-dev.csv", index=False)

# Staging (medium, balanced)
# Randomly samples 40 times as many non-fraud cases as there are fraud cases, ensuring a 1:10 fraud:non-fraud ratio.
fraud = df[df["Class"] == 1]
non_fraud = df[df["Class"] == 0].sample(n=len(fraud)*40, random_state=42)
staging_df = pd.concat([fraud, non_fraud])
staging_df.to_csv(RAW_PATH / "staging" / "creditcard-staging.csv", index=False)

# Prod (full dataset)
df.to_csv(RAW_PATH / "prod" / "creditcard-prod.csv", index=False)

In [16]:
len(dev_df)

10000

In [17]:
len(staging_df)

20172

In [18]:
len(df)

284807