# Fraud Detection â€” ETL & Data Quality

## Objective
Prepare an analytics-ready dataset from `creditcard.csv`, validate data quality, and export cleaned outputs for analysis and Power BI.

## Outputs
- `data/processed/fraud_transactions_clean.csv`

In [14]:
import pandas as pd
import numpy as np
from pathlib import Path

RAW_PATH = Path("../data/raw/creditcard.csv")
OUT_PATH = Path("../data/processed/fraud_transactions_clean.csv")

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

In [16]:
df.shape, df.columns
df.isna().sum().sort_values(ascending=False).head(10)
df["Class"].value_counts(normalize=True) * 100

Class
0    99.827251
1     0.172749
Name: proportion, dtype: float64

In [17]:
# Basic rules (company-style)
# 1) Amount must be >= 0
# 2) Remove duplicates (if any)
# 3) Ensure target is integer

df_clean = df.copy()
df_clean = df_clean[df_clean["Amount"] >= 0]
df_clean = df_clean.drop_duplicates()
df_clean["Class"] = df_clean["Class"].astype(int)

df_clean.shape

(283726, 31)

In [18]:
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
df_clean.to_csv(OUT_PATH, index=False)
print(f"Saved clean dataset to: {OUT_PATH.resolve()}")

Saved clean dataset to: C:\data_projects\banking-fraud-detection-analytics\data\processed\fraud_transactions_clean.csv
