The idea for this project came after reading a 2024 PwC report highlighting the immense scale of global financial fraud, which results in trillions of dollars in losses annually. With the growing usage of instant payment systems like Pix and Zelle, as well as the rapid expansion of e-commerce, fraud incidents are becoming more frequent and sophisticated. Traditional rule-based systems often fall short in identifying modern fraud patterns.

This motivated me to explore how modern Machine Learning techniques can be applied to build a robust, modular, and explainable pipeline for fraud detection. The focus is on practical implementation, reproducibility, and the application of best practices in MLOps.



The dataset was released as part of the IEEE-CIS Fraud Detection competition and sourced from real-world e-commerce transactions provided by Vesta Corporation. It features over 590,000 transactions (about 3.5% labeled as fraudulent), each described by **431 features**—including 31 categorical and 400 numerical features—along with a relative timestamp and a binary fraud label :contentReference[oaicite:1]{index=1}.

---

### 📁 Data Files

- `train_transaction.csv` – Transaction-level data used for training (includes the target `isFraud`)
- `train_identity.csv` – Identity information linked to the training set
- `test_transaction.csv` – Transaction-level data for testing (without `isFraud`)
- `test_identity.csv` – Identity information linked to the test set


###  Dataset Statistics

| Metric                 | Value                       |
|------------------------|-----------------------------|
| Total transactions     | ~590,540                    |
| Fraudulent cases       | ~20,663 (≈ 3.5%)            |
| Numerical features     | 400                         |
| Categorical features   | 31                          |



### Variable Dictionary

| Column              | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| `TransactionID`     | Unique identifier for each transaction                                       |
| `TransactionDT`     | Time difference from a reference start point                                |
| `TransactionAmt`    | Transaction amount                                                          |
| `ProductCD`         | Product code (e.g., A, B, C, D, E)                                          |
| `card1`–`card6`     | Payment card attributes (issuer ID, type, category, etc.)                  |
| `addr1`, `addr2`    | Address information                                                         |
| `dist1`, `dist2`    | Distance between billing and delivery addresses                            |
| `P_emaildomain`     | Purchaser’s email domain                                                    |
| `R_emaildomain`     | Recipient’s email domain                                                    |
| `C1`–`C14`          | Count features engineered by Vesta                                          |
| `D1`–`D15`          | Time deltas related to prior user behavior                                  |
| `M1`–`M9`           | Match flags (binary indicators like address match, card match, etc.)        |
| `V1`–`V339`         | Anonymized numerical features (PCA-like, designed to mask raw values)       |
| `DeviceType`        | Device type used for transaction (desktop or mobile)                        |
| `DeviceInfo`        | Device details including OS and browser models                              |
| `isFraud`           | Target variable: 1 indicates fraud, 0 indicates legitimate transaction (training set only) |

> **Note:** Many features have been anonymized for privacy reasons, but the dataset still provides rich signal for pattern detection.