# Credit Card Fraud Detection

## Introduction
Credit card fraud is a major issue in the financial industry, leading to significant financial losses each year.  
In this project, I aim to build a machine learning model that can identify fraudulent transactions from legitimate ones.  

I use the **Credit Card Fraud Detection dataset** from Kaggle, which contains **284,807 transactions** made by European cardholders in September 2013. Among them, only **492 transactions are frauds** (≈0.17%).  

## Challenge
The dataset is **highly imbalanced**, which makes the task much more complex.  
- Fraud cases are rare, but critical to detect.  
- Accuracy alone is not a reliable metric.  

## Project objective
- I will compare different models: **Logistic Regression, Random Forest, Gradient Boosting (XGBoost)**.  
- I will use appropriate evaluation metrics for imbalanced classification:  
  - **ROC-AUC**  
  - **Recall** (sensitivity for frauds)  
  - **Precision-Recall curve**  
- My goal is to build a clean, reproducible ML pipeline ready for production use.  

---


## Dataset description

The dataset I am working with contains transactions made by European cardholders in September 2013.  
It includes **284,807 rows** and **31 columns**:

- **Time**: the number of seconds elapsed between each transaction and the first transaction in the dataset.  
- **Amount**: the transaction amount.  
- **V1 – V28**: the result of a PCA transformation applied by the dataset creators to protect the confidentiality of the original features.  
  - These are numerical features with no direct interpretation.  
- **Class**: this is the target variable  
  - `0` = legitimate transaction  
  - `1` = fraudulent transaction  

👉 A key point I noticed: there are no missing values in this dataset, so I don't need to perform any imputation.  


In [1]:
# Core
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & pipeline
from sklearn.model_selection import train_test_split, StratifiedKFold, validation_curve
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    recall_score,
    precision_score,
    confusion_matrix,
    classification_report,
    ConfusionMatrixDisplay,
    roc_curve,
    auc,
    precision_recall_curve
)

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import randint, uniform

# Utils
import warnings
warnings.filterwarnings("ignore")

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


Data loading & Inspection

In [5]:
df = pd.read_csv('../data/creditcard.csv')
df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

## Exploratory Data Analysis (EDA)

We now perform a quick exploration to better understand the dataset and the target distribution.
Since this is an imbalanced dataset, checking the target variable `Class` is crucial.


### EDA observations

- I observe that the dataset is **highly imbalanced**: only about 0.17% of the transactions are frauds.  
- The variable **Amount** is very skewed...  
- The variable **Time** ...  

👉 These first checks confirm that the dataset is clean (no missing values) but challenging due to the imbalance.  
