### Soldier! From this point onwards, yoi=u will by building complete, end-to-end solutions to real world problems

### **Credit Card Fraud Detection**

#### **1. Project Brief**

**Problem Statement:**
A financial institution needs a model to detect fraudulent credit card transactions in real-time. The dataset contains transactions made over two days, with a very small fraction being fraudulent. The cost of missing a fraudulent transaction (a False Negative) is extremely high, while the cost of flagging a legitimate transaction as fraud (a False Positive) is an inconvenience for the customer. Therefore, the primary goal is to **maximize the detection of fraudulent transactions (Recall)** while maintaining reasonable precision.

**Dataset:**
Kaggle's "Credit Card Fraud Detection" dataset. It contains 284,807 transactions, of which only 492 (0.17%) are fraudulent. The features `V1` through `V28` are the result of a PCA transformation to protect user privacy. The only features that have not been transformed are `Time` and `Amount`.

**What You'll Learn & Master:**
-   **Handling Severe Class Imbalance:** This is the core challenge. You'll learn why accuracy is a useless metric here and master techniques like **SMOTE**, **class weighting**, and **undersampling**.
-   **Advanced Metrics:** You'll go beyond simple metrics and master the **Precision-Recall Curve** and **ROC-AUC** score, which are essential for imbalanced classification.
-   **Decision Threshold Tuning:** You will learn that `.predict()` is not the final step. By using `.predict_proba()`, you can tune the probability threshold (e.g., from 0.5 to 0.2) to optimize for recall over precision.
-   **Model Comparison:** You will compare `LogisticRegression`, `RandomForest`, and a powerful gradient boosting model (`LightGBM`) to see which performs best under these challenging conditions.
-   **Optuna for Imbalanced Classification:** You will use Optuna to find the best model, preprocessing steps, and decision threshold simultaneously.
-   **Interpretability Focus:** For the first time, you will produce a complete `INTERPRETABILITY_REPORT.md`. You will use **SHAP** to explain *why* a specific transaction is flagged as fraud, a critical requirement for any financial institution.

---

#### **2. Complete Dataset EDA**

Let's begin by thoroughly exploring the data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Settings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

**Load Data :**
> Assuming the data is downloaded from kaggle and placed in the same folder as this notebook. If not then download from this [ [link ](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)]. If you don't want to download the dataset then no worries. We will also be downloading it through a public url for your, YES your!!! convenience

In [3]:
try:
    df = pd.read_csv('creditcard.csv')
except FileNotFoundError:
    print("Didn't find the file.......")
    print("Attempting to load from a public URL....")
    print("\nThis might be slow")
    url = "https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv"
    df = pd.read_csv(url)

print("Dataset loaded successfully")

Dataset loaded successfully


In [5]:
# Initial Inspection
print(f"Shape of the dataset is : {df.shape}")
print("First Five rows ")
df.head()

Shape of the dataset is : (284807, 31)
First Five rows 


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.36,-0.073,2.536,1.378,-0.338,0.462,0.24,0.099,0.364,...,-0.018,0.278,-0.11,0.067,0.129,-0.189,0.134,-0.021,149.62,0
1,0.0,1.192,0.266,0.166,0.448,0.06,-0.082,-0.079,0.085,-0.255,...,-0.226,-0.639,0.101,-0.34,0.167,0.126,-0.009,0.015,2.69,0
2,1.0,-1.358,-1.34,1.773,0.38,-0.503,1.8,0.791,0.248,-1.515,...,0.248,0.772,0.909,-0.689,-0.328,-0.139,-0.055,-0.06,378.66,0
3,1.0,-0.966,-0.185,1.793,-0.863,-0.01,1.247,0.238,0.377,-1.387,...,-0.108,0.005,-0.19,-1.176,0.647,-0.222,0.063,0.061,123.5,0
4,2.0,-1.158,0.878,1.549,0.403,-0.407,0.096,0.593,-0.271,0.818,...,-0.009,0.798,-0.137,0.141,-0.206,0.502,0.219,0.215,69.99,0


In [8]:
# Dataset info and missing values
print("Dataset Info")
df.info(), \
print("\nMissing values check"), \
df.isnull().sum()

Dataset Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 2

(None,
 None,
 Time      0
 V1        0
 V2        0
 V3        0
 V4        0
 V5        0
 V6        0
 V7        0
 V8        0
 V9        0
 V10       0
 V11       0
 V12       0
 V13       0
 V14       0
 V15       0
 V16       0
 V17       0
 V18       0
 V19       0
 V20       0
 V21       0
 V22       0
 V23       0
 V24       0
 V25       0
 V26       0
 V27       0
 V28       0
 Amount    0
 Class     0
 dtype: int64)

*Ok so the dataset is clean with no missing values and no datatype issues*

_____
#### Basic Stats


In [11]:
df.describe()
# Note the huge range in the 'Amount' column. It needs scaling.
# The 'V' columns are already scaled-like due to PCA, but scaling them again doesn't hurt

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.86,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,...,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,88.35,0.002
std,47488.146,1.959,1.651,1.516,1.416,1.38,1.332,1.237,1.194,1.099,...,0.735,0.726,0.624,0.606,0.521,0.482,0.404,0.33,250.12,0.042
min,0.0,-56.408,-72.716,-48.326,-5.683,-113.743,-26.161,-43.557,-73.217,-13.434,...,-34.83,-10.933,-44.808,-2.837,-10.295,-2.605,-22.566,-15.43,0.0,0.0
25%,54201.5,-0.92,-0.599,-0.89,-0.849,-0.692,-0.768,-0.554,-0.209,-0.643,...,-0.228,-0.542,-0.162,-0.355,-0.317,-0.327,-0.071,-0.053,5.6,0.0
50%,84692.0,0.018,0.065,0.18,-0.02,-0.054,-0.274,0.04,0.022,-0.051,...,-0.029,0.007,-0.011,0.041,0.017,-0.052,0.001,0.011,22.0,0.0
75%,139320.5,1.316,0.804,1.027,0.743,0.612,0.399,0.57,0.327,0.597,...,0.186,0.529,0.148,0.44,0.351,0.241,0.091,0.078,77.165,0.0
max,172792.0,2.455,22.058,9.383,16.875,34.802,73.302,120.589,20.007,15.595,...,27.203,10.503,22.528,4.585,7.52,3.517,31.612,33.848,25691.16,1.0


**Target Variable Distribution Analysis**