
# Credit Card Fraud Detection System
## Professional Implementation

### Project Overview
This project implements a comprehensive machine learning pipeline for detecting fraudulent credit card transactions. 
The dataset typically used is the [Kaggle Credit Card Fraud Detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud), dealing with highly imbalanced data.

### Objectives
1.  **Data Analysis**: Understand the distribution of legitimate vs. fraudulent transactions.
2.  **Preprocessing**: Handle class imbalance and scale numerical features.
3.  **Modeling**: Implement Logistic Regression, Random Forest, and XGBoost classifiers.
4.  **Evaluation**: Use professional metrics: Confusion Matrix, Precision-Recall, F1-Score, and ROC-AUC.


In [None]:

# Installation of necessary libraries
# !pip install pandas numpy matplotlib seaborn scikit-learn xgboost
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn Metrics & Preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, 
                             precision_score, recall_score, f1_score, roc_auc_score, 
                             roc_curve, precision_recall_curve, auc)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
try:
    import xgboost as xgb
except ImportError:
    print("XGBoost not installed. It is recommended for best performance.")

# Visualization Settings
%matplotlib inline
sns.set(style="whitegrid", palette="muted")
plt.rcParams['figure.figsize'] = (12, 6)
print("Libraries Setup Complete")



### 1. Data Loading and Inspection
We expect the dataset `creditcard.csv` to be located in the `datasets` folder.


In [None]:

import os

# Define path
DATA_PATH = './datasets/creditcard.csv.zip'

if not os.path.exists(DATA_PATH):
    print(f"WARNING: Dataset not found at {DATA_PATH}")
    print("Please download 'creditcard.csv.zip' from Kaggle and place it in the 'datasets' directory.")
    # For demonstration purposes, we will stop here if data is missing, 
    # but in a real run, this block would load the data.
    data_exists = False
else:
    df = pd.read_csv(DATA_PATH, compression='zip')
    print("Dataset loaded successfully.")
    print(f"Shape: {df.shape}")
    display(df.head())
    data_exists = True



### 2. Exploratory Data Analysis (EDA)
Understanding the class imbalance is crucial.


In [None]:

if data_exists:
    # Class distribution
    count_classes = pd.value_counts(df['Class'], sort=True)
    count_classes.plot(kind='bar', rot=0)
    plt.title("Transaction Class Distribution")
    plt.xticks(range(2), ["Normal", "Fraud"])
    plt.xlabel("Class")
    plt.ylabel("Frequency")
    plt.show()

    fraud = df[df['Class'] == 1]
    normal = df[df['Class'] == 0]
    print(f"Fraudulent transactions: {fraud.shape[0]} ({round(fraud.shape[0]/len(df) * 100, 2)}%)")
    print(f"Normal transactions: {normal.shape[0]}")



### 3. Data Preprocessing
- **Scaling**: We use `RobustScaler` as it is less prone to outliers.
- **Splitting**: We split into Training and Test sets *before* any oversampling to avoid data leakage.


In [None]:

if data_exists:
    # Scaling Time and Amount
    rob_scaler = RobustScaler()

    df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
    df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

    df.drop(['Time','Amount'], axis=1, inplace=True)
    
    # Move scaled columns to front (optional, for clarity)
    scaled_amount = df['scaled_amount']
    scaled_time = df['scaled_time']
    df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
    df.insert(0, 'scaled_amount', scaled_amount)
    df.insert(1, 'scaled_time', scaled_time)

    # Separation of Input and Output
    X = df.drop('Class', axis=1)
    y = df['Class']

    # Stratified Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    print("Data validation split successful.")
    print(f"Training set: {X_train.shape[0]}")
    print(f"Testing set: {X_test.shape[0]}")



### 4. Model Building & Evaluation
We will establish a baseline with Logistic Regression and then try more complex models like Random Forest and XGBoost.


In [None]:

def evaluate_model(model, X_test, y_test, model_name="Model"):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else y_pred
    
    print(f"--- {model_name} Evaluation ---")
    
    # 1. Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'Confusion Matrix: {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    # 2. Classification Report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # 3. ROC-AUC Score
    try:
        roc = roc_auc_score(y_test, y_prob)
        print(f"ROC-AUC Score: {roc:.4f}")
    except:
        pass
        
    # 4. Neural Network / Other metrics if needed
    print("-" * 30)
    return cm

if data_exists:
    # Logistic Regression
    lr = LogisticRegression(solver='liblinear') # robust to small datasets/simple
    lr.fit(X_train, y_train)
    evaluate_model(lr, X_test, y_test, "Logistic Regression")



### 5. Advanced Modeling: XGBoost
XGBoost is often the gold standard for tabular data classification.


In [None]:

if data_exists:
    try:
        xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
        xgb_clf.fit(X_train, y_train)
        evaluate_model(xgb_clf, X_test, y_test, "XGBoost Classifier")
    except Exception as e:
        print(f"XGBoost training failed or skipped: {e}")



### 6. Metric Explanation
**Why Confusion Matrix?**
In fraud detection, accuracy is misleading because 99.8% of transactions are valid. A model that predicts "Valid" for everything has 99.8% accuracy but catches 0 fraud.

- **True Positives (TP)**: Fraud correctly identified. (High Priority)
- **False Negatives (FN)**: Fraud missed. (Costly!)
- **False Positives (FP)**: Legitimate transaction flagged as fraud. (Customer friction)

We optimize for **Recall** (catching as much fraud as possible) while maintaining reasonable **Precision** (not annoying too many customers).
