<a href="https://www.kaggle.com/code/hellixir/fraud-detection-comparing-many-sl-and-ul-model?scriptVersionId=182854362" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <h1 style="text-align:center; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; color: #007bff;">CREDIT CARD FRAUD DETECTION</h1>



# About this notebook

## Objective
- The main objective of this notebook is to explore and analyze credit card fraud detection using machine learning techniques. 
- This notebook provides a comprehensive overview of credit card fraud detection, highlighting the strengths and limitations of various machine learning approaches. The insights gained can inform the development of effective fraud detection systems and contribute to the ongoing efforts to mitigate financial fraud risks.


## Dataset
The dataset used in this analysis is sourced from [Kaggle's Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data). It contains 284,807 instances of credit card transactions out of which only 492 are frauds (0.172% of the dataset) which implies huge data imbalance.

## Methodology
1. **Data Exploration**: 
   - Investigated the structure and characteristics of the dataset.
   - Identified any missing values or duplicates.
   - Explored the distribution of fraudulent and non-fraudulent transactions.

2. **Data Preprocessing**:
   - Scaled the features to a uniform range.
   - Handled class imbalance using SMOTE (Synthetic Minority Over-sampling Technique).

3. **Modeling**:
   - Supervised machine learning algorithms:
     - Logistic Regression
     - Decision Tree Classifier
     - Random Forest Classifier
     - Support Vector Machines (SVM)
     - K-nearest neighbors (kNN)
     - XGBoost
   - Unsupervised machine learning algorithms:
     - One-Class SVM
     - Local Outlier Factor
     - DBSCAN
     - Isolation Forest
     - K-Means Clustering
   - Sequentional Neural Network
   - Tuned hyperparameters using grid search where applicable.

4. **Evaluation**:
   - Evaluated model performance using metrics such as precision, recall, and F1-score.
   - Visualized results using confusion matrices and ROC curve.
   - Compared the performance of different models and techniques.

5. **Conclusion**:
   - Summarized key findings and insights from the analysis.
   - Discussed the implications and practical considerations for deploying fraud detection systems in real-world scenarios.


# Import statements

In [None]:
# Basic imports
import pandas as pd
import numpy as np
import tensorflow as tf
from warnings import filterwarnings
filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(rc={'figure.figsize':(12,8)}, palette = "Purples")

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from collections import Counter

# Metrics and tools
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

# Imports of supervised learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
import shap

# Imports of unsupervised learning models
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans, DBSCAN

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data exploration

In [None]:
df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')
df.head()

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
duplicates = df[df.duplicated() == True]

In [None]:
duplicates.head(10)

In [None]:
df[df.Class == 1]

# Data visualization

In [None]:
sns.countplot(x = 'Class', data = df, color='blue')

In [None]:
fraud = df[df['Class'] == 1].describe().T
nofraud = df[df['Class'] == 0].describe().T

# Selecting only the mean values and renaming the columns
fraud_mean = fraud[['mean']].rename(columns={'mean': 'Fraud Mean'})
nofraud_mean = nofraud[['mean']].rename(columns={'mean': 'No Fraud Mean'})

compare = pd.DataFrame({'Fraud Mean': fraud_mean['Fraud Mean'], 'No Fraud Mean': nofraud_mean['No Fraud Mean']})

# Displaying the mean values of all the features as a DataFrame table
print("Mean values for Fraud and No Fraud Samples:")
compare


# Data Preprocessing

In [None]:
# Defining X and y

X = df.drop('Class', axis=1)
y = df['Class']

### Scaling

In [None]:
# Keeps the distribution of the features the same while adjusting their values

scaler = MinMaxScaler().fit(X)
X = scaler.fit_transform(X)

### Splitting data making sure the proportion of the classes remain the same in train and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2, stratify=y)

y.value_counts(), y_train.value_counts(), y_test.value_counts()

### SMOTE

In [None]:
# SMOTE to balance the classes

print(f"Original class distribution: {Counter(y_train)}")

smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {Counter(y_resampled)}")


## Splitting after SMOTE and creating GridSearch subset

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

subset_size = 20000
X_train_subset,_, y_train_subset, _ = train_test_split(X_train, y_train, train_size=subset_size, random_state=42, stratify=y_train)

# Check proportion of classes
X_train_subset.shape, y_train_subset.shape, Counter(y_train_subset)

In [None]:
# Defining a function to use a GridSearchCV and to evaluate the the models performance

def grid_search(model, params):
    grid = GridSearchCV(model, params, scoring= 'f1_macro', cv=5, n_jobs=-1)
    try: 
        grid = grid.fit(X_train_subset, y_train_subset)
    except AttributeError:
        grid = grid.fit_predict(X_train_subset, y_train_subset)

    return grid.best_params_


def evaluate_model(model, **kwargs):
    results = []
    try:
        y_pred = model.predict(X_val)
    except AttributeError:
        y_pred = model.fit_predict(X_val)

    if -1 in y_pred:
        y_pred = np.where(y_pred == -1, 1, 0)
        
    print(classification_report(y_val, y_pred))
    print(confusion_matrix(y_val, y_pred))
    results.append(classification_report(y_val, y_pred, output_dict=True))
    results.append(confusion_matrix(y_val, y_pred))
    return results
    

# Supervised Methods

## 1. Logistic Regression
 - Simple and commonly used algorithm for binary classification using sigmoid function that estimates the probability of an outcome based on input features. It's particularly useful for predicting whether an event will happen (e.g., fraud) or not.

In [None]:
lr_params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Inverse of regularization strength
    'penalty': ['l1', 'l2', 'elasticnet', 'none']  # Type of regularization
}

lr = LogisticRegression()
lr_best_params = grid_search(lr, lr_params)
print(lr_best_params)
lr = LogisticRegression(**lr_best_params)
lr.fit(X_train, y_train)
lr_results = evaluate_model(lr)


## 2. Decision Tree Classifier
 - Model that splits data into branches based on feature values, making decisions at each node until a final classification is reached. It's intuitive and can handle both numerical and categorical data.

In [None]:
dt_params = {
    'criterion': ['gini', 'entropy'],  # The function to measure the quality of a split
    'max_depth': [None, 5, 10, 20],  # The maximum depth of the tree
    'min_samples_split': [2, 5, 10]  # The minimum number of samples required to split an internal node
}

dt = DecisionTreeClassifier()
dt_best_params = grid_search(dt, dt_params)
print(dt_best_params)
dt = DecisionTreeClassifier(**dt_best_params)
dt.fit(X_train, y_train)
dt_results = evaluate_model(dt)

## 3. Random Forest Classifier
 - Ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It helps reduce overfitting and improves the performance of decision trees.



In [None]:
rfc_params = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [5, 10, 15],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]  # Minimum number of samples required to split an internal node
}

rfc = RandomForestClassifier()
rfc_best_params = grid_search(rfc, rfc_params)
print(rfc_best_params)
rfc = RandomForestClassifier(**rfc_best_params)
rfc.fit(X_train, y_train)
rfc_results = evaluate_model(rfc)

## 4. Support Vector Machines (SVM)
 - Powerful classification algorithm that finds the optimal hyperplane to separate classes in the feature space. It works well for high-dimensional data and is effective for both linear and non-linear classifications.

In [None]:
'''
svm_params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],  # Kernel type
    'gamma': ['scale', 'auto']  # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'
}


svm = SVC()
svm_best_params = grid_search(svm, svm_params)
print(svm_best_params)
'''

# To make the code run faster I commented out the grid search, but below are the best parameters it had returned
svm = SVC(C=100, gamma='scale', kernel='poly')
svm.fit(X_train, y_train)
svm_results = evaluate_model(svm)

## 5. K-nearest neighbors (kNN)
 - KNN is a simple, non-parametric algorithm that classifies a data point based on the majority class among its k-nearest neighbors. It’s easy to understand and implement but can be slow for large datasets.

In [None]:
knn_params = {
    'n_neighbors': [5, 10, 15],  # Number of nearest neighbors
    'weights': ['uniform', 'distance'],  # Weight function used in prediction
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']  # Algorithm
}

knn = KNeighborsClassifier()
knn_best_params = grid_search(knn, knn_params)
print(knn_best_params)
knn = KNeighborsClassifier(**knn_best_params)
knn.fit(X_train, y_train)
knn_results = evaluate_model(knn)

## 6. XGBoost
 - XGBoost (eXtreme Gradient Boosting) is an efficient and scalable implementation of gradient boosting for decision trees. It uses advanced regularization to prevent overfitting and can handle missing values. XGBoost is highly popular for its speed, performance, and accuracy, making it ideal for many machine learning tasks, including classification and regression.

In [None]:

xgb_params = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [5, 10, 15],  # Maximum depth of the tree
    'min_child_weight': [1, 3, 5]  # Minimum sum of instance weight (hessian) needed in a child    
}

xgb = XGBClassifier()
xgb_best_params = grid_search(xgb, xgb_params)
print(xgb_best_params)
xgb = XGBClassifier(**xgb_best_params)
xgb.fit(X_train, y_train)
xgb_results = evaluate_model(xgb)

## Supervised - Predict on unseen data, visualize and compare results of supervised models

In [None]:
supervised_models = [lr, dt, rfc, svm, knn, xgb]
model_names = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'KNN', 'XGBoost']

# Dictionary to store the results
test_data_results = {}

# Train and evaluate the models
for model, name in zip(supervised_models, model_names):
    y_test_pred = model.predict(X_test)
    if -1 in y_test_pred:
        y_test_pred = np.where(y_test_pred == -1, 1, 0)
        
    test_data_results[name] = {
        'classification_report': classification_report(y_test, y_test_pred, output_dict=True),
        'confusion_matrix': confusion_matrix(y_test, y_test_pred)
    }



### Confusion matrices

In [None]:
def plot_confusion_matrices(results):
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
    axes = axes.flatten()
    
    for idx, (model, result) in enumerate(results.items()):
        cm = result['confusion_matrix']
        sns.heatmap(cm, annot=True, fmt='d', ax=axes[idx], cbar=False, linewidths=1, linecolor='black', xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'], cmap='Purples')
        axes[idx].set_title(model)
        axes[idx].set_xlabel('Predicted labels')
        axes[idx].set_ylabel('True labels')
    
    plt.tight_layout()
    plt.show()

# Plot the confusion matrices
plot_confusion_matrices(test_data_results)

### Visualize metrics

In [None]:
# Convert the classification report to a DataFrame for visualization
data = []
for model, metrics in test_data_results.items():
    report = metrics['classification_report']
    for label, scores in report.items():
        if label not in ['accuracy', 'macro avg', 'weighted avg']:
            data.append([model, label, scores['precision'], scores['recall'], scores['f1-score']])

# Map labels to their corresponding names
label_map = {'0': 'Normal', '1': 'Fraud'}
data = [(model, label_map[label], precision, recall, f1) for model, label, precision, recall, f1 in data]

# Define metrics values in percents
data = [(model, label, precision * 100, recall * 100, f1 * 100) for model, label, precision, recall, f1 in data]

# Create the DataFrame
df = pd.DataFrame(data, columns=['Model', 'Class', 'Precision', 'Recall', 'F1-Score'])

# Plot the results
fig, axes = plt.subplots(3, 1, figsize=(15, 18))


# Function to add labels to all bars
def add_labels(ax):
    for container in ax.containers:
        ax.bar_label(container, fmt='%.2f%%')

sns.barplot(x='Model', y='Precision',hue='Class', data=df, ax=axes[0], dodge=True, palette='Purples', )
axes[0].set_title('Precision by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[0])

sns.barplot(x='Model', y='Recall', hue='Class', data=df, ax=axes[1], dodge=True, palette='Purples')
axes[1].set_title('Recall by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[1])
axes[1].legend(title='Class', loc='lower right')

sns.barplot(x='Model', y='F1-Score', hue='Class', data=df, ax=axes[2], dodge=True, palette='Purples')
axes[2].set_title('F1-Score by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[2])

plt.tight_layout()
plt.show()


# Unsupervised Methods

## 1. One-Class SVM
 - Algorithm that learns a decision function for anomaly detection by finding the boundary that best separates normal data points from outliers in the feature space.

In [None]:
svm_params = {
    'nu': [0.01, 0.05, 0.1, 0.2],  # An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],  # Kernel type
    'gamma': ['scale', 'auto']  # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'
}

svm = OneClassSVM()
svm_best_params = grid_search(svm, svm_params)
print(svm_best_params)
svm = OneClassSVM(**svm_best_params)
svm.fit(X_train)
svm_results = evaluate_model(svm)

## 2. Local Outlier Factor
 - LOF identifies anomalies by measuring the local density deviation of a data point relative to its neighbors. Points with significantly lower density than their neighbors are considered outliers.

In [None]:
lof = LocalOutlierFactor(algorithm='auto', leaf_size=10, n_neighbors=5)
lof.fit(X_train)
lof_results = evaluate_model(lof)

## 3. DBSCAN
 - Clustering algorithm that groups together points that are closely packed, marking points in low-density regions as outliers. It is effective for finding clusters of arbitrary shape and handling noise.

In [None]:
dbscan = DBSCAN(eps=0.1, min_samples=5)
dbscan.fit(X_train)
dbscan_results = evaluate_model(dbscan)

## 4. Isolation Forest
- Anomaly detection method that isolates observations by randomly selecting a feature and splitting it. Anomalies are isolated quickly, making it effective for identifying outliers in large datasets.

In [None]:
iso_for_params = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_samples': [0.5, 0.75, 1.0],  # The number of samples to draw from X to train each base estimator
}

iso_for = IsolationForest()
iso_for_best_params = grid_search(iso_for, iso_for_params)
print(iso_for_best_params)
iso_for = IsolationForest(**iso_for_best_params)
iso_for.fit(X_train)
iso_for_results = evaluate_model(iso_for)

## 5. K-Mean Clustering
 - K-Means is a popular clustering algorithm that partitions data into k distinct clusters based on feature similarity. Each data point is assigned to the nearest cluster center, and the centers are recalculated iteratively.

In [None]:
k_means_params = {
    'n_clusters': [2, 3, 4, 5],  # The number of clusters
    'max_iter': [100, 200, 300],  # Maximum number of iterations
    'tol': [0.001, 0.01, 0.1]  # Tolerance
}

k_means = KMeans()
k_means_best_params = grid_search(k_means, k_means_params)
print(k_means_best_params)
k_means = KMeans(**k_means_best_params)
k_means.fit(X_train)
k_means_results = evaluate_model(k_means)

## Unsupervised - Predict on unseen data, visualize and compare results of supervised models

In [None]:
unsupervised_models = [svm, lof, dbscan, iso_for, k_means]
model_names = ['SVM', 'LOF', 'DBSCAN', 'Isolation Forest', 'K-Means']

# Dictionary to store the results
unsupervised_test_results = {}


# Train and evaluate the models
for model, name in zip(unsupervised_models, model_names):
    try:
        y_test_pred = model.predict(X_test)
    except AttributeError:
        y_test_pred = model.fit_predict(X_test)

    if -1 in y_test_pred:
        y_test_pred = np.where(y_test_pred == -1, 1, 0)
        
    unsupervised_test_results[name] = {
        'classification_report': classification_report(y_test, y_test_pred, output_dict=True),
        'confusion_matrix': confusion_matrix(y_test, y_test_pred)
    }

### Confusion matrices

In [None]:
# Visualize the data

plot_confusion_matrices(unsupervised_test_results)

### Visualize metrics

In [None]:
# Convert the classification report to a DataFrame for visualization
data = []
for model, metrics in unsupervised_test_results.items():
    report = metrics['classification_report']
    for label, scores in report.items():
        if label not in ['accuracy', 'macro avg', 'weighted avg']:
            data.append([model, label, scores['precision'], scores['recall'], scores['f1-score']])

# Map labels to their corresponding names
label_map = {'0': 'Normal', '1': 'Fraud'}
data = [(model, label_map[label], precision, recall, f1) for model, label, precision, recall, f1 in data]

# Define metrics values in percents
data = [(model, label, precision * 100, recall * 100, f1 * 100) for model, label, precision, recall, f1 in data]

# Create the DataFrame
df = pd.DataFrame(data, columns=['Model', 'Class', 'Precision', 'Recall', 'F1-Score'])

# Plot the results
fig, axes = plt.subplots(3, 1, figsize=(15, 18))


# Function to add labels to all bars
def add_labels(ax):
    for container in ax.containers:
        ax.bar_label(container, fmt='%.2f%%')

sns.barplot(x='Model', y='Precision',hue='Class', data=df, ax=axes[0], dodge=True, palette='Purples', )
axes[0].set_title('Precision by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[0])

sns.barplot(x='Model', y='Recall', hue='Class', data=df, ax=axes[1], dodge=True, palette='Purples')
axes[1].set_title('Recall by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[1])

sns.barplot(x='Model', y='F1-Score', hue='Class', data=df, ax=axes[2], dodge=True, palette='Purples')
axes[2].set_title('F1-Score by Model and Class', fontsize=16, fontweight='bold')
add_labels(axes[2])

plt.tight_layout()
plt.show()


# Neural networks and supervised methods

## Neural Network

In [None]:
# Define the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Adding callbacks
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.001)

# Training
history = model.fit(X_train,y_train, epochs=10, batch_size=64, validation_data=(X_val, y_val),
                    callbacks=[early_stopping, reduce_lr])


In [None]:
from sklearn.metrics import roc_curve, auc

# Step 1: Generate predictions
y_pred_proba = model.predict(X_val).ravel()

# Step 2: Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)

# Step 3: Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--', label='Random Guessing')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

# Step 4: Calculate AUC-ROC
auc_score = auc(fpr, tpr)
print("AUC-ROC Score:", auc_score)


In [None]:
# Evaluate on test data using calculated threshold
y_pred = (model.predict(X_test) > 0.983).astype(int)

neural_test_data_results = {}
# Calculate classification report
neural_test_data_results['ANN'] = {
    'classification_report': classification_report(y_test, y_pred, output_dict=True),
    'confusion_matrix': confusion_matrix(y_test, y_pred)
}

In [None]:
sns.heatmap(neural_test_data_results['ANN']['confusion_matrix'], annot=True, fmt='d', cmap='Purples', linewidths=1, linecolor='black', xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'], cbar=False)
plt.title('ANN Confusion Matrix', fontsize=16, fontweight='bold')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

# Conclusion

Based on the analysis, both XGBoost, Random Forest and Neural Network models performed exceptionally well in detecting credit card fraud. However, XGBoost and Random Forest are more interpretable and transparent, providing clear feature importances and decision pathways. This transparency is crucial for explaining flagged transactions.

While Neural Networks demonstrated strong predictive power, their complexity and black-box nature pose challenges in explaining decisions to stakeholders.

Therefore, for practical deployment and interpretability in credit card fraud detection, Random Forest or XGBoost models are the preferred choices.

## Random Forest Classifier parameter tuning

In [None]:
# Try to improve model performace of Random Forest Classifier adjusting Class weights

# Define class weights
class_weights = {0: 1, 1: 10}  # Adjust the weights as needed

# Train the model with adjusted class weights
rfc = RandomForestClassifier(**rfc_best_params, class_weight=class_weights)
rfc.fit(X_train, y_train)

# Evaluate the model on validation data
rfc_results = evaluate_model(rfc)

# Evaluate the model on test data
y_pred = rfc.predict(X_test)
rfc_test_data_results = {}
rfc_test_data_results['Random Forest'] = {
    'classification_report': classification_report(y_test, y_pred, output_dict=True),
    'confusion_matrix': confusion_matrix(y_test, y_pred)
}


# Plot confusion matrix
sns.heatmap(rfc_test_data_results['Random Forest']['confusion_matrix'], annot=True, fmt='d', cmap='Purples', linewidths=1, linecolor='black', xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'], cbar=False)


## Random Forest Classifier Feature importance

In [None]:
feature_importances = rfc.feature_importances_

# Sort the feature importances in descending order
sorted_idx = np.argsort(feature_importances)[::-1]
sorted_importances = feature_importances[sorted_idx]

# Get the feature names
feature_names = df.columns[:-1]

# Plot the feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(X_train.shape[1]), sorted_importances, align='center', color='purple')
plt.xticks(range(X_train.shape[1]), feature_names[sorted_idx], rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()

## XGBoost SHAP

In [None]:
# Create a SHAP explainer object using your trained XGBoost model
explainer = shap.Explainer(xgb)

# Calculate SHAP values for all features using the validation data
shap_values = explainer.shap_values(X_val)

# Plot the SHAP summary plot
shap.summary_plot(shap_values, X_val, plot_type="bar", color="purple")