Dataset: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The provided code is a Python script that performs the following tasks:

In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# from sklearn.metrics import classification_report, confusion_matrix #plot_precision_recall_curve, plot_roc_curve
from sklearn.metrics import classification_report, confusion_matrix#, RocCurveDisplay
# import scikitplot as skplt
# import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


1. **Data Loading and Preprocessing:**
   - Imports necessary libraries such as `pandas` for data manipulation and `train_test_split` from `sklearn.model_selection` for splitting the dataset.
   - Defines a function `load_data(file_path)` to load the dataset from a CSV file using pandas.
   - Defines a function `normalize_data(data)` to normalize the data by applying Standard Scaling and dropping unnecessary columns.

In [2]:
def load_data(file_path):
    return pd.read_csv(file_path)

In [3]:
def normalize_data(data):
    scaler = StandardScaler()
    data['normalized_amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
    data.drop(['Time', 'Amount'], axis=1, inplace=True)
    return data

2. **Classifier Evaluation:**
   - Defines a function `evaluate_classifier()` which evaluates the performance of various classifiers using Confusion Matrix and Classification Report.
   - The classifiers evaluated include Logistic Regression, Random Forest, Gaussian Naive Bayes, Decision Trees, K-Nearest Neighbors, and Support Vector Machine (SVM).

In [4]:
def evaluate_classifier(classifier, X_train, y_train, X_test, y_test, name):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(f"Classifier: {name}")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

    # Use the probability estimates for plotting precision-recall curve
    # y_probas = classifier.predict_proba(X_test)
    # plot_precision_recall_curve(y_test, y_probas, title='Precision-Recall curve for ' + name)

    # plt.show()
    return classifier

In [5]:
"""
def plot_roc_for_all(classifiers, X_test, y_test):
    plt.figure()
    for name, classifier in classifiers.items():
        plot_roc_curve(classifier, X_test, y_test, title=f'ROC curve for {name}')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')  # Random guess line
    plt.title('Receiver Operating Characteristic (ROC) curve')
    plt.show()
"""

"\ndef plot_roc_for_all(classifiers, X_test, y_test):\n    plt.figure()\n    for name, classifier in classifiers.items():\n        plot_roc_curve(classifier, X_test, y_test, title=f'ROC curve for {name}')\n    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')  # Random guess line\n    plt.title('Receiver Operating Characteristic (ROC) curve')\n    plt.show()\n"

3. **Main Function:**
   - Loads the dataset from 'creditcard.csv' file and normalizes the data.
   - Splits the dataset into training and testing sets using a 80:20 ratio.
   - Initializes classifiers with specific hyperparameters.
   - Iterates over each classifier, evaluates their performance, and stores the trained classifiers in a dictionary.

In [6]:
def main():
    # Load the dataset
    file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/NeuroNexus Innovations - Data Science/Credit Card Fraud Detection/creditcard.csv'
    data = load_data(file_path)

    # Normalization
    data = normalize_data(data)

    # Dealing with Class Imbalance
    X = data.drop('Class', axis=1)
    y = data['Class']

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize classifiers with hyperparameters
    classifiers = {
        'Logistic Regression': LogisticRegression(),
        'Random Forest': RandomForestClassifier(n_estimators=100),
        'Gaussian Naive Bayes': GaussianNB(),
        'Decision Tree (Gini)': DecisionTreeClassifier(criterion="gini", max_depth=5),
        'Decision Tree (Entropy)': DecisionTreeClassifier(criterion="entropy", max_depth=5),
        'KNN': KNeighborsClassifier(n_neighbors=5),
        'SVM': SVC(kernel='rbf', probability=True)
    }

    # Iterate over each classifier and evaluate
    trained_classifiers = {}
    for name, classifier in classifiers.items():
        trained_classifiers[name] = evaluate_classifier(classifier, X_train, y_train, X_test, y_test, name)

    # Plotting ROC curve for all classifiers
    # plot_roc_for_all(trained_classifiers, X_test, y_test)

4. **Note:**
   - There are commented out sections related to plotting ROC curves and precision-recall curves using scikit-plot and matplotlib, which are not currently used.

5. **Execution:**
   - Executes the `main()` function when the script is run directly.

In [7]:
if __name__ == "__main__":
    main()

Classifier: Logistic Regression
Confusion Matrix:
 [[56855     9]
 [   41    57]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.86      0.58      0.70        98

    accuracy                           1.00     56962
   macro avg       0.93      0.79      0.85     56962
weighted avg       1.00      1.00      1.00     56962

Classifier: Random Forest
Confusion Matrix:
 [[56862     2]
 [   20    78]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.97      0.80      0.88        98

    accuracy                           1.00     56962
   macro avg       0.99      0.90      0.94     56962
weighted avg       1.00      1.00      1.00     56962

Classifier: Gaussian Naive Bayes
Confusion Matrix:
 [[55608  1256]
 [   18    80]]
Classification Report:
               precision    recall

Overall, this script aims to load, preprocess, train, and evaluate the performance of multiple classifiers on a credit card fraud detection dataset from Kaggle.