<a href="https://colab.research.google.com/github/OsirisEscaL/Machine_Learning/blob/main/Building_a_Model_for_Credit_Card_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Model for Credit Card Fraud Detection Using Scikit-Learn's Anomaly Detection Algorithms

In today's digital age, people worldwide commonly use credit cards and, therefore, are a prime target for fraudsters. Credit card fraud can lead to monetary losses for both consumers and businesses. To combat this growing problem, machine learning techniques, particularly anomaly detection algorithms, are being used to detect fraudulent transactions. In this article, we explore constructing a model to detect credit card fraud using Scikit-Learn's anomaly detection algorithms. As fraud is uncommon, we also address how to deal with unbalanced datasets.

**Understanding Anomaly Detection**

Anomaly detection is a subfield of machine learning that identifies data points or patterns that deviate from expected behavior. An anomaly in detecting credit card fraud is a transaction that significantly differs from the norm. Traditional supervised classification methods may not be appropriate for this task due to the rarity of fraudulent transactions compared to legitimate ones. Scikit-Learn offers several unsupervised anomaly detection algorithms, including Isolation Forest and One-Class SVM, which are ideal for managing imbalanced datasets.

**Dataset**

We will use the Credit Card Fraud Detection Dataset from [Kaggle](https://www.kaggle.com/datasets/isaikumar/creditcardfraud) for this project. This dataset typically includes transaction amount, timestamp, anonymized features, and a binary designation indicating whether the transaction is fraudulent or legitimate.

**Step 1: Importing Essential Libraries**

Importing the essential Python libraries for the project will be our initial step:

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.metrics import classification_report

**Step 2: Loading and Preprocessing the Dataset**

Once the dataset has been downloaded and extracted, it will be loaded and preprocessed.


In [2]:
# Load the dataset
data = pd.read_csv('creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
# Split the data into features and labels
X = data.drop('Class', axis=1)
y = data['Class']

# Split the data into training and testing sets, stratifying on the 'Class' label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Step 3: Training the Models**

Scikit-Learn provides two prominent anomaly detection algorithms: Isolation Forest and One-Class SVM. We'll use these algorithms for detecting uncommon anomalies in unbalanced datasets.

In [4]:
# Initialize and train the Isolation Forest model
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_train)

# Initialize and train the One-Class SVM model
svm = OneClassSVM(nu=0.01)
svm.fit(X_train)

**Step 3: Evaluating Model Performance**

It is essential to evaluate the efficacy of the model. A crucial metric for credit card fraud detection is the precision-recall trade-off. High precision (low false positive rate) is required to avoid inconveniencing legitimate cardholders, whereas high recall (low false negative rate) is required to detect the maximum number of fraudulent transactions.

**IsolationForest algorithm**

To use IsolationForest for binary classification, you must set a prediction threshold for the anomaly scores. Data points with scores above the threshold are assigned to one class, while those below are given to the other. You can adjust this threshold according to your specific problem and performance requirements.

In [22]:
# Predict anomaly scores on test data
iso_scores = iso_forest.decision_function(X_test)

# Define a threshold (you can adjust this threshold)
threshold = 0.0

# Convert anomaly scores to binary predictions
iso_forest_preds = (iso_scores > threshold).astype(int)

# Model evaluation
print("Isolation Forest:")
print(classification_report(y_test, iso_forest_preds))

Isolation Forest:
              precision    recall  f1-score   support

           0       0.89      0.01      0.02     56864
           1       0.00      0.34      0.00        98

    accuracy                           0.01     56962
   macro avg       0.45      0.17      0.01     56962
weighted avg       0.89      0.01      0.02     56962



**One-Class SVM**

Like the IsolationForest algorithm, you must also set a prediction threshold for the anomaly scores in the One-Class SVM algorithm for binary classification.

In [24]:
# Predict anomaly scores on test data
svm_scores = svm.decision_function(X_test)

# Define a threshold (you can adjust this threshold)
threshold = 0.0

# Convert anomaly scores to binary predictions
svm_preds = (svm_scores > threshold).astype(int)

# Model evaluation
print("One-Class SVM:")
print(classification_report(y_test, svm_preds))

One-Class SVM:
              precision    recall  f1-score   support

           0       0.92      0.01      0.02     56864
           1       0.00      0.43      0.00        98

    accuracy                           0.01     56962
   macro avg       0.46      0.22      0.01     56962
weighted avg       0.92      0.01      0.02     56962



Depending on your specific needs, you may need to fine-tune hyperparameters such as contamination and nu to achieve the optimal equilibrium between precision and recall.

Remember that achieving the optimal balance between precision and recall is crucial and that this may necessitate tailoring the hyperparameters to your specific use case.

**Conclusion**

Building a fraud detection model using Scikit-Learn's anomaly detection algorithms can be highly effective when dealing with imbalanced datasets and uncommon anomalies. By utilizing actual data and adhering to best practices for data preprocessing and model evaluation, it is possible to develop a robust system that protects cardholders and financial institutions from fraudulent activities.