# Anomaly Detection in Credit Card Fraud

Source: https://www.analyticsvidhya.com/blog/2023/05/anomaly-detection-in-credit-card-fraud/

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("creditcard.csv")

# Check the shape of the dataset
print("Shape of the dataset:", df.shape)

# Check the first few rows of the dataset
print(df.head())

Shape of the dataset: (284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010  

**Handling Missing Values**

In [4]:
# Check if there are any missing values in the dataset

print(df.isnull().sum())

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


**Scaling the data**

Anomaly detection algorithms can be sensitive to the scale of the data. Therefore, it is important to scale the data before applying the algorithm. We can use the StandardScaler class from the sklearn.preprocessing module to scale the data.

In [8]:
from sklearn.preprocessing import StandardScaler

# Scale the Amount column
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))

# Scale the Time column
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))

# Check the first few rows of the dataset after scaling
print(df.head())

       Time        V1        V2        V3        V4        V5        V6  \
0 -1.996583 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388   
1 -1.996583  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361   
2 -1.996562 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499   
3 -1.996562 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203   
4 -1.996541 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921   

         V7        V8        V9  ...       V21       V22       V23       V24  \
0  0.239599  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928   
1 -0.078803  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846   
2  0.791461  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281   
3  0.237609  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575   
4  0.592941 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267   

        V25       V26       V27       V28    Amount  Class  
0  0.12

## **Isolation Forest**

Isolation Forest is a popular algorithm for anomaly detection that is based on the concept of decision trees. It works by creating random decision trees for the given data and isolating the anomalies by creating shorter paths for them.

Let’s implement the Isolation Forest algorithm on our credit card fraud dataset.

In [11]:
from sklearn.ensemble import IsolationForest

# Create the Isolation Forest object
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.01),
 max_features=1.0, random_state=42)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))


Number of outliers: 2849


The Isolation Forest algorithm has detected 2848 anomalies in the dataset.

## **Local Outlier Factor**

Local Outlier Factor (LOF) is another popular algorithm for anomaly detection that is based on the concept of local density. It works by calculating the density of a data point relative to its neighbors and identifying points that have a much lower density than their neighbors as outliers.

Implementing the LOF algorithm on the credit card fraud dataset.

In [17]:
from sklearn.neighbors import LocalOutlierFactor

# Create the LOF object
clf = LocalOutlierFactor(n_neighbors=20, contamination=float(0.01))

# Fit the data and tag the outliers
y_pred = clf.fit_predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))


Number of outliers: 2849


The LOF algorithm has also detected 2848 anomalies in the dataset, which is the same as the Isolation Forest algorithm.

## **One-class SVM**

One-class SVM is another popular algorithm for anomaly detection that is based on the concept of maximum margin hyperplanes. It works by creating a hyperplane that separates the normal data points from the anomalies and identifying points that lie on the wrong side of the hyperplane as anomalies.

Implementing the One-class SVM algorithm on the credit card fraud dataset.

In [22]:
from sklearn.svm import OneClassSVM

# Create the One-class SVM object
clf = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.01)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

KeyboardInterrupt: 

The One-class SVM algorithm has detected 492 anomalies in the dataset.

In [27]:
## Evaluation and Model Selection

In [29]:
from sklearn.model_selection import train_test_split


# Define X and y
X = df.drop('Class', axis=1)
y = df['Class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [31]:
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a list of classifiers to evaluate
classifiers = [LogisticRegression(), DecisionTreeClassifier()]

# Create parameter grids for each classifier
lr_params = {'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10]}
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': [3, 5, 7]}
rf_params = {'n_estimators': [100, 300, 500], 'max_depth': [3, 5, 7]}
knn_params = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
param_grids = [lr_params, dt_params, rf_params, knn_params]

# Loop over classifiers and parameter grids to find the best model
for i, classifier in enumerate(classifiers):
    clf = GridSearchCV(classifier, param_grids[i], cv=5)
    clf.fit(X_train, y_train)
    print(classifier.__class__.__name__)
    print(clf.best_params_)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))

15 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Tooter\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Tooter\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tooter\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

LogisticRegression
{'C': 10, 'penalty': 'l2'}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85307
           1       0.87      0.64      0.74       136

    accuracy                           1.00     85443
   macro avg       0.93      0.82      0.87     85443
weighted avg       1.00      1.00      1.00     85443

DecisionTreeClassifier
{'criterion': 'entropy', 'max_depth': 5}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85307
           1       0.91      0.79      0.85       136

    accuracy                           1.00     85443
   macro avg       0.95      0.90      0.92     85443
weighted avg       1.00      1.00      1.00     85443



In this code, we have evaluated the performance of our models using cross-validation and selected the best performing model. We have used the stratified K-fold cross-validation technique to split the dataset into 5 folds, ensuring that the proportion of fraud cases is the same in each fold. Then, we have trained and evaluated three models – Logistic Regression, Decision Tree – using the cross-validation technique. We have used the average precision score as the evaluation metric because it is a suitable metric for imbalanced datasets.

## **Model Deployment**

The final step in the machine learning pipeline is to deploy the selected model to make predictions on new data. In this step, we will use the selected model to make predictions on the test dataset and evaluate its performance using classification metrics.

We will use the predict method of the trained model to make predictions on the test data, and then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics from the sklearn.metrics module.

The code for this step is as follows:

In [None]:
# make predictions on the test set
y_pred = rf_model.predict(X_test)

# evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# print the classification metrics
print(f"Accuracy: {acc}")
print(f"Precision: {prec}")
print(f"Recall: {rec}")
print(f"F1 Score: {f1}")


In [None]:
In this code, we first use the predict method of the trained rf_model to make predictions on the test set X_test. We then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics. Finally, we print the classification metrics to the console.

Note that we have imported the required metrics from the sklearn.metrics module. These metrics help us to evaluate the performance of the model and make
informed decisions about its suitability for deployment.

## **Conclusion**

In this article, we have discussed the concept of anomaly detection and various algorithms that can be used to detect anomalies in a dataset. We have also implemented some of these algorithms in Python and applied them to a credit card fraud dataset to detect anomalies. It is important to note that the choice of algorithm and the preprocessing techniques depend on the nature of the data and the problem at hand.

**Key Takeaways for Anomaly Detection in credit card fraud**

- Anomaly detection is used to detect unusual data points or patterns in a dataset and can be applied in various fields such as finance, healthcare, and cybersecurity.
- The choice of algorithm and preprocessing techniques should be based on the nature of the data and the problem at hand.
- The isolation forest algorithm is based on random forests. It is effective in detecting point anomalies and can be a suitable option for anomaly detection in some cases.
- Preprocessing techniques such as scaling and feature selection can improve the accuracy of the model. It should be considered when implementing anomaly detection.
- As the amount of data continues to grow, stay up-to-date with the latest algorithms and techniques to improve the accuracy and effectiveness of anomaly detection methods.