# Overview

Anomaly detection aims to identify data points that significantly differ from the majority of the data.

# Data Preprocessing and Cleaning

In [1]:
import pandas as pd

df = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")

In [2]:
import numpy as np

# Detect outliers using Z-score for 'Amount'
z_scores = np.abs((df['Amount'] - df['Amount'].mean()) / df['Amount'].std())
outliers = np.where(z_scores > 3)[0]  # Set a threshold for outliers (usually z-score > 3)
print(f"Number of Outliers in 'Amount': {len(outliers)}")

Number of Outliers in 'Amount': 4076


In [3]:
# convert 'Time' from seconds to hours for better analysis 
df['Hour'] = (df['Time'] //  3600) % 24

# create a rolling mean to identify spikes in fraud
df['Fraud_Spike'] = df['Class'].rolling(window=1200).mean()  # Window size can be adjusted

# calculate the percentage of fraudulent transactions in different ranges 
bins = [0, 50, 100, 200, 500, 1000, 5000, 10000, 50000] 
labels = ['0-50', '51-100', '101-200', '201-500', '501-1000', '1001-5000', '5001-10000', '10001+']
df['Amount Range'] = pd.cut(df['Amount'], bins=bins, labels=labels, right=False)

## Data Scaling

Data scaling is a method for reducing the effect of data bias on predictions which is highly used in pre-processing step in any Machine Learning project.

In [4]:
from sklearn.preprocessing import RobustScaler

# Create an instance of the LabelEncoder
rob_scaler = RobustScaler()

# Apply Label Encoding to the 'Amount Range' feature
df['Time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1, 1))
df['Amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df['Hour'] = rob_scaler.fit_transform(df['Hour'].values.reshape(-1, 1))

# drop the amount range column and fraud spike column that were added for the analyses part 
df.drop(['Amount Range'], axis=1, inplace=True)
df.drop(['Fraud_Spike'], axis=1, inplace=True)

## Splitting the Dataset

We will be using the stratified K-fold cross-validation to ensure that each fold has a similar distribution of the target classes *y*, which is important when dealing with imbalanced datasets.

In [5]:
from sklearn.model_selection import StratifiedKFold


X = df.drop('Class', axis=1)
y = df['Class']

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in sss.split(X, y):
    original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
    original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]

# Check the Distribution of the labels

# Turn into an array
original_Xtrain = original_Xtrain.values
original_Xtest = original_Xtest.values
original_ytrain = original_ytrain.values
original_ytest = original_ytest.values

# See if both the train and test label distribution are similarly distributed
train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True)

print('Label Distributions in the Dataset :')
print(' No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '%')
print(' Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '%')

print('Label Distributions in Train :')
print(" No Frauds " + str(round(train_counts_label[0]/ len(original_ytrain)*100, 2)) + " %")
print(" Frauds " + str(round(train_counts_label[1]/ len(original_ytrain)*100, 2)) + " %")

print('Label Distributions in Test :')
print(" No Frauds " + str(round(test_counts_label[0]/ len(original_ytest)*100, 2)) + " %")
print(" Frauds " + str(round(test_counts_label[1]/ len(original_ytest)*100, 2)) + " %")

Label Distributions in the Dataset :
 No Frauds 99.83 %
 Frauds 0.17 %
Label Distributions in Train :
 No Frauds 99.83 %
 Frauds 0.17 %
Label Distributions in Test :
 No Frauds 99.83 %
 Frauds 0.17 %


# Classification

## SMOTE Technique

The SMOTE(Synthetic Minority Over-sampling Technique) is a method used to address the problem of imbalanced datasets in machine learning.

## RandomizedSearchCV

RandomizedSearchCV is hyperparameter optimization technique used in machine learning.

## Cross-Validation

It is a technique used to assess the performance and generalizability of a machine learning model.

# Acknowledge

* https://python.plainenglish.io/anomaly-detection-end-to-end-real-life-bank-card-fraud-detection-with-xgboost-2a343f761fa9
* https://medium.com/@onersarpnalcin/standardscaler-vs-minmaxscaler-vs-robustscaler-which-one-to-use-for-your-next-ml-project-ae5b44f571b9