# Fraud Detection

Fraudulent activity has permeated multiple sectors, from e-commerce and healthcare to banking and payment systems. This
illicit industry amasses billions every year and is on an upward trajectory. The 2018 global economic crime survey by
PwC verifies this assertion, revealing that 49 percent of the 7,200 enterprises surveyed had fallen prey to some form of
fraudulent conduct.

<figure>
  <img src="images/artboard.png" alt="fraud-detection-banking" style="width:100%">
</figure>

Despite the perceived peril of fraud to businesses, the advent of sophisticated systems, such as rule engines or machine
learning, equips us with the tools to detect and prevent such behaviors. In this notebook, we demonstrate how a machine
learning system helps us achieve this.

At its core, a rules engine is a sophisticated software system that enforces one or more business rules in a real-time
production environment. More often than not, these rules are the crystallization of hard-earned insights gleaned from
domain experts. For instance, we could establish rules limiting the number of transactions in a given time frame, and
blocking transactions that originate from previously identified fraudulent IPs and/or domains. Such rules prove highly
effective in detecting certain types of fraud, yet they are not without their limitations. Rules with predefined
threshold values may give rise to false positives or false negatives. To illustrate, imagine a rule that rejects any
transaction exceeding \\$10,000 for a particular user. A seasoned fraudster might exploit this by staying one step
ahead, consciously making a transaction slightly below this threshold (for instance, \\$9,999), thereby evading
detection.

This is where machine learning comes to the rescue: By reducing both the risk of fraud and potential financial losses to
businesses, machine learning fortifies the efficacy of the detection system. Combining this technology with rules-based
systems ensures that fraud detection becomes a more precise and reliable endeavor. In this exploration, you will be
inspecting fraudulent transactions using the Banksim dataset. This synthetically created dataset is an combination of
various customer payments, made at different intervals and in varying amounts. Through this, you aim to provide a
comprehensive understanding of how you can detect and curtail fraudulent activities with high accuracy.

In this notebook, you will go through the steps below:

1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
1. [Data Preprocessing](#Data-Preprocessing)
1. [Oversampling with SMOTE](#Oversampling-with-SMOTE)
1. [K-Neighbours Classifier](#K-Neighbours-Classifier)
1. [Random Forest Classifier](#Random-Forest-Classifier)
1. [XGBoost Classifier](#XGBoost-Classifier)
1. [Logistic Regression Classifier](#Logistic-Regression-Classifier)
1. [Model Deployment](#Model-Deployment)
1. [Prediction](#Prediction)

Let's begin!

## Experiment Overview

First, let's go through the overview of this experiment. Below you can see the steps involved in this experiment:

- Import all required packages, define helper functions, and initialize global variables.
- Process and validate the data.
- Initiate the model training process and retrieve the best performing model.
- Deploy the model using KServe.
- Transform the Notebook in a Kubeflow pipeline using Kale

# Imports & Initialization

In this section, you import all the necessary packages that are required for your analysis. These packages provide the
tools and functionalities needed to effectively process the data, train your machine learning model, and evaluate its
performance.

In [None]:
import os
import json
import pickle
import requests
import subprocess

import boto3
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE
from sklearn.metrics import (accuracy_score, auc, classification_report,
                             confusion_matrix, roc_curve)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from requests.packages.urllib3.exceptions import InsecureRequestWarning

# Suppress warnings
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

# Set seaborn style
sns.set()

In [None]:
NAMESPACE = open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r").read()

config = {
    "MINIO_HOST_URL": os.environ["MLFLOW_S3_ENDPOINT_URL"],
    "MINIO_ACCESS_KEY": os.environ["AWS_ACCESS_KEY_ID"],
    "MINIO_SECRET_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
    "KSERVE_MODEL_NAME": "fraud-detection",
    "CURRENT_USER": NAMESPACE,
    "NAMESPACE": NAMESPACE,
    "BUCKET": "experiments",
    "SOURCE_PATH": "dataset/feed.csv",
    "SERVICE_ACCOUNT": "kserve-minio-sa",
    "PROTOCOL_VERSION": "v2"
}

In [None]:
client = boto3.client(
    service_name="s3",
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    endpoint_url=os.environ["MLFLOW_S3_ENDPOINT_URL"],
    verify=False)

bucket_name = config.get("BUCKET")
buckets = client.list_buckets()

if not any(bucket['Name'] == bucket_name for bucket in buckets['Buckets']):
    client.create_bucket(Bucket=config.get("BUCKET"))

if os.path.exists("dataset"):
    train_dataset = os.path.join("dataset", "feed.csv")

    client.upload_file(Filename=train_dataset, 
                       Bucket=config.get("BUCKET"), 
                       Key=f"{train_dataset}")

# Define Helper Functions

In this section, you define the function for plotting the Receiver Operating Characteristic Area Under the Curve
(ROC_AUC). The ROC curve is a powerful diagnostic tool used for assessing the performance of binary classification
models. It is a plot with the True Positive Rate (TPR) or sensitivity on the $y$-axis, and the False Positive Rate (FPR)
or 1-specificity on the $x$-axis. Both rates range between `0` and `1`. Each point on the ROC curve represents a
sensitivity/specificity pair corresponding to a particular decision threshold.

The Area Under the ROC Curve (AUC-ROC) is a single scalar value that aggregates the performance of the classifier over
all possible thresholds, providing a measure of the model's ability to distinguish between positive and negative
classes. Here's a breakdown of what the AUC-ROC value means:

- An AUC of `1.0` indicates that the classifier is perfect - it has a 100% true positive rate and a 0% false positive
  rate.
- An AUC of `0.5` suggests that the classifier is no better than random guessing - it has an equal chance of classifying
  a positive sample as negative, and vice versa.
- An AUC of less than `0.5` implies that the classifier is worse than random guessing - it's as if the model is
  "learning" to make the wrong predictions.

By visualizing the ROC curve, you can better understand the trade-off between sensitivity (the ability to correctly
classify true positives) and specificity (the ability to correctly classify true negatives). This helps in selecting the
optimal threshold that balances both metrics according to the specific needs of a given application.

In [None]:
def init_minio_client():
    client = boto3.client(
        service_name="s3",
        aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
        aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
        endpoint_url=os.environ["MLFLOW_S3_ENDPOINT_URL"],
        verify=False)
    return client


def plot_roc_auc(y_test, preds):
    """Plot the Receiver Operating Characteristic (ROC) curve."""
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

# Exploratory Data Analysis

In this section, you undertake a detailed Exploratory Data Analysis (EDA) of the dataset, aiming to uncover critical
insights that could inform your subsequent analysis. The initial snapshot of the dataset, as seen below, reveals nine
feature columns and a target column. The feature columns are as follows:

<table style="width:70%">
  <tr>
    <th style="text-align:left">Feature</th>
    <th style="text-align:left">Description</th>
  </tr>
  <tr>
    <td style="text-align:left">Step</td>
    <td style="text-align:left">
        Represents the simulation day. The simulation ran for a total of 180 steps, equivalent to six months.
    </td>
  </tr>
  <tr>
    <td style="text-align:left">Customer</td>
    <td style="text-align:left">Denotes the unique ID assigned to each customer.</td>
  </tr>
  <tr>
    <td style="text-align:left">zipCodeOrigin</td>
    <td style="text-align:left">Represents the originating or source zip code.</td>
  </tr>
  <tr>
    <td style="text-align:left">Merchant</td>
    <td style="text-align:left">Specifies the unique ID of the merchant.</td>
  </tr>
  <tr>
    <td style="text-align:left">zipMerchant</td>
    <td style="text-align:left">Represents the zip code associated with the merchant.</td>
  </tr>
  <tr>
    <td style="text-align:left">Age</td>
    <td style="text-align:left">
        Classifies the age of customers:
        <ul>
            <li>0: <= 18 years</li>
            <li>1: 19-25 years</li>
            <li>2: 26-35 years</li>
            <li>3: 36-45 years</li>
            <li>4: 46-55 years</li>
            <li>5: 56-65 years</li>
            <li>6: > 65 years</li>
            <li>U: Unknown</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td style="text-align:left">Gender</td>
    <td style="text-align:left">
        Indicates the gender of the customer:
        <ul>
            <li>E : Enterprise</li>
            <li>F: Female</li>
            <li>M: Male</li>
            <li>U: Unknown</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td style="text-align:left">Category</td>
    <td style="text-align:left">Denotes the category of the purchase.</td>
  </tr>
  <tr>
    <td style="text-align:left">Amount</td>
    <td style="text-align:left">Specifies the purchase amount.</td>
  </tr>
  <tr>
    <td style="text-align:left">Fraud</td>
    <td style="text-align:left">The target variable, indicates whether the transaction was fraudulent (1) or benign (0)
    </td>
  </tr>
</table>

So, let's fetch the dataset from object storage and read it in a Pandas dataframe.

In [None]:
file_name = config.get("SOURCE_PATH")

client = init_minio_client()
client.download_file(Bucket=config.get("BUCKET"), Key=config.get("SOURCE_PATH"), Filename="feed.csv")

data = pd.read_csv("feed.csv")
data.head(5)

Next, let's take a closer look at the types of data stored in each column and assess whether there are any missing
values present in the dataset. Fortunately, this dataset does not contain any missing values, which means you don't have
to include any imputation strategies in your preprocessing steps. The absence of missing values simplifies the data
cleaning process and allows you to proceed with the analysis.

In [None]:
data.info()

## Synthetic Minority Over-sampling TEchnique (SMOTE)

Fraudulent data, as you can also observe from the plot below, tends to be imbalanced. This imbalance can lead to a bias
towards the majority class during the training of a machine learning model. To counteract this issue, you can utilize
techniques such as oversampling or undersampling.

- Oversampling refers to the process of augmenting the number of instances in the minority class by generating similar
  instances.
- Undersampling involves reducing the number of instances in the majority class by randomly selecting data points until
  the count aligns with the minority class.

Each of these strategies has associated risks. For instance, oversampling may result in the creation of duplicate or
highly similar data points, which may not be beneficial for fraud detection given that fraudulent transactions often
exhibit unique characteristics. Undersampling, however, implies a loss of data points, and consequently, valuable
information.

To balance the dataset without introducing excessive redundancy or losing crucial information, you could employ an
oversampling technique known as SMOTE (Synthetic Minority Over-sampling TEchnique). Unlike naive oversampling methods,
SMOTE generates synthetic instances of the minority class using neighboring instances, ensuring that the new samples are
not exact copies but closely resemble existing instances. This technique can potentially improve the performance of your
model by providing it with a more representative and balanced view of the different classes in the dataset.

In [None]:
# Create two dataframes with fraud and non-fraud data 
df_fraud = data.loc[data.fraud == 1] 
df_non_fraud = data.loc[data.fraud == 0]

sns.countplot(x="fraud",data=data)
plt.title("Count of Fraudulent Payments")
plt.show()

print("Number of normal examples:", df_non_fraud.fraud.count())
print("Number of fradulent examples:", df_fraud.fraud.count())

Moreover, by examining the data below, it becomes apparent that the 'leisure' and 'travel' categories seem to be most
frequently targeted by fraudsters. It appears that these perpetrators might be strategically choosing categories where
people typically spend more on average.

To validate this hypothesis, you need to delve deeper into the transaction data, specifically comparing the amounts
transacted in fraudulent and non-fraudulent cases. By doing this, you could gain a better understanding of the patterns
and behaviors of fraudsters, and thus improve the effectiveness of the detection system.

In [None]:
print("Mean feature values per category:")
data.groupby('category')[['amount', 'fraud']].mean()

Upon further analysis, the initial hypothesis—that fraudsters predominantly target categories where average spending is
higher—holds true only to a certain extent. However, a clear trend emerges when you examine the transaction values
associated with fraudulent activities.

As illustrated in the table below, you can confidently assert that a fraudulent transaction is typically significantly
larger—about four times or more—than the average transaction within a given category. This significant deviation in
transaction amounts may provide a useful indicator when identifying potential fraudulent activities in the future.

In [None]:
# Create two dataframes with fraud and non-fraud data 
pd.concat([df_fraud.groupby('category')['amount'].mean(),
           df_non_fraud.groupby('category')['amount'].mean(),
           data.groupby('category')['fraud'].mean()*100],
           keys=["Fraudulent","Non-Fraudulent","Percent(%)"], axis=1,sort=False).sort_values(by=['Non-Fraudulent'])

Upon examining the average amount spent across various categories, you can see that spending is typically similar,
generally ranging from `0` to `500`, once you exclude the outliers. However, the 'travel' category stands as an
exception, with spending reaching significantly higher levels.

This deviation in the 'travel' category could be due to a variety of factors, such as the inherent high cost associated
with travel and tourism activities. Such information is crucial, not only for understanding the spending behavior of
customers but also for improving the fraud detection strategies, as categories with higher average spending might
attract more fraudulent activities.

In [None]:
# Plot histograms of the amounts in fraud and non-fraud data 
plt.figure(figsize=(30, 10))
sns.boxplot(x=data.category, y=data.amount)
plt.title("Boxplot for the Amount spend in category")
plt.ylim(0, 4000)
plt.show()

Reinforcing previous observations, the histogram below presents a striking representation of the relationship between
the number and amount of fraudulent transactions. While the count of fraudulent transactions is relatively low, the
monetary value they represent is disproportionately high. This pattern underscores the serious financial implications of
fraud, even when the number of incidents might appear relatively minor at first glance. It's precisely this disparity
that makes effective and precise fraud detection systems crucial for the integrity of any financial system.

In [None]:
# Plot histograms of the amounts in fraud and non-fraud data 
plt.hist(df_fraud.amount, alpha=0.5, label='fraud',bins=100)
plt.hist(df_non_fraud.amount, alpha=0.5, label='nonfraud',bins=100)
plt.title("Histogram for fraudulent and nonfraudulent payments")
plt.ylim(0, 10000)
plt.xlim(0, 1000)
plt.legend()
plt.show()

Examining the table below, you can observe the percentage of fraudulent transactions within each age category. Among
known age categories, the group '0' (representing ages 18 and under) exhibits the highest fraud percent, standing at
`1.957586`. This data is crucial for enhancing the understanding of the demographics most vulnerable to fraudulent
activities and can be instrumental in tailoring the fraud detection algorithms and preventive measures accordingly.

In [None]:
((data.groupby('age')['fraud'].mean() * 100).reset_index()
                                            .rename(columns={'age':'Age','fraud' : 'Fraud Percent'})
                                            .sort_values(by='Fraud Percent'))

# **Data Preprocessing**

In this section, you will focus on preprocessing the data and preparing it for the training phase. Upon investigating
the data, you can see that there are two columns with only one unique zip code value. In terms of machine learning, a
feature with a single value adds no predictive power, since it remains constant for all observations. Therefore, you
could drop this column from the dataset to streamline the model training process.

In [None]:
print("Unique 'zipCodeOri' values:", data.zipcodeOri.nunique())
print("Unique 'zipMerchant' values:", data.zipMerchant.nunique())
# dropping zipcodeori and zipMerchant since they have only one unique value
data_reduced = data.drop(["zipcodeOri", "zipMerchant"], axis=1)

Next, you will convert the categorical features into numerical values. One efficient way to transform categorical values
into numerical representations is by utilizing the pandas library's `cat.codes` property. This method allows us to
encode categorical variables into numerical codes without significantly increasing the dimensionality of the dataset.

In [None]:
# turning object columns type to categorical for easing the transformation process
col_categorical = data_reduced.select_dtypes(include= ['object']).columns
for col in col_categorical:
    data_reduced[col] = data_reduced[col].astype('category')
# categorical values ==> numeric values
data_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)
data_reduced.head(5)

To proceed with model training, you need to define the independent variable ($X$) and dependent/target variable ($y$).
In this context, the independent variable ($X$) refers to the set of features or attributes that you will use to
predict the dependent/target variable ($y$). By properly defining $X$ and $y$, you can establish the foundation for
training your machine learning model and exploring the relationships between the independent variables and the
dependent/target variable.

In [None]:
X = data_reduced.drop(['fraud'], axis=1)
y = data['fraud']

print(X.head(), "\n")
print(y.head())

# Oversampling with SMOTE

To address the issue of class imbalance in the dataset, you will apply the Synthetic Minority Over-sampling TEchnique
(SMOTE). As you saw earlier, this oversampling technique generates synthetic instances of the minority class (fraudulent
transactions) by interpolating between existing instances. By applying SMOTE, you will effectively increase the number
of instances in the minority class to match the number of instances in the majority class (non-fraudulent transactions).
As a result, you will have an equal number of instances for both classes, which helps to alleviate the potential bias
and improve the performance of your machine learning model.

In [None]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
y_res = pd.DataFrame(y_res)

print(y_res.value_counts())

# Train-Test Split for Model Performance Measurement

To assess the performance of your machine learning model, you should split the data into two sets: a training set and a
testing set. The training set will be used to train the model, allowing it to learn patterns and relationships within
the data. The testing set, on the other hand, will be used to evaluate the model's performance on unseen data. By
measuring the model's performance on the testing set, you can gain insights into how well it generalizes to new and
unseen instances.

While cross-validation is a commonly recommended practice for model evaluation, in this case, due to the large number of
instances in the dataset, you could opt for a simple train-test split. However, it is important to note that
cross-validation should be used whenever feasible, as it provides a more comprehensive evaluation of the model's
performance and helps to mitigate potential biases introduced by a single train-test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    shuffle=True,
                                                    stratify=y_res)

As mentioned earlier, fraud datasets often suffer from severe class imbalance, where the majority of instances are
non-fraudulent transactions. If you were to naively predict non-fraudulent for all instances in such imbalanced
datasets, you would achieve a high accuracy score, typically around 99%. However, this misleading accuracy score does
not indicate a successful fraud detection system. In reality, the goal of a fraud detection classifier is to identify
the fraudulent transactions accurately, which are the minority class in the dataset.

To accurately evaluate the performance of a fraud detection classifier, it is essential to consider metrics that are
sensitive to both the minority (fraudulent) and majority (non-fraudulent) classes. Metrics such as precision, recall,
F1-score, and the area under the Receiver Operating Characteristic (ROC) curve provide a more comprehensive assessment
of the model's effectiveness in detecting fraud.

In addition, it is important to establish a baseline accuracy score that exceeds the accuracy achieved by simply
predicting the majority class (non-fraudulent). This baseline accuracy serves as a benchmark for evaluating the
performance of the fraud detection model, ensuring that it performs significantly better than a simplistic approach and
demonstrates its ability to accurately detect fraudulent transactions.

Therefore, when evaluating the performance of a fraud detection model, it is imperative to consider multiple metrics
that provide a comprehensive understanding of its effectiveness in detecting both fraudulent and non-fraudulent
instances, rather than relying solely on accuracy.

In [None]:
# The base score should be better than predicting always non-fraduelent
print("Base accuracy score: ", 
      df_non_fraud.fraud.count() / np.add(df_non_fraud.fraud.count(),df_fraud.fraud.count()) * 100)

# K-Neighbours Classifier

The K-Nearest Neighbors (KNN) classifier is a popular machine learning algorithm used for classification. In this
section, you will train a KNN classifier as a potential approach for the fraud detection problem.

The KNN classifier works based on the principle that instances with similar feature values tend to belong to the same
class. It classifies new instances by finding the K nearest neighbors in the training set and assigning the majority
class label among those neighbors to the new instance.

Key features of the KNN classifier include:

- K value: The value of K represents the number of neighbors to consider for classification. It is an important
  parameter that needs to be carefully chosen to achieve optimal performance.
- Distance metric: The choice of distance metric, such as Euclidean distance or Manhattan distance, determines the
  similarity between instances and influences the classification process.
- Computational cost: The KNN classifier can be computationally expensive, especially for large datasets, as it requires
  calculating distances between the new instance and all training instances. Therefore, it is essential to consider the
  computational trade-offs when working with KNN.

By employing the KNN classifier, you can leverage the proximity-based nature of the algorithm to detect potential fraud
patterns in the data. However, it is important to experiment with different values of K and evaluate the model's
performance using appropriate evaluation metrics to ensure its effectiveness in detecting fraudulent transactions.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, p=1)
knn.fit(X_train, np.ravel(y_train, order='C'))
y_pred = knn.predict(X_test)

print("Classification Report for K-Nearest Neighbours: \n", classification_report(y_test, y_pred))
print("Confusion Matrix of K-Nearest Neigbours: \n", confusion_matrix(y_test, y_pred))
plot_roc_auc(y_test, knn.predict_proba(X_test)[:,1])

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# save the model to minio s3 object
filename = "model.pkl"
pickle.dump(knn, open(f"{filename}", "wb"))

minio_client = init_minio_client()
minio_client.upload_file(Filename=filename, 
                         Bucket=config.get("BUCKET"), 
                         Key=f"banking/pickles/k-neighbors/model/{filename}")

# Random Forest Classifier

The Random Forest classifier is a powerful and versatile machine learning algorithm widely used for classification. In
the context of the fraud detection problem, let's explore the Random Forest classifier as a potential approach.

The Random Forest algorithm is an ensemble method that works by constructing a multitude of decision trees and
aggregating their predictions to make the final classification. Each decision tree in the Random Forest is trained on a
different subset of the data, using a random selection of features. This randomness helps to reduce overfitting and
improve the generalization ability of the model.

Key features of the Random Forest classifier include:

- Ensemble learning: The Random Forest classifier combines the predictions of multiple decision trees to make a more
  robust and accurate prediction. The ensemble approach helps to mitigate the risk of individual decision trees making
  errors.
- Feature importance: The Random Forest classifier provides a measure of feature importance, indicating the relative
  importance of each feature in making predictions. This information can be valuable for understanding the key factors
  contributing to fraudulent transactions.
- Parallelization: The Random Forest algorithm lends itself well to parallelization, as each decision tree in the forest
  can be trained independently. This makes it suitable for large datasets and can lead to faster training times.

When applying the Random Forest classifier to the fraud detection problem, it is crucial to tune hyperparameters, such
as the number of decision trees in the forest and the maximum depth of each tree, to optimize performance. Additionally,
evaluating the model's performance using appropriate metrics and considering feature importance can provide valuable
insights for fraud detection and prevention.

In [None]:
rf_clf = RandomForestClassifier(n_estimators=50,
                                max_depth=8,
                                random_state=42,
                                verbose=1,
                                class_weight="balanced")
rf_clf.fit(X_train, y_train.values.ravel())
y_pred = rf_clf.predict(X_test)

print("Classification Report for Random Forest Classifier: \n", classification_report(y_test, y_pred))
print("Confusion Matrix of Random Forest Classifier: \n", confusion_matrix(y_test, y_pred))
plot_roc_auc(y_test, rf_clf.predict_proba(X_test)[:,1])

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# save the model to minio s3 object
filename = "model.pkl"
pickle.dump(rf_clf, open(f"{filename}", "wb"))

minio_client = init_minio_client()
minio_client.upload_file(Filename=filename, 
                         Bucket=config.get("BUCKET"), 
                         Key=f"banking/pickles/random_forest/model/{filename}")

# XGBoost Classifier

The XGBoost (Extreme Gradient Boosting) classifier is a state-of-the-art machine learning algorithm known for its
exceptional performance and widespread use in various domains, including fraud detection. Let's delve into the XGBoost
classifier and its relevance to the fraud detection problem.

XGBoost is an ensemble learning method that combines the power of gradient boosting with several innovative techniques.
It excels at handling large-scale datasets and effectively capturing complex relationships between features. The
algorithm constructs a series of decision trees iteratively, where each subsequent tree corrects the mistakes made by
the previous trees.

Key features of the XGBoost classifier include:

- Gradient boosting: XGBoost utilizes gradient boosting, a technique that sequentially adds decision trees to improve
  the model's predictive accuracy. By iteratively minimizing a specified loss function, XGBoost focuses on capturing
  intricate patterns and relationships in the data.
- Feature importance: XGBoost provides valuable insights into feature importance by quantifying the impact of each
  feature on the model's performance. This information aids in identifying the most influential features for detecting
  fraudulent transactions.
- Parallel processing: XGBoost supports parallel processing, enabling faster training times and efficient computation on
  large-scale datasets. It leverages the capabilities of multicore processors and distributed computing frameworks for
  accelerated model training.

When employing the XGBoost classifier for fraud detection, it is crucial to tune hyperparameters, such as the learning
rate, maximum depth of trees, and regularization parameters, to optimize performance. Evaluating the model's performance
using appropriate metrics and considering feature importance can enhance the effectiveness of the fraud detection
system.

In [None]:
XGBoost_CLF = xgb.XGBClassifier(silent=None, seed=42, colsample_bynode=1, max_depth=6, learning_rate=0.05, n_estimators=50, 
                                objective="binary:hinge", booster='gbtree', missing=1,
                                n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, 
                                subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
                                base_score=0.5, random_state=42, verbosity=1)
XGBoost_CLF.fit(X_train, y_train)
y_pred = XGBoost_CLF.predict(X_test)

print("Classification Report for XGBoost: \n", classification_report(y_test, y_pred))
print("Confusion Matrix of: \n", confusion_matrix(y_test, y_pred))
plot_roc_auc(y_test, XGBoost_CLF.predict_proba(X_test)[:,1])

# Get accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# save the model to minio s3 object
filename = "model.pkl"
pickle.dump(XGBoost_CLF, open(f"{filename}", "wb"))

minio_client = init_minio_client()
minio_client.upload_file(Filename=filename, 
                         Bucket=config.get("BUCKET"), 
                         Key=f"banking/pickles/xgb/model/{filename}")

# Logistic Regression Classifier

The Logistic Regression classifier is a well-established and widely used algorithm for binary classification tasks,
making it relevant to the fraud detection problem. Let's explore the Logistic Regression classifier and its
applicability to this scenario.

Despite its name, Logistic Regression is a classification algorithm that models the probability of an instance belonging
to a particular class. It is particularly suited for problems where the dependent variable is binary, as in your case
where you aim to distinguish between fraudulent and non-fraudulent transactions.

Key features of the Logistic Regression classifier include:

- Probabilistic modeling: Logistic Regression models the relationship between the independent variables and the
  probability of belonging to a specific class. It employs the logistic function (also known as the sigmoid function)
  to map the output to a probability score.
- Interpretability: Logistic Regression provides interpretable coefficients for each independent variable, which allows
  us to understand the impact of the features on the likelihood of fraud. These coefficients indicate the direction and
  magnitude of the relationship between each feature and the probability of fraudulent transactions.
- Efficiency: Logistic Regression is computationally efficient and can handle large datasets with relative ease. It
  converges quickly and is less prone to overfitting, making it suitable for situations where interpretability and
  simplicity are important factors.

When utilizing the Logistic Regression classifier for fraud detection, it is crucial to preprocess the data
appropriately, handle categorical variables, and consider feature scaling if necessary. Evaluating the model's
performance using appropriate metrics such as precision, recall, F1-score, and area under the ROC curve can provide a
comprehensive understanding of its effectiveness in detecting fraudulent transactions.

In [None]:
LRmodel = LogisticRegression(max_iter=999, solver='lbfgs')
LRmodel.fit(X_train, y_train)

# Get predictions
y_pred = LRmodel.predict(X_test)
print("Classification Report for LogisticRegression: \n", classification_report(y_test, y_pred))
print("Confusion Matrix of LogisticRegression: \n", confusion_matrix(y_test, y_pred))
plot_roc_auc(y_test, LRmodel.predict_proba(X_test)[:,1])

# Get accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# save the model to minio s3 object
filename = "model.pkl"
pickle.dump(LRmodel, open(f"{filename}", "wb"))

minio_client = init_minio_client()
minio_client.upload_file(Filename=filename, 
                         Bucket=config.get("BUCKET"), 
                         Key=f"banking/pickles/logisticregression/model/{filename}")

# Model Deployment

After training and evaluating the fraud detection model, the next step is to deploy it into a production environment
where it can be used to detect fraud in real-time transactions.

Before diving into deployment, you should start by creating a secure environment for accessing the S3 endpoint. First,
you define a Secret object, which securely holds the necessary credentials. Additionally, you create a ServiceAccount
object, associating it with the secret to establish an identity for the deployment process.

In [None]:
manifest = f"""
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-minio-sa
secrets:
- name: {config['CURRENT_USER']}-objectstore-secret

---
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-detection"
spec:
  predictor:
    serviceAccountName: kserve-minio-sa
    sklearn:
      protocolVersion: "v2"
      storageUri: "s3://{config['BUCKET']}/banking/pickles/logisticregression/model"
"""

os.makedirs("manifests", exist_ok=True)

with open(os.path.join("manifests", "isvc.yaml"), "w") as f:
    f.write(manifest)

In [None]:
res = subprocess.run(["kubectl", "apply", "-f", "manifests/isvc.yaml"])

# Prediction

With the deployed fraud detection model, you can now use it to make predictions on new, incoming transactions. The
prediction process involves passing the relevant transaction data through the deployed model to obtain a prediction of
whether the transaction is fraudulent or not.

In [None]:
data = pd.read_csv(os.path.join("dataset", "generated-data.csv"))
data.head(5)

In [None]:
data_reduced = data.drop(['zipcodeOri', 'zipMerchant'], axis=1)
data_reduced.loc[:, ['customer', 'merchant', 'category']].astype('category')

col_categorical = data_reduced.select_dtypes(include= ['object']).columns
for col in col_categorical:
    data_reduced[col] = data_reduced[col].astype('category')

data_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)
data_reduced.head(5)

In [None]:
DOMAIN_NAME = "svc.cluster.local"  # change this to your domain for external access
NAMESPACE = config.get("NAMESPACE")
DEPLOYMENT_NAME = config.get("KSERVE_MODEL_NAME")
MODEL_NAME = DEPLOYMENT_NAME
SVC = f'{DEPLOYMENT_NAME}-predictor-default.{NAMESPACE}.{DOMAIN_NAME}'
URL = f"https://{SVC}/v2/models/{MODEL_NAME}/infer"

print(URL)

In [None]:
X = data_reduced.drop(['fraud'], axis=1)
y = data_reduced['fraud']
print("Shape:", [len(X.values), len(X.values[0])])

inference_request = {
    "inputs" : [{
        "name" : "fraud-detection-infer-001",
        "datatype": "FP32",

        "shape": [1, 7],
        # Example of non-fraudulent Transaction Dtls
        # "data": [list(item) for item in X.values][14],
        # Example of a fraudulent request
        "data": [list(item) for item in X.values][17],
    }]
}

print("data:", inference_request)

In [None]:
session = requests.Session()
message = {"message":"", "value":""}

headers = {"Authorization": f"Bearer {os.environ['AUTH_TOKEN']}"}
response = requests.post(URL, json=inference_request, headers=headers, verify=False)

if response.status_code == 200:
    if json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0] != None and json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0] == 1:
        message['message'] = "Fraudulent Banking Transaction!"
        message['value'] = json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0]        
        print('\033[91m' "Prediction Result:", json.dumps(message))
    elif len(json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'])>1:
        print("Model-Infer-dtl:[data]:\n", json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'])
    else:
        message['message'] = "Non-fraudulent Banking Transaction!"
        message['value'] = json.loads(response.__dict__.get('_content')).get('outputs')[0]['data'][0]   
        print('\033[92m'  "Prediction Result:", json.dumps(message))
else:
    print(response.status_code, response.content)

# Conclusion

In this notebook, the primary objective was to develop a fraud detection model using bank payment data. We employed
various classifiers and achieved remarkable results in detecting fraudulent transactions. As fraud datasets often suffer
from class imbalance, you utilized the SMOTE oversampling technique to address this issue by generating synthetic
minority class examples.

# References

1. Lavion, Didier; et al, "PwC's Global Economic Crime and Fraud Survey 2022", 
1. https://www.pwc.com/gx/en/services/forensics/economic-crime-survey.html
1. https://www.pwc.com/gx/en/services/forensics/gecs/outcomes-of-platform-fraud.svg |
1. https://www.pwc.com/gx/en/forensics/gecsm-2022/pdf/PwC%E2%80%99s-Global-Economic-Crime-and-Fraud-Survey-2022.pdf
1. [SMOTE: Synthetic Minority Over-sampling Technique](https://jair.org/index.php/jair/article/view/10302)
1. [Banksim Data Set Paper](http://www.msc-les.org/proceedings/emss/2014/EMSS2014_144.pdf)