<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Financial Fraud Detection with In-Database Machine Learning</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    <b>ABC Bank</b> is a global bank that offers financial services to millions of customers worldwide. They have been experiencing a significant increase in fraud incidents, resulting in financial losses and customer dissatisfaction. As a result, <b>ABC Bank</b> has decided to seek a solution to improve their fraud detection capabilities.
<br>
<br>
After conducting a thorough evaluation of their current fraud detection system and analyzing historical transaction data, <b>ABC Bank</b> has identified the need for a robust and accurate fraud detection solution. They approach us for assistance in developing a solution using machine learning techniques.
    <br>
    <br>
    This notebook provides a demonstration of "data science workflow" that illustrates how to leverage <b>teradataml</b> package to build, validate and score a model at scale in Vantage without moving the data. Users can perform large-scale operations such as feature analysis, data transformation, Model training and ML Model Scoring in the Vantage environment without moving data.</p>


<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>Data Preparation</li>
    <li>In-Database Machine Learning</li>
    <li>In-Database Model Scoring</li>
    <li>Visualize the results</li>
    <li>Cleanup</li>
</ol>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Configuring the Environment</b>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard Libraries
import os
import getpass
import warnings

# Data Manipulation and Visualization Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

# Teradata Libraries
from teradataml import *

# Machine Learning Metrics and Visualizations
from sklearn.metrics import mean_absolute_error, roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay

# Configuration
display.max_rows = 5
configure.val_install_location = 'val'

# Suppress Warnings
warnings.filterwarnings("ignore")

# Magic Command for Inline Plotting
%matplotlib inline

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Financial_Fraud_Detection_InDB_PY_SQL.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('demo_glm_fraud_cloud');"        # Takes 1 minute
%run -i ../run_procedure.py "call get_data('demo_glm_fraud_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used. </p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>The data from <a href = 'https://www.kaggle.com/datasets/ealaxi/paysim1'>https://www.kaggle.com/datasets/ealaxi/paysim1</a> is loaded in Vantage in a table named "transaction_data". Check the data size and print sample rows: 63k rows and 12 columns.</p>
<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

In [None]:
txn_data = DataFrame(in_schema('DEMO_GLM_Fraud','transaction_data'))

print(txn_data.shape)
txn_data

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.1 Renaming columns</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here we rename a misspelt column without moving the data out of Vantage. We are renaming <b>oldbalanceOrg</b> to <b>oldbalanceOrig</b></p>

In [None]:
txn_data = txn_data.assign(oldbalanceOrig = txn_data.oldbalanceOrg).drop(['oldbalanceOrg'], axis=1)

txn_data

<p style = 'font-size:16px;font-family:Arial'>Fraudulent agents inside a simulation make these transactions. In this specific dataset, the fraudulent behaviour of the agents aims to profit by taking control or customers' accounts and trying to empty the funds by transferring them to another account and then cashing out of the system.</p>
<!-- <p style = 'font-size:16px;font-family:Arial'><b>Below are some insights about the dataset:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>There are 92 fraud transactions i.e. 0.14% of fraud transactions in the dataset.</li>
    <li>From these 92 fraud transactions, 47 are of type TRANSFER and 45 are of type CASH_OUT.</li>
    <li>97.83% of fraud transations have transaction amount equal to oldbalanceOrig i.e. account cleanout.</li>
    <li>71.74% of fraud transactions have recipient's old balance as zero.</li>
    <li>isFlaggedFraud is correct only two times among the 92 fraud transactions.</li>
</ol> -->

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.2 How many fraudulent transactions do we have in our dataset?</b></p>

In [None]:
# There are 92 fraud transactions i.e. 0.14% of fraud transactions in the dataset.
print("No of fraud transactions: %d\nPercentage of fraud transactions: %.2f%%"%(
    txn_data.loc[txn_data.isFraud == 1].shape[0],
    txn_data.loc[txn_data.isFraud == 1].shape[0]/txn_data.shape[0]*100)
)

# Calculate percentage of fraud transactions
fraud_transactions_count = txn_data.loc[txn_data.isFraud == 1].shape[0]
total_transactions_count = txn_data.shape[0]
fraud_percentage = fraud_transactions_count / total_transactions_count * 100

# Create a pie chart with Plotly
fig = px.pie(values = [fraud_percentage, 100-fraud_percentage],
             labels = ["Fraud Transactions", "Non-Fraud Transactions"],
             names = ["Fraud Transactions", "Non-Fraud Transactions"],
             color_discrete_sequence = ['lightgreen', 'red'],
             hover_name = ["Fraud Transactions", "Non-Fraud Transactions"],
             hole = 0.6)

# Update layout
fig.update_traces(textposition = 'inside', textinfo = 'percent+label')
fig.update_layout(title_text = 'Percentage of Fraud Transactions')

# Show plot
fig.show()

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.3 How many fraudulent transactions do we have group by transaction type?</b></p>

In [None]:
# Filter data for fraud transactions and group by 'type'
fraud_transactions_by_type = txn_data.loc[txn_data.isFraud == 1].groupby('type').count().get(['type','count_step']).to_pandas()


# Sort by 'count_step' column in descending order
fraud_transactions_by_type = fraud_transactions_by_type.sort_values('count_step', ascending = False)

# Create a bar chart with Plotly
fig = px.bar(data_frame = fraud_transactions_by_type,
             x = 'type',
             y = 'count_step',
             color = 'type',
             color_discrete_sequence = px.colors.qualitative.Pastel,
             hover_name = 'type',
             text = 'count_step')

# Update layout
fig.update_traces(textposition = 'inside')
fig.update_layout(title_text = 'Distribution of Fraud Transactions by Type',
                  xaxis_title = 'Transaction Type',
                  yaxis_title = 'Count')

# Show plot
fig.show()

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.4 What percentage of fraudulent transactions do we have where transaction amount is equal to old balance in the origin account?</b></p>

<p style = 'font-size:16px;font-family:Arial'>This might be the case where the fraudster emptied the account of the victim.</p>

In [None]:
print("No of cleanout fraud transactions: %d\nPercentage of cleanout fraud transactions: %.2f%%"%(
    txn_data.loc[txn_data['amount'] == txn_data.oldbalanceOrig].loc[txn_data['isFraud'] == 1].shape[0],
    txn_data.loc[txn_data['amount'] == txn_data.oldbalanceOrig].loc[txn_data['isFraud'] == 1].shape[0] / txn_data.loc[txn_data.isFraud == 1].shape[0]*100)
)

# Calculate percentage of fraud transactions
cleanout_fraud_transactions_count = txn_data.loc[(txn_data.isFraud == 1) and (txn_data.amount == txn_data.oldbalanceOrig)].shape[0]
fraud_transactions_coun = txn_data.loc[(txn_data.isFraud == 1)].shape[0]
fraud_percentage = cleanout_fraud_transactions_count / fraud_transactions_coun * 100

# Create a pie chart with Plotly
fig = px.pie(values = [fraud_percentage, 100-fraud_percentage],
             labels = ["Cleanout Fraud Transactions", "Non-Cleanout Fraud Transactions"],
             names = ["Cleanout Fraud Transactions", "Non-Cleanout Fraud Transactions"],
             color_discrete_sequence = ['pink', 'red'],
             hover_name = ["Cleanout Fraud Transactions", "Non-Cleanout Fraud Transactions"],
             hole = 0.6)

# Update layout
fig.update_traces(textposition = 'inside', textinfo = 'percent+label')
fig.update_layout(title_text = 'Percentage of Cleanout Fraud Transactions')

# Show plot
fig.show()

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Data Preparation</b>

<p style = 'font-size:16px;font-family:Arial'><b>We'll perform the following steps:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>One-hot encoding categorical "type" column</li>
    <li>Feature scaling using ScaleFit and ScaleTransform on numerical columns.</li>
    <li>Splitting the data in training and testing datasets (80:20 split)</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>ScaleFit outputs a table of statistics used as an input to ScaleTransform, which scales specified input table columns. ScaleTransform scales specified input table columns using ScaleFit output.</p>

<p style = 'font-size:16px;font-family:Arial'>Feature scaling is performed during data pre-processing to handle highly varying magnitudes, values, or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values higher and consider smaller values as lower ones, regardless of the unit of the values.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1 Drop unnecessary columns</b></p>
<p style = 'font-size:16px;font-family:Arial'>nameDest and nameOrigin are not required as we have txn_id to uniquely identify each transaction.</p>

In [None]:
txn_data = txn_data.drop(['nameDest', 'nameOrig'], axis = 1)
txn_data

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.2 One-hot encoding</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here we are one-hot encoding the "type" column. One-hot encoding is necessary in many cases to represent categorical variables as binary values, enable numerical processing, ensure feature independence, handle non-numeric data, and improve the performance and interpretability of machine learning models.</p>

In [None]:
txn_type_encoder = OneHotEncoder(values = ["CASH_IN", "CASH_OUT", "DEBIT", "PAYMENT", "TRANSFER"], columns = "type")

retain = Retain(columns = ['step', 'amount','newbalanceOrig','oldbalanceDest','newbalanceDest','oldbalanceOrig', 'isFlaggedFraud', 'isFraud'])

obj = valib.Transform(data = txn_data,
                      one_hot_encode = txn_type_encoder,
                      retain = retain,
                      index_columns = 'txn_id')
txn_trans = obj.result
txn_trans

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.3 Feature Scaling</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here we are using ScaleFit and ScaleTransform for scaling the numerical columns using Standard Deviation as scale method.
<br><br>
Feature scaling is important in machine learning to avoid numerical instability, ensure fair comparison of features, improve model performance, enhance interpretability, and handle distance-based algorithms.</p>

In [None]:
from teradataml import ScaleFit, ScaleTransform

sf_fit = ScaleFit(data = txn_trans, scale_method = 'STD',
                     target_columns = ['step', 'amount','newbalanceOrig','oldbalanceDest','newbalanceDest','oldbalanceOrig'])

sf_trns = ScaleTransform(data = txn_trans, object = sf_fit.output, accumulate = ["txn_id", "isFraud", 'isFlaggedFraud',
                                                                                 'CASH_IN_type', 'CASH_OUT_type', 'DEBIT_type',
                                                                                 'PAYMENT_type', 'TRANSFER_type'])

copy_to_sql(sf_trns.result, table_name = 'clean_data', if_exists = 'replace')
sf_trns.result

<p style = 'font-size:16px;font-family:Arial'>The above output shows that the data has transformed into a scaled dataset.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.4 Train-test split</b></p>
<p style = 'font-size:16px;font-family:Arial'>Splitting the data into train-test datasets in 80:20 ratio.</p>

In [None]:
query = '''
CREATE MULTISET TABLE TrainTestSplit_output AS (
    SELECT * FROM TD_TrainTestSplit(
        ON clean_data AS InputTable
        USING
        IDColumn('txn_id')
        trainSize(0.75)
        testSize(0.25)
        Seed(7)
    ) AS dt
) WITH DATA;
'''

try:
    execute_sql(query)
except:
    db_drop_table('TrainTestSplit_output')
    execute_sql(query)

In [None]:
query = '''
CREATE MULTISET TABLE clean_data_train AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 1
) WITH DATA;
'''

try:
    execute_sql(query)
except:
    db_drop_table('clean_data_train')
    execute_sql(query)

In [None]:
query = '''
CREATE MULTISET TABLE clean_data_test AS (
    SELECT * FROM TrainTestSplit_output WHERE TD_IsTrainRow = 0
) WITH DATA;
'''

try:
    execute_sql(query)
except:
    db_drop_table('clean_data_test')
    execute_sql(query)

In [None]:
DataFrame('clean_data_train')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. In-Database Machine Learning</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.1 Generalized Linear Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>The GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential or binomial family distribution.</p>
<p style = 'font-size:16px;font-family:Arial'>Due to gradient-based learning, the function is highly sensitive to feature scaling. Input features should be standardized, such as using ScaleFit, and ScaleTransform, before using them in the function. The function takes only numeric features. We must convert the categorical features to numeric values before training. The function skips the rows with missing (null) values during training.</p>

In [None]:
query = '''
CREATE MULTISET TABLE glm_model AS (
    SELECT * FROM TD_GLM (
        ON clean_data_train
        OUT TABLE MetaInformationTable(glm_out) 
        USING
            InputColumns('[3:13]')
            ResponseColumn('isFraud')
            Family('Binomial')
            BatchSize(10)
            MaxIterNum(300)
            RegularizationLambda(0.02)
            Alpha(0.15)
            IterNumNoChange(50)
            Intercept('true')
            LearningRate('optimal')
            InitialEta(0.05)
            Momentum(0)
            Nesterov('false')
            LocalSGDIterations(0)
    ) AS dt
) WITH DATA;
'''

try:
    execute_sql(query)
except:
    # Drop the tables and try again if the table already exists
    db_drop_table('glm_model')
    db_drop_table('glm_out')
    execute_sql(query)

<p style = 'font-size:16px;font-family:Arial'>In the next cell, we extract the feature importances and plot them. Remember to consider absolute value of the feature importances.</p>

In [None]:
glm_model_out = DataFrame('glm_model').to_pandas().reset_index()
feat_imp = glm_model_out[glm_model_out['attribute'] > 0].sort_values(by = 'estimate', ascending = False)

In [None]:
# Specify figure size
fig, ax = plt.subplots(figsize = (10, 6))

# Use ax.barh() for horizontal bar chart
ax.barh(feat_imp['predictor'], feat_imp['estimate'], edgecolor = 'red')

# Add text labels on right of the bars
for x, y in zip(feat_imp['estimate'], feat_imp['predictor']):
    ax.text(x, y, str(round(x, 2)), ha = 'left', va = 'center')

# Set y-axis label
ax.set_xlabel('Estimate')

plt.show()

<p style = 'font-size:16px;font-family:Arial'>The above plot shows that GLM model considers the transaction types: PAYMENT, CASH_OUT and CASH_IN as important.</p>

<p style = 'font-size:16px;font-family:Arial'>The TD_GLM output is a trained GLM model, which can be input to the TDGLMPredict function for prediction. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC.</p>
<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed output explanation.</i></b></p>

In [None]:
glm_model_out[glm_model_out['attribute'] < 1]

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.2 XGBoost</b></p>

<p style = 'font-size:16px;font-family:Arial'>The TD_XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
<br>
<br>
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.</p>

In [None]:
# This cell might take upto few minutes to run

query = '''
CREATE MULTISET TABLE xgb_model AS (
    SELECT * FROM TD_XGBoost(
        ON clean_data_train PARTITION BY ANY
        OUT TABLE MetaInformationTable(xgb_out) 
        USING
            ResponseColumn('isFraud')
            InputColumns('[3:13]')
            MaxDepth(10)
            NumBoostedTrees(100)
            ModelType('CLASSIFICATION')
            Seed(2)
            ShrinkageFactor(0.1)
            IterNum(10)
            ColumnSampling(1.0) 
    ) AS dt
) WITH DATA;
'''

try:
    execute_sql(query)
except:
    # Drop the tables and try again if the table already exists
    db_drop_table('xgb_model')
    db_drop_table('xgb_out')
    execute_sql(query)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. In-Database Model Scoring</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.1 Generalized Linear Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TDGLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the GLM function.</p>
<p style = 'font-size:16px;font-family:Arial'>Similar to GLM, input features should be standardized, such as using ScaleFit, and ScaleTransform, before using them in the function. The function takes only numeric features. We must convert the categorical features to numeric values before prediction.</p>

In [None]:
query = '''
CREATE MULTISET TABLE glm_predict_out AS (
    SELECT * FROM TD_GLMPredict(
        ON "clean_data_test" AS inputtable
        PARTITION BY ANY 
        ON glm_model AS ModelTable
        DIMENSION
        USING
            IDColumn('txn_id')
            Accumulate('isFraud')
            OutputProb('True')
            Responses('0','1')
    ) AS dt
) WITH DATA;
'''

try:
    execute_sql(query)
except Exception as e:
    db_drop_table('glm_predict_out')
    execute_sql(query)

In [None]:
# Evaluate the GLM model's performance using TD_CLASSIFICATIONEVALUATOR

# Check if the necessary tables exist before executing the query
if not get_connection().dialect.has_table(get_connection(), 'glm_predict_out'):
    print('Error: glm_predict_out table does not exist.')
    sys.exit(1)

query = '''
SELECT * from TD_CLASSIFICATIONEVALUATOR(
    ON (
        SELECT
            CAST("isFraud" AS INTEGER) AS "isFraud",
            CAST(prediction AS INTEGER) as prediction
        FROM glm_predict_out
    ) AS InputTable
    OUT TABLE OutputTable(additional_metrics_glm)
    USING
        Labels(0, 1)
        ObservationColumn('isFraud')
        PredictionColumn('Prediction')
) AS dt1
ORDER BY 1, 2, 3;
'''

try:
    execute_sql(query)
except:
    db_drop_table('additional_metrics_glm')
    execute_sql(query)

In [None]:
metrics_glm = DataFrame('additional_metrics_glm').to_pandas()
metrics_glm['Metric'] = metrics_glm['Metric'].str.strip('\x00')
metrics_glm

In [None]:
glm_result = DataFrame('glm_predict_out').to_pandas()
glm_result

<p style = 'font-size:16px;font-family:Arial'>The output above shows prob_1, i.e. transaction is fraud and prob_0, i.e. transaction is not a fraud. The prediction column uses these probabilities to give a class label, i.e. prediction column.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.2 XGBoost</b></p>

In [None]:
query = '''
CREATE MULTISET TABLE xgb_predict_out AS (
    SELECT * FROM TD_XGBoostPredict(
        ON clean_data_test AS inputtable PARTITION BY ANY
        ON xgb_model AS modeltable DIMENSION ORDER BY task_index, tree_num, iter, class_num, tree_order
        USING
            IdColumn('txn_id')
            OutputProb('True')
            Responses('0','1')
            ModelType('classification')
            Accumulate('isFraud')
    ) AS dt
) WITH DATA;
'''

try:
    execute_sql(query)
except Exception as e:
    db_drop_table('xgb_predict_out')
    execute_sql(query)

In [None]:
# Evaluate the XGBoost model's performance using TD_CLASSIFICATIONEVALUATOR

# Check if the necessary tables exist before executing the query
if not get_connection().dialect.has_table(get_connection(), 'xgb_predict_out'):
    print('Error: xgb_predict_out table does not exist.')
    sys.exit(1)

query = '''
SELECT * from TD_CLASSIFICATIONEVALUATOR(
    ON (
        SELECT CAST("isFraud" AS INTEGER) AS "isFraud", prediction FROM xgb_predict_out
    ) AS InputTable
    OUT TABLE OutputTable(additional_metrics_xgb)
    USING
        Labels(0, 1)
        ObservationColumn('isFraud')
        PredictionColumn('Prediction')
) AS dt1
ORDER BY 1, 2, 3;
'''

try:
    execute_sql(query)
except:
    db_drop_table('additional_metrics_xgb')
    execute_sql(query)

In [None]:
metrics_xgb = DataFrame('additional_metrics_xgb').to_pandas()
metrics_xgb['Metric'] = metrics_xgb['Metric'].str.strip('\x00')
metrics_xgb

In [None]:
xgb_result = DataFrame('xgb_predict_out').to_pandas().reset_index()
xgb_result

<p style = 'font-size:16px;font-family:Arial'>The output above shows prob_1, i.e. transaction is fraud and prob_0, i.e. transaction is not a fraud. The prediction column uses these probabilities to give a class label, i.e. prediction column.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Visualize the results</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.1 Comparing the metrics</b></p>

In [None]:
# Set the width of the bars
bar_width = 0.35

# Generate the x-axis values for the bars
ind = np.arange(len(metrics_glm['Metric']))

# Create bar graph with increased size
fig, ax = plt.subplots(figsize = (10, 6))  # Set figsize to (width, height) in inches

# Define soothing colors for the bars
color1 = '#2C7BB6'  # blue
color2 = '#ABD18A'  # green

# Plotting bars for metrics_glm with soothing blue color
ax.bar(ind, metrics_glm['MetricValue'], bar_width, label = 'metrics_glm', color = color1)

# Plotting bars for metrics_xgb with soothing green color
ax.bar(ind + bar_width, metrics_xgb['MetricValue'], bar_width, label = 'metrics_xgb', color = color2)

ax.set_ylabel('MetricValue')
ax.set_xlabel('Metric')
ax.set_title('Comparison of MetricValues between metrics_glm and metrics_xgb', fontsize = 14)
ax.set_xticks(ind + bar_width / 2)
ax.set_xticklabels(metrics_glm['Metric'], rotation = 45, ha = 'right', fontsize = 12)
ax.legend()

plt.tight_layout()  # Add padding between subplots and prevent overlapping labels
plt.show()

<p style = 'font-size:16px;font-family:Arial'>In this specific use case, minimizing Type-2 error or False Negatives is crucial as we want to avoid fraud transactions being misclassified as non-fraud. The bar graph presented above further confirms that the xgboost model outperforms the glm model in handling Type-2 errors.
    <br>
    <br>
This is supported by the higher sensitivity or recall of the xgboost model in identifying fraud cases, as evident from the comparison of macro-precision, macro-f1, and macro-recall values. By minimizing Type-2 errors, we can reduce the risk of financial losses or reputational damage associated with undetected fraud transactions, making the xgboost model a more reliable choice for fraud detection in this scenario.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.2 Confusion Matrix</b></p>

In [None]:
# Calculate confusion matrix for GLM
cm_glm = confusion_matrix(glm_result['isFraud'], glm_result['prediction'])

# Calculate confusion matrix for XGBoost
cm_xgb = confusion_matrix(xgb_result['isFraud'], xgb_result['Prediction'])

# Create figure and axes objects
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16, 8))

# Plot GLM confusion matrix
disp_glm = ConfusionMatrixDisplay(confusion_matrix = cm_glm, display_labels = ['No Fraud', 'Fraud'])
disp_glm.plot(ax = ax1, cmap = 'Blues', colorbar = False)
ax1.set_title('GLM Confusion Matrix')
ax1.set_xlabel('Predicted Label')
ax1.set_ylabel('True Label')
ax1.set_xticks([0, 1])
ax1.set_yticks([0, 1])
ax1.set_xticklabels(['No Fraud', 'Fraud'])
ax1.set_yticklabels(['No Fraud', 'Fraud'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm_glm.shape[0]):
    for j in range(cm_glm.shape[1]):
        ax1.text(j, i, f'{cm_glm[i, j]}', ha = 'center', va = 'center', color = 'white' if cm_glm[i, j] > cm_glm.max() / 2 else 'black')

# Plot XGBoost confusion matrix
disp_xgb = ConfusionMatrixDisplay(confusion_matrix = cm_xgb, display_labels = ['No Fraud', 'Fraud'])
disp_xgb.plot(ax = ax2, cmap = 'Blues', colorbar = False)
ax2.set_title('XGBoost Confusion Matrix')
ax2.set_xlabel('Predicted Label')
ax2.set_ylabel('True Label')
ax2.set_xticks([0, 1])
ax2.set_yticks([0, 1])
ax2.set_xticklabels(['No Fraud', 'Fraud'])
ax2.set_yticklabels(['No Fraud', 'Fraud'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm_xgb.shape[0]):
    for j in range(cm_xgb.shape[1]):
        ax2.text(j, i, f'{cm_xgb[i, j]}', ha = 'center', va = 'center', color = 'white' if cm_xgb[i, j] > cm_xgb.max() / 2 else 'black')

# Adjust layout and spacing
plt.tight_layout()

# Show the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial'>The comparison of the confusion matrices reveals that the xgboost model demonstrates superior performance in detecting fraud cases compared to the glm model. While the xgboost model may misclassify some non-fraud cases as fraud, it is important to note that the primary objective is to minimize false negatives, or type-2 errors. This underscores the significance of prioritizing high recall or sensitivity in fraud detection scenarios, where the consequences of missed fraud cases can be substantial.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.3 ROC-AUC</b></p>
<p style = 'font-size:16px;font-family:Arial'>The ROC curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) for different classification thresholds. AUC measures the overall performance of a classification model, where a higher value indicates better performance. AUC above 0.75 is generally considered decent.</p>

In [None]:
# Plot 1
AUC_glm = roc_auc_score(glm_result['isFraud'], glm_result['prob_1'])
fpr_glm, tpr_glm, thresholds_glm = roc_curve(glm_result['isFraud'], glm_result['prob_1'])
plt.plot(fpr_glm, tpr_glm, color = 'orange', label='GLM ROC. AUC = {}'.format(str(round(AUC_glm, 4))))

# Plot 2
AUC_xgb = roc_auc_score(xgb_result['isFraud'], xgb_result['Prob_1'])
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(xgb_result['isFraud'], xgb_result['Prob_1'])
plt.plot(fpr_xgb, tpr_xgb, color = 'green', label = 'XGB ROC. AUC = {}'.format(str(round(AUC_xgb, 4))))

# Plot the diagonal dashed line
plt.plot([0, 1], [0, 1], color = 'darkblue', linestyle = '--')

# Set labels and title
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')

# Add legend
plt.legend()

# Show the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial'>Based on the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the Curve) analysis, it is evident that the xgboost model performs better compared to the glm model in this use case.
<br>
<br>
The ROC curve of the xgboost model consistently shows higher true positive rates at various false positive rates compared to the glm model. This indicates that the xgboost model has better sensitivity or ability to correctly identify true positive cases (fraud transactions) while maintaining a lower false positive rate or misclassification of non-fraud cases.
<br>
<br>
In conclusion, both the ROC curve and AUC analysis suggest that the xgboost model performs better than the glm model in accurately classifying fraud and non-fraud transactions, and minimizing both false negatives and false positives.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>8. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['clean_data', 'clean_data_train', 'clean_data_test', 'glm_model',
          'glm_out', 'xgb_model', 'xgb_out', 'glm_predict_out',
          'additional_metrics_glm', 'xgb_predict_out', 'additional_metrics_xgb']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('demo_glm_fraud');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `txn_id`: transaction id
- `step`: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (31 days simulation).
- `type`: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- `amount`: amount of the transaction in local currency
- `nameOrig`: customer who started the transaction
- `oldbalanceOrig`: customer's balance before the transaction
- `newbalanceOrig`: customer's balance after the transaction
- `nameDest`: customer who is the recipient of the transaction
- `oldbalanceDest`: recipient's balance before the transaction
- `newbalanceDest`: recipient's balance after the transaction
- `isFraud`: identifies a fraudulent transaction (1) and non fraudulent (0)
- `isFlaggedFraud`: flags illegal attempts to transfer more than 200,000 in a single transaction

<b style = 'font-size:24px;font-family:Arial;color:#E37C4D'>Model Output Explanations:</b>
<br>
<b style = 'font-size:16px;font-family:Arial;color:#E37C4D'>Model Configuration</b>
- `Loss Function (LOG)`: The logarithmic loss function measures the difference between the predicted probabilities and the actual outcomes. It is commonly used in binary classification problems.
- `Regularization (Enabled, 0.02)`: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. The value 0.02 indicates the strength of this penalty.
- `Alpha (Elasticnet, 0.15)`: Alpha is a parameter in the Elastic Net regularization method, which combines L1 (Lasso) and L2 (Ridge) regularization. A value of 0 represents pure L2 regularization, while a value of 1 represents pure L1 regularization. In this case, the model uses a combination of both (0.15).

<b style = 'font-size:16px;font-family:Arial;color:#E37C4D'>Optimization Algorithm</b>
- `Learning Rate (Initial, 0.05; Final, 0.388190)`: The learning rate determines the step size the optimization algorithm takes during each iteration. The initial learning rate is set to 0.05, and it reaches a final value of 0.388190.
- `Momentum (0.0)`: Momentum is a technique used in optimization algorithms to speed up convergence by adding a fraction of the previous update to the current update. A momentum value of 0.0 means it's not being used in this case.
- `Nesterov (FALSE)`: Nesterov momentum is a variation of the standard momentum method. This parameter being set to FALSE indicates it is not being used.

<b style = 'font-size:16px;font-family:Arial;color:#E37C4D'>Model Performance</b>
- `Log-likelihood (-0.013004)`: Log-likelihood is a measure of how well the model fits the observed data. Higher values indicate a better fit.
- `Akaike Information Criterion (AIC, 26.026008)`: AIC is used to compare models with different numbers of parameters, balancing goodness of fit with model complexity. Lower values indicate a better model.
- `Bayesian Information Criterion (BIC, 140.9153)`: BIC is similar to AIC but places a stronger penalty on model complexity. Lower values indicate a better model.

<b style = 'font-size:16px;font-family:Arial;color:#E37C4D'>Convergence</b>
- `Number of Observations (50,901)`: The total number of data points used in the model.
- `Number of Iterations (111, Converged)`: The number of iterations the optimization algorithm took to converge, indicating the algorithm has found an optimal solution.
- `Intercept (-4.113711)`: The intercept (or bias term) is the value of the output when all input features are set to zero.
- `LocalSGD Iterations (0)`: The number of iterations used in the Local Stochastic Gradient Descent (SGD) method. A value of 0 indicates it's not being used in this case.

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Dataset source: <a href = 'https://www.kaggle.com/datasets/ealaxi/paysim1'>https://www.kaggle.com/datasets/ealaxi/paysim1</a></li>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>