<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>GLM Fraud Detection with Python and Teradata SQL</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    In recent years we have seen a huge increase in Fraud attempts, making fraud detection necessary for Banking and Financial Institutions. Despite countless efforts and human supervision, hundreds of millions are lost due to fraud. Fraud can happen using various methods i.e., stolen credit cards, misleading accounting, phishing emails, etc. Due to small cases in large population detection of fraud is important as well as challenging.
    <br>
    <br>
    This notebook provides a demonstration of "data science workflow" that illustrates how to leverage Vantage's <b>Advanced SQL Engine (was NewSQL Engine)</b> to build, validate and score a model at scale in Vantage without moving the data. Users can perform the large-scale operations such as feature analysis, data transformation, Model training and ML Model Scoring in the Vantage environment without having to move data.</p>


<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Initiate a connection to Vantage</li>
    <li>Read the data from Vantage as a teradataml Dataframe</li>
    <li>Clean up the dataset</li>
    <li>Create training and testing datasets in Vantage</li>
    <li>In-Database GLM model training</li>
    <li>In-Database GLM model scoring</li>
    <li>Visualize the results (ROC curve and AUC)</li>
    <li>Cleanup</li>
</ol>

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'> Accessing the Data
<p style = 'font-size:16px;font-family:Arial'>These demos will work either with foreign tables accessed from Cloud Storage via NOS or you may import the tables to your machine. If you import data for multiple demos, you may need to use the Data Dictionary "Manage Your Space" routine to cleanup tables you no longer need. 
    
<p style = 'font-size:16px;font-family:Arial'>Use the link below to access the 2 options for using data from the data dictionary notebook:

[Click Here to get data for this notebook](../Data_Dictionary/Data_Dictionary.ipynb#TRNG_GLMFraud)

[Click Here to Manage Your Space](../Data_Dictionary/Data_Dictionary.ipynb#Manage_Your_Space)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Configuring the Environment</b>
<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import os
import getpass
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from teradataml.dataframe.dataframe import DataFrame
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

warnings.filterwarnings("ignore")
%matplotlib inline

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You might be prompted to enter the password.</p>

In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = getpass.getpass())
print(eng)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Read the data from Vantage as a teradaml Dataframe</b>
<p style = 'font-size:16px;font-family:Arial'>The data from <a href = 'https://www.kaggle.com/code/georgepothur/4-financial-fraud-detection-xgboost/data'>https://www.kaggle.com/code/georgepothur/4-financial-fraud-detection-xgboost/data</a> is loaded in Vantage in a table named "transaction_data". Check the data size and print sample rows. 63k rows and 12 columns.</p>
<p style = 'font-size:16px;font-family:Arial'>Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</p>

In [None]:
txn_data = DataFrame(in_schema('TRNG_GLMFraud','transaction_data'))

print(txn_data.shape)
txn_data.to_pandas(num_rows = 5).head()

<p style = 'font-size:16px;font-family:Arial'>Here we rename a misspelt column without moving the data out of Vantage. We are renaming <b>oldbalanceOrg</b> to <b>oldbalanceOrig</b></p>

In [None]:
new_data = txn_data.assign(oldbalanceOrig = txn_data.oldbalanceOrg).drop(['oldbalanceOrg'] , axis=1)

new_data.to_pandas(num_rows = 5).head()

<p style = 'font-size:16px;font-family:Arial'>These transactions are made by the fraudulent agents inside a simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.</p>
<p style = 'font-size:16px;font-family:Arial'><b>Below are some insights about the dataset:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>There are 92 fraud transactions i.e. 0.14% of fraud transactions in the dataset.</li>
    <li>From these 92 fraud transactions, 47 are of type TRANSFER and 45 are of type CASH_OUT.</li>
    <li>97.83% of fraud transations have transaction amount equal to oldbalanceOrig i.e. account cleanout.</li>
    <li>71.74% of fraud transactions have recipient's old balance as zero.</li>
    <li>isFlaggedFraud is correct only two times among the 92 fraud transactions.</li>
</ol>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Clean up the dataset</b>
<p style = 'font-size:16px;font-family:Arial'>Based on what we discovered above, we will:</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Remove all data that isn't 'CASH OUT' or 'TRANSFER'.</li>
    <li>Drop "nameOrig" and "nameDest" since the origin and destination accounts don't matter.</li>
    <li>Drop "isFlaggedFraud" because it has just flagged two transactions. Hence it doesn't have much significance.</li>
</ol> 

In [None]:
clean_data = new_data.loc[(new_data.type == 'CASH_OUT') | (new_data.type == 'TRANSFER')]
clean_data.shape

<p style = 'font-size:16px;font-family:Arial'>Now our dataset is reduced to 27k records.</p>

In [None]:
clean_data = clean_data.drop(['nameDest', 'nameOrig', 'isFlaggedFraud'], axis = 1)
clean_data.head(5)

In [None]:
#create the source data table in the database
clean_data.to_sql('clean_data', if_exists = 'replace', primary_index='txn_id')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Create training and testing datasets in Vantage</b>

<p style = 'font-size:16px;font-family:Arial'><b>We'll perform the following steps:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Create Training and Testing datasets in the ratio 80:20</li>
    <li>Use TD_ScaleFit to scale the dataset</li>
    <li>Use TD_ScaleTransform to transform the training and testing datasets</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>TD_ScaleFit outputs a table of statistics to input to TD_ScaleTransform, which scales specified input table columns. TD_ScaleTransform scales specified input table columns, using TD_ScaleFit output.</p>

<p style = 'font-size:16px;font-family:Arial'>Feature scaling is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.</p>

In [None]:
# Create clean_data_train table using SAMPLE

qry = '''
CREATE MULTISET TABLE clean_data_train 
  AS(SELECT * FROM clean_data SAMPLE 0.8) 
WITH DATA
PRIMARY INDEX (txn_id);
'''

try:
    eng.execute(qry)
except:
    eng.execute('DROP TABLE clean_data_train;')
    eng.execute(qry)

In [None]:
qry = '''SELECT * FROM TD_ScaleFit(
    ON clean_data_train as InputTable
    OUT VOLATILE TABLE OutputTable(scale_train)
    USING
    TargetColumns('step', 'amount','newbalanceOrig','oldbalanceDest','newbalanceDest','oldbalanceOrig')
    ScaleMethod('STD')
) as dt;'''

eng.execute(qry)

In [None]:
qry = '''CREATE MULTISET TABLE transformed_train AS (
    SELECT * FROM TD_ScaleTransform(
        ON clean_data_train AS InputTable
        ON scale_train AS FitTable DIMENSION
        USING
        accumulate('txn_id', 'isFraud')
    ) AS dt
) WITH data;'''

eng.execute(qry)

In [None]:
# Create clean_data_test table using SAMPLE

qry = '''
CREATE MULTISET TABLE clean_data_test
  AS(SELECT * FROM clean_data
    EXCEPT
    SELECT * FROM clean_data_train)
WITH DATA
PRIMARY INDEX (txn_id); 
'''

try:
    eng.execute(qry)
except:
    eng.execute('DROP TABLE clean_data_test;')
    eng.execute(qry)

In [None]:
qry = '''SELECT * FROM TD_ScaleFit(
    ON clean_data_test as InputTable
    OUT VOLATILE TABLE OutputTable(scale_test)
    USING
    TargetColumns('step', 'amount','newbalanceOrig','oldbalanceDest','newbalanceDest','oldbalanceOrig')
    ScaleMethod('STD')
) as dt;'''

eng.execute(qry)

In [None]:
qry = '''CREATE MULTISET TABLE transformed_test AS (
    SELECT * FROM TD_ScaleTransform(
        ON clean_data_test AS InputTable
        ON scale_train AS FitTable DIMENSION
        USING
        accumulate('txn_id', 'isFraud')
    ) AS dt
) WITH data;'''

eng.execute(qry)

In [None]:
temp = pd.read_sql('select * from transformed_train', eng)
temp.head(5)

<p style = 'font-size:16px;font-family:Arial'>The above output shows that the data has been tranformed into a scaled dataset. Scaling of data makes it easy for the model to learn and understand the problem.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. In-Database GLM model training</b>
<p style = 'font-size:16px;font-family:Arial'>The TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution.</p>
<p style = 'font-size:16px;font-family:Arial'>Due to gradient-based learning, the function is highly sensitive to feature scaling. Input features should be standardized, such as using TD_ScaleFit, and TD_ScaleTransform, before using in the function. The function takes only numeric features. The categorical features must be converted to numeric values prior to training. The rows with missing (null) values are skipped by the function during training.</p>

In [None]:
qry = '''CREATE VOLATILE TABLE glm_model AS (
SELECT * from TD_GLM (
ON transformed_train
USING
InputColumns('step', 'amount','newbalanceOrig','oldbalanceDest','newbalanceDest','oldbalanceOrig')
ResponseColumn('isFraud')
Family('Binomial')
) as dt) WITH DATA
ON COMMIT PRESERVE ROWS
;'''

eng.execute(qry)

<p style = 'font-size:16px;font-family:Arial'>The function output is a trained GLM model which can be input to the TD_GLMPredict function for prediction. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC.</p>

In [None]:
pd.read_sql('select * from glm_model', eng)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. In-Database GLM model scoring</b>
<p style = 'font-size:16px;font-family:Arial'>The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the TD_GLM function.</p>
<p style = 'font-size:16px;font-family:Arial'>Similar to TD_GLM, input features should be standardized, such as using TD_ScaleFit, and TD_ScaleTransform, before using in the function. The function takes only numeric features. The categorical features must be converted to numeric values prior to prediction.</p>

In [None]:
qry = '''CREATE VOLATILE TABLE glm_prediction AS (
SELECT * from TD_GLMPredict (
ON transformed_test AS INPUTTABLE
ON glm_model AS Model DIMENSION
USING
IDColumn ('txn_id')
Accumulate('isFraud')
OutputProb('true')
Responses ('1','0')
) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS
;'''

eng.execute(qry)

In [None]:
pred = pd.read_sql('select * from glm_prediction;', eng)
pred

<p style = 'font-size:16px;font-family:Arial'>The output above shows prob_1 i.e. transaction being fraud and prob_0 i.e. transaction being not fraud. The prediction column uses these probabilities to give a class label i.e. prediction column.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>8. Visualize the results (ROC curve and AUC)</b>
<p style = 'font-size:16px;font-family:Arial'>Calculate mean absolute error and AUC(Area Under the Curve) for Receiver Operating Characteritic Curve</p>
<p style = 'font-size:16px;font-family:Arial'>Mean Absolute Error is the summation of difference of actual and predicted value averaged over the number of observations.</p>

In [None]:
print(mean_absolute_error(pred['isFraud'], pred['prob_1']))

<p style = 'font-size:16px;font-family:Arial'>ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve is a metric on how good the model is able to distinguish between positive and negative classes. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes. AUC above 0.75 is generally considered decent.</p>

In [None]:
AUC = roc_auc_score(pred['isFraud'], pred['prob_1'])
AUC

In [None]:
fpr, tpr, thresholds = roc_curve(pred['isFraud'], pred['prob_1'])
plt.plot(fpr, tpr, color='orange', label='ROC. AUC = {}'.format(str(AUC)))
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

<p style = 'font-size:16px;font-family:Arial'>Looking at the above ROC Curve we can confidently say that the model has performed well on testing data as well. The AUC value is way above 0.75 and hence resonates with our understanding that the model is performing well.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>9. Cleanup</b>

In [None]:
eng.execute('DROP TABLE clean_data;')

In [None]:
eng.execute('DROP TABLE clean_data_train;')

In [None]:
eng.execute('drop table transformed_train;')

In [None]:
eng.execute('DROP TABLE clean_data_test;')

In [None]:
eng.execute('drop table transformed_test;')

In [None]:
eng.execute('DROP TABLE glm_model;')

In [None]:
eng.execute('DROP TABLE glm_prediction;')

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>

- `txn_id`: transaction id
- `step`: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (31 days simulation).
- `type`: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- `amount`: amount of the transaction in local currency
- `nameOrig`: customer who started the transaction
- `oldbalanceOrig`: customer's balance before the transaction
- `newbalanceOrig`: customer's balance after the transaction
- `nameDest`: customer who is the recipient of the transaction
- `oldbalanceDest`: recipient's balance before the transaction
- `newbalanceDest`: recipient's balance after the transaction
- `isFraud`: identifies a fraudulent transaction (1) and non fraudulent (0)
- `isFlaggedFraud`: flags illegal attempts to transfer more than 200,000 in a single transaction

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Uses a dataset and feature discovery methods outlined here: <a href = 'https://www.kaggle.com/georgepothur/4-financial-fraud-detection-xgboost/notebook'>https://www.kaggle.com/georgepothur/4-financial-fraud-detection-xgboost/notebook</a></li>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/May-2022/Introduction-to-Teradata-Package-for-Python'>https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/May-2022/Introduction-to-Teradata-Package-for-Python</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>