<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Financial Fraud Detection using AutoML
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>AutoML Approach</b></p>
<p style = 'font-size:16px;font-family:Arial'>Teradataml Automated Machine Learning (AutoML) provides functionality to automate the end-to-end machine learning flow. AutoML takes data scientist productivity to next-level by automatically train high-quality models specific to their business needs. AutoML represents a method for streamlining the entire process of machine learning pipeline in automated way. It encompasses various distinct phases of the machine learning pipeline, including feature exploration, features engineering, data preparation, model selection, model training with hyperparameters tuning, and model evaluation. By automating these tasks, AutoML eliminates the need for manual intervention by trained data scientists and reduces the prerequisite knowledge required for beginners. This accessibility allows individuals of varying expertise levels to effortlessly use AutoML to create machine learning models in an automated fashion.
</p>


<p style = 'font-size:16px;font-family:Arial'>Key Features of Teradata AutoML approach:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Helps users determine the most optimal model automatically.</li>
    <li>Increases ease of use in model building</li>
    <li>Supports various problem types, including Regression, Binary Classification, and Multiclass Classification.</li>
    <li>Provides five different models for training: GLM, SVM, Decision Forest, XGBoost, and KNN.</li>
    <li>Flexibility to select specific models out of the available models.</li>
    <li>All five phases are automated and can be customized based on user input.</li>
    <li>Generates model leaderboard and leader for a given dataset.</li>
    <li>Allows prediction on validation dataset and on user passed data on the leader board</li>
    
</ul>

<p style = 'font-size:16px;font-family:Arial'>Below are the different phases of AutoML:</p>
</p>
<center><img src="images/AutoML_phases.png" alt="efs" width=800 height=1200  style = "border: 4px solid #404040; padding-right: 10px; border-radius: 10px;"/></center>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial'>To maximize the business value of advanced analytic techniques including Machine Learning and Artificial Intelligence, it is estimated that organizations must scale their model development and deployment pipelines to 100s or 1000s of times greater amounts of data, models, or both.</p>    

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Configuring the Environment</b>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
%%capture
!pip install teradataml==20.0.0.5 
!pip install teradatasqlalchemy==20.0.0.5

<p style = 'font-size:16px;font-family:Arial'>Enterprise Feature Store is new feature added in teradataml 20.0.0.3 so we are upgrading the installed teradataml version

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>Please restart the kernel after executing the above command to bring the upgraded library into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
# Standard Libraries
import os
import getpass
import warnings
warnings.filterwarnings("ignore")

# Teradata Libraries
from teradataml import *

# Configuration
spacing_large = " "*95
spacing_small = " "*12
display.max_rows = 5
configure.val_install_location = 'val'

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=EE_Financial_Fraud_Detection_AutoML_Approach.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:16px;font-family:Arial'>We begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_GLM_Fraud_cloud');"        # Takes 1 minute
%run -i ../../run_procedure.py "call get_data('DEMO_GLM_Fraud_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Data Exploration</b>
<p style = 'font-size:16px;font-family:Arial'>We loaded the data from <a href = 'https://www.kaggle.com/code/georgepothur/4-financial-fraud-detection-xgboost/data'>https://www.kaggle.com/code/georgepothur/4-financial-fraud-detection-xgboost/data</a> into Vantage in a table named "transaction_data". We checked the data size and printed sample rows: 63k rows and 12 columns.</p>
<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

In [None]:
txn_data = DataFrame(in_schema('DEMO_GLM_Fraud', 'transaction_data'))
# txn_data = DataFrame(in_schema('demo_user', 'transaction_data'))
print(txn_data.shape)
txn_data

<p style = 'font-size:16px;font-family:Arial'>In this simulated scenario, deceptive agents engage in transactions with the objective of taking control of customers' accounts, transferring funds to another account, and ultimately cashing out for profit.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.1 How many fraudulent transactions do we have in our dataset?</b></p>

In [None]:
# There are 92 fraud transactions i.e. 0.14% of fraud transactions in the dataset.
print("No of fraud transactions: %d\nPercentage of fraud transactions: %.2f%%"%(
    txn_data.loc[txn_data.isFraud == 1].shape[0],
    txn_data.loc[txn_data.isFraud == 1].shape[0]/txn_data.shape[0]*100)
)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.2 How many transactions do we have group by transaction type?</b></p>

In [None]:
# Filter data for fraud transactions and group by 'type'
transactions_by_type = txn_data.groupby('type').count().get(['type','count_txn_id'])


# Sort by 'count_step' column in descending order
transactions_by_type = transactions_by_type.sort('count_txn_id', ascending = False)

transactions_by_type = transactions_by_type.assign(
    type_int = case([
        (transactions_by_type.type == 'CASH_IN', 0),
        (transactions_by_type.type == 'CASH_OUT', 1),
        (transactions_by_type.type == 'DEBIT', 2),
        (transactions_by_type.type == 'PAYMENT ', 3),
        (transactions_by_type.type == 'TRANSFER', 4),
    ])
)

In [None]:
transactions_by_type.plot(
    x = transactions_by_type.type_int,
    y = transactions_by_type.count_txn_id,
    kind = 'bar',
    legend = ['Count by Type'],
    ylabel = 'Count of Transactions',
    xlabel = spacing_small.join(sorted(list(transactions_by_type[['type']].get_values().flatten()))),
    title = "Number of Transactions per Transaction Type"
)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.3 How many fraudulent transactions do we have group by transaction type?</b></p>

In [None]:
# Filter data for fraud transactions and group by 'type'
fraud_transactions_by_type = txn_data.loc[txn_data.isFraud == 1].groupby('type').count().get(['type','count_txn_id'])

# Sort by 'count_step' column in descending order
fraud_transactions_by_type = fraud_transactions_by_type.sort('count_txn_id', ascending = False)

fraud_transactions_by_type = fraud_transactions_by_type.assign(
    total_fraud = txn_data.loc[txn_data.isFraud == 1].shape[0],
    type_int = case([(fraud_transactions_by_type.type == 'TRANSFER', 0)], else_ = 1)
)

In [None]:
fraud_transactions_by_type.plot(
    x = fraud_transactions_by_type.type_int,
    y = [fraud_transactions_by_type.total_fraud, fraud_transactions_by_type.count_txn_id],
    kind = 'bar',
    figsize = (800, 500),
    legend = ['Total Fraud', 'Count by Type'],
    ylabel = 'Count of Fraud Transactions',
    xlabel = 'TRANSFER' + spacing_large + 'CASH_OUT',
    title = "Number of Fraud Transactions by Transaction Type"
)

<p style = 'font-size:16px;font-family:Arial'>From the above result, we can see that out of the 92 fraud transactions, 47 are from transaction type "TRANSFER" and 45 are from "CASH_OUT".</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.4 What percentage of fraudulent transactions do we have where transaction amount is equal to old balance in the origin account?</b></p>

<p style = 'font-size:16px;font-family:Arial'>This might be the case where the fraudster emptied the account of the victim.</p>

In [None]:
print("No of cleanout fraud transactions: %d\nPercentage of cleanout fraud transactions: %.2f%%"%(
    txn_data.loc[txn_data['amount'] == txn_data.oldbalanceOrig].loc[txn_data['isFraud'] == 1].shape[0],
    txn_data.loc[txn_data['amount'] == txn_data.oldbalanceOrig].loc[txn_data['isFraud'] == 1].shape[0] / txn_data.loc[txn_data.isFraud == 1].shape[0]*100)
)

<p style = 'font-size:16px;font-family:Arial'>From the above result, we can see that out of 92 Fraud transactions, the amount involved in 90 fraud transactions was equal to the total balance in the account. </p>

<hr style="height:1px;border:none;">
<p style = 'font-size:16px;font-family:Arial'><b>Below are some insights about the dataset:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>We have 92 fraud transactions, which account for 0.14% of the dataset.</li>
    <li>Out of these 92 fraud transactions, 47 are of type TRANSFER, and 45 are of type CASH_OUT.</li>
    <li>Approximately 97.83% of our fraud transactions have a transaction amount equal to oldbalanceOrig, indicating account cleanout.</li>
    <li>About 71.74% of our fraud transactions have the recipient's old balance as zero.</li>
    <li>The isFlaggedFraud indicator is correct only two times among our 92 fraud transactions.</li>
</ol>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.5 Univariate statistics</b></p>

<p style = 'font-size:16px;font-family:Arial'>The describe funtion computes the count, mean, std, min, percentiles, and max for numeric columns.</p>

In [None]:
txn_data.describe()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.6 Checking for Null Values</b></p>
<p style = 'font-size:16px;font-family:Arial'>The ColumnSummary() function can be used to take a quick look at the columns, their datatypes, and summary of NULLs/non-NULLs for a given table.</p>

In [None]:
colsum = ColumnSummary(
    data  = txn_data,
    target_columns = [':']
)
colsum.result

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. Feature Engineering</b>

<p style='font-size:16px;font-family:Arial'>Teradata Enterprise Feature Store (EFS) Functions are designed to handle feature management within the Vantage environment. While inspired by the syntax of Feast, Teradata EFS Functions stands out, offering efficiency and robustness in data management and feature handling tailored specifically for the use of Teradata Vantage. Teradata EFS Functions use Teradata Dataframes for Feature management, to the contrary of the pandas dataframe of Feast. With Teradata Dataframes we avoid extracting the data to create or use Features from the Enterprise Feature Store (EFS). The EFS Functions are crafted to empower Data Science teams for effective and streamlined feature management. This notebook will walk you through the capabilities of EFS Functions, demonstrating how it integrates seamlessly with your data models and processes.</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>4.1 Setup a Feature Store Repository</b>

<p style='font-size:16px;font-family:Arial'>The Enterprise Feature Store (EFS) SDK is designed with a totally object-oriented approach, focusing on intuitive interaction with feature stores. Central to this design are several core objects: Feature, Entity, DataSource, FeatureGroup. Together, these objects facilitate the efficient management and utilization of features within your data ecosystem, leveraging Teradata Vantage for metadata storage.</p>
<p style='font-size:16px;font-family:Arial'>A feature store repository serves as the foundational environment for storing and managing your data features. The owner of the FeatureStore can grant/revoke read only, write only or read and write authorization to other user(s)</p>

In [None]:
# FeatureStore is not setup for repo LabRepoOne. Let's setup.
fin_fs = FeatureStore('FinFraud')
fin_fs.setup(perm_size='10e8')

In [None]:
# Let's verify by listing the repo's.
FeatureStore.list_repos()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>4.2 Create and Register Entity </b></p>

In [None]:
# Create entity for DataFrame 'patient_profile_df'
entity=Entity(name='TrxnId', columns=txn_data.txn_id)

In [None]:
# Register the Entity.
fin_fs.apply(entity)

In [None]:
# Look at existing Entities after registering the Entity.
fin_fs.list_entities()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>4.3 Create and Register FeatureGroup </b></p>
<li style = 'font-size:16px;font-family:Arial'>FeatureGroup can be created using Teradata DataFrame.</li>
<li style = 'font-size:16px;font-family:Arial'>FeatureGroup can be created using SQL Query. </li>
<li style = 'font-size:16px;font-family:Arial'>FeatureGroup can be created using objects of Feature, Entity, DataSource.  </li>


<p style = 'font-size:16px;font-family:Arial'><b>Creating a FeatureGroup from Teradata DataFrame
</b></p>

In [None]:
fin_fg = FeatureGroup.from_DataFrame(
    name='TransDF', 
    entity_columns='txn_id', 
    df=txn_data
)

In [None]:
# Let's look at Properties.
fin_fg.features, fin_fg.entity, fin_fg.data_source, fin_fg.description

In [None]:
fin_fs.apply(fin_fg)

In [None]:
fin_fs.list_features()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>4.4 Use Enterprise Feature Store with teradataml analytic functions for data preparation.</b>


<p style = 'font-size:16px;font-family:Arial'>Since FeatureStore stores DataSource also, you can retrive Teradata DataFrame from FeatureStore. <br> `FeatureStore.get_dataset()` get's Teradata DataFrame from FeatureGroup.</p>

In [None]:
# Get DataSet for FeatureGroup PatientProfile. 
txn_data_df=fin_fs.get_dataset('TransDF')
txn_data_df

In [None]:
txn_data_df = txn_data_df.assign(txn_type=txn_data_df.type)
txn_data_final = txn_data_df.select(['txn_id','newbalanceDest','isFlaggedFraud','isFraud','step',
                                     'nameOrig', 'oldbalanceDest', 'newbalanceOrig', 'amount',
                                     'nameDest', 'oldbalanceOrig', 'txn_type'])
txn_data_final

In [None]:
copy_to_sql(txn_data_final, table_name='new_data', if_exists='replace')
txn_data_df = DataFrame('new_data')

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Create training and testing datasets in Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>We'll create two datasets for training and testing in the ratio of 80:20.</p>

In [None]:
TrainTestSplit_out = TrainTestSplit(
    data = txn_data_df,
    id_column = "txn_id",
    train_size = 0.80,
    test_size = 0.20,
    seed = 25
)

df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

print("Training Set = " + str(df_train.shape[0]) + ". Testing Set = " + str(df_test.shape[0]))

In [None]:
copy_to_sql(df_train, table_name = 'clean_data_train', if_exists = 'replace')
copy_to_sql(df_test, table_name = 'clean_data_test', if_exists = 'replace')

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>6. AutoML Training</b>

<p style = 'font-size:16px;font-family:Arial'>AutoML (Automated Machine Learning) is an approach that automates the process of building, training, and validating machine learning models. It involves various algorithms to automate various aspects of the machine learning workflow, such as data preparation, feature engineering, model selection, hyperparameter tuning, and model deployment. It aims to simplify the process of building machine learning models, by automating some of the more time-consuming and labor-intensive tasks involved in the process.</p>

<p style = 'font-size:16px;font-family:Arial'>We create a <code>AutoClassifier</code> instance which is a special purpose AutoML feature to run classification specific tasks. We use the <code>exclude</code> parameter to specify model algorithms to be excluded from model training phase. Here we exclude the 'knn' model. The <code>max_runtime_secs</code> specifies the time limit in seconds for model training.
<br><br>
<code>verbose</code>: specifies the detailed execution steps based on verbose level as follows:
</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>0</b>: prints the progress bar and leaderboard</li>
    <li><b>1</b>: prints the execution steps of AutoML.</li>
    <li><b>2</b>: prints the intermediate data between the execution of each step of AutoML.</li>
</ul>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>6.1. AutoML Training</b>

<p style = 'font-size:16px;font-family:Arial'>AutoML (Automated Machine Learning) is an approach that automates the process of building, training, and validating machine learning models. It involves various algorithms to automate various aspects of the machine learning workflow, such as data preparation, feature engineering, model selection, hyperparameter tuning, and model deployment. It aims to simplify the process of building machine learning models, by automating some of the more time-consuming and labor-intensive tasks involved in the process.</p>

<p style = 'font-size:16px;font-family:Arial'>We create a <code>AutoClassifier</code> instance which is a special purpose AutoML feature to run classification specific tasks. We use the <code>exclude</code> parameter to specify model algorithms to be excluded from model training phase. Here we exclude the 'knn' model. The <code>max_runtime_secs</code> specifies the time limit in seconds for model training.
<br><br>
<code>verbose</code>: specifies the detailed execution steps based on verbose level as follows:
</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>0</b>: prints the progress bar and leaderboard</li>
    <li><b>1</b>: prints the execution steps of AutoML.</li>
    <li><b>2</b>: prints the intermediate data between the execution of each step of AutoML.</li>
</ul>

In [None]:
df_train = df_train.drop(['nameDest', 'nameOrig', 'isFlaggedFraud'], axis = 1)
df_train

In [None]:
# Creating AutoClassifier Instance
# Selecting 'Auto' mode for AutoML training
# Excluding knn,glm and svm model from default model list for training
# Used early stopping timer criteria with value 600 sec

aml = AutoClassifier(
    exclude          = ['knn','svm','glm'],
    verbose          = 2,
    max_runtime_secs = 600
)

<p style = 'font-size:16px;font-family:Arial'><b><i>Note: Since the AutoML functionality does a lot of steps like Feature exploration and Data Preparation along with Model Training and Evaluating to select the Best model the below step may take anywhere between 12-15 minutes</i></b></p>

In [None]:
# Fitting train data 
aml.fit(data = df_train, target_column = 'isFraud')

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>6.2. Model Leaderboard Generation</b>

<p style = 'font-size:16px;font-family:Arial'>Here, we generate model leaderboard and leader for a given dataset. Leaderboard is a ranked table with a list of models with all their evaluation metrics.</p>

In [None]:
# Fetching leaderboard

leaderboard = aml.leaderboard()
leaderboard

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>6.3. Best Performing Model</b>

<p style = 'font-size:16px;font-family:Arial'>The following function displays the best performing model.</p>

In [None]:
# Fetching best performing model
aml.leader()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>7. Prediction</b>

<p style = 'font-size:16px;font-family:Arial'>The predict function generates predictions using either the default test data or any specified dataset, based on the model's rank in the leaderboard, and displays the performance metrics of the chosen model. If the test data contains a target column, both predictions and performance metrics are displayed; otherwise, only the predictions are shown.
<br><br>
You can also use the <code>rank</code> parameter in the predict function. The <code>rank</code> parameter specifies the model's rank in the leaderboard to be used for prediction. By default, the rank is set to 1, meaning the best-performing model is used.</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>7.1 Generating prediction on test data using Best Model</b>

<p style = 'font-size:16px;font-family:Arial'>Here, we specify the <code>df_test</code> dataset for prediction. When using external data instead of the default test data, the predict function applies all the data transformation steps performed during the training phase on the external data before passing the data to the model for prediction.</p>

In [None]:
# Fetching prediction and metrics on test data
prediction = aml.predict(df_test)

In [None]:
# Printing prediction
prediction

<b style = 'font-size:18px;font-family:Arial'>Generating predictions using 2nd Best Model</b>

In [None]:
#Prediction using the second best performing model
prediction_second = aml.predict(df_test, rank=2)

#Printing prediction
prediction_second

<b style = 'font-size:18px;font-family:Arial'>Generating predictions using 3rd Best Model</b>

In [None]:
prediction_third = aml.predict(df_test, rank=3)

#Printing prediction
prediction_third

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>7.2 Generating and Comparing ROC for the Top 3 Models</b>

<p style = 'font-size:16px;font-family:Arial'>The ROC curve is a graph between TPR(True Positive Rate) and FPR(False Positive Rate). The area under the ROC curve measures how well the model can distinguish between positive and negative classes. The higher the AUC, the better the model's performance in distinguishing between the positive and negative categories. AUC above 0.75 is generally considered decent.</p>

In [None]:
#Calculating True-Positive Rate (TPR), False-Positive Rate (FPR), Threshold_values for both the models
roc_first = ROC(
    probability_column = "prob_1",
    observation_column = "isFraud",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction
)

roc_second = ROC(
    probability_column = "prob_1",
    observation_column = "isFraud",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction_second
)

roc_third = ROC(
    probability_column = "prob_1",
    observation_column = "isFraud",
    positive_class = '1',
    num_thresholds = 100,
    data = prediction_third
)

#Getting auc_score for both models
auc_first = roc_first.result.get_values()[0][0]
auc_second = roc_second.result.get_values()[0][0]
auc_third = roc_third.result.get_values()[0][0]

In [None]:
#first model
first_model = leaderboard.MODEL_ID.iloc[0]

#second model
second_model = leaderboard.MODEL_ID.iloc[1]

third_model = leaderboard.MODEL_ID.iloc[2]

#Plotting the ROC Curve
roc_second.output_data.plot(
    x = roc_first.output_data.fpr,
    y = [roc_first.output_data.tpr, roc_second.output_data.tpr, roc_third.output_data.tpr,roc_first.output_data.fpr],
    legend = [
                '{}: AUC = {}'.format(first_model,str(auc_first)),
                '{}: AUC = {}'.format(second_model,str(auc_second)),
                '{}: AUC = {}'.format(third_model,str(auc_third)),
                'Baseline: AUC = {}'.format(str(round(0.5, 4)))
             ],
    legend_style = 'lower right',
    title = 'Receiver Operating Characteristic (ROC) Curve',
    xlabel = 'False Positive Rate',
    ylabel = 'True Positive Rate',
    color = ['green', 'orange', 'blue'],
    linestyle = ['-', '-', '--']
)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>Conclusion</b>

<p style = 'font-size:16px;font-family:Arial'>We used feature store to store features as well as its processing. We re-used it in model training. The features and processing can be re-used accross multiple machine learning models and use-case , helping to improve data science productivity</p>

<p style = 'font-size:16px;font-family:Arial'>Teradata's AutoML functionality plays a crucial role in this context by automating the complex process of building and deploying machine learning models. AutoML ensures the most optimal preparation and training of models, delivering high-quality machine learning models in minutes. Through hyperparameter tuning (HPT), Teradata's AutoML can automatically select the best parameters for machine learning algorithms using grid search and random search techniques, significantly enhancing model performance.
<br><br>
By leveraging Teradata's AutoML, companies can save time and reduce costs associated with manual model building and tuning. The technology not only improves the accuracy of predictive models but also democratizes the power of machine learning, allowing customers to utilize advanced analytics without requiring extensive coding or data science expertise. This capability enables companies to swiftly and effectively analyze customer churn data, develop predictive models, and implement proactive strategies to retain customers and enhance their satisfaction.
<br><br>
In conclusion, Teradata's AutoML functionality is a vital tool for banks aiming to reduce customer churn. By automating and optimizing the machine learning process, Teradata empowers various industries to make data-driven decisions that improve customer retention and drive long-term profitability.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>8. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>We need to clean up our work tables to prevent errors next time.</p>

In [None]:
tables = ['clean_data', 'clean_data_train', 'clean_data_test','new_data']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

In [None]:
fin_fs.archive_feature_group(feature_group='TransDF')

In [None]:
fin_fs.delete_feature_group(feature_group='TransDF')

In [None]:
fin_fs.delete()

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../../run_procedure.py "call remove_data('Demo_glm_fraud');"        # Takes 5 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;">

<b style = 'font-size:20px;font-family:Arial'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:18px;font-family:Arial'><b>Filters:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>Industry:</b> Finance</li>
    <li><b>Functionality:</b> Machine Learning</li>
    <li><b>Use Case:</b> Fraud Detection</li>
</ul>

<p style = 'font-size:18px;font-family:Arial'><b>Related Resources:</b></p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><a href='https://www.teradata.com/Blogs/Fraud-Busting-AI'>Fraud-Busting-AI</a></li>
    <li><a href='https://www.teradata.com/Industries/Financial-Services'>Financial Services</a></li>
    <li><a href='https://www.teradata.com/Resources/Datasheets/Move-from-Detection-to-Prevention-and-Outsmart-Fraudsters'>Move from Detection to Prevention and Outsmart Tech-Savvy Fraudsters</a></li>
</ul>

<b style = 'font-size:20px;font-family:Arial'>Dataset:</b>

- `txn_id`: transaction id
- `step`: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (31 days simulation).
- `type`: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- `amount`: amount of the transaction in local currency
- `nameOrig`: customer who started the transaction
- `oldbalanceOrig`: customer's balance before the transaction
- `newbalanceOrig`: customer's balance after the transaction
- `nameDest`: customer who is the recipient of the transaction
- `oldbalanceDest`: recipient's balance before the transaction
- `newbalanceDest`: recipient's balance after the transaction
- `isFraud`: identifies a fraudulent transaction (1) and non fraudulent (0)
- `isFlaggedFraud`: flags illegal attempts to transfer more than 200,000 in a single transaction

<p style = 'font-size:18px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Uses a dataset and feature discovery methods outlined here: <a href = 'https://www.kaggle.com/georgepothur/4-financial-fraud-detection-xgboost/notebook'>https://www.kaggle.com/georgepothur/4-financial-fraud-detection-xgboost/notebook</a></li>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>