<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Anomaly Detection in Credit Card
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style='font-size:20px;font-family:Arial;'><b>Credit Card Fraud Detection using K-Means Clustering:</b></p> 
<p style='font-size:16px;font-family:Arial;'>
Detecting fraudulent transactions is crucial for financial security. This approach leverages <b>K-Means clustering</b> to group similar transactions and identifies anomalies based on <b>Euclidean distance</b>, where fraud-like patterns deviate significantly from normal spending behaviors.
</p>

<ul style='font-size:16px;font-family:Arial;'>
    <li><strong>Anomaly Detection:</strong> Identifies outliers based on their distance from the cluster center, marking transactions that deviate from normal spending patterns.</li>
    <li><strong>Vector Embeddings:</strong> Converts categorical transaction data into vector representations to improve clustering accuracy.</li>
    <li><strong>Feature Engineering:</strong> Includes transaction amount, location, time, and merchant category to enhance fraud detection.</li>
    <li><strong>Dimensionality Reduction:</strong> Uses t-SNE to visualize clusters and detect fraudulent transactions that do not fit normal behavior.</li>
    <li><strong>Scalability:</strong> Works efficiently on large datasets by leveraging K-Means for unsupervised learning and anomaly detection.</li>
</ul>


<p style = 'font-size:18px;font-family:Arial;'><b>Why Vantage?</b></p>

<p style = 'font-size:16px;font-family:Arial;'>
    Teradata’s integration with <b>LLMs and hosting capabilities in-DB</b>, along with the Open Analytics Framework, would enable customers to run NLP models at scale. The key challenges noted for on-prem customers—such as data movement latency and lack of access to cloud models—are valid. By bringing language models within Vantage, Teradata can provide a significant advantage to on-prem customers by allowing them to run NLP models without needing to move large amounts of data to and from external services.
</p>

<hr style="height:1px;border:none;background-;">
<p style = 'font-size:18px;font-family:Arial;'><b>Downloading and installing additional software needed</b>

In [None]:
# %%capture

!pip install wordcloud nltk --quiet --no-warn-script-location

In [None]:
# %%capture

!pip install --force-reinstall pillow --quiet --no-warn-script-location

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>The above libraries have to be installed. Restart the kernel after executing these cells to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing <b> 0 0</b></i> (zero zero) and pressing <i>Enter</i>.</p>
</div>
<p style = 'font-size:16px;font-family:Arial;'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard libraries
import time
import warnings
import random

# Data manipluation and Visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.manifold import TSNE
import seaborn as sns

# Teradata libraries
from teradatamlwidgets import *
from teradataml import (
    configure,
    concat,
    create_context, 
    delete_byom, 
    display,
    copy_to_sql,
    execute_sql,
    save_byom,
    remove_context,
    in_schema,
    ScaleFit,
    ScaleTransform, 
    VectorDistance,
    KMeans,
    KMeansPredict,
    DataFrame,
    db_drop_table,
    db_drop_view,
    ONNXEmbeddings
)
display.max_rows = 5

# NLP libraries
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)


# machine learning libraries
from sklearn.manifold import TSNE

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Anomaly_Detection_Credit_Card_ONNXEmbeddings.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial;'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_CreditCard_cloud');" 
# takes about 20seconds, estimated space: 0 MB
%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_local');" 
# takes about 35 seconds, estimated space: 11 MB

<p style = 'font-size:16px;font-family:Arial;'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>2. Loading sample data for the demo</b>

In [None]:
tdf_cc_original_data = DataFrame(in_schema("DEMO_CreditCard","Credit_Card"))
tdf_cc_original_data

In [None]:
tdf_cc_original_data.shape

In [None]:
#Storing a copy in database
copy_to_sql(df = tdf_cc_original_data, table_name = "credit_card_db", if_exists="replace")

In [None]:
DF_CC = DataFrame.from_table('credit_card_db')
DF_CC.shape

In [None]:
DF_CC

<hr style="height:1px;border:none;background-;">
<b style = 'font-size:18px;font-family:Arial;'>2.1 Real-time Data Collection
</b>

<p style = 'font-size:16px;font-family:Arial;'>This simulated data mimics the real-time process, where transaction details are captured continuously, enabling anomaly detection in real-time credit card activity. We will built our prediction model with testing with sample historical data, then we will save the artifacts, as a last step we will use the same artifacts to detect anomalies in simulated real-time data generated by the functions below.</p>

In [None]:
# Real time data collection
def fetching_real_time_data():
    contract_types = ["Cash loans", "Revolving loans"]
    genders = ["M", "F"]
    own_car = ["Y", "N"]
    family_status = ["Married", "Single", "Separated"]
    house_types = ["Block of flats", "House", "Municipal"]
    occupations = ["Sales staff", "Managers", "Core staff", "None"]

    records = []
    for _ in range(10):
        record = {
            "SK_ID_CURR": random.randint(456255, 999999),
            "TARGET": random.choice([0, 1]),
            "NAME_CONTRACT_TYPE": random.choice(contract_types),
            "CODE_GENDER": random.choice(genders),
            "FLAG_OWN_CAR": random.choice(own_car),
            "CNT_CHILDREN": random.randint(0, 5),
            "AMT_INCOME_TOTAL": round(random.uniform(117000000, 117100000), 2),
            "NAME_FAMILY_STATUS": random.choice(family_status),
            "REGION_POPULATION_RELATIVE": round(random.uniform(0.001, 0.05), 6),
            "FLAG_MOBIL": 1,
            "FLAG_EMP_PHONE": random.choice([0, 1]),
            "CNT_FAM_MEMBERS": random.randint(1, 6),
            "HOUSETYPE_MODE": random.choice(house_types),
            "OCCUPATION_TYPE": random.choice(occupations),
            "AGE": random.randint(20, 70)
        }
        records.append(record)
    
    return pd.DataFrame(records)

In [None]:
# Function for real-time data fetching
def fetch_data():
    all_data = [] 

    print("\nInitializing real-time credit card data fetch...\n")
    time.sleep(1)
    
    while True:
        user_input = input("Do you want to fetched updated credit card record? (yes to start, stop to end): ").strip().lower()
        
        if user_input == 'yes':
            print("Fetching new credit card record...", end="")
            sys.stdout.flush()
            time.sleep(random.uniform(0.5, 1.5))

            data = fetching_real_time_data()
            print(" Data fetched successfully!")
            all_data.append(data)
        elif user_input == 'stop':
            print("\nStopping the data collection. Finalizing...\n")
            time.sleep(1)
            break
        else:
            print("Invalid input. Please enter 'yes' to generate a record or 'stop' to end.")
    
    if all_data:
        print("\nFinalizing the dataset...\n")
        time.sleep(1)
        print("Merging all generated records...\n")
        print("Inserted new records")
        return pd.concat(all_data, ignore_index=True)
    else:
        print("No records fetched, returning an empty DataFrame.")  
        return pd.DataFrame(columns=['TransactionID', 'Amount'])  

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>3. Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/gte-base-en-v1.5), such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

In [None]:
from huggingface_hub import hf_hub_download

model_name = "bge-small-en-v1.5"
number_dimensions_output = 384
model_file_name = "model.onnx" 

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>3.1 Download and Store Model
</b>
<p style = 'font-size:16px;font-family:Arial;'>In this step, we download the ONNX embedding model and tokenizer from Hugging Face, then store them as BLOBs in Vantage tables using the <b>save_byom</b> function. This allows the model to be cached and reused across multiple nodes for parallel embedding generation.</p>

In [None]:
# Step 1: Download Model from Teradata HuggingFace Page

hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")

In [None]:
try:
    db_drop_table("embeddings_models")
except:
    pass
try:
    db_drop_table("embeddings_tokenizers")
except:
    pass

In [None]:
# Step 2: Load Models into Vantage
# a) Embedding model
save_byom(model_id = model_name, # must be unique in the models table
               model_file = f"onnx/{model_file_name}",
               table_name = 'embeddings_models' )
# b) Tokenizer
save_byom(model_id = model_name, # must be unique in the models table
              model_file = 'tokenizer.json',
              table_name = 'embeddings_tokenizers') 

<p style = 'font-size:16px;font-family:Arial;'>Recheck the installed model and tokenizer

In [None]:
df_model = DataFrame('embeddings_models')
df_model

In [None]:
df_token = DataFrame('embeddings_tokenizers')
df_token

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>4. Create the Embeddings</b>

<p style = 'font-size:16px;font-family:Arial;'> Let us take a look at the demo data once again.

In [None]:
DF_CC

In [None]:
DF_CC.shape

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>4.1 Creation of Views and Final Embeddings Table
</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(384)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>10 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>


<p style='font-size:18px;font-family:Arial;'><b>Generate Embeddings</b></p>    

<p style='font-size:16px;font-family:Arial;'>
Generating embeddings will take approximately <b>10-15 minutes.</b>
</p>


<center><img src="images/visual.svg" alt="embeddings_decision" width=700 style="border: 4px solid #404040; border-radius: 10px;"/></center>


<div class="alert alert-block alert-info">
<p style='font-size:16px;font-family:Arial;'>
<i><b>
These embeddings will later be used in anomaly detection by comparing the similarity between different transactions. By converting each transaction into a vector representation, we can identify outliers or anomalies based on the distance between vectors.</b></i>
</p>
</div>


In [None]:
#configure Teradata Bring Your Own Model Artifacts
configure.byom_install_location = "mldb"
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
query = """
SELECT 
    SK_ID_CURR AS id,
    NAME_CONTRACT_TYPE || ' ' ||
    CODE_GENDER || ' ' ||
    FLAG_OWN_CAR || ' ' ||
    NAME_FAMILY_STATUS || ' ' ||
    COALESCE(HOUSETYPE_MODE, '') || ' ' ||
    COALESCE(OCCUPATION_TYPE, '') || ' ' ||
    CAST(CNT_CHILDREN AS VARCHAR(50)) || ' ' ||
    CAST(REGION_POPULATION_RELATIVE AS VARCHAR(50)) || ' ' ||
    CAST(CNT_FAM_MEMBERS AS VARCHAR(50)) || ' ' ||
    CAST(AMT_INCOME_TOTAL AS VARCHAR(50)) AS txt
FROM credit_card_db
SAMPLE 100
"""

In [None]:
DF_sample100 = DataFrame.from_query(query)
DF_sample100

In [None]:
DF_embeddings_training = ONNXEmbeddings(
    newdata = DF_sample100,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id","txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_training.show_query()

In [None]:
from teradatamlwidgets import *
DF_embeddings_training.head(2)

<p style = 'font-size:16px;font-family:Arial;'>Store the generated embeddings in a permanent table for use in K-Means clustering and subsequent analysis.</p>

In [None]:
copy_to_sql(DF_embeddings_training, table_name="embeddings_table_training" , if_exists='replace')

<hr style="height:1px;border:none;background-;">
<b style = 'font-size:18px;font-family:Arial;'>4.2 Elbow Method
</b>

<p style = 'font-size:16px;font-family:Arial;'>In this step, we apply the <b>Elbow Method</b> to determine the optimal number of clusters for the KMeans clustering algorithm. The Elbow Method helps in selecting the number of clusters that best represents the data by analyzing the <b>Total Within-Cluster Sum of Squares (Total_WithinSS)</b>.</p>

<p style='font-size:16px;font-family:Arial;'>
For each number of clusters, we print the `Total_WithinSS` value, which represents the compactness of the clusters.
</p>


<p style='font-size:16px;font-family:Arial;'>
The resulting plot provides a visual representation of the relationship between the number of clusters and the <b>Total_WithinSS</b>. The "elbow" in the plot will suggest the optimal `k` for further analysis.
</p>

<p style='font-size:16px;font-family:Arial;'>
In this step, we calculate the differences between successive WCSS values and use the second-order differences to identify the <b>Elbow Point</b>, which helps in selecting the optimal number of clusters (`k`).
</p>

In [None]:
def elbo_method_calculation(embeddings_sample_df: DataFrame, embedding_columns: list, k_values: list) -> list:
    total_withinss_values = []
    for num_clusters in k_values:
        kmeans_out = KMeans(
            data=embeddings_sample_df,
            id_column="id",
            target_columns=embedding_column_list,
            num_clusters=num_clusters,
            num_init=10,
            iter_max=50
        )
        
        result_table = kmeans_out.result
        
        # Convert to pandas FIRST, then filter
        result_pandas = result_table.to_pandas()
        
        # Filter in pandas instead of teradataml
        filtered_rows = result_pandas[
            result_pandas['td_modelinfo_kmeans'].str.contains('Total_WithinSS', na=False)
        ]
        
        if len(filtered_rows) == 0:
            print(f"Warning: No Total_WithinSS found for {num_clusters} clusters")
            continue
        
        value_str = filtered_rows['td_modelinfo_kmeans'].iloc[0]
        numeric_str = value_str.split(':')[1].strip()
        numeric_value = round(float(numeric_str), 2)
        total_withinss_values.append(numeric_value)
               
        print(f'Number of clusters: {num_clusters}, Total_WithinSS: {numeric_value}')

    # Ensure k_values matches the actual data collected
    plt.figure(figsize=(10, 6))
    plt.plot(k_values[:len(total_withinss_values)], total_withinss_values, marker='o', linestyle='--', color='b')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Total Within-Cluster Sum of Squares (Total_WithinSS)')
    plt.title('Elbow Method for Optimal Number of Clusters')
    plt.xticks(k_values[:len(total_withinss_values)])  # Show only integer ticks
    plt.grid(True)
    plt.show()
    return total_withinss_values

<p style = 'font-size:16px;font-family:Arial;'>The function below identifies the optimal number of clusters by finding the elbow point in the Total Within-Cluster Sum of Squares curve. It calculates angles at each point and selects the cluster count where the curve bends most sharply.</p>

In [None]:
def find_elbow_point(total_withinss_values: list, k_values: list) -> int:
    # Need at least 3 values for second-order differences
    if len(total_withinss_values) < 3:
        print("Not enough data points to determine optimal K")
    else:
        wcss_diff = np.diff(total_withinss_values)   # First derivative (length: n-1)
        wcss_diff2 = np.diff(wcss_diff)              # Second derivative (length: n-2)
        
        # Add 2 to get index in original k_values (because we lost 2 elements from two diff operations)
        elbow_index = np.argmax(wcss_diff2) + 2  # Use argmax, not argmin (see explanation)
        optimal_k_index = k_values[elbow_index]
      
    print(f'Rate of WCSS change between K values: {wcss_diff}')
    print(f'Change in rate (acceleration): {wcss_diff2}')
    print(f'Index where curve bends most: {elbow_index}')
    print(f'Optimal number of clusters (K) based on Elbow Method: {optimal_k_index}')
    return optimal_k_index

<p style = 'font-size:16px;font-family:Arial;'>Extract the list of embedding columns (emb_0 through emb_383) to be used as features for K-Means clustering.</p>

In [None]:
embedding_column_list = [col for col in DF_embeddings_training.columns if col not in ["id", "txt"]]
k_values = list(range(5, 10))
total_withinss_values = elbo_method_calculation(DF_embeddings_training, embedding_column_list, k_values)
optimal_k = find_elbow_point(total_withinss_values, k_values)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>5. Run K-Means on the Embeddings Store and then build final table with Cluster ID assignments to rows</b>

<p style = 'font-size:16px;font-family:Arial;'>The K-means() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster</p>

<p style = 'font-size:16px;font-family:Arial;'>The function below encapsulates the complete K-Means training workflow. It trains the model, generates predictions with distances, computes cluster statistics, and calculates the 95th percentile distance threshold for each cluster. These thresholds serve as boundaries for identifying anomalies.</p>

In [None]:
def train_kmeans_model(embeddings_table: str, feature_columns: list, num_clusters: int) -> dict:
    kmeans_trained = KMeans(
    id_column="id",
    data=DataFrame(embeddings_table),
    target_columns=feature_columns,
    num_init=10,
    num_clusters=num_clusters,
    iter_max=50,
    )

    trained_key_means_df = kmeans_trained.result
    
    clustered_data = KMeansPredict(data=DataFrame(embeddings_table),
                                    object=trained_key_means_df,
                                    output_distance=True)
    
    cluster_stats = clustered_data.result.groupby('td_clusterid_kmeans').count()

    training_data_scored = DataFrame(embeddings_table).join(
    other = clustered_data.result[['id', 'td_clusterid_kmeans', 'td_distance_kmeans']],
    on = ["id"],
    how = "inner",
    lprefix="l",
    rprefix="r"
    )

    thresholds_df = training_data_scored[['td_clusterid_kmeans','td_distance_kmeans']].groupby('td_clusterid_kmeans').percentile(0.95)
    


    copy_to_sql(trained_key_means_df, table_name='kmeans_trained_model', if_exists='replace')
    copy_to_sql(training_data_scored, table_name='kmeans_scored_training_output', if_exists='replace')
    copy_to_sql(cluster_stats, table_name='kmeans_cluster_stats', if_exists='replace')
    copy_to_sql(thresholds_df, table_name='kmeans_thresholds', if_exists='replace')  

    return {
        'kmeans_model_trained_table': 'kmeans_trained_model',
        'kmeans_scored_training_table': 'kmeans_scored_training_output',
        'kmeans_cluster_stats_table': 'kmeans_cluster_stats',
        'kmeans_thresholds_table': 'kmeans_thresholds'
    }

In [None]:
model_artifacts = train_kmeans_model('embeddings_table_training', embedding_column_list, optimal_k)

<p style = 'font-size:16px;font-family:Arial;'>The output below shows cluster assignment for each row.</p>

In [None]:
kmeans_scored_training_data = DataFrame(model_artifacts['kmeans_scored_training_table'])
kmeans_scored_training_data.head(2)

<p style = 'font-size:16px;font-family:Arial;'>Let's check how many data points each cluster has.</p>

In [None]:
kmeans_cluster_stats_df = DataFrame(model_artifacts['kmeans_cluster_stats_table'])
kmeans_cluster_stats_df[['td_clusterid_kmeans','count_id']].head(optimal_k)

<p style = 'font-size:16px;font-family:Arial;'>And the thresholds</p>

In [None]:
kmeans_thresholds_df = DataFrame(model_artifacts['kmeans_thresholds_table'])
kmeans_thresholds_df.head(optimal_k)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>6. Visualization</b>

<hr style='height:1px;border:none;background-;'> 

<p style='font-size:18px;font-family:Arial;'><b>6.1 Visualization of Clusters</b></p> 

<p style='font-size:16px;font-family:Arial;'>The graph illustrates the clustering of transactions into distinct groups for credit card fraud detection using anomaly detection. Based on the analysis, the data has been divided into optimal clusters, each representing a unique transaction pattern. This clustering approach helps identify potential fraudulent activities by distinguishing normal and anomalous transaction behaviors, enabling more targeted fraud detection and prevention efforts.</p>


<p style='font-size:16px;font-family:Arial;'>This visualization helps us explore the structure of the clusters and visually identify <b>anomalies</b> (points that deviate significantly from their respective cluster centers). The use of <b>diamonds for anomalies</b> and <b>circles for normal data points</b> provides a clear and intuitive way to distinguish between outliers and inliers in the data, making it easier to detect potential fraudulent or unusual activity.</p>

<p style='font-size:16px;font-family:Arial;'>The interactive nature of the plot allows for an engaging exploration of the data, where users can hover over points to view more detailed information.</p>


<p style = 'font-size:16px;font-family:Arial;'>The visualization function below applies <b>t-SNE</b> (t-distributed Stochastic Neighbor Embedding) to reduce the high-dimensional embeddings to 2D for plotting. It calculates distances from cluster centers and marks the top 5% most distant points as anomalies.</p>

In [None]:
def visualize_clusters(clustered_data: DataFrame):   
    clus = clustered_data.to_pandas()
    # --- Perform t-SNE ---
    tsne = TSNE(n_components=2, random_state=123)
    tsne_result = tsne.fit_transform(clus.iloc[:, 3:-2])

    # --- Create visualization DataFrame ---
    tsne_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
    tsne_df['cluster_id'] = clus['td_clusterid_kmeans']
    tsne_df['record_id'] = clus['l_id']
    tsne_df['features'] = clus['txt']  # rename for clarity

    # Truncate text values
    tsne_df['truncated_features'] = tsne_df['features'].apply(
        lambda x: f"{x[:50]}..." if isinstance(x, str) else x
    )

    # --- Compute cluster centers ---
    cluster_centers = tsne_df.groupby('cluster_id')[['tsne_1', 'tsne_2']].mean()

    # --- Compute distance to cluster center ---
    def euclidean_distance(row):
        center = cluster_centers.loc[row['cluster_id']]
        return np.sqrt((row['tsne_1'] - center['tsne_1'])**2 + (row['tsne_2'] - center['tsne_2'])**2)

    tsne_df['distance'] = tsne_df.apply(euclidean_distance, axis=1)

    # --- Mark anomalies per cluster (top 5% distance) ---
    tsne_df['is_anomaly'] = tsne_df.groupby('cluster_id')['distance'].transform(
        lambda x: x > x.quantile(0.95)
    )

    # --- Plot using Matplotlib ---
    plt.figure(figsize=(12, 9))
    palette = sns.color_palette('tab10', n_colors=tsne_df['cluster_id'].nunique())

    # Draw normal and anomaly points separately
    for i, cluster in enumerate(sorted(tsne_df['cluster_id'].unique())):
        cluster_data = tsne_df[tsne_df['cluster_id'] == cluster]
        
        # Normal points
        normal_points = cluster_data[~cluster_data['is_anomaly']]
        plt.scatter(
            normal_points['tsne_1'], normal_points['tsne_2'],
            label=f'Cluster {cluster}',
            color=palette[i],
            alpha=0.7,
            edgecolor='k',
            s=50,
            marker='o'
        )
        
        # Anomalies
        anomalies = cluster_data[cluster_data['is_anomaly']]
        plt.scatter(
            anomalies['tsne_1'], anomalies['tsne_2'],
            color=palette[i],
            marker='D',  # diamond marker
            edgecolor='black',
            s=120,
            label=f'Anomaly (Cluster {cluster})'
        )

    # --- Plot cluster centers ---
    plt.scatter(
        cluster_centers['tsne_1'], cluster_centers['tsne_2'],
        color='black', s=250, marker='X', label='Cluster Centers'
    )

    plt.title('t-SNE Visualization of Clusters with Anomaly Detection', fontsize=16, weight='bold')
    plt.xlabel('Dimension-1', fontsize=14)
    plt.ylabel('Dimension-2', fontsize=14)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.tight_layout()


    plt.show()

In [None]:
visualize_clusters(kmeans_scored_training_data)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>7. Anomaly Prediction</b>

<hr style='height:1px;border:none;background-;'> 

<p style='font-size:18px;font-family:Arial;'><b>7.1 Predict Anomalies</b></p> 

<p style = 'font-size:16px;font-family:Arial;'>The anomaly prediction function joins the scored data with the cluster thresholds and flags each transaction as an anomaly (1) or normal (0) based on whether its distance exceeds the cluster's 95th percentile threshold.</p>

In [None]:
def predict_anomalies(scored_data_table: str, thresholds_table: str, output_table: str) -> dict:
    scored_data_df = DataFrame(scored_data_table)
    thresholds_df = DataFrame(thresholds_table)

    prediction_df = scored_data_df.join(
        other=thresholds_df,
        on='td_clusterid_kmeans',
        how='inner',
        lsuffix='_l',
        rsuffix='_r'
    )

    from teradataml import case

    prediction_df = prediction_df.assign(
        anomaly = case(
            [(prediction_df['td_distance_kmeans'] > prediction_df['percentile_td_distance_kmeans'], 1)],
            else_=0
        )
    )

    copy_to_sql(prediction_df, table_name=output_table, if_exists='replace')

    return {
        'anomaly_prediction_table': output_table
    }

<p style = 'font-size:16px;font-family:Arial;'>We apply the anomaly prediction function to our training data to identify transactions that deviate significantly from their assigned cluster patterns.</p>

In [None]:
predict_anomalies(model_artifacts['kmeans_scored_training_table'],model_artifacts['kmeans_thresholds_table'], 'training_data_predictions')

<p style = 'font-size:16px;font-family:Arial;'>Let us view the prediction results. The <b>anomaly</b> column indicates whether each transaction is flagged as anomalous (1) or normal (0).</p>

In [None]:
predictions_training_df= DataFrame('training_data_predictions')
predictions_training_df.head(5)

<p style = 'font-size:16px;font-family:Arial;'>Examine the distribution of anomalies across clusters to understand which transaction patterns are most likely to be flagged as suspicious.</p>

In [None]:
predictions_training_df[predictions_training_df['anomaly'] == 1].groupby('td_clusterid_kmeans__r').count()[['td_clusterid_kmeans__r','count_anomaly']].head(optimal_k)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>8. Real-time data Fetching</b>

<hr style='height:1px;border:none;background-;'> 

<p style='font-size:18px;font-family:Arial;'><b>8.1 Data Fetching Process</b></p> 

<p style='font-size:16px;font-family:Arial;'>The <code>fetch_data()</code> function allows the user to fetch new credit card transaction records in real-time. By entering <b>'yes'</b>, the system fetches and stores 10 new records, simulating live data collection. Entering <b>'stop'</b> terminates the data fetching process, finalizing and merging all collected records into a dataset.</p>





In [None]:
# Fetch Real time credit card data
tdf = fetch_data()

<p style = 'font-size:16px;font-family:Arial;'>Store the fetched real-time data in the database for subsequent processing and embedding generation.</p>

In [None]:
# Store in database
copy_to_sql(tdf, table_name="credit_card_db_test", if_exists="replace")

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>9. Create the Embeddings on Fetched Data</b>

<p style='font-size:16px;font-family:Arial;'>we generate and store the embeddings for the credit card transaction data. The embeddings are created using a pre-trained model, which transforms the transaction data into vectorized representations that can be used for anomaly detection.</p>

<p style = 'font-size:16px;font-family:Arial;'>Load and verify the test data from the database to confirm the data was stored correctly.</p>

In [None]:
tdf = DataFrame.from_table('credit_card_db_test')
tdf.shape

<p style = 'font-size:16px;font-family:Arial;'>Prepare the test data by concatenating relevant features into a single text column, following the same format used for training data.</p>

In [None]:
query = """
SELECT 
    SK_ID_CURR AS id,
    NAME_CONTRACT_TYPE || ' ' ||
    CODE_GENDER || ' ' ||
    FLAG_OWN_CAR || ' ' ||
    NAME_FAMILY_STATUS || ' ' ||
    COALESCE(HOUSETYPE_MODE, '') || ' ' ||
    COALESCE(OCCUPATION_TYPE, '') || ' ' ||
    CAST(CNT_CHILDREN AS VARCHAR(50)) || ' ' ||
    CAST(REGION_POPULATION_RELATIVE AS VARCHAR(50)) || ' ' ||
    CAST(CNT_FAM_MEMBERS AS VARCHAR(50)) || ' ' ||
    CAST(AMT_INCOME_TOTAL AS VARCHAR(50)) AS txt
FROM credit_card_db_test
"""

<p style = 'font-size:16px;font-family:Arial;'>Execute the query to create the formatted text representation of test transaction data.</p>

In [None]:
DF_test_data = DataFrame.from_query(query)
DF_test_data.head(2)

<p style = 'font-size:16px;font-family:Arial;'>Generate embeddings for the test data using the same ONNX model and tokenizer. These embeddings will be used to predict cluster assignments and detect anomalies in the new transactions.</p>

In [None]:
# Generate Embeddings
DF_embeddings_test = ONNXEmbeddings(
    newdata = DF_test_data,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result
print("All steps completed successfully!")

In [None]:
# View sample data
DF_embeddings_test.sample(2)

In [None]:
# View shape of dataframe
DF_embeddings_test.shape

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>10. Applying KMeans Prediction</b>

<p style='font-size:16px;font-family:Arial;'>We apply the <b>KMeans Prediction</b> to predict the cluster assignments for the dataset using the previously trained KMeans model. This allows us to label the data points with the clusters they belong to and further analyze the results.</p>

<p style = 'font-size:16px;font-family:Arial;'>Let us verify the model artifacts that were saved during training. These include the trained K-Means model, scored training output, cluster statistics, and distance thresholds.</p>

In [None]:
print(model_artifacts)

<p style = 'font-size:16px;font-family:Arial;'>Apply the trained K-Means model to the test embeddings to predict cluster assignments and calculate distances. The <b>output_distance=True</b> parameter ensures we get the distance to the nearest cluster center for anomaly detection.</p>

In [None]:
# Applying KMeansPredict
kmeans_scored_test_data = KMeansPredict(data=DF_embeddings_test,
                                    object=DataFrame(model_artifacts['kmeans_model_trained_table']),
                                    output_distance=True)

# Print the result DataFrames.
kmeans_scored_test_data_df = kmeans_scored_test_data.result

In [None]:
kmeans_scored_test_data_df

<p style = 'font-size:16px;font-family:Arial;'>Let's check how many data points each cluster has.</p>

In [None]:
# Count of each clusterid
kmeans_scored_test_data_df.groupby('td_clusterid_kmeans').count().head(optimal_k)

<p style = 'font-size:16px;font-family:Arial;'>Store the scored test data with cluster assignments and distances in the database.</p>

In [None]:
copy_to_sql(kmeans_scored_test_data_df, 'kmeans_scored_test_output', if_exists='replace')

<p style = 'font-size:16px;font-family:Arial;'>Apply the anomaly detection function to the test data using the established thresholds from training. Transactions exceeding their cluster's distance threshold will be flagged as anomalies.</p>

In [None]:
predict_anomalies('kmeans_scored_test_output',model_artifacts['kmeans_thresholds_table'], 'test_data_predictions')

<p style = 'font-size:16px;font-family:Arial;'>View the anomaly predictions for the test data. The <b>anomaly</b> column shows which transactions have been flagged as potential fraud.</p>

In [None]:
predictions_test_df= DataFrame('test_data_predictions')
predictions_test_df.head(10)

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>11. Visualization</b>

<hr style='height:1px;border:none;background-;'> 

<p style='font-size:18px;font-family:Arial;'><b>11.1 Visualization of Clusters</b></p> 

<p style='font-size:16px;font-family:Arial;'>This visualization helps us explore the structure of the clusters and visually identify <b>anomalies</b> (points that deviate significantly from their respective cluster centers). The use of <b>diamonds for anomalies</b> and <b>circles for normal data points</b> provides a clear and intuitive way to distinguish between outliers and inliers in the data, making it easier to detect potential fraudulent or unusual activity.</p>

<p style='font-size:16px;font-family:Arial;'>The interactive nature of the plot allows for an engaging exploration of the data, where users can hover over points to view more detailed information.</p>


<p style = 'font-size:16px;font-family:Arial;'>Join the test embeddings with their cluster predictions and distances to create a complete scored dataset for anomaly detection.</p>

In [None]:
test_data_scored = DF_embeddings_test.join(
other = kmeans_scored_test_data_df[['id', 'td_clusterid_kmeans', 'td_distance_kmeans']],
on = ["id"],
how = "inner",
lprefix="l",
rprefix="r"
)

In [None]:
test_data_scored

<p style = 'font-size:16px;font-family:Arial;'>Combine the training and test scored data to visualize both datasets together, allowing us to see how the new transactions compare to the established cluster patterns.</p>

In [None]:
combined_clustered_data = concat([kmeans_scored_training_data, test_data_scored], allow_duplicates=False)

In [None]:
combined_clustered_data.shape

<p style = 'font-size:16px;font-family:Arial;'>Visualize the test data clusters using t-SNE to see how the real-time transactions are positioned relative to the learned cluster patterns. Anomalies will appear as diamond shapes, indicating transactions that deviate from normal behavior.</p>

In [None]:
visualize_clusters(combined_clustered_data)

<hr style="height:1px;border:none;background-;">
<p style = 'font-size:20px;font-family:Arial;'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;'>In this demo we have seem that how we can run HuggingFace Embedding Model (BAAI/bge-small-1.5) in ONNX format and run it in database parallelly to create embeddings. We have done KMeans Clustering to group for credit card fraud detection using anomaly detection and interactive t-SNE visualization allowed us to explore the clusters, distinguish anomalies using <b>diamond shapes</b>, and analyze the structure of the data more intuitively.</p> 

<hr style="height:2px;border:none;background-;">
<b style = 'font-size:20px;font-family:Arial;'>12. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = [
    'credit_card_db',
    'embeddings_models',
    'embeddings_tokenizers',
    'embeddings_table_training',
    'kmeans_trained_model',
    'kmeans_scored_training_output',
    'kmeans_cluster_stats',
    'kmeans_thresholds',
    'training_data_predictions',
    'credit_card_db_test',
    'kmeans_scored_test_output',
    'test_data_predictions'
]

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

views = ['credit_card_tokenized_for_embeddings','credit_card_embeddings']

for view in views:
    try:
        db_drop_view(view_name=view)
    except:
        pass

<hr style="height:1px;border:none;background-;">
<p style = 'font-size:18px;font-family:Arial;'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_CreditCard_local');"        # Takes 10 seconds

<p style = 'font-size:16px;font-family:Arial;'>Close the connection to Vantage to release resources.</p>

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `SK_ID_CURR`: Customer ID
- `TARGET`: Target (0 = No, 1 = Yes)
- `NAME_CONTRACT_TYPE`: Contract type (e.g., Cash loans, Revolving loans)
- `CODE_GENDER`: Gender (Female / Male)
- `FLAG_OWN_CAR`: Car ownership status (Y = Yes, N = No)
- `CNT_CHILDREN`: Number of children
- `AMT_INCOME_TOTAL`: Total income
- `NAME_FAMILY_STATUS`: Family status (e.g., Married, Separated, Single)
- `REGION_POPULATION_RELATIVE`: Relative population of the region
- `FLAG_MOBIL`: Mobile phone status (1 = Yes)
- `FLAG_EMP_PHONE`: Employment phone status (0 = No, 1 = Yes)
- `CNT_FAM_MEMBERS`: Number of family members
- `HOUSETYPE_MODE`: Type of house (e.g., Block of flats, House, Municipal)
- `OCCUPATION_TYPE`: Occupation (e.g., None, Sales staff, Managers)
- `AGE`: Age of the customer

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>KMeans reference: <a href = 'https://docs.teradata.com/search/all?query=KMeans&value-filters=vrm_release~%252220.00.00.03%2522&content-lang=en-US'>here</a></li>
    <li>KMeansPredict reference: <a href = 'https://docs.teradata.com/search/all?query=KMeansPredict&value-filters=vrm_release~%252220.00.00.03%2522&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>