<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Anomaly Detection in Credit Card
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style='font-size:20px;font-family:Arial;color:#00233C'><b>Credit Card Fraud Detection using K-Means Clustering:</b></p> 
<p style='font-size:16px;font-family:Arial;color:#00233C'>
Detecting fraudulent transactions is crucial for financial security. This approach leverages <b>K-Means clustering</b> to group similar transactions and identifies anomalies based on <b>Euclidean distance</b>, where fraud-like patterns deviate significantly from normal spending behaviors.
</p>

<ul style='font-size:16px;font-family:Arial;color:#00233C'>
    <li><strong>Anomaly Detection:</strong> Identifies outliers based on their distance from the cluster center, marking transactions that deviate from normal spending patterns.</li>
    <li><strong>Vector Embeddings:</strong> Converts categorical transaction data into vector representations to improve clustering accuracy.</li>
    <li><strong>Feature Engineering:</strong> Includes transaction amount, location, time, and merchant category to enhance fraud detection.</li>
    <li><strong>Dimensionality Reduction:</strong> Uses t-SNE to visualize clusters and detect fraudulent transactions that do not fit normal behavior.</li>
    <li><strong>Scalability:</strong> Works efficiently on large datasets by leveraging K-Means for unsupervised learning and anomaly detection.</li>
</ul>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    Teradata’s integration with <b>LLMs and hosting capabilities in-DB</b>, along with the Open Analytics Framework, would enable customers to run NLP models at scale. The key challenges noted for on-prem customers—such as data movement latency and lack of access to cloud models—are valid. By bringing language models within Vantage, Teradata can provide a significant advantage to on-prem customers by allowing them to run NLP models without needing to move large amounts of data to and from external services.
</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture

!pip install wordcloud nltk --quiet

In [None]:
%%capture

!pip install --force-reinstall pillow --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above libraries have to be installed. Restart the kernel after executing these cells to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing <b> 0 0</b></i> (zero zero) and pressing <i>Enter</i>.</p>
</div>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard libraries
import time
import warnings
import random

# Data manipluation and Visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Teradata libraries
from teradataml import *
from teradataml import (
    create_context, 
    delete_byom, 
    display,
    execute_sql,
    save_byom,
    remove_context,
    in_schema,
    KMeans,
    DataFrame,
    db_drop_table,
    db_drop_view
)
display.max_rows = 5

# NLP libraries
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)


# machine learning libraries
from sklearn.manifold import TSNE

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Anomaly_Detection_Credit_Card.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_cloud');" 
# takes about 20seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_CreditCard_local');" 
# takes about 35 seconds, estimated space: 11 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Confirmation for functions</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Before starting let us confirm that the required functions are installed.</p>

In [None]:
from IPython.display import display, Markdown

df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')
if df_check.get_values()[0][0] >= 10:
    print('Functions are installed, please continue.')
else:
    print('Functions are not installed, please go to Instalization notebook before proceeding further')
    display(Markdown("[Initialization Notebook](./Initialization_and_Model_Load.ipynb)"))

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Create the Embeddings</b>

In [None]:
tdf_cc = DataFrame(in_schema("DEMO_CreditCard","Credit_Card"))
tdf_cc

In [None]:
tdf_cc.shape

In [None]:
#Storing in database
copy_to_sql(df = tdf_cc, table_name = "credit_card_db", if_exists="replace")

In [None]:
tdf = DataFrame.from_table('credit_card_db')
tdf.shape

<hr style="height:1px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>3.1 Real-time Data Collection
</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This simulated data mimics the real-time process, where transaction details are captured continuously, enabling anomaly detection in real-time credit card activity.</p>

In [None]:
# Real time data collection
def fetching_real_time_data():
    contract_types = ["Cash loans", "Revolving loans"]
    genders = ["M", "F"]
    own_car = ["Y", "N"]
    family_status = ["Married", "Single", "Separated"]
    house_types = ["Block of flats", "House", "Municipal"]
    occupations = ["Sales staff", "Managers", "Core staff", "None"]

    records = []
    for _ in range(10):
        record = {
            "SK_ID_CURR": random.randint(456255, 999999),
            "TARGET": random.choice([0, 1]),
            "NAME_CONTRACT_TYPE": random.choice(contract_types),
            "CODE_GENDER": random.choice(genders),
            "FLAG_OWN_CAR": random.choice(own_car),
            "CNT_CHILDREN": random.randint(0, 5),
            "AMT_INCOME_TOTAL": round(random.uniform(117000000, 117100000), 2),
            "NAME_FAMILY_STATUS": random.choice(family_status),
            "REGION_POPULATION_RELATIVE": round(random.uniform(0.001, 0.05), 6),
            "FLAG_MOBIL": 1,
            "FLAG_EMP_PHONE": random.choice([0, 1]),
            "CNT_FAM_MEMBERS": random.randint(1, 6),
            "HOUSETYPE_MODE": random.choice(house_types),
            "OCCUPATION_TYPE": random.choice(occupations),
            "AGE": random.randint(20, 70)
        }
        records.append(record)
    
    return pd.DataFrame(records)

In [None]:
# Function for real-time data fetching
def fetch_data():
    all_data = [] 

    print("\nInitializing real-time credit card data fetch...\n")
    time.sleep(1)
    
    while True:
        user_input = input("Do you want to fetched updated credit card record? (yes to start, stop to end): ").strip().lower()
        
        if user_input == 'yes':
            print("Fetching new credit card record...", end="")
            sys.stdout.flush()
            time.sleep(random.uniform(0.5, 1.5))

            data = fetching_real_time_data()
            print(" Data fetched successfully!")
            all_data.append(data)
        elif user_input == 'stop':
            print("\nStopping the data collection. Finalizing...\n")
            time.sleep(1)
            break
        else:
            print("Invalid input. Please enter 'yes' to generate a record or 'stop' to end.")
    
    if all_data:
        print("\nFinalizing the dataset...\n")
        time.sleep(1)
        print("Merging all generated records...\n")
        print("Inserted new records")
        return pd.concat(all_data, ignore_index=True)
    else:
        print("No records fetched, returning an empty DataFrame.")  
        return pd.DataFrame(columns=['TransactionID', 'Amount'])  

<hr style="height:1px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>3.2 Creation of Views and Final Embeddings Table
</b>

<p style='font-size:16px;font-family:Arial;color:#00233C'>
This code creates a view named <code>v_complaints_tokenized_for_embeddings</code>, which contains tokenized consumer complaint data for embedding generation. It extracts the <code>id</code>, <code>txt</code> (amount), <code>input_ids</code> (tokenized representations), and <code>attention_mask</code> using the <code>ivsm.tokenizer_encode</code> function.
</p>

<p style='font-size:16px;font-family:Arial;color:#00233C'>
Additionally, a view named <code>complaints_embeddings</code> is created to store computed embeddings—vector representations of consumer complaint texts. These embeddings are generated via the <code>ivsm.IVSM_score</code> function, which encodes input text based on a specific model.
</p>

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>Generate Embeddings</b></p>    

<p style='font-size:16px;font-family:Arial;color:#00233C'>
Generating embeddings will take approximately <b>10-15 minutes.</b>
</p>


<center><img src="images/visual.svg" alt="embeddings_decision" width=1300 height=1400/></center>


<div class="alert alert-block alert-info">
<p style='font-size:16px;font-family:Arial;color:#00233C'>
<i><b>
These embeddings will later be used in anomaly detection by comparing the similarity between different transactions. By converting each transaction into a vector representation, we can identify outliers or anomalies based on the distance between vectors.</b></i>
</p>
</div>


In [None]:
def generate_and_store_embeddings():
    
    # Step 1: Tokenization View
    global tdf_embeddings_store

    execute_sql("""
    REPLACE VIEW credit_card_tokenized_for_embeddings AS (
      SELECT 
        id, 
        txt, 
        IDS AS input_ids, 
        attention_mask 
      FROM 
        ivsm.tokenizer_encode(
          ON (
            SELECT 
              top 1000 SK_ID_CURR AS id, 
              AMT_INCOME_TOTAL AS txt 
            FROM 
              credit_card_db
          ) ON (
            SELECT 
              model AS tokenizer 
            FROM 
              embeddings_tokenizers 
            WHERE 
              model_id = 'bge-small-en-v1.5'
          ) DIMENSION
          USING 
              ColumnsToPreserve('id', 'txt')
              OutputFields('IDS', 'ATTENTION_MASK')
              MaxLength(1024)
              PadToMaxLength('True')
              TokenDataType('INT64')
        ) a
    );
    """)
    print("Tokenized View Created")
    
    # Step 2: Embeddings View
    execute_sql("""
    REPLACE VIEW credit_card_embeddings AS (
      SELECT 
        * 
      FROM 
        ivsm.IVSM_score(
          ON credit_card_tokenized_for_embeddings 
          ON (
            SELECT 
              * 
            FROM 
              embeddings_models 
            WHERE 
              model_id = 'bge-small-en-v1.5'
          ) dimension
          USING 
              ColumnsToPreserve('id', 'txt')
              ModelType('ONNX')
              BinaryInputFields('input_ids', 'attention_mask')
              BinaryOutputFields('sentence_embedding')
              Caching('inquery')
        ) a
    );
    """)
    print("Embeddings View Created")
    
    # Step 3: Store in Table
    print("\nGenerating embeddings and Saving to the database, please wait...")
    start = time.time()
    qry = """
    CREATE multiset TABLE credit_card_embeddings_store AS (
      SELECT 
        * 
      FROM 
        ivsm.vector_to_columns(
          ON credit_card_embeddings 
        USING 
          ColumnsToPreserve('id', 'txt') 
          VectorDataType('FLOAT32')
          VectorLength(384) 
          OutputColumnPrefix('emb_')
          InputColumnName('sentence_embedding')
        ) a
    ) WITH DATA PRIMARY index(id)
    """
    
    try:
        start = time.time()
        execute_sql(qry)
        end = time.time()
        print('Table Created')
        print("Total time to run tokenization+embeddings took = ",(end-start)/60, " min on 2nodes 4Amp VM")
        tdf_embeddings_store = DataFrame('credit_card_embeddings_store')
    except:
        db_drop_table('credit_card_embeddings_store')
        start = time.time()
        execute_sql(qry)
        end = time.time()
        print('Table Created')
        print("Total time to run tokenization+embeddings took = ",(end-start)/60, " min on 2nodes 4Amp VM")
        tdf_embeddings_store = DataFrame('credit_card_embeddings_store')

    end = time.time()

    print(f"Total time to run tokenization+embeddings on {tdf_embeddings_store.shape[0]} rows = ", (end-start)/60, " min")

In [None]:
generate_and_store_embeddings()  

print("All steps completed successfully!")

In [None]:
# Displaying Sample Rows from the Embeddings Store
tdf_embeddings_store.sample(2)

In [None]:
# Checking the Shape of the Embeddings Store
tdf_embeddings_store.shape

<hr style="height:1px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>3.3 Elbow Method
</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this step, we apply the <b>Elbow Method</b> to determine the optimal number of clusters for the KMeans clustering algorithm. The Elbow Method helps in selecting the number of clusters that best represents the data by analyzing the <b>Total Within-Cluster Sum of Squares (Total_WithinSS)</b>.</p>

<p style='font-size:16px;font-family:Arial;color:#00233C'>
For each number of clusters, we print the `Total_WithinSS` value, which represents the compactness of the clusters.
</p>


In [None]:
# Extract embedding features (excluding ID and text columns)
embedding_column_list = [col for col in tdf_embeddings_store.columns if col not in ["id", "txt"]]

total_withinss_values = []

for num_clusters in range(2, 10):
    # Perform KMeans clustering
    kmeans_out = KMeans(data=tdf_embeddings_store,
                        id_column="id",
                        target_columns=embedding_column_list,
                        num_clusters=num_clusters,
                        num_init=10,
                        iter_max=50
                        )
    result_table = kmeans_out.result
    result_table_df = result_table[['td_modelinfo_kmeans']]
    df = result_table_df.assign(has_name = result_table_df.td_modelinfo_kmeans.str.contains('Total_WithinSS', na = False))
    df1 = df[df.has_name == 1]
    df2 = df1.drop(columns =['has_name'])
    df3 = dict(df2.to_pandas()['td_modelinfo_kmeans'])
    
    # Access the value associated with key 0
    value_str = df3[0]

    # Split the string by ':' and strip any leading/trailing whitespace
    numeric_str = value_str.split(':')[1].strip()
    numeric_value = round(float(numeric_str), 2)
    total_withinss_values.append(numeric_value)
    
    # Display the numeric value for each number of clusters
    print(f'Number of clusters: {num_clusters}, Total_WithinSS: {numeric_value}')

<p style='font-size:16px;font-family:Arial;color:#00233C'>
The resulting plot provides a visual representation of the relationship between the number of clusters and the <b>Total_WithinSS</b>. The "elbow" in the plot will suggest the optimal `k` for further analysis.
</p>

In [None]:
# Plotting the elbow graph
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(range(2, 10), total_withinss_values, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Clusters')
plt.ylabel('Total Within-Cluster Sum of Squares (Total_WithinSS)')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.grid(True)
plt.show()

<p style='font-size:16px;font-family:Arial;color:#00233C'>
In this step, we calculate the differences between successive WCSS values and use the second-order differences to identify the <b>Elbow Point</b>, which helps in selecting the optimal number of clusters (`k`).
</p>

In [None]:
import numpy as np

# Calculate the differences between successive WCSS values
wcss_diff = np.diff(total_withinss_values)

# Calculate the second-order differences to find the largest drop
wcss_diff2 = np.diff(wcss_diff)

# Find the index of the maximum curvature (Elbow Point)
optimal_k_index = np.argmin(wcss_diff2) + 2 

print(f'Optimal number of clusters (K) based on Elbow Method: {optimal_k_index}')

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Run K-Means on the Embeddings Store and then build final table with Cluster ID assignments to rows</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The K-means() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster</p>

In [None]:
# Convert to Teradata-compatible int
optimal_k_index = int(optimal_k_index)

In [None]:
embedding_column_list = [col for col in tdf_embeddings_store.columns if col not in ["id", "txt"]]

num_clusters = optimal_k_index
kmeans_out = KMeans(
    id_column="id",
    data=tdf_embeddings_store,
    target_columns=embedding_column_list,
    num_init=10,
    num_clusters=num_clusters,
    iter_max=50,
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output below shows cluster assignment for each row.</p>

In [None]:
# Print out the result
kmeans_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's check how many data points each cluster has.</p>

In [None]:
# Applying KMeans Predict
KMeansPredict_out_1 = KMeansPredict(data=tdf_embeddings_store,
                                    object=kmeans_out.result,
                                    #accumulate="ram",
                                    output_distance=False)

In [None]:
# Print the result DataFrames.
KMeansPredict_out_1.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's check how many data points each cluster has.</p>

In [None]:
# Count of each clustered
kmeans_df = KMeansPredict_out_1.result
d2 = kmeans_df.groupby('td_clusterid_kmeans').count()
d2

In [None]:
# Combine embeddings and clusterid
clustered_df = tdf_embeddings_store.join(
    other = kmeans_df,
    on = ["id"],
    how = "inner",
    lprefix="l",
    rprefix="r"
)

clustered_df.shape

In [None]:
# Storing into database
copy_to_sql(df = clustered_df, table_name = "clustered_data", if_exists="replace")

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Visualization</b>

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>5.1 Visualization of Clusters</b></p> 

<p style='font-size:16px;font-family:Arial;color:#00233c'>The graph illustrates the clustering of transactions into distinct groups for credit card fraud detection using anomaly detection. Based on the analysis, the data has been divided into optimal clusters, each representing a unique transaction pattern. This clustering approach helps identify potential fraudulent activities by distinguishing normal and anomalous transaction behaviors, enabling more targeted fraud detection and prevention efforts.</p>


<p style='font-size:16px;font-family:Arial;color:#00233c'>This visualization helps us explore the structure of the clusters and visually identify <b>anomalies</b> (points that deviate significantly from their respective cluster centers). The use of <b>diamonds for anomalies</b> and <b>circles for normal data points</b> provides a clear and intuitive way to distinguish between outliers and inliers in the data, making it easier to detect potential fraudulent or unusual activity.</p>

<p style='font-size:16px;font-family:Arial;color:#00233c'>The interactive nature of the plot allows for an engaging exploration of the data, where users can hover over points to view more detailed information.</p>


In [None]:
def cluster_visualization():
    
    clus = clustered_df.to_pandas()

    # Perform t-SNE dimensionality reduction
    tsne = TSNE(n_components=2, random_state=123)
    tsne_result = tsne.fit_transform(clus.iloc[:, 3:-1])

    # Create DataFrame for visualization
    tsne_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
    tsne_df['cluster_id'] = clus['td_clusterid_kmeans']
    tsne_df['customer_id'] = clus['l_id']
    tsne_df['amount'] = clus['txt']  # Rename 'txt' to 'amount'

    # Truncate amount for hover data
    tsne_df['truncated_amount'] = clus['txt'].apply(lambda x: f"{x[:50]}..." if isinstance(x, str) else x)

    # Compute cluster centers
    cluster_centers = tsne_df.groupby('cluster_id')[['tsne_1', 'tsne_2']].mean()

    # Calculate Euclidean distance of each point from its cluster center
    def euclidean_distance(row):
        center = cluster_centers.loc[row['cluster_id']]
        return np.sqrt((row['tsne_1'] - center['tsne_1'])**2 + (row['tsne_2'] - center['tsne_2'])**2)

    tsne_df['distance'] = tsne_df.apply(euclidean_distance, axis=1)

    # Apply anomaly detection for each cluster independently
    tsne_df['is_anomaly'] = tsne_df.groupby('cluster_id')['distance'].transform(lambda x: x > x.quantile(0.95))  # Detect anomalies per cluster

    # Assign marker shapes
    tsne_df['marker_shape'] = tsne_df['is_anomaly'].apply(lambda x: 'diamond' if x else 'circle')

    # Plot using Plotly Express
    fig = px.scatter(
        tsne_df, x='tsne_1', y='tsne_2', 
        color='cluster_id',
        symbol='marker_shape',  # Change shape for anomalies
        hover_data=['customer_id', 'truncated_amount', 'cluster_id']
    )

    fig.update_traces(marker=dict(size=15))
    
    # Remove marker_shape from legend
    fig.for_each_trace(lambda trace: trace.update(showlegend=False) if trace.name in ['circle', 'diamond'] else trace)


    fig.update_layout(
        title='t-SNE Visualization of Clusters with Anomaly Detection',
        xaxis_title='Dimension-1',
        yaxis_title='Dimension-2',
        width=1000,
        height=800,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
        autosize=False
    )

    # Customize hover template
    fig.update_traces(hovertemplate="<b>Customer ID:</b> %{customdata[0]}"
                                     "<b>Amount:</b> %{customdata[1]}"
                                     "<b>Cluster ID:</b> %{customdata[2]}<extra></extra>")

    fig.show()

In [None]:
# Visualization of clusters
cluster_visualization()  

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. Real-time data Fetching</b>

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>6.1 Data Fetching Process</b></p> 

<p style='font-size:16px;font-family:Arial;color:#00233c'>The <code>fetch_data()</code> function allows the user to fetch new credit card transaction records in real-time. By entering <b>'yes'</b>, the system fetches and stores 10 new records, simulating live data collection. Entering <b>'stop'</b> terminates the data fetching process, finalizing and merging all collected records into a dataset.</p>





In [None]:
# Fetch Real time credit card data
tdf = fetch_data()

In [None]:
# Store in database
copy_to_sql(tdf, table_name="credit_card_db", if_exists="replace")

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>7. Create the Embeddings</b>

<p style='font-size:16px;font-family:Arial;color:#00233c'>we generate and store the embeddings for the credit card transaction data. The embeddings are created using a pre-trained model, which transforms the transaction data into vectorized representations that can be used for anomaly detection.</p>

In [None]:
tdf = DataFrame.from_table('credit_card_db')
tdf.shape

In [None]:
# Generate Embeddings
generate_and_store_embeddings() 
print("All steps completed successfully!")

In [None]:
# View sample data
tdf_embeddings_store.sample(2)

In [None]:
# View shape of dataframe
tdf_embeddings_store.shape

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>8. Applying KMeans Prediction</b>

<p style='font-size:16px;font-family:Arial;color:#00233c'>We apply the <b>KMeans Prediction</b> to predict the cluster assignments for the dataset using the previously trained KMeans model. This allows us to label the data points with the clusters they belong to and further analyze the results.</p>

In [None]:
# Applying KMeansPredict
KMeansPredict_out_1 = KMeansPredict(data=tdf_embeddings_store,
                                    object=kmeans_out.result,
                                    #accumulate="ram",
                                    output_distance=False)

# Print the result DataFrames.
KMeansPredict_out_1.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's check how many data points each cluster has.</p>

In [None]:
# Count of each clusterid
kmeans_df = KMeansPredict_out_1.result
d2 = kmeans_df.groupby('td_clusterid_kmeans').count()
d2

In [None]:
# Combine embeddings and clusterid
clustered_df = tdf_embeddings_store.join(
    other = kmeans_df,
    on = ["id"],
    how = "inner",
    lprefix="l",
    rprefix="r"
)

In [None]:
# Store in database
copy_to_sql(df = clustered_df, table_name = "clustered_data", if_exists="append")

In [None]:
# View updated data
clustered_df = DataFrame.from_table('clustered_data')
clustered_df.shape

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Visualization</b>

<hr style='height:1px;border:none;background-color:#00233C;'> 

<p style='font-size:18px;font-family:Arial;color:#00233c'><b>5.1 Visualization of Clusters</b></p> 

<p style='font-size:16px;font-family:Arial;color:#00233c'>This visualization helps us explore the structure of the clusters and visually identify <b>anomalies</b> (points that deviate significantly from their respective cluster centers). The use of <b>diamonds for anomalies</b> and <b>circles for normal data points</b> provides a clear and intuitive way to distinguish between outliers and inliers in the data, making it easier to detect potential fraudulent or unusual activity.</p>

<p style='font-size:16px;font-family:Arial;color:#00233c'>The interactive nature of the plot allows for an engaging exploration of the data, where users can hover over points to view more detailed information.</p>


In [None]:
# Visualization of clusters
cluster_visualization()  

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this demo we have seem that how we can run HuggingFace Embedding Model (BAAI/bge-small-1.5) in ONNX format and run it in database parallelly to create embeddings. We have done KMeans Clustering to group for credit card fraud detection using anomaly detection and interactive t-SNE visualization allowed us to explore the clusters, distinguish anomalies using <b>diamond shapes</b>, and analyze the structure of the data more intuitively.</p> 

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['credit_card_embeddings_store', 'credit_card_db', 'clustered_data']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

views = ['credit_card_tokenized_for_embeddings','credit_card_embeddings']

for view in views:
    try:
        db_drop_view(view_name=view)
    except:
        pass

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_CreditCard_cloud');"        # Takes 10 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `SK_ID_CURR`: Customer ID
- `TARGET`: Target (0 = No, 1 = Yes)
- `NAME_CONTRACT_TYPE`: Contract type (e.g., Cash loans, Revolving loans)
- `CODE_GENDER`: Gender (Female / Male)
- `FLAG_OWN_CAR`: Car ownership status (Y = Yes, N = No)
- `CNT_CHILDREN`: Number of children
- `AMT_INCOME_TOTAL`: Total income
- `NAME_FAMILY_STATUS`: Family status (e.g., Married, Separated, Single)
- `REGION_POPULATION_RELATIVE`: Relative population of the region
- `FLAG_MOBIL`: Mobile phone status (1 = Yes)
- `FLAG_EMP_PHONE`: Employment phone status (0 = No, 1 = Yes)
- `CNT_FAM_MEMBERS`: Number of family members
- `HOUSETYPE_MODE`: Type of house (e.g., Block of flats, House, Municipal)
- `OCCUPATION_TYPE`: Occupation (e.g., None, Sales staff, Managers)
- `AGE`: Age of the customer

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>KMeans reference: <a href = 'https://docs.teradata.com/search/all?query=KMeans&value-filters=vrm_release~%252220.00.00.03%2522&content-lang=en-US'>here</a></li>
    <li>KMeansPredict reference: <a href = 'https://docs.teradata.com/search/all?query=KMeansPredict&value-filters=vrm_release~%252220.00.00.03%2522&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>