<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Vector Embeddings<br>
   <span style="font-size: 20px;">An introduction to generate vector embeddings using HuggingFace models in-Vantage</span>
       
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Hugging Face is a French-American company based in New York City that develops computation tools for building applications using machine learning. They are known for their <b>Transformers Library</b> which provides open-source implementations of transformer models for text, image, video, audio tasks including time-series. These models include well-known architectures like BERT and GPT. The library is compatible with PyTorch, TensorFlow, and JAX deep learning libraries. <br>
    Deep Learning Models in HuggingFace are pre-trained by users/open source outfits/companies on various types of data – NLP, Audio, Images, Videos etc. Most popular tool of choice by users is PyTorch (open source python library) which helps create a Deep Learning model from scratch or take an existing model, retrain/fine-tune (Transfer Learning) on new set of data to be published in HF. Models can be inference with CPUs and GPUs with slight performance improvement for smaller models.<br>
</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As many Hugging Face models are available in <b>ONNX Runtime</b>, we can load them using the <b>BYOM</b> feature of Vantage and run them in Vantage. Because of <b>Graph Optimizations</b> on ONNX Runtime, there are proven benchmarks that show that inference with <b>ONNX Runtime will be 20% faster than a native PyTorch model on a CPU</b>. </p>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Vantage Parallelism</b> on top of boosted ONNX Runtime inference can turn a Vantage system as effective as inference on GPUs. If we have a <b>Vantage box with 72 AMPs</b>, assuming the table is perfectly distributed, it will <b>closely match the performance of a dedicated GPU and data never moves across the network saving time and I/O operations</b>. As parallelism increases with number of AMPs, the model inference will complete faster in Teradata Vantage with the same amount of text data vs a GPU. We can of course quantize the model (change float8 weights to int8/int4) for inference on CPU to go even faster with some tradeoff with accuracy. However, If Model size goes up GPU advantage will widen – example LLM like LLama3 and costs will be disproportionate with GPU but for smaller models we can get comparable performance. 
</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Overall flow:</b></p>  

<center><img src="../../../UseCases/Language_Models_InVantage/images/pat1.png" alt="Design pattern 1" width=1200 height=900/></center>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:20px;font-family:Arial;color:#00233c'>1. Configuring the environment</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.1 Install the required libraries</b></p>

In [None]:
%%capture

!pip install tdstone2

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b>
</i>
    <br>You can remove or comment the <b>%%capture</b> is you want to observe what <i>!pip install</i> is doing. </p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.2 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')
from teradataml import *

import pandas as pd
import numpy as np

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../../../UseCases/startup.ipynb
eng = create_context(host='host.docker.internal', username='demo_user', password=password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=PP_Vectore_Embedding_iVSM.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Getting data for this demo</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will generate the required data. The data we are creating is categorized by typical software issues and some questions that are typically asked. To simplify the process we will insert the data into a python dictionary, load it into pandas dataframe, and than copy the dataframeinto a table in Vantage.</p> 

In [None]:
# Define the three types of software problems and corresponding questions
problems_data = {
    "Problem_Type": ["Installation Issue", "Performance Issue", "Functionality Issue"],
    "User_Question": [
        [
            "Why can't I install the software on my machine?",
            "What do I do if the installer keeps crashing?",
            "How do I resolve dependency errors during installation?",
            "Why is my antivirus blocking the software installation?"
        ],
        [
            "Why is the software running so slowly?",
            "How do I fix memory issues causing the software to crash?",
            "What can I do if the software takes too long to load?",
            "Why is the CPU usage so high when using the software?"
        ],
        [
            "Why is the 'Save' button not working?",
            "How do I troubleshoot errors when trying to export data?",
            "Why does the software keep freezing when I try to open certain files?",
            "What should I do if features are missing after an update?"
        ]
    ]
}

# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(problems_data)

# Expanding the dataframe so each row corresponds to one question
expanded_rows = []

for index, row in df.iterrows():
    problem_type = row["Problem_Type"]
    questions = row["User_Question"]
    for question in questions:
        expanded_rows.append({"Problem_Type": problem_type, "User_Question": question})

# Create a new DataFrame with the expanded rows
df = pd.DataFrame(expanded_rows)
df['id'] = df.index
df = df[['id','Problem_Type','User_Question']]
df

In [None]:
copy_to_sql(df, table_name = 'questions', if_exists = 'replace')

In [None]:
dataset = DataFrame('questions')
dataset

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Data Distribution</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check the number of AMPs available on our system and than based on that we will distribute the data equally on all amps to utilize the complete power of the sytem available to us.</p> 

In [None]:
nb_amps = execute_sql('SEL HASHAMP()').fetchall()[0][0]+1
nb_amps

In [None]:
from tdstone2.data_distribution import InverseHash, EquallyDistribute
from tdstone2.dataset_generation import gen_query

In [None]:
df = gen_query(dataset[['User_Question']],n=1)[['User_Question']] # Generate a single partition
df = gen_query(df, n=nb_amps, replication_column = 'Partition_ID')
df

In [None]:
df = EquallyDistribute(df)

In [None]:
df = df.assign(Problem_Type = df.Partition_ID)
df = FillRowId(data=df,
                    row_id_column='Id'
                   ).result
df[['Id','Problem_Type','User_Question']].to_sql(
    table_name='questions_large',
    primary_index = 'Id',
    if_exists = 'replace'
)
dataset_large = DataFrame('questions_large')
dataset_large

In [None]:
from tdstone2.data_distribution import PlotDistribution
PlotDistribution('demo_user', 'questions_large', partition = 'Problem_Type')

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Installing the files in Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below command will create the database and functions required for text summarization and embedding models using Huggingface PyTorch models in Vantage.</p>

In [None]:
with open("../../../UseCases/Language_Models_InVantage/commands.json", "r") as file:
    data = json.load(file)

for item in data["queries"]:
    try:
        print('Executing query: ', item["query"])
        execute_sql(item["query"])
    except Exception as e:
        print(
            f"The initialization steps have already been executed for this environment!"
        )
        # print(f"Error: {e}")
        pass

In [None]:
from tdstone2.tdsgenai import install_model_in_vantage_from_name_for_byom

In [None]:
install_model_in_vantage_from_name_for_byom(
    sequence_length=256,
    model_name = 'BAAI/bge-small-en-v1.5',
    model_task = 'feature-extraction',
    replace = True
)

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Check installed files</b></p>

In [None]:
from tdstone2.tdsgenai import list_installed_files_byom

In [None]:
list_installed_files_byom()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. Compute Vector Embeddings</b></p>

In [None]:
dataset = DataFrame('questions')
dataset

In [None]:
model_id = 'tdstone2_emb_256_BAAI_bge-small-en-v1.5'
get_model_dimension(model_id)

In [None]:
from tdstone2.tdsgenai import compute_vector_embedding_byom

In [None]:
res = compute_vector_embedding_byom(
    # choose your language model
    model              = 'tdstone2_emb_256_BAAI_bge-small-en-v1.5',
    # the description of the dataset
    dataset            = dataset,           # the teradata dataframeS
    text_column        = 'User_Question',   # the column containing the text
    accumulate_columns = ['id','User_Question'],  # the columns we want to keep in the output results
    # the output table
    schema_name        = 'demo_user', # the database 
    table_name         = 'embeddings_ivsm', # the output table name
    primary_index      = ['id'],            # the primary index columns
    # choose ivsm instead of onnxembeddins
    mldb_function      = 'iVSM'
)

In [None]:
res

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Work Tables</b></p>

In [None]:
tables = ['questions', 'questions_large','embeddings>ivsm']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>