<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Huggingface model using Script Table Operator(STO)
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Hugging Face is a French-American company based in New York City that develops computation tools for building applications using machine learning. They are known for their <b>Transformers Library</b> which provides open-source implementations of transformer models for text, image, video, audio tasks including time-series. These models include well-known architectures like BERT and GPT. The library is compatible with PyTorch, TensorFlow, and JAX deep learning libraries. <br>
    Deep Learning Models in HuggingFace are pretrained by users/open source outfits/companies on various types of data – NLP, Audio, Images, Videos etc. Most popular tool of choice by users is PyTorch (open source python library) which helps create a Deep Learning model from scratch or take an existing model, retrain/fine-tune (Transfer Learning) on new set of data to be published in HF. Models can be inferenced with CPUs and GPUs with slight performance improvement for smaller models.<br>
</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As many Hugging Face models are availble in ONNX Runtime, we can load them using the BYOM feature of Vantage and run them in Vantage. Because of Graph Optimizations on ONNX Runtime, there are proven benchmarks that show that inference with ONNX Runtime will be 20% faster than a native PyTorch model on a CPU. Vantage Parallelism on top of boosted ONNX Runtime inference can turn a Vantage system as effective as inference on GPUs. If we have a Vantage box with 72 AMPs, assuming the table is perfectly distributed, it will closely match the performance of a dedicated GPU and data never moves across the network saving time and I/O operations. As parallelism increases with number of AMPs, the model inference will complete faster in Teradata Vantage with the same amount of text data vs a GPU. We can of course quantize the model (change float8 weights to int8/int4) for inference on CPU to go even faster with some tradeoff with accuracy. However, If Model size goes up GPU advantage will widen – example LLM like LLama2 and costs will be disproportionate with GPU but for smaller models we can get comparable performance. 
</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    This notebook demonstrates the ability to use Huggingface model in Vantage Script table operator(STO) for an On Prem Enterprise edition of Vantage.
</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
%%capture
!pip install tdstone2==0.1.3.13

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')


from teradataml import (create_context,execute_sql, copy_to_sql, DataFrame,remove_context)
import tdstone2

# Modify the following to match the specific client environment settings
display.max_rows = 5


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_HuggingFace_model_STO_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Getting Data for This Demo</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will generate the required data. The data we are generating is related to software issues and the questions related to these issues that can be asked. To simplify the process we will generate the data load it into pandas daraframe and than copy the data inot Vantage.</p> 

In [None]:
import pandas as pd
import numpy as np

# Define the three types of software problems and corresponding questions
problems_data = {
    "Problem_Type": ["Installation Issue", "Performance Issue", "Functionality Issue"],
    "User_Question": [
        [
            "Why can't I install the software on my machine?",
            "What do I do if the installer keeps crashing?",
            "How do I resolve dependency errors during installation?",
            "Why is my antivirus blocking the software installation?"
        ],
        [
            "Why is the software running so slowly?",
            "How do I fix memory issues causing the software to crash?",
            "What can I do if the software takes too long to load?",
            "Why is the CPU usage so high when using the software?"
        ],
        [
            "Why is the 'Save' button not working?",
            "How do I troubleshoot errors when trying to export data?",
            "Why does the software keep freezing when I try to open certain files?",
            "What should I do if features are missing after an update?"
        ]
    ]
}

# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(problems_data)

# Expanding the dataframe so each row corresponds to one question
expanded_rows = []

for index, row in df.iterrows():
    problem_type = row["Problem_Type"]
    questions = row["User_Question"]
    for question in questions:
        expanded_rows.append({"Problem_Type": problem_type, "User_Question": question})

# Create a new DataFrame with the expanded rows
df = pd.DataFrame(expanded_rows)

df

In [None]:
copy_to_sql(df,table_name = 'questions', if_exists = 'replace')
dataset = DataFrame('questions')
dataset

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Huggingface model usage with Vantage STO</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the tdstone2 package to install the generic python files that enables ml training, scoring, feature engineering and vector embedding computations. These files are installed once, and enable user friendly interactions with the platform, like vector embedding as you will see in the following.</p> 

In [None]:
from tdstone2.tdstone import TDStone
sto = TDStone(schema_name = 'demo_user', SEARCHUIFDBPATH = 'demo_user')
sto.setup()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will install the necessary libraries into the sto environment of Vantage. PushFile installs the py file, including the py file that implements the vector embedding.</p> 

In [None]:
sto.PushFile()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than install the required models into Vantage. Here, we are using the <code>sentence-transformers/paraphrase-MiniLM-L6-v2</code> and <code>prajjwal1/bert-mini</code> models</p> 

In [None]:
from tdstone2.tdsgenai import install_model_in_vantage_from_name

In [None]:
install_model_in_vantage_from_name(model_name = 'sentence-transformers/paraphrase-MiniLM-L6-v2', model_task = 'sentence-similarity')

In [None]:
install_model_in_vantage_from_name(model_name = 'prajjwal1/bert-mini')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check the install models using the list function and as we can see the 2 models we installed above are now available in the Vantage environment.</p> 

In [None]:
from tdstone2.tdsgenai import list_installed_files
list_installed_files()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will specify the model and the dataset to be used for computing the vector embeddings.</p> 

In [None]:
model = 'tdstone2_emb_512_sentence-transformers_paraphrase-MiniLM-L6-v2.zip'

In [None]:
dataset = DataFrame('questions')
dataset

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will specify the details of the dataset. Like, the text column to be used, the hash column that is the primary index so that it helps us in the parallel processing based on the PI column and than the columns that need to be retained.</p> 

In [None]:
from tdstone2.tdsgenai import compute_vector_embedding
schema_name        = 'demo_user'
table_name         = 'embeddings'

text_column        = 'User_Question'
hash_columns       = ['Problem_Type']
accumulate_columns = ['Problem_Type']

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than compute the embeddings using the compute function which will use the model and the dataset and than generate the embeddings. The steps involved in the process will be displayed as a part of the output.  compute_vector_embedding function sets up and executes a script for the given model and dataset, ensuring that the text column is VARCHAR
and the model exists. Finally, create a pivot view of the results.</p> 

In [None]:
res = compute_vector_embedding(model, dataset, schema_name, table_name, text_column, hash_columns, accumulate_columns)

In [None]:
res

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than execute the run_tds_vector_embedding_script_locally function. It runs the 'tds_vector_embedding.py' script in the data module of the 'tdstone2' package by passing a dataframe via stdin and the required arguments.</p> 

In [None]:
from tdstone2.tdsgenai import run_tds_vector_embedding_script_locally 
import pandas as pd

In [None]:
# Define the three types of software problems and corresponding questions
problems_data = {
    "Problem_Type": ["Installation Issue", "Performance Issue", "Functionality Issue"],
    "User_Question": [
        [
            "Why can't I install the software on my machine?",
            "What do I do if the installer keeps crashing?",
            "How do I resolve dependency errors during installation?",
            "Why is my antivirus blocking the software installation?"
        ],
        [
            "Why is the software running so slowly?",
            "How do I fix memory issues causing the software to crash?",
            "What can I do if the software takes too long to load?",
            "Why is the CPU usage so high when using the software?"
        ],
        [
            "Why is the 'Save' button not working?",
            "How do I troubleshoot errors when trying to export data?",
            "Why does the software keep freezing when I try to open certain files?",
            "What should I do if features are missing after an update?"
        ]
    ]
}

# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(problems_data)

# Expanding the dataframe so each row corresponds to one question
expanded_rows = []

for index, row in df.iterrows():
    problem_type = row["Problem_Type"]
    questions = row["User_Question"]
    for question in questions:
        expanded_rows.append({"Problem_Type": problem_type, "User_Question": question})

# Create a new DataFrame with the expanded rows
df = pd.DataFrame(expanded_rows)
df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will execute the function for both models. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The inputs to the function are as below:</p> 
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'>df (pd.DataFrame): The dataframe to process.</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>zip_file_path (str): The path to the zip file.</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>text_column (int): The index of the text column in the dataframe.</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>accumulate_columns (list): The list of column indexes to accumulate and print.</li></p> 

In [None]:
# Define the arguments
zip_file_path      = "models/sentence-transformers_paraphrase-MiniLM-L6-v2/"  # Replace with actual path
text_column        = 'User_Question'  # Assuming 'text' column is at index 0
accumulate_columns = ['Problem_Type']  # Indices of other columns to accumulate

# Run the script
output = run_tds_vector_embedding_script_locally(df, zip_file_path, text_column, accumulate_columns)
output

In [None]:
# Define the arguments
zip_file_path      = "models/prajjwal1_bert-mini/"  # Replace with actual path
text_column        = 'User_Question'  # Assuming 'text' column is at index 0
accumulate_columns = ['Problem_Type']  # Indices of other columns to accumulate

# Run the script
output = run_tds_vector_embedding_script_locally(df, zip_file_path, text_column, accumulate_columns)
output

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than compute the embeddings using the SentenceTranformer to verify the embeddings.</p> 

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('prajjwal1/bert-mini')
model.encode(output.index[0][1])

In [None]:
model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
model.encode(output.index[0][1])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can use these functions and process for business use cases like finding similarity, clustering or topic modelling etc.</p> 

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Work Tables</b></p>

In [None]:
tables = ['questions', 'T_embeddings','TDS_MODEL_REPOSITORY','TDS_MAPPER_REPOSITORY','TDS_HYPER_MODEL_REPOSITORY',
          'TDS_FEATURE_ENGINEERING_PROCESS_REPOSITORY','TDS_CODE_REPOSITORY']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass
    

try:
    db_drop_view('embeddings')
except:
    pass    

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>