<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       RAG solution with Vantage catalogue and AWS Bedrock integration
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this demo we will explore how to do extract data from metadata tables of Teradata using embedding and vector db style indexing in Vantage and then query LLM with context/prompts to get the details.</p>

<p style = 'font-size:16px;font-family:Arial'><p style = 'font-size:16px;font-family:Arial'>We use ONNXEmbeddings function for creating embeddings using the Hugging Face PyTorch models stored inDb using the BYOM functionality.</p>

<p style = 'font-size:16px;font-family:Arial'>Teradata has Integration with LLMs with Amazon BedRock etc., and also emerging Open Analytics Framework in the Cloud Lake, where you can host a Language Model etc.</p>

<p style = 'font-size:16px;font-family:Arial'>LLMs are a key artificial intelligence (AI) technology powering intelligent chatbots and other natural language processing (NLP) applications. The goal is to create bots that can answer user questions in various contexts by cross-referencing authoritative knowledge sources. Unfortunately, the nature of LLM technology introduces unpredictability in LLM responses. Additionally, LLM training data is static and introduces a cut-off date on the knowledge it has.</p>

<p style = 'font-size:16px;font-family:Arial'>Known challenges of LLMs include:</p>

<li style = 'font-size:16px;font-family:Arial'>Presenting false information when it does not have the answer.</li>
<li style = 'font-size:16px;font-family:Arial'>Presenting out-of-date or generic information when the user expects a specific, current response.</li>
<li style = 'font-size:16px;font-family:Arial'>Creating a response from non-authoritative sources.</li>
<li style = 'font-size:16px;font-family:Arial'>Creating inaccurate responses due to terminology confusion, wherein different training sources use the same terminology to talk about different things.</li>
</p>
<p style = 'font-size:16px;font-family:Arial'>RAG is one approach to solving some of these challenges. It redirects the LLM to retrieve relevant information from authoritative, pre-determined knowledge sources. Organizations have greater control over the generated text output, and users gain insights into how the LLM generates the response.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we will work on creating a catalogue using the tables in the database and use LLM to answer the prompts regarding the catalogue.</p>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>1. Configuring the environment</b>

<p style = 'font-size:18px;font-family:Arial'><b>1.1 Install the required libraries</b></p>

In [None]:
%%capture
!pip install langchain_community pypdf
!pip install boto3 awscli
!pip install pyopenssl --upgrade --force-reinstall
!pip install -U pandas==2.1.3
!pip install langchain
!pip install langchain_text_splitters

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.2 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import boto3

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


import pandas as pd
from teradataml import *
import getpass

configure.byom_install_location = "mldb"

import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Markdown

display.max_rows=5


<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1.3. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host='host.docker.internal', username='demo_user', password=password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Language_Model_RAG_Catalogue_Python.ipynb;' UPDATE FOR SESSION;")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Confirmation for Model</b>
<p style = 'font-size:16px;font-family:Arial'>Before starting let us confirm that the required functions are installed.</p>
 

In [None]:
model_name = "bge-small-en-v1.5"

In [None]:
from IPython.display import display, Markdown

df_check= DataFrame.from_query(f'''select (select 1 as cnt from embeddings_models where model_id = '{model_name}') +
(select 1 as cnt from embeddings_tokenizers where model_id =  '{model_name}') as cnt''')
if df_check.get_values()[0][0] == 2:
    print('Model is installed, please continue.')
else:
    print('Model is not installed, please go to Instalization notebook before proceeding further')
    display(Markdown("[Initialization Notebook](./Initialization_and_Model_Load.ipynb)"))

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. Since we are using embeddings stored in Vantage for this demo we are only using the local storage for the demo. We will only use the option of creating table locally.</p>   


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_SLMRAG_Catalogue_local');"
 # Takes about 2 minutes 

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Create td catalogue using the tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here we will create a catalogue from the tables which are present in the database.</p>
<p style = 'font-size:16px;font-family:Arial'>In order to create catalogue, we will first create tables realted to different use cases. We are creating these tables here for showcasing the usecase. In actual production system there will be existing schemas and tables on which we can create the catalogue.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_BankChurn_cloud');"
%run -i ../run_procedure.py "call get_data('DEMO_UAF_cloud');"
%run -i ../run_procedure.py "call get_data('DEMO_TelcoNetwork_cloud');"
%run -i ../run_procedure.py "call get_data('DEMO_5G_cloud');"
%run -i ../run_procedure.py "call get_data('DEMO_SalesForecasting_cloud');"

In [None]:
qry = """CREATE TABLE td_catalog_for_rag
    as
    (
        SELECT 
            sum(1) over( rows unbounded preceding ) as id,
            schema as txt
        FROM(
            SELECT
                'Database: ' || DATABASENAME ||', Table: ' || TABLENAME || ', Columns: ' || TRIM(TRAILING ',' 
                 FROM (XMLAGG(TRIM(ColumnName) || ', ' ORDER BY ColumnId)(VARCHAR(10000)))) AS Schema
            FROM dbc.columnsV
            WHERE TableName NOT LIKE 'ml__%' 
            and DataBasename like any ('DBC','Demo%','mldb')
            GROUP BY DATABASENAME, TABLENAME 
        ) as x
    ) with data
;"""

try:
    execute_sql(qry)
    print('Table Created')
except:
    db_drop_table('td_catalog_for_rag')
    execute_sql(qry)
    print('Table Created')

In [None]:
df = DataFrame('td_catalog_for_rag')
df

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Generate embeddings on the TD catalogue</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create embeddings for the catalogue.</p>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(354)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>100 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
df_sample = df.iloc[:100, :]

In [None]:
number_dimensions_output = 384

In [None]:
DF_embeddings_sample = ONNXEmbeddings(
    newdata = df_sample,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample

<p style = 'font-size:16px;font-family:Arial;'> Here we can see how the embeddings are generated for the catalogue data. For further analysis we will use the precomputed embeddings.</p>

In [None]:
df_emb = DataFrame(in_schema("DEMO_SLMRAG_Catalogue","Catalogue_Embedding_Data"))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Insert Prompts into a Table</b></p>

<p style = 'font-size:16px;font-family:Arial'>We will create the required table and than we will insert different values for the prompts.</p>

In [None]:
qry = '''CREATE MULTISET TABLE rag_topics_of_interest(
      txt VARCHAR(1024) CHARACTER SET UNICODE NOT CASESPECIFIC,
      id INT) NO PRIMARY INDEX''' ;
try:
    execute_sql(qry)
except:
    db_drop_table('rag_topics_of_interest')
    execute_sql(qry)

<p style = 'font-size:16px;font-family:Arial'>We will create prompts for different questions that can be answered from the document. Below are some sample questions that can be asked.</p>

In [None]:
prompts = ["I have to demo Teradata features to a Telco network for use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries",
           "I have to demo Teradata features to a Bank customer churn for some use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries",
           "I have to demo Teradata features for Sales Forecasting for some use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries",
           "What logging tables are available in Teradata to check AWT usage ?",
           "What logging tables are available in Teradata to check CPU usage ?",
           "What metadata tables are available in Teradata to DBQL Details ?",
           "What BYOM functions are available in Teradata?"]

for idx, prompt in enumerate(prompts, start=1):
    execute_sql(f'''INSERT into rag_topics_of_interest values ('{prompt}', {idx});''')
    # print(f'''INSERT into rag_topics_of_interest values ('{prompt}', {idx});''')

In [None]:
df = DataFrame('rag_topics_of_interest')
df

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Generate Embeddings from the Prompts</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create embeddings for the prompts which we have inserted into the table above.</p>

In [None]:
DF_rag_embeddings = ONNXEmbeddings(
    newdata = DataFrame('rag_topics_of_interest'),
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
copy_to_sql(DF_rag_embeddings,table_name='rag_topics_embeddings_store', if_exists='replace')

In [None]:
df = DataFrame('rag_topics_embeddings_store')
df

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Find top 10 matching chunks</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will find the top 10 chunks that match the queries using the TD_VectorDistance. The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs. The function computes the distance between the target pair and the reference pair from the same table. We must have the same column order in the TargetFeatureColumns argument and the RefFeatureColumns argument. The function ignores the feature values during distance computation if the value is either NULL, NAN, or INF.</p>

In [None]:
qry="""create multiset table rag_semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.txt as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON rag_topics_embeddings_store AS TargetTable
        ON DEMO_SLMRAG_Catalogue.Catalogue_Embedding_Data AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(10)
    ) AS dt
JOIN rag_topics_embeddings_store e_tgt on e_tgt.id = dt.target_id
JOIN DEMO_SLMRAG_Catalogue.Catalogue_Embedding_Data e_ref on e_ref.id = dt.reference_id
) with data;"""

try:
    execute_sql(qry)
    print('Table Created')
except:
    db_drop_table('rag_semantic_search_results')
    execute_sql(qry)
    print('Table Created')

In [None]:
df = DataFrame('rag_semantic_search_results').to_pandas()
df

<hr style="height:2px;border:none;">
<a id="rule"></a>
<p style = 'font-size:20px;font-family:Arial'><b>9. Create Context and Prompt for LLM</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create context and prepare instructions and prompt to the LLM.</p>

In [None]:
prompt = ["I have to demo Teradata features to a Telco network for use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries"]

<p style = 'font-size:16px;font-family:Arial'>Below are some options available.</p>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["I have to demo Teradata features to a Bank customer churn for some use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries"]</li>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["I have to demo Teradata features for Sales Forecasting for some use case they have. What database or tables \
 can I use to prepare for my demo so the presentation is relevant ? Can you write some Teradata SQL queries around these \
 tables. Make sure you prefix the relevant database name in front of the tables in the queries"]</li>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["What logging tables are available in Teradata to check AWT usage ?"]</li>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["What tables are available in Teradata to check CPU usage ?"]</li>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["What metadata tables are available in Teradata to DBQL Details ?"]</li>
<li style = 'font-size:16px;font-family:Arial'> prompt = ["What BYOM functions are available in Teradata?"]</li>

</p>


In [None]:
context = str.join('\n',df['reference_txt'].to_list())

In [None]:
llm_query = "Answer the question based only on the following context: " + context + \
"Answer the question based on the above context: " + prompt[0] + \
""" 
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""


<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>10. Configuring AWS CLI and Initialize Bedrock Model</b>
<p style = 'font-size:16px;font-family:Arial'>The following cell will prompt us for the following information:</p>
<ol style = 'font-size:16px;font-family:Arial'>
<li><b>aws_access_key_id</b>: Enter your AWS access key ID</li>
<li><b>aws_secret_access_key</b>: Enter your AWS secret access key</li>
<li><b>region name</b>: Enter the AWS region you want to configure (e.g., us-east-1)</li>
<ol>

In [None]:
import boto3
from botocore.exceptions import ClientError

# Prompt for credentials securely
aws_access_key_id = getpass.getpass("Enter AWS Access Key ID: ")
aws_secret_access_key = getpass.getpass("Enter AWS Secret Access Key: ")
region_name = getpass.getpass("Enter AWS Region (e.g., us-east-1): ")
aws_session_token = getpass.getpass("Enter AWS Session Token (if any, else leave blank): ")

<b style = 'font-size:18px;font-family:Arial'>Initialize the Bedrock Model</b>
<ul style = 'font-size:16px;font-family:Arial'>
<li>The code below initializes a Boto3 client for the “bedrock-runtime” service.</li>
<li>We provide the model to be used.</li>
<ul>

In [None]:
# Create a Bedrock Runtime client with user-provided credentials
client = boto3.client(
    "bedrock-runtime",
    region_name=region_name,
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    aws_session_token=aws_session_token
)


# Set the model ID
model_id = "mistral.mistral-7b-instruct-v0:2"

print("Bedrock client successfully created!")

In [None]:
# Create a Boto3 client for the "bedrock-runtime" service in the us-east-1 region
bedrock = boto3.client(service_name="bedrock-runtime", region_name='us-east-1')

def get_llm():
    # Create a Bedrock model with specific configuration options
    return Bedrock(
        model_id="mistral.mistral-7b-instruct-v0:2",
        client=bedrock,
        model_kwargs={
            'temperature': 0.2,
            'max_tokens' : 200
        }
    )

# Get the Bedrock model

llm = get_llm()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>11. Pass the question and get Answer from the catalogue</b>
<p style = 'font-size:16px;font-family:Arial'>The following cell will pass the question to the llm model and get the answer using the embeddings created from the catalogue and the prompts.</p>


In [None]:
# ---- Prepare the message for the model ----
messages = [
    {"role": "user", "content": [{"text": llm_query}]}
]

In [None]:
# ---- Call the model ----
try:
    response = client.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={"maxTokens": 200, "temperature": 0.2}
    )

    # ---- Extract and print model response ----
    output_text = response['output']['message']['content'][0]['text']
    print("✅ Model response:")
    print(output_text)

except ClientError as e:
    print(f"❌ Error testing client: {e}")

<p style = 'font-size:16px;font-family:Arial'>In case you want to check answer for some other question please enter the question again <a href='#rule'>here</a> and run the following steps again.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>12. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'> <b>Work Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables created above.</p>

In [None]:
tables = ['td_catalog_for_rag','rag_topics_embeddings_store','rag_topics_of_interest','rag_semantic_search_results']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass  
    


<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_BankChurn');"
%run -i ../run_procedure.py "call remove_data('DEMO_UAF');"
%run -i ../run_procedure.py "call remove_data('DEMO_TelcoNetwork');"
%run -i ../run_procedure.py "call remove_data('DEMO_5G');"
%run -i ../run_procedure.py "call remove_data('DEMO_SalesForecasting');"
%run -i ../run_procedure.py "call remove_data('DEMO_SLMRAG_Catalogue');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>