<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Teradata Enterprise Vector Store : Vectorizing PDF, Audio and Text files
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial'>In our chat with the documentation and database system using Generative AI, we have combined <b>RAG, Langchain, LLM models, and SQLAgents.</b> This allows us to ask queries in layman's terms, retrieve relevant information from the Vector store and/or Vantage Table, and generate accurate and concise answers based on the retrieved data. This integration of retrieval-based and generative-based approaches provides a powerful tool for extracting knowledge from structured or unstructured sources like PDFs, text, or audio files and delivering user-friendly responses.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo we will build Chatbot type feature by using LangChain, a powerful library for working with LLMs like <b>OpenAI's GPT-4, Amazon's Titan, Anthropic Claude 3.5, etc.</b> and JumpStart in ClearScape notebooks, a system is built where users can ask business questions in natural English and receive answers with data drawn from the relevant databases.</p>

<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture.</p>

<center><img src="images/rag1.png" alt="architecture"  width=1200 height=1000 style="border: 4px solid #404040; border-radius: 10px;"/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>Before going any farther, let's get a better understanding of RAG, LangChain, and LLM.</p>

<ol style = 'font-size:16px;font-family:Arial'><b><li> Retrieval-Augmented Generation (RAG):</li></b></ol>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp;RAG is a framework that combines the strengths of retrieval-based and generative-based approaches in question-answering systems.It utilizes both a retrieval model and a generative model to generate high-quality answers to user queries. The retrieval model is responsible for retrieving relevant information from a knowledge source, such as a database or documents. The generative model then takes the retrieved information as input and generates concise and accurate answers in natural language.</p>


<p style = 'font-size:16px;font-family:Arial'>A typical RAG (Retrieval-and-Generation) application has two main components:</p>

<p style = 'font-size:16px;font-family:Arial'><b>Indexing:</b> a pipeline for ingesting data from a source and indexing it. This usually happens offline. The indexing process involves several steps, including loading the data, splitting it into smaller chunks, and storing and indexing the splits. This is often done using a VectorStore and Embeddings model.</p>
    
<p style = 'font-size:16px;font-family:Arial'><b>Retrieval and generation:</b> the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model. The retrieval process involves searching the index for the most relevant data based on the user query, and then passing that data to the model for generation.</p>

<p style = 'font-size:16px;font-family:Arial'>The most common full sequence from raw data to answer looks like:</p>
<p style = 'font-size:16px;font-family:Arial'><b>Indexing</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>Load:</b> Load: First we need to load our data. We'll use <code>PyMuPDFLoader</code> for this.</li>
    <li><b>Split:</b> Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't in a model's finite context window. Here, our pdf document will be splits into pages.</li>
    <li><b>Store:</b> We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture of load, split and store.</p>

<center><img src="images/rag_load_store.png" alt="rag indexing architecture"  width=800 height=600 style="border: 4px solid #404040; border-radius: 20px;"/></center>
<center>image source: <a href="https://python.langchain.com/docs/use_cases/question_answering/">langchain.com</a></center>

<p style = 'font-size:16px;font-family:Arial'><b>Retrieval and generation</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>Retrieval:</b> During runtime, the user inputs a query. We first generate embeddings for it, which are then passed to the Vantage in-db function <b>TD_VectorDistance</b> to retrieve similar documents as context. This context is then fed into the LLM model.</li>
    <li><b>Generation:</b> Finally, the model generates an answer based on the retrieved data. The answer is then presented to the user.</li>
    </ul>
    
<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture of retrieval and generation.</p>
<center><img src="images/rag_retrieval_generation_td.png" alt="retrieval generation architecture" width=800 height=600 style="border: 4px solid #404040; border-radius: 10px;"/></center>
<center>image source: <a href="https://python.langchain.com/docs/use_cases/question_answering/">langchain.com</a></center>

<ol style = 'font-size:16px;font-family:Arial' start="2"><b><li> Langchain:</li></b></ol>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; LangChain is a framework that facilitates the integration and chaining of large language models with other tools and sources to build more sophisticated AI applications. LangChain does not serve its own LLMs; instead, it provides a standard way of communicating with a variety of LLMs, including those from OpenAI and HuggingFace. LangChain accelerates the development of AI applications with building blocks. We learn the leverage the following building blocks in this notebook:</p>
 
<ol style = 'font-size:16px;font-family:Arial'>
    <li> <b> LLMs</b> – LangChain's <code>llm</code> class is designed to provide a standard interface for all LLM it supports.   </li>
    <li> <b> PromptTemplate</b>  - LangChain’s <code>PromptTemplate</code> class are predefined structures for generating prompts for LLM’s. They can be reused across different LLM's.</li>
    <li> <b> Chains</b> – When we build complex AI applications, we may need to combine multiple calls to LLM’s and to other components  LangChain’s <code>chain</code> class allows us to link calls to LLM’s and components. The most common type of chaining in any LLM application is combining a prompt template with an LLM and optionally an output parser. </li>
</ol>

<ol style = 'font-size:16px;font-family:Arial' start="3"><b><li> LLM Models (Large Language Models):</li></b></ol>

<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; LLM models refer to the large-scale language models that are trained on vast amounts of text data.
These models, such as GPT-4, Llama 3,  Google's Gemini 1.5, etc. are capable of generating human-like text responses. LLM models have been pre-trained on diverse sources of text data, enabling them to learn patterns, grammar, and context from a wide range of topics. They can be fine-tuned for specific tasks, such as question-answering, natural language understanding, and text generation.
LLM models have achieved impressive results in various natural language processing tasks and are widely used in AI applications for generating human-like text responses.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration Getting Data for This Demo</li>
    <li>Read source data</li>
    <li>Generate embeddings from the chunks</li>
    <li>Insert Prompts into a Table</li>
    <li>Generate Embeddings from the Prompts</li>
    <li>Find top 10 matching chunks</li>
    <li>Configuring AWS CLI and Initialize Bedrock Model</li>
    <li>Test and Compare Results</li>
    <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>1. Configuring the environment</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Install the required libraries</b></p>

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b>The installation of the required libraries will take approximately <b>4 to 5 minutes</b> for the first-time installation. However, if the libraries are already installed, the execution will complete within 5 seconds.</i></p>
</div>

In [None]:
%%capture

# Install required libraries from requirements.txt
!pip install --upgrade -r requirements.txt --quiet

In [None]:
%%capture

# Install additional ML and LLM libraries
!pip install torchaudio transformers litellm langchain-openai

In [None]:
%%capture

# Install dataset and video processing libraries
!pip install datasets torchcodec

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>The above statements will install the required libraries to run this demo. Be sure to restart the kernel after executing the above lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
    </div>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>1.2 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

<ul style="font-size: 16px; font-family: Arial; list-style-type: disc; padding-left: 20px;">
    <li>
        <b>teradataml</b>: Enables enables us to establish a connection to our database using the <code>create_context()</code> function and allows us to create virtual DataFrames, which serve as references to database objects, allowing exploration of object storage data and enabling operations directly on Vantage without transferring entire datasets to the client, except when needed. For this demo, we will be exploring a dataset in S3 via a foreign table on Vantage.
    </li>
    <li>
        <b>LangChain’s SQLDatabase class </b>: A wrapper around the SQLAlchemy engine to facilitate interactions with databases using SQLAlchemy’s Python SQL toolkit and ORM capabilities.
    </li>
    <li>
        <b> LangChain’s create_sql_agent function</b>: A LangChain function to build a SQL agent by providing a language model and a database connection.
    </li>
    <li>
        <b>LangChain’s ChatBedrockConverse class</b>: A common interface for working with Amazons Bedrock's FM's that support chat functionalities.
    </li>
</ul>

In [None]:
# Standard libraries
import os
import time
import timeit
import warnings
import ipywidgets as widgets
from ipywidgets import interact, Dropdown
from dotenv import load_dotenv
from litellm import completion

# Data manipulation and visualization libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Teradata libraries
from teradataml import (
    create_context,
    delete_byom,
    execute_sql,
    save_byom,
    remove_context,
    in_schema,
    display,
    DataFrame,
    db_drop_table,
    db_drop_view,
    VectorDistance,
    configure,
    ONNXEmbeddings,
)

# Helper functions
from utils.sql_helper_func import *
from utils.transcripts_helper_func import *

# LLM libraries - Updated for langchain 1.0.5
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.chat_models import init_chat_model
from langchain.tools import tool
from langchain.agents import create_agent
import bs4
from IPython.display import display, Markdown

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")
display.max_rows = 5

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>1.3 Load Audio model and test</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let's load the <b>Small Language Model (SLM)</b> from <code>huggingface</code> and verify it output.</p>


In [None]:
import os

# Configure Hugging Face environment
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

try:
    # Load Whisper model and processor
    processor = WhisperProcessor.from_pretrained("openai/whisper-small")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
    model.config.forced_decoder_ids = None

    # Load dummy dataset for testing audio transcription
    ds = load_dataset(
        "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
    )
    sample = ds[2]["audio"]

    # Process audio sample
    input_features = processor(
        sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
    ).input_features

    # Generate token ids and decode to text
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    # Display transcription result
    print("--" * 25)
    print("Transcription: \n", transcription[0])
    print("--" * 25)

except Exception as e:
    print(f"Error loading or testing Whisper model: {e}")

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i>The code above will download the necessary models to generate the embeddings required to run this demo. The initial download may take approximately 50-60 seconds if you are running this demo for the first time in this environment. However, subsequent runs will be much faster since the models will already be available locally.</i></p>
</div>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>2. Connect to Vantage</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>2.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
# Run startup script to get password
%run -i ../startup.ipynb

try:
    # Create database connection context
    eng = create_context(host='host.docker.internal', username='demo_user', password=password)
    print(eng)
    
    # Set query band for session tracking
    execute_sql('''SET query_band='DEMO=Teradata_Enterprise_VectorStore_VectorizingPDFs_GenAI_Python.ipynb;' UPDATE FOR SESSION;''')
    
except Exception as e:
    print(f"Error connecting to Vantage: {e}")
    raise

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>3. Data Exploration Getting Data for This Demo</b>

<p style = 'font-size:16px;font-family:Arial'>The Chat with documentation demo aims to demonstrate how users can interact with documents such as insurance policy wordings, invoices, and other similar documents through a conversational interface. Additionally in this demo, we have added Audio and text files as well to extract transcripts from audio and make it conversational.</p>

In [None]:
# Load demo data (takes approximately 2 minutes)
%run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_local');"

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
# Optional: View space report for databases and tables
%run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:16px;font-family:Arial'>We have a Customer 360 details table containing all the customers' personal and banking-related information. We will use this table to ask questions in natural language and retrieve answers from the Vantage Database.<p/>

In [None]:
# Load Customer 360 details table
complaints_data = DataFrame(in_schema("DEMO_ComplaintAnalysis", "Customer_360_Details"))
complaints_data

<hr style='height:2px;border:none;'>
<a id='section4'></a>
<b style = 'font-size:20px;font-family:Arial'>4. Read source data. </b>
<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>4.1 Run the data loader </b></p>

<p style = 'font-size:16px;font-family:Arial'>The Traveller Easy Single Trip - International insurance policy is a comprehensive travel insurance plan that provides cover for a wide range of risks, including medical expenses, trip cancellation, loss of luggage, and personal accident. The policy is designed to be affordable and flexible, and it can be purchased online or over the phone.<p/>

<p style = 'font-size:16px;font-family:Arial'>The source data from <a href="https://axa-com-my.cdn.axa-contento-118412.eu/axa-com-my/3d2f84a5-42b9-459b-911a-710546df0633_Policy+wording+-+SmartTraveller+Easy+Single+Trip+-+International+%280820%29.pdf">AXA</a> is loaded in Teradata Vantage as Vector Database.</p>

<p style = 'font-size:16px;font-family:Arial'>Now, let's use <code>PyMuPDFLoader</code> library to read the pdf document and split it into pages.</p>

<p style = 'font-size:16px;font-family:Arial'>For Audio files <code>openai/whisper-small</code> open source audio model we have used to extract the transcripts and split it into pages.</p>

In [None]:
def get_next_id(df, id_column="id", start_id=1000):
    """
    Get the next available ID for a new record.

    Args:
        df: DataFrame containing existing records
        id_column: Name of the ID column
        start_id: Starting ID value if DataFrame is empty

    Returns:
        Next available ID value
    """
    if df.empty or df[id_column].max() < start_id:
        return start_id
    else:
        return df[id_column].max() + 1


def get_splitter():
    """
    Create a text splitter for chunking documents.

    Returns:
        RecursiveCharacterTextSplitter configured for optimal chunk size
    """
    return RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=30,
        length_function=len,
        is_separator_regex=False,
    )


def read_document_content(raw_data_df, pages, file_name):
    """
    Process document pages and convert to chunked text with metadata.

    Args:
        raw_data_df: Existing DataFrame with processed documents
        pages: List of document pages to process
        file_name: Source file name for tracking

    Returns:
        Updated DataFrame with new document chunks
    """
    # Extract page content and split into chunks
    docs = [p.page_content for p in pages]
    docs = get_splitter().create_documents(docs)

    # Convert chunks to text list
    texts_data = [chunk.page_content for chunk in docs]

    # Create DataFrame with text chunks
    temp_df = pd.DataFrame(data=texts_data, columns=["txt"])
    next_id = get_next_id(raw_data_df)
    temp_df["id"] = range(next_id, len(temp_df.index) + next_id)
    temp_df["txt"] = texts_data
    temp_df["file_name"] = file_name

    # Concatenate with existing data
    return pd.concat([raw_data_df, temp_df], ignore_index=True)

In [None]:
def read_data_files(directory_path):
    """
    Read and process multiple file types (PDF, MP3, TXT) from a directory.

    Args:
        directory_path: Path to directory containing source files

    Returns:
        DataFrame containing all processed document chunks with metadata
    """
    import warnings

    warnings.filterwarnings(
        "ignore", "Due to a bug fix in https://github.com/huggingface/transformers/pull"
    )

    # Define schema for output DataFrame
    columns = {
        "id": "int64",
        "txt": "object",
        "file_name": "object",
    }

    # Create loading spinner widget
    loading_spinner = widgets.HTML(
        value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Reading the raw data...",
    )

    # Initialize empty DataFrame
    raw_data_df = pd.DataFrame(
        {col: pd.Series(dtype=dt) for col, dt in columns.items()}
    )

    # Walk through directory and process files
    for root, dirs, files in os.walk(directory_path):
        display(loading_spinner)

        # Skip checkpoint directories
        if ".ipynb_checkpoints" in root:
            continue

        for file_name in files:
            file_path = os.path.join(root, file_name)

            try:
                # Process MP3 audio files
                if file_name.lower().endswith(".mp3"):
                    print(f"MP3 File: {file_name}")
                    transcripts = process_audio(file_path)
                    texts = get_splitter().create_documents([transcripts])
                    raw_data_df = read_document_content(raw_data_df, texts, file_name)

                # Process PDF files
                elif file_name.lower().endswith(".pdf"):
                    print(f"PDF File: {file_name}")
                    pages = PyMuPDFLoader(file_path).load_and_split()
                    print(f"Total pages: {len(pages)}")
                    raw_data_df = read_document_content(raw_data_df, pages, file_name)

                # Process text files
                elif file_name.lower().endswith(".txt"):
                    print(f"TXT File: {file_name}")
                    with open(file_path, "r", encoding="utf-8") as file:
                        text_content = file.read()
                        texts = get_splitter().create_documents([text_content])
                        raw_data_df = read_document_content(
                            raw_data_df, texts, file_name
                        )

                else:
                    print(f"Skipping: {file_name}")

            except Exception as e:
                print(f"Error processing {file_name}: {e}")
                continue

    print("*" * 70)
    print("All source files have been read, and chunking has been completed.")
    print("*" * 70)

    # Hide the loading spinner
    loading_spinner.value = ""

    return raw_data_df

<p style='font-size:16px;font-family:Arial'>In the above cell, we will read all the pages of the PDF file and split them into pages. To process further, we will save documents to Vantage.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>5. Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/gte-base-en-v1.5), such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
from huggingface_hub import hf_hub_download

# Configure Hugging Face environment
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

# Model configuration
model_name = "bge-base-en-v1.5"
number_dimensions_output = 768
model_file_name = "model.onnx"

try:
    # Download embedding model and tokenizer from Hugging Face
    hf_hub_download(
        repo_id=f"Teradata/{model_name}",
        filename=f"onnx/{model_file_name}",
        local_dir="./",
    )
    hf_hub_download(
        repo_id=f"Teradata/{model_name}", filename="tokenizer.json", local_dir="./"
    )
    print(f"Successfully downloaded {model_name} model and tokenizer")

except Exception as e:
    print(f"Error downloading model files: {e}")
    raise

<hr style="height:1px;border:none">
<p style = 'font-size:18px;font-family:Arial'><b>5.1 Save the Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>In above steps, we have checked that the model is working fine in ONNX format. Now we will save the model file.</p>

In [None]:
# Clean up existing model tables if they exist
try:
    db_drop_table("embeddings_models")
except Exception:
    pass

try:
    db_drop_table("embeddings_tokenizers")
except Exception:
    pass

In [None]:
try:
    # Save embedding model to Vantage
    save_byom(
        model_id=model_name,  # Must be unique in the models table
        model_file=f"onnx/{model_file_name}",
        table_name="embeddings_models",
    )
    print(f"Successfully saved embedding model: {model_name}")

    # Save tokenizer to Vantage
    save_byom(
        model_id=model_name,  # Must be unique in the models table
        model_file="tokenizer.json",
        table_name="embeddings_tokenizers",
    )
    print(f"Successfully saved tokenizer: {model_name}")

except Exception as e:
    print(f"Error saving model to Vantage: {e}")
    raise

<p style = 'font-size:16px;font-family:Arial;'>Recheck the installed model and tokenizer

In [None]:
# Verify embedding models saved successfully
df_model = DataFrame("embeddings_models")
df_model

In [None]:
# Verify tokenizer saved successfully
df_token = DataFrame("embeddings_tokenizers")
df_token

<p style = 'font-size:16px;font-family:Arial'>Load the mode that we have save to DB in previous notebook by passing Model ID.</p>

In [None]:
# Load the saved model and tokenizer from database
my_model = DataFrame.from_query(
    f"SELECT * FROM embeddings_models WHERE model_id = '{model_name}'"
)
my_tokenizer = DataFrame.from_query(
    f"SELECT model AS tokenizer FROM embeddings_tokenizers WHERE model_id = '{model_name}'"
)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Generate embeddings from the chunks.</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create prompts for different questions that can be answered from the document. Below are some sample questions that can be asked.</p>

In [None]:
# Configure BYOM installation locations
configure.val_install_location = "val"
configure.byom_install_location = "mldb"


def generate_embeddings_data(input_tdf, cols_to_preserve):
    """
    Generate embeddings for text data using ONNX model.

    Args:
        input_tdf: Input DataFrame containing text data
        cols_to_preserve: List of column names to preserve in output

    Returns:
        DataFrame containing embeddings and preserved columns
    """
    try:
        return ONNXEmbeddings(
            newdata=input_tdf,
            modeldata=my_model,
            tokenizerdata=my_tokenizer,
            accumulate=cols_to_preserve,
            model_output_tensor="sentence_embedding",
            output_format=f"FLOAT32({number_dimensions_output})",
            enable_memory_check=False,
        ).result

    except Exception as e:
        print(f"Error generating embeddings: {e}")
        raise

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>6.1 Do you want to generate the embeddings?</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Generating embeddings will take around <b>35-40 minutes.</b></p>

<p style = 'font-size:16px;font-family:Arial'>We have already generated embeddings for the pdf and stored them in <b>Vantage</b> table.</p>
 
<center><img src="images/decision_emb_gen_2.svg" alt="embeddings_decision"  width=300 height=400/></center>
 
<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><i><b>Note: If you would like to skip the embedding generation step to save the time and move quickly to next step, please enter "No" in the next prompt.</b></i></p>
</div>

<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b>If you choose <b>"yes"</b> to run the embeddings generation step, you must first execute the <a href="./Initialization_and_Model_Load.ipynb">Initialization_and_Model_Load.ipynb</a> file to install the ONNX model on the ClearScape machine.</i></p>
</div>

 
<p style = 'font-size:16px;font-family:Arial'>To save time, you can move to the already generated embeddings section. However, if you would like to see how we generate the embeddings, or if you need to generate the embeddings for a different dataset, then continue to the following section.</p>

In [None]:
def generate_emb():
    """
    Generate embeddings from source documents (PDF, text, audio).

    This function:
    1. Reads raw data from various file formats
    2. Saves raw data to SQL table
    3. Generates embeddings for all chunks
    4. Saves embeddings to database
    """
    display(loading_spinner)

    try:
        print("*" * 50)
        print("Step 1: Reading raw data from PDF, TXT and audio files...")
        print("*" * 50)

        directory_path = "./data"
        final_raw_data_df = read_data_files(directory_path)

        print("*" * 50)
        print("Step 2: Saving raw data to SQL...")
        print("*" * 50)

        # Copy documents to Vantage
        copy_to_sql(
            final_raw_data_df,
            table_name="docs_data",
            primary_index="id",
            if_exists="replace",
        )

        tdf_docs = DataFrame("docs_data")
        print(f"Data information: \n{tdf_docs.shape}")
        tdf_docs.sort("id")

        print("*" * 50)
        print("Step 3: Generating embeddings...")
        print("*" * 50)
        display(loading_spinner)
        display(Markdown(get_section5_desc_start(tdf_docs)))

        start = time.time()

        # Generate embeddings with preserved columns
        cols_to_preserve = ["id", "txt", "file_name"]
        docs_data = DataFrame("docs_data")
        df_embeddings = generate_embeddings_data(docs_data, cols_to_preserve)

        # Save embeddings to database
        copy_to_sql(
            df_embeddings,
            table_name="pdf_embeddings_store",
            if_exists="replace",
            index=False,
        )

        elapsed_time = time.time() - start
        print(f"Embeddings generated successfully in {elapsed_time:.2f} seconds")

    except Exception as e:
        print(f"Error in embedding generation process: {e}")
        raise

In [None]:
def load_data_emb():
    """
    Load pre-generated embeddings from local parquet files.

    This function loads both raw data and embeddings from compressed
    parquet files and saves them to the database.
    """
    try:
        print("*" * 60)
        print("Step 1: Loading raw data from parquet file stored locally.")
        print("*" * 60)

        # Load raw data from parquet
        raw_data_prq = pd.read_parquet("./embeddings/all_source_data_v1.parquet.gzip")

        # Save raw data to database
        delete_and_copy_embeddings(
            table_name="docs_data",
            tdf=raw_data_prq,
            eng=eng,
        )

        print("*" * 60)
        print("Step 2: Loading embeddings from parquet file stored locally.")
        print("*" * 60)

        # Load embeddings from parquet
        embeddings_prq = pd.read_parquet("./embeddings/all_embeddings_v3.parquet.gzip")

        # Save embeddings to database
        delete_and_copy_embeddings(
            table_name="pdf_embeddings_store",
            tdf=embeddings_prq,
            eng=eng,
        )

        print("*" * 50)
        print("Embeddings loaded and saved successfully!")
        print("*" * 50)

    except FileNotFoundError as e:
        print(f"Error: Parquet files not found. Please check file paths: {e}")
        raise
    except Exception as e:
        print(f"Error loading embeddings from parquet: {e}")
        raise

In [None]:
# Create loading spinner widget
loading_spinner = widgets.HTML(
    value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Generating embeddings for documents...",
)


def get_section5_desc_start(tdf):
    """
    Generate informational message about embedding generation time.

    Args:
        tdf: DataFrame containing documents to process

    Returns:
        HTML formatted information message
    """
    return f"""<div class="alert alert-info">
    <p style='font-size:16px;font-family:Arial'><i><b>Please be patient:</b> Generating embeddings for {tdf.shape[0]} document contents may take up to 35 to 40 minutes depending on the number of AMPs in the database. Since the volume of data is large and the machine is small, going through the below code could take up to 40 minutes.</i></p>
</div>"""


# Prompt user for their preference
generate = input("Do you want to generate embeddings? ('yes'/'no'): ").strip().lower()

try:
    if generate == "yes":
        generate_emb()
    elif generate == "no":
        load_data_emb()
    else:
        print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")

except Exception as e:
    print(f"ERROR: {e}")
    raise
finally:
    # Hide loading spinner
    loading_spinner.value = ""

In [None]:
# Load and display embeddings table
tdf_embeddings_store = DataFrame("pdf_embeddings_store")
tdf_embeddings_store

<p style = 'font-size:16px;font-family:Arial'>Let's view the shape of embeddings table.</p>

In [None]:
# Display shape of embeddings table (rows, columns)
tdf_embeddings_store.shape

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Insert Prompts into a Table</b></p>


<p style = 'font-size:16px;font-family:Arial'>We will create the required table and than we will insert different values for the prompts.</p>

In [None]:
def create_question_to_ask_table():
    """
    Create table to store questions for query generation.

    Drops existing table if present and creates new one.
    """
    qry = """CREATE MULTISET TABLE question_to_ask(
        txt VARCHAR(1024) CHARACTER SET UNICODE NOT CASESPECIFIC,
        query_id INT
    ) NO PRIMARY INDEX"""

    try:
        execute_sql(qry)
        print("Table 'question_to_ask' created successfully")
    except Exception:
        # Drop and recreate if table already exists
        db_drop_table("question_to_ask")
        execute_sql(qry)
        print("Table 'question_to_ask' recreated successfully")

<p style = 'font-size:16px;font-family:Arial'>We will create prompts for different questions that can be answered from the document. Below are some sample questions that can be asked.</p>

In [None]:
def add_questions():
    """
    Insert sample questions into the question_to_ask table.

    These questions cover both structured data queries and document-based questions.
    """
    prompts = [
    "Does this policy cover  Loss of or Damage to the Insured’s Articles?",
    "What is the reimbursement limit per Baggage?",
    "What is the sum insured amount in the case Accidental Death in domestic and international for adult as well as child?",
    "What documents are required for Rental Car Excess?",
    "Where can I submit my complaints or feedback?",
    "What is the bank tenure of customer 789456123?",
    "What is the Total credit balance of customer 456789123?",
    "How many customers have only Credit Card as product holdings?",
    ]

    try:
        for idx, prompt in enumerate(prompts, start=100):
            execute_sql(f"""INSERT into question_to_ask values ('{prompt}', {idx});""")
        # for idx, prompt in enumerate(prompts, start=100):
        #     execute_sql(f"""INSERT INTO question_to_ask VALUES ('{prompt}', {idx});""")
        print(f"Successfully inserted {len(prompts)} questions")

    except Exception as e:
        print(f"Error inserting questions: {e}")
        raise

In [None]:
def load_que_emb(table_name):
    """
    Load pre-generated question embeddings from parquet file to database.

    Args:
        table_name: Name of the target table for embeddings
    """
    try:
        print("*" * 50)
        print("Loading question embeddings from parquet file stored locally.")
        print("*" * 50)

        # Load embeddings from parquet
        embeddings_prq = pd.read_parquet(
            "./embeddings/questions_embeddings.parquet.gzip"
        )

        # Save to database
        copy_to_sql(
            embeddings_prq,
            table_name=table_name,
            primary_index="query_id",
            if_exists="replace",
        )

        print("*" * 50)
        print("Question embeddings loaded and saved successfully!")
        print("*" * 50)

    except FileNotFoundError as e:
        print(f"Error: Question embeddings parquet file not found: {e}")
        raise
    except Exception as e:
        print(f"Error loading question embeddings: {e}")
        raise

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Generate Embeddings from the Prompts</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create embeddings for the prompts which we have inserted into the table above.</p>

<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b>If you choose <b>"yes"</b> to run the embeddings generation step, you must first execute the <a href="./Initialization_and_Model_Load.ipynb">Initialization_and_Model_Load.ipynb</a> file to install the ONNX model on the ClearScape machine.</i></p>
</div>

In [None]:
# Create loading spinner
loading_spinner = widgets.HTML(
    value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Generating embeddings for documents...",
)

# Create question table and add sample questions
create_question_to_ask_table()
add_questions()

# Request user input
generate = input("Do you want to generate embeddings? ('yes'/'no'): ").strip().lower()

try:
    if generate == "yes":
        # Generate embeddings for questions
        loading_spinner = widgets.HTML(
            value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Generating embeddings for questions...",
        )

        display(loading_spinner)

        # Define columns to preserve in output
        cols_to_preserve = ["query_id", "txt"]
        question_to_ask = DataFrame("question_to_ask")
        df_embeddings_que = generate_embeddings_data(question_to_ask, cols_to_preserve)

        # Save to database
        copy_to_sql(
            df_embeddings_que,
            table_name="question_to_ask_embeddings",
            if_exists="replace",
            index=False,
        )

        loading_spinner.value = ""
        print("Question embeddings generated successfully")

    elif generate == "no":
        load_que_emb(table_name="question_to_ask_embeddings")

    else:
        print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")

except Exception as e:
    print(f"Error in question embedding process: {e}")
    loading_spinner.value = ""
    raise

In [None]:
# Load and display question embeddings table
tdf_question_embeddings_store = DataFrame("question_to_ask_embeddings")
tdf_question_embeddings_store

In [None]:
# Display shape of question embeddings table
tdf_question_embeddings_store.shape

In [None]:
# Optional: Save question embeddings to parquet file for future use
# df = tdf_question_embeddings_store.to_pandas().reset_index()
# df.drop("index", axis=1, inplace=True)
# df.to_parquet('./embeddings/questions_embeddings.parquet.gzip', compression='gzip')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>9. Find top 10 matching chunks</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will find the top 10 chunks that match the queries using the <b>TD_VectorDistance</b>. The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs. The function computes the distance between the target pair and the reference pair from the same table. We must have the same column order in the TargetFeatureColumns argument and the RefFeatureColumns argument. The function ignores the feature values during distance computation if the value is either NULL, NAN, or INF.</p>

In [None]:
def calculate_vector_distance(target_table, reference_table, emb_column_names, topk):
    """
    Calculate cosine distance between target and reference vectors.

    Args:
        target_table: DataFrame containing target (query) vectors
        reference_table: DataFrame containing reference (document) vectors
        emb_column_names: List of embedding column names
        topk: Number of top matching results to return

    Returns:
        DataFrame containing top-k matches with distances
    """
    try:
        start = timeit.default_timer()

        VectorDistance_out = VectorDistance(
            target_id_column="query_id",
            target_feature_columns=emb_column_names,
            ref_id_column="id",
            ref_feature_columns=emb_column_names,
            distance_measure=["Cosine"],
            topk=topk,
            target_data=target_table,
            reference_data=reference_table,
        )

        elapsed_time = timeit.default_timer() - start
        print(f"Vector distance calculation time: {elapsed_time:.4f} seconds")

        return VectorDistance_out.result

    except Exception as e:
        print(f"Error calculating vector distance: {e}")
        raise

In [None]:
# Get embedding column names (exclude id and text columns)
emb_column_names = DataFrame("question_to_ask_embeddings").columns[2:]

# Define number of top matches to retrieve
number_of_recommendations = 10

# Calculate vector distances between questions and document chunks
vector_distance_df = calculate_vector_distance(
    target_table=tdf_question_embeddings_store,
    reference_table=tdf_embeddings_store,
    emb_column_names=emb_column_names,
    topk=number_of_recommendations,
)

In [None]:
def get_final_matching_chunks(
    vector_distance_df, tdf_embeddings_store, tdf_question_embeddings_store
):
    """
    Join vector distances with original text and metadata.

    Args:
        vector_distance_df: DataFrame with vector distance results
        tdf_embeddings_store: DataFrame with document embeddings
        tdf_question_embeddings_store: DataFrame with question embeddings

    Returns:
        DataFrame with matched chunks including text, metadata, and distances
    """
    try:
        # Select relevant columns from embeddings table
        embeddings_df_selected_columns = tdf_embeddings_store.select(
            ["id", "txt", "file_name"]
        )

        # Join vector-distance results with document text
        vec_prod_join_result = vector_distance_df.merge(
            right=embeddings_df_selected_columns,
            left_on="reference_id",
            right_on="id",
            lsuffix="t1",
            rsuffix="t2",
        )

        # Select columns for final output
        vec_prod_join_result_selected = vec_prod_join_result[
            ["id", "txt", "file_name", "target_id", "distancetype", "distance"]
        ]

        # Get question text
        df_que_selected = tdf_question_embeddings_store.select(["query_id", "txt"])

        # Join with question text to get complete results
        df_matched_chunks = df_que_selected.merge(
            right=vec_prod_join_result_selected,
            left_on="query_id",
            right_on="target_id",
            how="inner",
            lsuffix="que",
            rsuffix="matched",
        )

        # Filter out low-quality matches (distance threshold)
        df_matched_chunks = df_matched_chunks[df_matched_chunks.distance > 0.001]

        # Sort by query_id and distance (ascending = most similar first)
        df_matched_chunks = df_matched_chunks.sort(
            ["query_id", "distance"], ascending=True
        )

        return df_matched_chunks[
            ["query_id", "txt_que", "txt_matched", "id", "file_name", "distance"]
        ]

    except Exception as e:
        print(f"Error processing matching chunks: {e}")
        raise

In [None]:
# Get final matching chunks with text and metadata
tdf_matching_chunks = get_final_matching_chunks(
    vector_distance_df, tdf_embeddings_store, tdf_question_embeddings_store
)

# Copy results to SQL for improved performance
copy_to_sql(tdf_matching_chunks, table_name="df_matching_chunks", if_exists="replace")

# Reload from database
tdf_matching_chunks = DataFrame("df_matching_chunks")
tdf_matching_chunks

In [None]:
# Load document data for reference
tdf_docs = DataFrame("docs_data")


def get_similarity_search_context(target_id):
    """
    Retrieve context, file names, and reference IDs for a given query.

    Args:
        target_id: Query ID to retrieve context for

    Returns:
        Tuple containing:
        - context: Concatenated text from matching chunks
        - file_names: List of unique source file names
        - ref_ids: List of reference document IDs
    """
    try:
        # Get matching chunks for the target query
        tmp1 = tdf_matching_chunks.loc[tdf_matching_chunks["query_id"] == target_id][
            ["txt_matched", "query_id", "file_name", "id"]
        ]

        # Extract unique file names from matching chunks
        ref_ids = tmp1[["id"]].get_values().flatten()
        file_names = list(
            set(
                tdf_docs[tdf_docs.id.isin(list(ref_ids))][["file_name"]]
                .get_values()
                .flatten()
            )
        )

        # Concatenate matched text as context
        context = "\n".join(tmp1[["txt_matched"]].get_values().flatten())

        return context, file_names, ref_ids

    except Exception as e:
        print(f"Error retrieving similarity search context: {e}")
        return "", [], []

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>10. Configuring LiteLLM Proxy Environment</b>

<p style="font-size:16px;font-family:Arial;color:#00233C">
The following cell configures the LiteLLM environment for this demo by enabling debug logging, loading credentials from a secure environment file, and routing all LLM requests through a centralized LiteLLM proxy.
</p>

<ol style="font-size:16px;font-family:Arial;color:#00233C">
  <li>
    <b>LITELLM_LOG</b>: Enables detailed debug-level logging for request tracing and troubleshooting
  </li>
  <li>
    <b>LITELLM_API_KEY</b>: Loads the LiteLLM API key securely from the environment file
  </li>
  <li>
    <b>LITELLM_BASE_URL</b>: Defines the LiteLLM proxy endpoint through which all model requests are routed
  </li>
</ol>

In [None]:
# Configure LiteLLM environment
os.environ["LITELLM_LOG"] = "DEBUG"

try:
    # Load environment variables from config file
    load_dotenv("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env")

    # Set LiteLLM API key
    os.environ["LITELLM_API_KEY"] = os.getenv("litellm_key")
    litellm_key = os.environ["LITELLM_API_KEY"]

    # Set LiteLLM proxy base URL
    os.environ["LITELLM_BASE_URL"] = os.getenv("litellm_base_url")
    base_url = os.environ["LITELLM_BASE_URL"]

    if not litellm_key or not base_url:
        raise ValueError("LiteLLM API key or base URL not found in environment")

    print("LiteLLM environment configured successfully")

except Exception as e:
    print(f"Error configuring LiteLLM environment: {e}")
    raise

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>9.1 Connect to databases using SQL Alchemy</b></p>    

<p style='font-size:16px;font-family:Arial'>Under the hood, we use SQLAlchemy to connect to SQL databases. This means that the SQLDatabaseChain can be used with any SQL dialect supported by SQLAlchemy, such as Teradata Vantage, MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, and SQLite. For more information about the requirements for connecting to our database, we recommend referring to the <a href="https://docs.sqlalchemy.org/en/20/">SQLAlchemy documentation</a>.</p>

<p style='font-size:16px;font-family:Arial'>Important: The code below establishes a database connection for our data sources and Large Language Models. Please note that the solution will only work if we define the database connection for our sources in the cell below.</p>

<p style='font-size:16px;font-family:Arial'>We build a consolidated view of the Table Data Catalog by combining metadata stored for the database and table.</p>

In [None]:
try:
    # Create Vantage SQLAlchemy engine
    database = "DEMO_ComplaintAnalysis_db"
    db = SQLDatabase(
        eng,
        schema=database,
        include_tables=["Customer_360_Details"],
    )

    print(f"Database dialect: {db.dialect}")
    print(f"Available tables: {db.get_usable_table_names()}")

except Exception as e:
    print(f"Error creating SQL database connection: {e}")
    raise

In [None]:
# Define table schema mapping
main_d = {
    "Customer_360_Details": complaints_data.columns,
}


def get_db_schema():
    """
    Generate database schema string for LLM prompts.

    Returns:
        Formatted string containing table names and column names
    """
    table_dicts = []
    for k in main_d:
        table_dicts.append(
            {
                "table_name": k,
                "column_names": main_d[k],
            }
        )

    database_schema_string = "\n".join(
        [
            f"Table: {table['table_name']}\nColumns: {', '.join(table['column_names'])}"
            for table in table_dicts
        ]
    )

    return database_schema_string

In [None]:
# Generate and display database schema
database_schema = get_db_schema()
print(database_schema)

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>9.2 Define LLM model</b></p>  


<p style="font-size: 16px; font-family: Arial;">
    Define the LLM using the <code>ChatBedrockConverse</code> interface. When defining <code>ChatBedrockConverse</code>, set the <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html#model-ids-arns">Amazon Bedrock base model ID</a>, the client as <code>boto3_bedrock</code>, and the common inference parameters.
</p> 
<p style="font-size: 16px; font-family: Arial;">
    We use the optional parameter <b>temperature</b> to make our Teradata SQL outputs more predictable.
</p>

<div style="margin-left: 16px; font-size: 16px; font-family: Arial;">
    <b>- Temperature:</b> which can range from 0.0 to 2 and controls how creative our results will be, Setting it to 0.1 ensures the model favors higher-probability (more predictable) words, resulting in more consistent and less varied outputs.<br>
</div>

<p style="font-size: 16px; font-family: Arial;">
    For a complete list of optional parameters for base models provided by Amazon Bedrock, visit the <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters.html"> AWS docs</a>.
</p>

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain_core.messages import HumanMessage, SystemMessage

try:
    # Initialize ChatOpenAI LLM with LiteLLM proxy
    llm = ChatOpenAI(
        openai_api_base=base_url,  # LiteLLM Proxy URL
        model="AWS-Bedrock-anthropic.claude-opus-4-1-20250805-v1:0",
        temperature=0.1,  # Low temperature for more deterministic SQL generation
        api_key=litellm_key,
    )
    print("LLM initialized successfully")

except Exception as e:
    print(f"Error initializing LLM: {e}")
    raise

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>9.3 Define SQL Agent</b></p>  

<p style="font-size: 16px; font-family: Arial;">
    With the connection to Teradata Vantage established and our database (<code>db</code>) and Large Language Model (<code>LLM</code>) defined, we are ready to create and invoke our SQL Agent using the <code>create_sql_agent()</code> function. 
    </p>
<p style="font-size: 16px; font-family: Arial;">
    We pass in our <code>llm</code> and <code>db</code> as required parameters and set <code>agent_type</code> to "zero-shot-react-description" to instruct the agent to perform a reasoning step before acting.  
    </p>
<p style="font-size: 16px; font-family: Arial;">
    We set <code>verbose</code> to true so that the agent can output detailed information of intermediate steps. Additionally, we set <code>handle_parsing_errors</code> to <code>True</code>, ensuring that errors are sent back to the LLM as observations, for the LLM to attempt handling the errors.
    </p>
    
<p style="font-size: 16px; font-family: Arial;">
 We can optimize the agents performance with additional prompt engineering. 
</p>

<p style="font-size: 16px; font-family: Arial;">
We import a <code>ChatPromptTemplate</code> class to build flexible reusable prompts in our agent. Here we define a prefix, format instructions, and a suffix and join them to create a custom prompt. The prefix has unique rules that apply to Teradata. The format guides it's Question, thought, observation behavior and the suffix cues it to begin. 
</p>

In [None]:
# Define custom prompt template for Teradata SQL Agent
prefix = (
    """You are a helpful and expert TeradataSQL database admin. TeradataSQL shares many similarities to SQL, with a few key differences.
Given an input question, first create a syntactically correct TeradataSQL query to run, then look at the results of the query and return the answer.
Given an input question, create a syntactically correct {dialect} query to run,
then look at the results of the query and return the answer. Unless the user
specifies a specific number of examples they wish to obtain, always limit your
query to at most {top_k} results.

IMPORTANT: Unless the user specifies an exact number of rows they wish to obtain, you must always limit your query to at most {top_k} results by using "SELECT TOP {top_k}".

The following keywords do not exist in TeradataSQL: 
1. LIMIT 
2. FETCH
3. FIRST
Instead of LIMIT or FETCH, use the TOP keyword. The TOP keyword should immediately follow a "SELECT" statement.
For example, to select the top 3 results, use "SELECT TOP 3 FROM <table_name>"
Enclose all value identifiers in quotes to prevent errors from restricted keywords. Append an underscore to all alias keywords (e.g., AS count_).
Always use double quotation marks (" ") for column names in SQL queries to avoid syntax errors.
Do NOT make any DML statements (INSERT, UPDATE, DELETE, DROP, etc.) to the database. 
If the question does not seem related to the database, just return "I don't know" as the answer

IMPORTANT: Use default database as 'DEMO_ComplaintAnalysis_db'

IMPORTANT: Use the following Tables: \n
"""
    + database_schema
    + """

Few examples of Question-SQL Pairs:
Question: What is the Total credit balance of customer 456789123?
SQL: SELECT TOP 1 "Total Credit Balance" FROM DEMO_ComplaintAnalysis_db.Customer_360_Details WHERE "Customer Identifier" = '456789123'

IMPORTANT: Here are some tips for writing Teradata style queries:

Always use table aliases when your SQL statement involves more than one source
Aggregated fields like COUNT(*) must be appropriately named 
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 3 results by using SELECT TOP 3, note that LIMIT function does not work in Teradata DB.
[Best] If the question can be answered with the available tables: {{'sql': <sql here>}} 
If the question cannot be answered with the available tables: {{'error': <explanation here>}} 
Remove unnecessary ORDER BY clauses unless required. 
Remember: Do not use 'LIMIT' or 'FETCH' keyword in the SQLQuery, instead use TOP keyword. For Example: To select top 3 results, use TOP keyword instead of LIMIT or FETCH.  
\nResponse Guidelines: 
If the provided context is insufficient, please explain why it can't be generated. 
Most important: Always give property options with details like PropertyID, Property Type, Building Size, Price, Address, Bedroom Count. PropertyID is mandatory in the response.
Critical Instruction: Ensure responses are exclusively derived from query results. Refrain from generating or adding synthetic data in any form.
Most important: The function should return the relevant answer for the question asked only based on Query results.
Given a user's question about this data, write a valid Teradata SQL query that accurately extracts or calculates the requested information from these tables 
and adheres to SQL best practices for Teradata database, optimizing for readability and performance where applicable. Do not try to make any answer

You have access to the following tools:"""
)

format_instructions = """You must always use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Don't forget to prefix your final answer with the string, "Final Answer:"!"""

suffix = """Begin!

Question: {input}
Thought:{agent_scratchpad}"""

# Create custom prompt template
custom_prompt = ChatPromptTemplate.from_template(
    "\n\n".join(
        [
            prefix,
            "{tools}",
            format_instructions,
            suffix,
        ]
    )
)

try:
    # Create SQL Agent with custom prompt
    agent = create_sql_agent(
        llm=llm,
        db=db,
        agent_type="zero-shot-react-description",
        verbose=True,
        handle_parsing_errors=True,
        prompt=custom_prompt,
        max_iterations=10,
    )
    print("SQL Agent created successfully")

except Exception as e:
    print(f"Error creating SQL agent: {e}")
    raise

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>9.4 Setup Hybrid RAG</b></p>  
<p style="font-size: 16px; font-family: Arial;">We have source data stored in both VectorDB and Vantage Database. Our hybrid RAG system is designed to automatically identify the appropriate source and query it accordingly. In some cases, responses to certain questions may be derived from both sources.</p>

In [None]:
def _get_classifier(query):
    """
    Create a chain to classify the query type.

    Args:
        query: User's question to classify

    Returns:
        LangChain chain for query classification
    """
    query_classifier_prompt = PromptTemplate(
        input_variables=["query"],
        template="""Classify if the following query requires:
            1. SQL database (if it's asking about structured data like Customer_360_Details, Customer Identifier, Name, City, State, Customer Type, Product Holdings, Total Deposit Balance, Total Credit Balance, Total Investments AUM, Customer Profitability,
             Customer Lifetime Value, Bank Tenure, Affluence Segment, Digital Banking Segment, Branch Banking Segment.)
            2. Vector database (if it's asking about document content, general knowledge, customer complaints)
            3. Both (if it needs to combine information from structured data and documents)
            
            Query: {query}
            
            Return only one word: SQL, VECTOR, or BOTH
            """,
    )
    chain = query_classifier_prompt | llm | StrOutputParser()
    return chain


def _query_vector_store(query_id, query):
    """
    Query the vector store and return relevant content.

    Args:
        query_id: Unique identifier for the query
        query: User's question text

    Returns:
        Dictionary containing response, references, and reference IDs
    """
    try:
        # Get similar document chunks from vector store
        context, file_names, ref_ids = get_similarity_search_context(query_id)

        # Create prompt for response generation
        response_prompt = PromptTemplate(
            input_variables=["context", "query"],
            template="""Using the following context, answer the question:

            Context: {context}

            Question: {query}

            Answer:""",
        )

        # Generate response using context
        response_chain = response_prompt | llm | StrOutputParser()
        response = response_chain.invoke({"context": context, "query": query})

        return {"response": response, "reference": file_names, "ref_ids": ref_ids}

    except Exception as e:
        print(f"Error querying vector store: {e}")
        return {"response": f"Error: {e}", "reference": [], "ref_ids": []}


def _combine_responses(sql_response, vector_response, query):
    """
    Combine responses from SQL and vector stores.

    Args:
        sql_response: Response from SQL database query
        vector_response: Response from vector store query
        query: Original user question

    Returns:
        Combined response string
    """
    try:
        combination_prompt = PromptTemplate(
            input_variables=["sql_response", "vector_response", "query"],
            template="""Combine the following information to provide a complete answer:

            SQL Database Info: {sql_response}
            Document Info: {vector_response}
            Original Question: {query}

            Combined Answer:""",
        )

        combination_chain = combination_prompt | llm | StrOutputParser()
        combined_response = combination_chain.invoke(
            {
                "sql_response": sql_response,
                "vector_response": vector_response,
                "query": query,
            }
        )
        return combined_response

    except Exception as e:
        print(f"Error combining responses: {e}")
        return f"Error combining responses: {e}"


def process_query(query_id):
    """
    Process user query and return appropriate response using hybrid RAG.

    Args:
        query_id: Query identifier or custom query string

    Returns:
        Dictionary containing response, references, and reference IDs
    """
    try:
        if not isinstance(query_id, str):
            # Get query text from query_id
            query = tdf_question_embeddings_store[
                tdf_question_embeddings_store["query_id"] == query_id
            ][["txt"]].get_values()[0][0]

            # Classify query type
            query_type = _get_classifier(query).invoke(query).strip().upper()

            if query_type == "SQL":
                # Query SQL database only
                return {
                    "response": agent.invoke(query)["output"],
                    "reference": ["Customer_360_Details"],
                    "ref_ids": [],
                }

            elif query_type == "VECTOR":
                # Query vector store only
                return _query_vector_store(query_id, query)

            elif query_type == "BOTH":
                # Query both sources and combine
                sql_response = agent.invoke(query)
                vector_response = _query_vector_store(query_id, query)
                return {
                    "response": _combine_responses(
                        sql_response, vector_response["response"], query
                    ),
                    "reference": vector_response["reference"],
                    "ref_ids": vector_response["ref_ids"],
                }

            else:
                return {
                    "response": "Unable to classify query type. Please rephrase your question.",
                    "reference": [],
                    "ref_ids": [],
                }

    except Exception as e:
        print(f"Error processing query: {e}")
        return {
            "response": f"Error processing query: {e}",
            "reference": [],
            "ref_ids": [],
        }

<hr style="height:2px;border:none;">
<a id="rule"></a>
<p style = 'font-size:20px;font-family:Arial'><b>10. Test and Compare Results</b></p>
<p style = 'font-size:16px;font-family:Arial'>To test and compare our results let's invoke the agent by selecting question from dropdown.</p>

In [None]:
def response_template(response):
    """
    Generate HTML template for displaying query response.

    Args:
        response: Dictionary containing response text and references

    Returns:
        HTML formatted string for display
    """
    view = """<p style='font-size:18px;font-family:Arial;'><b>Here is your response:</b></p>"""
    view += f"""<ul style='font-size:16px;font-family:Arial;'>
    <li><strong>{response['response']}</strong><ul>
    <li>References: """

    for ref in response["reference"]:
        view += f"""<ul style='font-size:16px;font-family:Arial;'><li>{ref}</li></ul>"""

    view += """</ul></ul>"""
    return view

In [None]:
def plot_vectors_2d(question_vector, matching_chunks, doc_chunks, loading_spinner2):
    """
    Create 2D visualization of question and document embeddings using PCA.

    Args:
        question_vector: Question embedding vector
        matching_chunks: Embeddings of matching document chunks
        doc_chunks: Embeddings of all document chunks
        loading_spinner2: Loading spinner widget to hide after plotting
    """
    try:
        # Combine all vectors for PCA transformation
        all_vectors = np.vstack(
            [question_vector.reshape(1, -1), matching_chunks, doc_chunks]
        )

        # Reduce dimensionality to 2D using PCA
        pca = PCA(n_components=2)
        vectors_2d = pca.fit_transform(all_vectors)

        # Split back into separate arrays
        question_2d = vectors_2d[0]
        matching_2d = vectors_2d[1 : len(matching_chunks) + 1]
        docs_2d = vectors_2d[len(matching_chunks) + 1 :]

        # Create plot
        plt.figure(figsize=(10, 8))

        # Plot all document chunks in background
        plt.scatter(
            docs_2d[:, 0],
            docs_2d[:, 1],
            c="gray",
            alpha=0.5,
            label="All document embeddings",
        )

        # Plot matching chunks
        plt.scatter(
            matching_2d[:, 0],
            matching_2d[:, 1],
            c="green",
            marker="s",
            s=100,
            label="Matching embeddings",
        )

        # Plot question vector
        plt.scatter(
            question_2d[0],
            question_2d[1],
            c="red",
            marker="*",
            s=200,
            label="Question embeddings",
        )

        # Add labels and legend
        plt.title("2D Visualization of Question with Matched Documents")
        plt.xlabel("Principal Component 1")
        plt.ylabel("Principal Component 2")
        plt.legend()
        plt.grid(True, alpha=0.3)

        # Add arrows from question to matching chunks
        for match in matching_2d:
            plt.arrow(
                question_2d[0],
                question_2d[1],
                match[0] - question_2d[0],
                match[1] - question_2d[1],
                color="blue",
                alpha=0.3,
                head_width=0.001,
                head_length=0.001,
            )

        plt.tight_layout()
        plt.show()

    except Exception as e:
        print(f"Error creating 2D visualization: {e}")
    finally:
        loading_spinner2.value = ""

In [None]:
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

# Setup loading spinner for initial interface preparation
loading_spinner3 = widgets.HTML(
    value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Please wait while we prepare the question-answer interface and get the answer for first question...",
)

# Load question embeddings data
pdf_question_embeddings_store = tdf_question_embeddings_store.to_pandas().reset_index()

display(loading_spinner3)

# Create dropdown options from questions
op = list(
    zip(
        pdf_question_embeddings_store[["txt"]].values.flatten(),
        pdf_question_embeddings_store[["query_id"]].values.flatten(),
    )
)

# Create dropdown widget for question selection
prod_dw = Dropdown(
    options=op,
    description="Please select the query:",
    style={"description_width": "initial"},
    display="flex",
    flex_flow="column",
    align_items="stretch",
    layout=widgets.Layout(width="50%", height="50px"),
    value=101,
)

# Load all document chunks for 2D visualization
doc_chunks = tdf_embeddings_store.loc[:, "emb_0":"emb_767"].get_values()


@interact(query_id=prod_dw)
def print_product(query_id):
    """
    Process selected query and display results with visualization.

    Args:
        query_id: Selected query ID from dropdown
    """
    # Create loading spinner for query processing
    loading_spinner = widgets.HTML(
        value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Thinking...",
    )

    # Create loading spinner for plot generation
    loading_spinner2 = widgets.HTML(
        value="<i class='fas fa-cog fa-spin' style='font-size:24px'></i> Drawing...",
    )

    if query_id != "":
        display(loading_spinner)

    try:
        # Process the query
        response = process_query(query_id)

        if response is not None:
            loading_spinner.value = ""
            display(Markdown(response_template(response)))

            # Generate 2D visualization if document references exist
            if len(response["ref_ids"]) > 0:
                print("Drawing a 2D plot, please wait.")
                display(loading_spinner2)

                # Filter matching chunks for the query
                matching_chunks = (
                    tdf_embeddings_store.loc[
                        tdf_embeddings_store["id"].isin(list(response["ref_ids"]))
                    ]
                    .iloc[:, 3:]
                    .get_values()
                )

                # Get question embedding
                question_vector = (
                    tdf_question_embeddings_store.loc[
                        tdf_question_embeddings_store["query_id"] == query_id
                    ]
                    .iloc[:, 2:]
                    .get_values()
                )

                # Generate 2D plot
                plot_vectors_2d(
                    question_vector, matching_chunks, doc_chunks, loading_spinner2
                )

    except Exception as e:
        loading_spinner.value = ""
        loading_spinner2.value = ""
        print(f"Error processing query: {e}")
    finally:
        loading_spinner3.value = ""

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>11. Integrated data with customer 360</b>
<p style = 'font-size:16px;font-family:Arial'>The following is an example of the output from LLM integrated with existing customer360 data. Please scroll to the right to see all the columns.</p>

In [None]:
# Create augmented Customer 360 data with chatbot interaction metrics
cusomter360_augmented = complaints_data.to_pandas().reset_index()

# Add simulated chatbot interaction columns
cusomter360_augmented["num_of_question_asked"] = np.random.randint(
    8, 40, size=len(cusomter360_augmented)
)
cusomter360_augmented["types of question"] = np.random.choice(
    ["insurance", "personal banking"], size=len(cusomter360_augmented)
)

# Add recommended bank strategy based on interaction patterns
cusomter360_augmented["bank strategy"] = [
    "Insurance Manager to contact customer immediately",
    "Send Policy Letter from Insurance Servicing",
    "Insurance Manager to follow-up with Title Company for documentation and contact customer",
    "Insurance Manager to contact customer immediately",
    "All answered by chatbot, no action required",
]

In [None]:
# Display full column content without truncation
pd.set_option("display.max_colwidth", None)
cusomter360_augmented

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>11. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial'><b>11.1 Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
# Clean up work tables created during the demo
tables_to_drop = [
    "question_to_ask",
    "question_to_ask_embeddings",
    "df_matching_chunks",
    "docs_data",
    "pdf_embeddings_store",
]

for table in tables_to_drop:
    try:
        db_drop_table(table_name=table, schema_name="demo_user")
        print(f"Dropped table: {table}")
    except Exception as e:
        print(f"Could not drop table {table}: {e}")

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'> <b>11.2 Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
# Remove demo data and databases (takes approximately 5 seconds)
%run -i ../run_procedure.py "call remove_data('DEMO_ComplaintAnalysis');"

In [None]:
# Close database connection and remove context
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023-2026. All Rights Reserved
        </div>
    </div>
</footer>