<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Teradata &mdash; Multimodal Agentic Semantic Search with Enterprise Vector Store and Unstructured.io
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style='font-size:20px;font-family:Arial;'><b>Multimodal Semantic Search with Teradata Enterprise Vector Store and Unstructured.io</b></p>
<p style='font-size:16px;font-family:Arial;'>
This notebook demonstrates how to build a production-grade semantic search pipeline over unstructured data — composite PDF documents and raw images — entirely within Teradata VantageCloud. Rather than moving data to an external vector database, embeddings are stored, indexed, and queried in-database using the <b>Teradata Enterprise Vector Store</b>, keeping large-scale retrieval fast and data residency compliant.
</p>

<ul style='font-size:16px;font-family:Arial;'>
    <li><strong>Unstructured.io Ingestion:</strong> Parses and chunks composite PDFs and images via the Unstructured API, then stores the resulting text chunks and embeddings directly in Vantage tables.</li>
    <li><strong>Enterprise Vector Store:</strong> Uses <code>TeradataVectorStore</code> from the <code>langchain-teradata</code> library to create and manage an in-database index over the stored embeddings.</li>
    <li><strong>Similarity Search:</strong> Queries the vector store using image embeddings to find semantically similar documents, then uses an LLM to summarize and format the matched results.</li>
    <li><strong>LangChain Agent:</strong> Wires the similarity search and a PDF renderer into a conversational agent, enabling a natural-language interface for exploring the document and image library.</li>
</ul>

<p style='font-size:18px;font-family:Arial;'><b>Why Vantage?</b></p>
<p style='font-size:16px;font-family:Arial;'>
Storing embeddings inside Vantage eliminates the data movement, latency, and operational complexity of maintaining a separate vector database. Teradata's Massively Parallel Processing architecture scales similarity search across billions of vectors while keeping the data collocated with the rest of your enterprise data warehouse.
</p>

<hr style='height:2px;border:none;background-color:#00233c;'>
<b style='font-size:20px;font-family:Arial;'>1. Configure the Environment</b>

<p style='font-size:16px;font-family:Arial;'>Before running this notebook, install the required libraries. This demo depends on <b>teradatagenai</b> (which provides the <code>VSManager</code> and <code>TeradataVectorStore</code> classes) and <b>langchain-teradata</b> (the LangChain integration layer). The cell below performs a quiet install; if the packages were not already present, restart the kernel before proceeding.</p>

In [None]:
%%capture
!pip install teradatagenai, langchain-teradata, anywidget --quiet

<div class="alert alert-block alert-info">
    <p style='font-size:16px;font-family:Arial'><i><b>Note:</b>If the above commands install the modules please restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b>0 0</b></i></p>
</div>

<p style='font-size:16px;font-family:Arial;'>The cell below loads all required Python libraries. Key imports include <b>VSManager</b> for vector store lifecycle management, <b>TeradataVectorStore</b> from LangChain Teradata for index creation and querying, <b>teradataml</b> for in-database DataFrame operations, and the LangChain agent and tool utilities used to build the conversational interface later in the notebook.</p>

In [None]:
# Required imports

# General imports
from teradatagenai import VSManager
from langchain_teradata import TeradataVectorStore
from teradataml import *
import os
import json
import re

# Credentials and configuration management
from dotenv import load_dotenv
from getpass import getpass


#Langchain Imports
from langchain.chat_models import init_chat_model
from langchain.agents import create_agent
from langchain_core.messages import HumanMessage
from langchain.tools import tool

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True

# Widget display
from IPython.display import display, HTML, IFrame
import ipywidgets as widgets

# Import utilities
from unstructured_utils.teradata_ingest import ingest
from utils.image_grid import display_image_grid

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>2. Connect to VantageCloud</b></p>
<p style='font-size:16px;font-family:Arial;'>Establish a connection to VantageCloud Lake using <code>create_context</code> from the teradataml library. Connection details — host, username, and password — are read automatically from the environment configuration file provisioned for your lab. The query band is also set so that all SQL generated by this session is tagged for auditability.</p>

In [None]:
print("Checking if this environment is ready to connect to VantageCloud Lake...")

if os.path.exists("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env"):
    print("Your environment parameter file exist.  Please proceed with this use case.")
    # Load all the variables from the .env file into a dictionary
    env_vars = dotenv_values("/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env")
    # Create the Context
    eng = create_context(host=env_vars.get("host"), username=env_vars.get("username"), password=env_vars.get("my_variable"))
    execute_sql('''SET query_band='DEMO=VCL_GettingStarted_VectorStore.ipynb;' UPDATE FOR SESSION;''')
    print("Connected to VantageCloud Lake with:", eng)
else:
    print("Your environment has not been prepared for connecting to VantageCloud Lake.")
    print("Please contact the support team.")

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>3. Authenticate into the User Environment Service (UES)</b></p>

<p style='font-size:16px;font-family:Arial;'><b>UES authentication</b> is required to create and manage the Open Analytics environments that power the embedding model inference layer inside Vantage. A VantageCloud Lake user can create the necessary authentication objects through the Console; these objects have already been provisioned for this lab.
</p>
<p style='font-size:16px;font-family:Arial;'>The <code>set_auth_token</code> function accepts a Personal Access Token (PAT) and a PEM key file to establish a secure session with the UES endpoint. After authentication, <code>VSManager.health()</code> confirms that the vector store service is reachable and ready.
</p>
<ul style='font-size:16px;font-family:Arial; margin-top:4px;'>
  <li><a href='https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Teradata-Package-for-Python-on-VantageCloud-Lake/Working-with-Open-Analytics/APIs-to-Use-with-Open-Analytics-Framework/API-to-Set-Authentication-Token/set_auth_token'>Click here</a> to see more details about using the Teradata APIs to set the authentication objects.</li>
  <li>Check out <a href='https://medium.com/teradata/deploy-hugging-face-llms-on-teradata-vantagecloud-lake-with-nvidia-gpu-acceleration-d94d999edaa5'>Step 4</a> of this tutorial for details on configuring a VantageCloud Lake environment to use the Open Analytics Framework.</li>
</ul>

In [None]:
# We've already loaded all the values into our environment variables and into a dictionary, env_vars.
# username=env_vars.get("username") isn't required when using base_url, pat and pem.
ues_uri=env_vars.get("ues_uri")
if ues_uri.endswith("/open-analytics"):
    ues_uri = ues_uri[:-15]   # remove last 5 chars ("/open-analytics")

if set_auth_token(base_url=ues_uri,
                  pat_token=env_vars.get("access_token"), 
                  pem_file=env_vars.get("pem_file")
                 ):
    print("UES Authentication successful")
else:
    print("UES Authentication failed. Check credentials.")
    sys.exit(1)

In [None]:
VSManager.health()

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>4. Ingest Unstructured Data and Generate Embeddings</b></p>

<p style='font-size:16px;font-family:Arial;'>
This section uses the <b>Unstructured.io API</b> to parse and embed a library of healthcare assets stored in S3. Unstructured.io handles the heavy lifting of document parsing — extracting text from PDFs, detecting layout elements, chunking content into semantically coherent pieces, and computing embeddings. The resulting records (text chunks plus embedding vectors) are written directly into Vantage tables.
</p>
<p style='font-size:16px;font-family:Arial;'>
You will need an <b>Unstructured API key</b> to run this section. If you do not already have one, create a free account at <a href='https://unstructured.io/'>unstructured.io</a> and retrieve your key from the API Keys page in the platform dashboard.
</p>
<div class='alert alert-block alert-info'>
<p style='font-size:16px;font-family:Arial;'><b>Note:</b> The first cell below retrieves the current default database name, which is used as the target schema for the two embedded tables created during ingestion: <code>composite_pdfdocs_embedded</code> and <code>image_samples_embedded</code>.</p>
</div>

In [None]:
default_db = execute_sql("SELECT DATABASE").fetchone()[0]

In [None]:
unstructured_api_key = getpass("Enter your unstructured API Key")

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>4.1 Ingest Composite PDF Documents</b></p>
<p style='font-size:16px;font-family:Arial;'>
The <code>ingest</code> utility calls the Unstructured API against an S3 prefix containing a library of composite healthcare PDFs. Each PDF is parsed, chunked, and embedded; the resulting records are written to the <b>composite_pdfdocs_embedded</b> table in Vantage. The table stores the raw text chunk, its embedding vector, and metadata such as the source filename and S3 record locator.
</p>

In [None]:
ingest(api_key=unstructured_api_key, 
       td_host=env_vars.get("host"), 
       td_user=env_vars.get("username"), 
       td_password=env_vars.get("my_variable"), 
       td_database=default_db, 
       td_table='composite_pdfdocs_embedded', 
       s3_uri="s3://dev-rel-demos/teradata-unstructured/healthcare-assets/composite-pdfs/", 
       s3_anonymous=True)

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>4.2 Ingest Sample Images</b></p>
<p style='font-size:16px;font-family:Arial;'>
The same ingestion pipeline is applied to a set of sample medical images stored at a separate S3 prefix. Unstructured processes each image, generates an embedding, and writes the record to the <b>image_samples_embedded</b> table. These image embeddings will later be used as query vectors to search the PDF document store for semantically similar content.
</p>

In [None]:
ingest(api_key=unstructured_api_key, 
       td_host=env_vars.get("host"), 
       td_user=env_vars.get("username"), 
       td_password=env_vars.get("my_variable"), 
       td_database=default_db, 
       td_table='image_samples_embedded', 
       s3_uri="s3://dev-rel-demos/teradata-unstructured/healthcare-assets/images/", 
       s3_anonymous=True)

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>4.3 Preview Ingested Data</b></p>
<p style='font-size:16px;font-family:Arial;'>
Load both tables into teradataml DataFrames and inspect a sample of the records. The preview displays the <b>record_id</b>, <b>filename</b>, and <b>record_locator</b> columns — the embedding vectors are omitted here for readability, but they are present in the underlying table and will be used for similarity search in later steps.
</p>

In [None]:
image_documentation_bank = DataFrame.from_query(f"SELECT * FROM {default_db}.composite_pdfdocs_embedded")

In [None]:
image_documentation_bank[['record_id','filename','record_locator']].head(2)

In [None]:
raw_images_df = DataFrame.from_query(f"SELECT * FROM {default_db}.image_samples_embedded")

In [None]:
raw_images_df[['record_id','filename','record_locator']].head(2)

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>5. Review the Vector Store Registry</b></p>
<p style='font-size:16px;font-family:Arial;'>
<code>VSManager.list()</code> returns a catalogue of all vector stores currently registered under your user's database. Use this to confirm that there are no conflicting names before creating a new store, or to audit existing stores and their metadata.
</p>

In [None]:
vslist = VSManager.list()
vslist[ (vslist['database_name'] == env_vars.get("username"))]

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>6. Build the Enterprise Vector Store</b></p>
<p style='font-size:16px;font-family:Arial;'>
<code>TeradataVectorStore.from_embeddings</code> creates a named vector store index backed by the <b>composite_pdfdocs_embedded</b> table. The method registers the index in Vantage and builds an index over the embedding column, enabling sub-second approximate nearest-neighbor lookups at scale. Key parameters include the data table, the column containing the embedding vectors, primary key columns for deduplication, and the metadata columns to carry through into search results.
</p>

In [None]:
vs_emb = TeradataVectorStore.from_embeddings(name = "unstructured_embeddings_demo",
                                     data = image_documentation_bank,
                                     data_columns = "embeddings",
                                     key_columns = ["id", "record_id"],
                                     embedding_data_columns = "text",
                                     metadata_columns = ["text","date_created", "date_modified", "record_locator", "filename"],)

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>6.1 Check Vector Store Status and Details</b></p>
<p style='font-size:16px;font-family:Arial;'>
Although the create, update, and destroy APIs are synchronous, the <b>status</b> method gives an explicit confirmation that the operation completed and the index is in a healthy state. <code>get_details()</code> returns a richer summary that includes the index configuration, the number of indexed records, and the embedding dimensionality.
</p>

In [None]:
vs_emb.status()

In [None]:
vs_emb.get_details()

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>6.2 Auxiliary Functions</b></p>
<p style='font-size:16px;font-family:Arial;'>
The two cells below show common lifecycle operations on a vector store. <code>TeradataVectorStore(name=..., log=True)</code> reconnects to an existing named store without recreating it, which is useful when returning to an already-indexed dataset. <code>vs_emb.destroy()</code> tears down the index and removes the associated metadata from the registry — use this to clean up after experimentation.
</p>
<div class='alert alert-block alert-info'>
<p style='font-size:16px;font-family:Arial;'><b>Note:</b> The <code>destroy()</code> call in this cell is provided for reference. Do not execute it here unless you intend to delete the vector store before completing the similarity search steps in Section 7.</p>
</div>

In [None]:
# ACTION: Specify a name for a new vector store.
vs_emb = TeradataVectorStore(name='unstructured_embeddings_demo', log=True)

In [None]:
# NOTE: To destroy a vector store, use the destroy() function.
vs_emb.destroy()

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>7. Perform a Similarity Search</b></p>
<p style='font-size:16px;font-family:Arial;'>
With the vector store built over the PDF embeddings, we can now query it using an image embedding as the search vector. <code>similarity_search_by_vector</code> computes cosine similarity between the query embedding and all indexed vectors, returning the top matches ranked by score.
</p>
<p style='font-size:16px;font-family:Arial;'>
The raw similarity response is then passed to <code>prepare_response</code>, which forwards the matched documents together with a natural-language question to the connected LLM. The LLM synthesizes the results into a readable answer — in this case extracting the title, description, filename, and record ID of the best-matching document.
</p>

In [None]:
response = vs_emb.similarity_search_by_vector(data = raw_images_df.head(1), column='embeddings')

In [None]:
response.similar_objects.sort('score',False).head(1)

In [None]:
question='I need to recover The title, description and record id, and locator of the most similar record?'
prompt='Format the response in a conversational way.'
response = vs_emb.prepare_response(question=question, similarity_results=response, prompt=prompt)

In [None]:
# NOTE: The "response" object is a string. If we ask to display the string itself,
# the special characters like new lines will not be interpreted.
# To show the actual text with new lines, explicitly specify to print() the "response" object.

print(response)

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>8. Build a Conversational Agent with the Vector Store</b></p>
<p style='font-size:16px;font-family:Arial;'>
This section wires the Enterprise Vector Store into a <b>LangChain agent</b>, creating a natural-language interface for exploring the document and image library. The agent is equipped with two tools:
</p>
<ul style='font-size:16px;font-family:Arial;'>
    <li><b>search_and_display_similar_image:</b> Takes the currently selected image from the interactive grid, runs a similarity search against the PDF vector store, and displays the best-matching document metadata.</li>
    <li><b>display_pdf_from_locator:</b> Renders a PDF inline in the notebook given a filename and S3 path extracted from a previous search result.</li>
</ul>
<p style='font-size:16px;font-family:Arial;'>
The agent uses an LLM (routed through the VantageCloud Lake LiteLLM proxy) to decide which tool to call based on the user's message. An interactive image grid lets the user select an image before asking the agent to find similar documents or display a specific PDF.
</p>

<p style='font-size:16px;font-family:Arial;'>
The code below initializes the LLM connection using credentials from the environment file, renders the image grid widget, and sets up the chat UI. The <code>on_send</code> handler captures user messages and dispatches them to the agent, routing tool calls back to the vector store or the PDF renderer as needed.
</p>

In [None]:
display_df = raw_images_df.to_pandas()

In [None]:
# ── Environment ───────────────────────────────────────────────────────────────

environment_path = "/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env"
load_dotenv(dotenv_path=environment_path)

llm_key = os.getenv("litellm_key")
llm_url = os.getenv("litellm_base_url")

llm = init_chat_model(
    model="openai-gpt-41",
    model_provider="openai",
    base_url=llm_url,
    api_key=llm_key,
)

# ── Display the grid ──────────────────────────────────────────────────────────
def on_image_selected(record_id: str) -> None:
    chat_output.clear_output()

result = display_image_grid(display_df, on_select=on_image_selected)

display(HTML("<hr style='margin:20px 0'>"))

# ── Chat UI ───────────────────────────────────────────────────────────────────

chat_output = widgets.Output(layout=widgets.Layout(
    max_width="720px",
    min_height="200px",
    padding="12px",
    border="1px solid #ddd",
    border_radius="6px",
))

text_input = widgets.Text(
    placeholder="Ask something about the selected image...",
    layout=widgets.Layout(width="600px")
)

send_btn = widgets.Button(
    description="Send",
    button_style="primary",
    layout=widgets.Layout(width="80px")
)

display(widgets.VBox([
    chat_output,
    widgets.HBox([text_input, send_btn]),
], layout=widgets.Layout(max_width="720px")))

# ── Tool ──────────────────────────────────────────────────────────────────────

@tool
def search_and_display_similar_image(dummy: str = "") -> str:
    """
    Runs a similarity search using the currently selected image and displays
    the most similar result. Call this whenever the user asks about similar images.
    """
    if result.selected_id is None:
        return "No image is selected. Please select an image from the grid first."

    response_similarity = vs_emb.similarity_search_by_vector(
        data=raw_images_df[raw_images_df["record_id"] == result.selected_id],
        column="embeddings",
        return_type="json",
    )
    question = "I need to recover the title, description, filename, record id, and file_locator of the most similar record?"
    prompt = "Format the response as JSON object."
    response_chat = vs_emb.prepare_response(
        question=question,
        similarity_results=response_similarity,
        prompt=prompt,
    )

    with chat_output:
        display(HTML(
            f"<div style='font-family:sans-serif; padding:12px; background:#f9f9f9;"
            f"border-radius:6px; margin:8px 0;'>{response_chat}</div>"
        ))

    return response_chat

@tool
def display_pdf_from_locator(filename: str, remote_file_path: str) -> str:
    """
    Renders a PDF in the notebook given a filename and S3 remote_file_path.
    Use this when the user wants to view a PDF from the search results.
    Extract filename and remote_file_path from the search response and pass them here.
    """
    s3_path = remote_file_path.rstrip("/")
    without_scheme = s3_path.replace("s3://", "", 1)
    bucket, _, prefix = without_scheme.partition("/")
    key = f"{prefix}/{filename}" if prefix else filename
    url = f"https://{bucket}.s3.amazonaws.com/{key}"

    with chat_output:
        display(HTML(
            f"<div style='margin:8px 0;'>"
            f"<b>Displaying:</b> <a href='{url}' target='_blank'>{filename}</a>"
            f"</div>"
        ))
        display(IFrame(src=url, width="100%", height="600px"))

    return f"PDF displayed: {filename}"


# ── Agent setup ───────────────────────────────────────────────────────────────

tools = [search_and_display_similar_image, display_pdf_from_locator]

agent = create_agent(
    model=llm,
    tools=tools,
    system_prompt=(
        "You are an image analysis assistant with two tools. "
        "Use search_and_display_similar_image to find similar images. "
        "Use display_pdf_from_locator to render a PDF by passing the filename "
        "and remote_file_path extracted from the search response. "
        "When the user asks to view or open a PDF, extract those two values "
        "from the search result and call display_pdf_from_locator."
    ),
)
# ── Send handler (synchronous, no threads) ────────────────────────────────────

def on_send(btn):
    msg = text_input.value
    if not msg.strip():
        return
    text_input.value = ""

    with chat_output:
        display(HTML(
            f"<div style='margin:6px 0; padding:8px 12px; background:#e8f0fe;"
            f"border-radius:6px;'><b>You:</b> {msg}</div>"
        ))

    response = agent.invoke({
        "messages": [HumanMessage(content=msg)]
    })

    agent_reply = response["messages"][-1].content
    with chat_output:
        display(HTML(
            f"<div style='margin:6px 0; padding:8px 12px; background:#f0f4f0;"
            f"border-radius:6px;'><b>Agent:</b> {agent_reply}</div>"
        ))

send_btn.on_click(on_send)
text_input.on_submit(lambda t: on_send(None) or setattr(t, "value", ""))

<hr style='height:2px;border:none;background-color:#00233c;'>
<p style='font-size:20px;font-family:Arial;'><b>9. Cleanup</b></p>
<p style='font-size:16px;font-family:Arial;'>
Run the cells in this section to tear down all resources created during the notebook. This includes destroying the vector store index, removing the embedded tables from Vantage, disconnecting the UES session, and closing the database connection.
</p>
<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>9.1 Destroy the Vector Store and Review Registry</b></p>
<p style='font-size:16px;font-family:Arial;'>Call <code>vs_emb.destroy()</code> to remove the HNSW index and deregister the vector store. The subsequent <code>status()</code> and <code>VSManager.list()</code> calls confirm that the store has been fully removed.</p>

In [None]:
vs_emb.destroy()

In [None]:
vs_emb.status()

In [None]:
vslist = VSManager.list()
vslist[ (vslist['database_name'] == env_vars.get("username"))]

<hr style='height:1px;border:none;background-color:#ccc;'>
<p style='font-size:18px;font-family:Arial;'><b>9.2 Disconnect and Drop Tables</b></p>
<p style='font-size:16px;font-family:Arial;'>Use <code>VSManager.disconnect()</code> to close the UES session, then drop the two embedded tables created during ingestion to return the database to a clean state. Finally, <code>remove_context()</code> closes the teradataml connection to Vantage.</p>

In [None]:
VSManager.disconnect()

In [None]:
# Drop tables
db_drop_table('composite_pdfdocs_embedded', schema_name=default_db)
db_drop_table('image_samples_embedded', schema_name=default_db)

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>