<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Enterprise Vector Store - Embedding and Search in SQL
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<hr>

<p style = 'font-size:28px;font-family:Arial;color:#00233C'><b>Overview</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage provides a suite of new in-database analytic capabilities for Vector storage, Management, Indexing, and Search, including</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Vector Datatype</b> based VARBYTE arrays</li>
    <li><b>Normalization</b> functions to improve search efficiencies</li>
    <li><b>Vector Indexing and Search</b> leveraging multiple algorithms</li>
    </ul>

<b style = 'font-size:16px;font-family:Arial;color:#00233C'>Vector Datatype using TD_AITextEmbeddings</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Currently, the Vantage Database supports several different methods for generating <b>Vector Embeddings</b> including in-database Bring Your Own Model (BYOM) functions, in-platform GPU-accelerated open-source model inferencing, and API-supported embedding using Cloud-based Large Language Models. </p>

<p style = 'font-size:28px;font-family:Arial;color:#00233C'><b>Demonstration</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following demonstration uses some of the Consumer Financial Protection Board complaints data to illustrate a SQL-based end-to-end Vector Embedding use case:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Generate <b>Vector Embeddings</b> using Cloud-based LLMs</li>
    <li><b>Normalize</b> the vector data for efficient search</li>
    <li>Calculate <b>Vector Distance</b> between complaints and topics data</li>
    <li>Perform <b>Retreival Augmented Generation (RAG)</b> using native functions and cloud-based LLMs</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Connect to the database</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new named connection if necessary using the %addconnect magic</p> 

In [None]:
%addconnect name=vs_demo, host=34.232.150.25

In [None]:
%chconnect name=vs_demo, host=34.232.150.25

In [None]:
%connect vs_demo, user=data_engineer, hidewarnings=True

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 1 - Generate Vector Embeddings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClearScape Analytics function TD_AITextEmbeddings can use either built-in or user-defined CSP LLM endpoints and models for generating vector embedding.  The built-in capabilities follow the model support matrix provded in the User Guide.</p> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>View the original dataset.  CFPB Consumer Complaints</li>
    <li>Set up authorization.  To use built-in LLM services, keep the USER and PASSWORD values blank</li>
    <li>Pass this data to the TD_AITextEmbeddings function</li>
    </ol>
<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Show original data:</b></p>

In [None]:
SELECT TOP 2 * FROM demo_ofs.CFPB_Complaints_1K;

In [None]:
SELECT TOP 2 * FROM demo_ofs.topics_of_interest;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Configure authorization</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>VantageCloud environments that have Enteperprise Vector Store enabled leverage the existing CSP account for LLM access.  For these accounts, users can pass a blank authorization object to the SQL functions.  If a user does not have Enterprise Vector Store, the user can pass valid credentials to access the CSP LLM (AWS Bedrock, Azure/OpenAI, Google Gemini, etc.).  See the documentation for more details.</p>

In [None]:
REPLACE AUTHORIZATION demo_embeddings_auth USER '' PASSWORD '';

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Generate Vector Embeddings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function takes the following as input</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Input Data</b>.  Including a partition value to execute the embedding function from a single AMP</li>
    <li><b>Model information</b> including the model name, AWS region of the deployment, and any additional arguments</li>
    <li><b>Authorization</b>.  In this case, the blank passthrough object</li>
    </ul>

In [None]:
SELECT *
FROM AI_TEXTEMBEDDINGS (
    ON (SELECT TOP 2 txt, id, TD_BYONE() p FROM demo_ofs.CFPB_Complaints_1K) AS InputTable
    PARTITION BY p
USING       
     region('us-east-1')
     refreshcredentialtimeseconds('3600')
     Authorization(demo_embeddings_auth)
     apitype('aws') 
     modelname('amazon.titan-embed-text-v2:0') --'amazon.titan-embed-image-v1', or 'amazon.titan-embed-text-v1'
     --modelargs('{"dimensions":256}') --to change the number of embddings
     textcolumn('txt')
     outputformat('vector')
) as dt;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 2 - Normalize the vector values</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Vector normalization is the process of scaling a vector to have a magnitude (length) of 1, while preserving its direction. This resulting vector is called a unit vector. It's essentially dividing each component of the vector by its length.  This makes some calculations much more efficient, including some of the search and indexing operations.  The ClearScape Analytics function <b>TD_VectorNormalize</b> will perform this operation at scale on our VECTOR datatype.</p>

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Execute the VectorNormalize function against the output of the embedding function</b></p>


In [None]:
SELECT * FROM TD_Vectornormalize(
       ON (SELECT txt, id, Embedding, Embedding Embedding_Normalized
            FROM AI_TEXTEMBEDDINGS (
                ON (SELECT TOP 2 txt, id, TD_BYONE() p FROM demo_ofs.CFPB_Complaints_1K) AS InputTable
                PARTITION BY p
            USING       
                 region('us-east-1')
                 refreshcredentialtimeseconds('3600')
                 Authorization(demo_embeddings_auth)
                 apitype('aws')
                 modelname('amazon.titan-embed-text-v2:0')
                 modelargs('{}')
                 textcolumn('txt')
                 outputformat('vector')
    ) as ve) AS InputTable
USING
    IDColumns('id')
    TargetColumns('Embedding_Normalized')
    Approach('UNITVECTOR')
    Accumulate('txt','Embedding')
    EmbeddingSize(1024)
) AS dt;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Create new volatile tables to store the final embeddings and normalized values.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create new volatile tables passing all the functions in a single expression.  Use Accumulate clause to return original embedding and comments text.</p>

In [None]:
DROP TABLE topics_embeddings_Normalized 

In [None]:
CREATE VOLATILE TABLE topics_embeddings_Normalized AS (
    SELECT * FROM TD_Vectornormalize(
           ON (SELECT txt, id, Embedding, Embedding Embedding_Normalized
                FROM AI_TEXTEMBEDDINGS (
                    ON (SELECT txt, id, TD_BYONE() p FROM demo_ofs.topics_of_interest) AS InputTable
                    PARTITION BY p
                USING       
                     region('us-east-1')
                     refreshcredentialtimeseconds('3600')
                     Authorization(demo_embeddings_auth)
                     apitype('aws')
                     modelname('amazon.titan-embed-text-v2:0')
                     modelargs('{}')
                     textcolumn('txt')
                     outputformat('vector')
        ) as ve) AS InputTable
    USING
        IDColumns('id')
        TargetColumns('Embedding_Normalized')
        Approach('UNITVECTOR')
        Accumulate('txt','Embedding')
        EmbeddingSize(1024)
) AS d) WITH DATA
ON COMMIT PRESERVE ROWS;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use <b>CAST</b> to VARCHAR to create a human-readable embedding array.</p>

In [None]:
SELECT TOP 2 id, 
    CAST(Embedding_Normalized AS VARCHAR(34000)), 
    txt, 
    CAST(Embedding as VARCHAR(34000))
FROM topics_embeddings_Normalized;

In [None]:
DROP TABLE CFPB_embeddings_Normalized

In [None]:
CREATE VOLATILE TABLE CFPB_embeddings_Normalized AS (
    SELECT * FROM TD_Vectornormalize(
           ON (SELECT txt, id, Embedding, Embedding Embedding_Normalized
                FROM AI_TEXTEMBEDDINGS (
                    ON (SELECT TOP 1000 txt, id, TD_BYONE() p FROM demo_ofs.CFPB_Complaints_1K) AS InputTable --only pass 1000 rows to the embedding function
                    PARTITION BY p
                USING       
                     region('us-east-1')
                     refreshcredentialtimeseconds('3600')
                     Authorization(demo_embeddings_auth)
                     apitype('aws')
                     modelname('amazon.titan-embed-text-v2:0')
                     modelargs('{}')
                     textcolumn('txt')
                     outputformat('vector')
        ) as ve) AS InputTable
    USING
        IDColumns('id')
        TargetColumns('Embedding_Normalized')
        Approach('UNITVECTOR')
        Accumulate('txt','Embedding')
        EmbeddingSize(1024)
) AS d) WITH DATA
ON COMMIT PRESERVE ROWS;

In [None]:
SELECT TOP 2 id, 
    CAST(Embedding_Normalized AS VARCHAR(34000)), 
    txt, 
    CAST(Embedding as VARCHAR(34000))
FROM CFPB_embeddings_Normalized;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 3 - Perform Vector Distance calculations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClearScape Analytics function <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Training-Functions/TD_VectorDistance'>TD_VectorDistance</a> function will take a table of input Vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.  Since this function scans every row and performs the distance calculation, it is resource-intensive and usually suited to a lower number of records.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Additionally, perform a simple join to display the original complaints, topic of interest, and distance calculations</p>

In [None]:
SELECT TOP 10 target_id, reference_id, distancetype, cast(distance as decimal(36,8)) as distance FROM TD_VECTORDISTANCE (
    ON CFPB_Embeddings_Normalized AS TargetTable
    ON topics_embeddings_Normalized AS ReferenceTable DIMENSION
USING
    TargetIDColumn('id')
    TargetFeatureColumns('Embedding_Normalized')
    RefIDColumn('id')
    RefFeatureColumns('Embedding_Normalized')
    DistanceMeasure('cosine')
    topk(1)
) AS dt order by 3,1,2,4;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Join the results back to the original complaints and topics</b></p>

In [None]:
SELECT TOP 2 * FROM demo_ofs.CFPB_Complaints_1K;

In [None]:
SELECT TOP 2 * FROM demo_ofs.topics_of_interest;

In [None]:
SELECT TOP 10 c.id complaint_id, r.txt topic, c.txt complaint, d.distance

FROM (SELECT target_id, reference_id, distancetype, cast(distance as decimal(36,8)) as distance FROM TD_VECTORDISTANCE (
    ON CFPB_Embeddings_Normalized AS TargetTable
    ON topics_embeddings_Normalized AS ReferenceTable DIMENSION
USING
    TargetIDColumn('id')
    TargetFeatureColumns('Embedding_Normalized')
    RefIDColumn('id')
    RefFeatureColumns('Embedding_Normalized')
    DistanceMeasure('cosine')
    topk(1)
) AS dt) d
    
JOIN demo_ofs.CFPB_Complaints_1K c
    ON c.id = d.target_id
JOIN demo_ofs.topics_of_interest r
    ON r.id = d.reference_id

ORDER BY d.distance;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 4 - Pass the search results to the LLM to generate a response</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClearScape Analytics function <a href = 'https://docs.teradata.com/r/Lake-Analyze-Your-Data-with-ClearScape-AnalyticsTM/Text-Analytics-AI-Functions/AI_AskLLM'>AI_ASKLLM</a> can use user-defined CSP LLM endpoints and models for generating a response.  The function takes multiple user-defined parameters to select the model and control the generation tasks:</p> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Input Table</b> containing one or more "questions".  Each row in this table will call the generation task once</li>
    <li><b>Context Table</b> containg additional context to pass to the prompt.  This can be the result of a similarity search as in the example here, or any other data that the user wishes to send in the prompt.  Note the function will use a single column, so use PACK or other string functions to concatenate additional context if desired</li>
    <li><b>Model information</b> including region, CSP, and model name</li>
    <li><b>Authorization</b> Passed as keys/secrets or a secure Authorization Object</li>
    </ol>
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demonstration, we will construct a small questions table as input, and pass the similarity search results from above as context.</p>

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Create an input table</b></p>

In [None]:
DROP TABLE input_questions

In [None]:
CREATE VOLATILE TABLE input_questions(
    id BIGINT,
    question VARCHAR(300)
    )
ON COMMIT PRESERVE ROWS;

In [None]:
INSERT INTO input_questions VALUES(1,'What are the most common complaints in the given data');
INSERT INTO input_questions VALUES(2,'What is the best way to mitigate the worst complaints');

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Create the Context query</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demonstration, we will perform a similarity search using one of the topics from above</p>

In [None]:
DROP TABLE input_context

In [None]:
CREATE VOLATILE TABLE input_context AS (

SELECT TOP 1 c.id id, c.txt complaint

FROM (SELECT target_id, reference_id, distancetype, cast(distance as decimal(36,8)) as distance FROM TD_VECTORDISTANCE (
    ON CFPB_Embeddings_Normalized AS TargetTable
    ON (SELECT * FROM topics_embeddings_Normalized WHERE id = 1) AS ReferenceTable DIMENSION --select a single topic/question
USING
    TargetIDColumn('id')
    TargetFeatureColumns('Embedding_Normalized')
    RefIDColumn('id')
    RefFeatureColumns('Embedding_Normalized')
    DistanceMeasure('cosine')
    topk(1)
) AS dt) d
    
JOIN demo_ofs.CFPB_Complaints_1K c
    ON c.id = d.target_id
JOIN demo_ofs.topics_of_interest r
    ON r.id = d.reference_id

ORDER BY d.distance) WITH DATA;
ON COMMIT PRESERVE ROWS;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Generate Responses</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This query will return two rows - one for each question</p>

In [None]:
SELECT * FROM AI_AskLLM( 
      ON input_questions AS InputTable partition by id
      --ON (SELECT 'what is the subject of the provided data' question, 1 id) AS InputTable partition by id
      --ON (SELECT 1 id, 'apples, bananas, peas, and plums' complaint) AS ContextTable partition by id
      ON input_context AS ContextTable partition by id
      USING   
      TextColumn('question')
      ContextColumn('complaint')
      ApiType('aws')
      REGION('us-west-2')
      Authorization(Repositories.BedrockAuth)
      ModelName('anthropic.claude-instant-v1')
      Prompt('Provide an answer to the question using data as information relevant to the question. \nQuestion: #QUESTION# \n Data: #DATA#')
      DATAPOSITION('#DATA#')
      QUESTIONPOSITION('#QUESTION#')
      isDebug('true')
      Accumulate('[0:]')
    ) as dt;

In [None]:
%disconnect vs_demo