<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Enterprise Vector Store database functions
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<hr>

<p style = 'font-size:28px;font-family:Arial;color:#00233C'><b>Overview</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage provides a suite of new in-database analytic capabilities for Vector storage, Management, Indexing, and Search, including</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Vector Datatype</b> based VARBYTE arrays</li>
    <li><b>Normalization</b> functions to improve search efficiencies</li>
    <li><b>Vector Indexing and Search</b> leveraging multiple algorithms</li>
    </ul>

<b style = 'font-size:16px;font-family:Arial;color:#00233C'>Vector Datatype</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Currently, the Vantage Database supports several different methods for generating <b>Vector Embeddings</b> including in-database Bring Your Own Model (BYOM) functions, in-platform GPU-accelerated open-source model inferencing, and API-supported embedding using Cloud-based Large Language Models.  However, not all of these methods support the native VECTOR datatype - and may return data as numeric or character-based columns.</p>

<p style = 'font-size:28px;font-family:Arial;color:#00233C'><b>Demonstration</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following demonstration uses some of the Consumer Financial Protection Board complaints data and embeddings, and will illustrate some of the in-database functions for constructing and analyzing vector data.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Construct a VECTOR Datatype from FLOAT columns</li>
    <li>Normalize the vector data for efficient search</li>
    <li>Calculate Vector Distance between complaints and topics data</li>
    <li>Create an HSNW index and use it for highly-scalable distance analysis</li>
    </ol>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Connect to the database</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new named connection if necessary using the %addconnect magic</p> 

In [None]:
%addconnect name=vs_demo, host=XXX.XXX.XXX.XXX

In [None]:
%chconnect name=vs_demo, host=XXX.XXX.XXX.XXX

In [None]:
%connect vs_demo, user=data_scientist, hidewarnings=True

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 1 - Construct a Vector Datatype from float columns</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Many Vantage functions (teradatagenai classes, BYOM, OAF inferencing, and CSP-direct FastPath functions) generate Vector data as FLOAT columns.  Convert these to the new VECTOR datatype:</p> 
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Show original dataset.  This data is a set of vector embeddings generated with Open Analytics Framework using a Hugging Face open-source embedding model.  See that demonstration <a href = 'https://github.com/Teradata/lake-demos/tree/main/UseCases/GenAI/Complaints_Search'>here</a></li>
    <li>Run the native ClearScape Analytics Function "PACK" to construct a comma-separated array of FLOAT values</li>
    <li>Pass this data to the NEW VECTOR function</li>
    </ol>
<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Show original data:</b></p>

In [None]:
SELECT TOP 2 * FROM demo_ofs.CFPB_Complaints_1K;

In [None]:
SELECT TOP 2 * FROM demo_ofs.CFPB_Embeddings_1K;

In [None]:
SELECT TOP 2 * FROM demo_ofs.topics_of_interest;

In [None]:
SELECT TOP 2 * FROM demo_ofs.topics_embeddings;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Pass the output of PACK to the datatype constructor</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Note that <b>CAST</b> and <b>NEW</b> functions can be used to create the VECTOR datatype.</p>

In [None]:
SELECT CAST(packed_data AS VECTOR) Vector_Data, id

FROM (
    SELECT * FROM PACK (
    ON (SELECT TOP 2 * FROM demo_ofs.CFPB_Embeddings_1K)
    USING
        OutputColumn('packed_data')
        TargetColumns('[1:384]')
        IncludeColumnName('False')
        Accumulate('id')
) AS dt) d;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 2 - Normalize the vector values</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Vector normalization is the process of scaling a vector to have a magnitude (length) of 1, while preserving its direction. This resulting vector is called a unit vector. It's essentially dividing each component of the vector by its length.  This makes some calculations much more efficient, including some of the search and indexing operations.  The ClearScape Analytics function <b>TD_VectorNormalize</b> will perform this operation at scale on our VECTOR datatype.</p>

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Review the queries for both tables</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Complaints Embeddings and Topics Embeddings</p>

In [None]:
SELECT TOP 10 id, NEW VECTOR(packed_data) Vector_Data

FROM (
    SELECT * FROM PACK (
    ON (SELECT * FROM demo_ofs.CFPB_Embeddings_1K)
    USING
        OutputColumn('packed_data')
        TargetColumns('[1:384]')
        IncludeColumnName('False')
        Accumulate('id')
) AS dt) d;

In [None]:
SELECT TOP 10 id, NEW VECTOR(packed_data) Vector_Data

FROM (
    SELECT * FROM PACK (
    ON (SELECT * FROM demo_ofs.topics_embeddings)
    USING
        OutputColumn('packed_data')
        TargetColumns('[1:384]')
        IncludeColumnName('False')
        Accumulate('id')
) AS dt) d;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Execute the VectorNormalize function to create new tables</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create new volatile tables passing all the functions in a single expression.</p>

In [None]:
SELECT TOP 10 * FROM TD_Vectornormalize(
   ON (SELECT id, NEW VECTOR(packed_data) Vector_Data
        FROM (
            SELECT * FROM PACK (
                ON (SELECT * FROM demo_ofs.topics_embeddings) --original FLOAT-based embeddings table
            USING
                OutputColumn('packed_data')
                TargetColumns('[1:384]')
                IncludeColumnName('False')
                Accumulate('id')
        ) AS p) d) AS InputTable
USING
    IDColumns('id')
    TargetColumns('Vector_Data')
    Approach('UNITVECTOR')
    EmbeddingSize(384)
) AS dt;

In [None]:
DROP TABLE topics_embeddings_Normalized 

In [None]:
CREATE VOLATILE TABLE topics_embeddings_Normalized AS (
    SELECT TOP 10 * FROM TD_Vectornormalize(
       ON (SELECT id, NEW VECTOR(packed_data) Vector_Data
            FROM (
                SELECT * FROM PACK (
                    ON (SELECT * FROM demo_ofs.topics_embeddings) --original FLOAT-based embeddings table
                USING
                    OutputColumn('packed_data')
                    TargetColumns('[1:384]')
                    IncludeColumnName('False')
                    Accumulate('id')
            ) AS p) d) AS InputTable
    USING
        IDColumns('id')
        TargetColumns('Vector_Data')
        Approach('UNITVECTOR')
        EmbeddingSize(384)
    ) AS dt) WITH DATA
ON COMMIT PRESERVE ROWS;

In [None]:
DROP TABLE CFPB_embeddings_Normalized 

In [None]:
CREATE VOLATILE TABLE CFPB_embeddings_Normalized AS (
    SELECT TOP 10 * FROM TD_Vectornormalize(
       ON (SELECT id, NEW VECTOR(packed_data) Vector_Data
            FROM (
                SELECT * FROM PACK (
                    ON (SELECT * FROM demo_ofs.CFPB_embeddings_1K) --original FLOAT-based embeddings table
                USING
                    OutputColumn('packed_data')
                    TargetColumns('[1:384]')
                    IncludeColumnName('False')
                    Accumulate('id')
            ) AS p) d) AS InputTable
    USING
        IDColumns('id')
        TargetColumns('Vector_Data')
        Approach('UNITVECTOR')
        EmbeddingSize(384)
    ) AS dt) WITH DATA
ON COMMIT PRESERVE ROWS;

In [None]:
SELECT TOP 10 * FROM CFPB_embeddings_Normalized;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 3 - Perform Vector Distance calculations</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClearScape Analytics function <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Training-Functions/TD_VectorDistance'>TD_VectorDistance</a> function will take a table of input Vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.  Since this function scans every row and performs the distance calculation, it is resource-intensive and usually suited to a lower number of records.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Additionally, perform a simple join to display the original complaints, topic of interest, and distance calculations</p>

In [None]:
SELECT TOP 10 target_id, reference_id, distancetype, cast(distance as decimal(36,8)) as distance FROM TD_VECTORDISTANCE (
    ON CFPB_Embeddings_Normalized AS TargetTable
    ON topics_embeddings_Normalized AS ReferenceTable DIMENSION
USING
    TargetIDColumn('id')
    TargetFeatureColumns('Vector_Data')
    RefIDColumn('id')
    RefFeatureColumns('Vector_Data')
    DistanceMeasure('cosine')
    topk(1)
) AS dt order by 3,1,2,4;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Join the results back to the original complaints and topics</b></p>

In [None]:
SELECT TOP 2 * FROM demo_ofs.CFPB_Complaints_1K;

In [None]:
SELECT TOP 2 * FROM demo_ofs.topics_of_interest;

In [None]:
SELECT TOP 10 c.id complaint_id, r.txt topic, c.txt complaint, d.distance

FROM (SELECT target_id, reference_id, distancetype, cast(distance as decimal(36,8)) as distance FROM TD_VECTORDISTANCE (
    ON CFPB_Embeddings_Normalized AS TargetTable
    ON topics_embeddings_Normalized AS ReferenceTable DIMENSION
USING
    TargetIDColumn('id')
    TargetFeatureColumns('Vector_Data')
    RefIDColumn('id')
    RefFeatureColumns('Vector_Data')
    DistanceMeasure('cosine')
    topk(1)
) AS dt) d
    
JOIN demo_ofs.CFPB_Complaints_1K c
    ON c.id = d.target_id
JOIN demo_ofs.topics_of_interest r
    ON r.id = d.reference_id

ORDER BY d.distance;

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Step 4 - leverage an HNSW index for fast search</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClearScape Analytics function <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Training-Functions/TD_VectorDistance'>TD_HNSW</a> (Hierarchical Navigable Small Worlds) function will create an index table representing....  This index model can then be used by the TD_HNSWPredict function to perform extremely efficient similarity searches.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Train the model using the Vector Datatype as input</li>
    <li>Predict nearest matches using the topics embeddings as input</li>
    <li>Join the original data for human-readable results</li>
    </ol>
    
<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Train the model</b></p>

In [None]:
SELECT * FROM TD_HNSW (
    ON CFPB_Embeddings_Normalized AS InputTable
    OUT VOLATILE TABLE OutputTable(hnsw_model)
USING
    IdColumn('id')
    VectorColumn('Vector_Data')
    EfConstruction(16)
    NumConnPerNode(16)
    MaxNumConnPerNode(20)
    DistanceMeasure('euclidean')
    EmbeddingSize(384)
    ApplyHeuristics('true')
) as dt;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Return similar results using topics as input</b></p>

In [None]:
SELECT TOP 10 * FROM
TD_HNSWPREDICT (
    ON hnsw_model AS ModelTable
    ON topics_embeddings_Normalized AS InputTable DIMENSION
    USING
    IdColumn('id')
    VectorColumn('Vector_Data')
    EfSearch(16)
    TopK(2)
    OutputNearestVector('true')
) T ORDER BY 1,3;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Join the results back to the original complaints and topics</b></p>

In [None]:
SELECT TOP 10 c.id complaint_id, r.txt topic, c.txt complaint, d.distance

FROM (SELECT * FROM
TD_HNSWPREDICT (
    ON hnsw_model AS ModelTable
    ON topics_embeddings_Normalized AS InputTable DIMENSION
    USING
    IdColumn('id')
    VectorColumn('Vector_Data')
    EfSearch(16)
    TopK(2)
    OutputNearestVector('true')
) T) d
    
    
JOIN demo_ofs.CFPB_Complaints_1K c
    ON c.id = d.nearest_neighbor_id
JOIN demo_ofs.topics_of_interest r
    ON r.id = d.id

ORDER BY d.distance;

<hr>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use TD_HNSWSummary to create a human-readable model output</b></p>

In [None]:
SELECT amp_id, graph_id, node_id, layer_id, input_row_id, cast(node_vector
as varchar(60)) as node_vector, num_neighbors, cast(neighbor_node_id as
varchar(60)) as neighbor_node_id, cast(model_info as varchar(500)) as model_info 

FROM TD_HNSWSummary(
    ON hnsw_model as ModelTable
) as dt
ORDER by 1,9

In [None]:
%disconnect vs_demo