
# Create Managed Vector Search Index

The process of creating a **managed** Vector Search index for retrieval-augmented generation (RAG) applications. This involves configuring Databricks Vector Search to ingest data from a Delta table containing text embeddings and metadata.

In [0]:
%pip install -U -qq databricks-vectorsearch databricks-sdk flashrank PyPDF2
dbutils.library.restartPython()

In [0]:
%run ../Includes/Classroom-Setup-03

**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

## Step 1: Create a Vector Search Endpoint

To start, you need to create a Vector Search endpoint to serve the index.


### Step-by-Step Instructions:


**Vector Search Endpoint**: The first step for creating a Vector Search index is to create a compute endpoint. This endpoint is already created in this lab environment.

**Wait for Endpoint to be Ready**: After defining the endpoint name, check the status of the endpoint using the provided function `wait_for_vs_endpoint_to_be_ready`.

Additionally, you can check the endpoint status in the Databricks workspace [Vector Search Endpoints in Compute section](#/setting/clusters/vector-search).

In [0]:
## assign vs search endpoint by username
vs_endpoint_prefix = "vs_endpoint_"
vs_endpoint_name = vs_endpoint_prefix + str(get_fixed_integer(DA.unique_name("_")))
print(f"Assigned Vector Search endpoint name: {vs_endpoint_name}.")

In [0]:
import databricks.sdk.service.catalog as c
from databricks.vector_search.client import VectorSearchClient
from databricks.sdk import WorkspaceClient

vsc = VectorSearchClient(disable_notice=True)

## check the status of the endpoint.
wait_for_vs_endpoint_to_be_ready(vsc, vs_endpoint_name)
print(f"Endpoint named {vs_endpoint_name} is ready.")

## Step 2: Create a Managed Vector Search Index

Now, connect the Delta table containing text and metadata with the Vector Search endpoint. In this , you will create a **managed** index, which means you don't need to create the embeddings manually. For API details, check the [documentation page](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html#create-index-using-the-python-sdk).


**📌 Note 1: You will use the embeddings table that you created in the previous lab. If you haven't completed that lab, stop here and complete it first.**

**📌 Note 2:** Although the source table already has the embedding column precomputed, we are not going to use it here to test the managed vector search capability to populate embeddings on the fly during data ingestion and query.

**💡 Instructions:**

1. Define the source Delta table containing the text to be indexed.

1. Create a Vector Search index. Use these parameters; source column as `content` and `databricks-gte-large-en` as embedding model. Also, the sync process should be  `manually triggered`.

1. Create or synchronize the Vector Search index based on the source Delta table.


In [0]:
%sql 
ALTER TABLE dbacademy.labuser10914379_1753166678.pdf_text_embeddings
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);


In [0]:
# Define full table names
source_table_fullname = f"{DA.catalog_name}.{DA.schema_name}.pdf_text_embeddings"
vs_index_fullname = f"{DA.catalog_name}.{DA.schema_name}.pdf_text_managed_vs_index"

# Optional: Ensure the source table has an `id` column
# You can skip this part if you've already added it manually
from pyspark.sql.functions import monotonically_increasing_id

df = spark.table(source_table_fullname)
if 'id' not in df.columns:
    print("Adding 'id' column to the source table...")
    df = df.withColumn("id", monotonically_increasing_id())
    df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(source_table_fullname)
else:
    print("'id' column already present in the table.")

# Create or sync the index
if not index_exists(vsc, vs_endpoint_name, vs_index_fullname):
    print(f"Creating index '{vs_index_fullname}' on endpoint '{vs_endpoint_name}'...")
    
    vsc.create_delta_sync_index(
        endpoint_name=vs_endpoint_name,
        index_name=vs_index_fullname,
        source_table_name=source_table_fullname,
        pipeline_type="TRIGGERED",
        primary_key="id",
        embedding_source_column="content",  # The column containing raw text
        embedding_model_endpoint_name="databricks-gte-large-en"  # Pretrained embedding model
    )
else:
    print(f"Index '{vs_index_fullname}' already exists. Triggering sync...")
    vsc.get_index(vs_endpoint_name, vs_index_fullname).sync()

# Wait for the index to be ready
print("Waiting for the index to be ready...")
wait_for_index_to_be_ready(vsc, vs_endpoint_name, vs_index_fullname)
print("Index is ready.")


## Step 3: Search Documents Similar to the Query

Test the Vector Search index by searching for similar content based on a sample query.

**💡 Instructions:**

1. Get the index instance that we created.

1. Send a sample query to the language model endpoint using **query text**. 🚨 Note: As you created a managed index, you will use plain text for similarity search using `query_text` parameter.

1. Use the embeddings to search for similar content in the Vector Search index.

In [0]:
## get VS index
index = vsc.get_index(vs_endpoint_name, vs_index_fullname)

question = "What are the security and privacy concerns when training generative models?"

## search for similar documents  
results = index.similarity_search(
    query_text = question,
    columns=["pdf_name", "content"],
    num_results=4
    )

## show the results
docs = results.get("result", {}).get("data_array", [])

print(docs)

## Step 4: Re-rank Search Results

You have retrieved some documents that are similar to the query text. However, the question of which documents are the most relevant is not done by the vector search results. Use `flashrank` library to re-rank the results and show the most relevant top 3 documents. 

**💡 Instructions:**

1. Define `flashrank` with **`rank-T5-flan`** model.

1. Re-rank the search results.

1. Show the most relevant **top 3** documents.


In [0]:
from flashrank import Ranker, RerankRequest

## define the ranker.
cache_dir = f"{DA.paths.working_dir}/opt"

ranker = Ranker(model_name="rank-T5-flan", cache_dir=cache_dir)

## format the result to align with reranker library format. 
passages = []
for doc in docs:
    new_doc = {"file": doc[0], "text": doc[1]}
    passages.append(new_doc)

## rerank the passages.
rerankrequest = RerankRequest(query=question, passages=passages)
ranked_passages = ranker.rerank(rerankrequest)

## show the top 3 results.
print(*ranked_passages[:3], sep="\n\n")