## Install LangChain library to interact with Databricks 
This library enables you to connect your LangChain application with various Databricks services.

In [0]:
%pip install -qU databricks-langchain

In [0]:
dbutils.library.restartPython()

## Import our Packages

In [0]:
from databricks.vector_search.client import VectorSearchClient
from databricks_langchain import DatabricksVectorSearch, DatabricksEmbeddings
from langchain_core.documents import Document

## Create VectorSearchClient

In [0]:
client = VectorSearchClient()

## Create Direct Vector Access Index

In [0]:
index_name = "vector_db_demo.default.direct_access_index"
endpoint_name = "vector-db"

index = client.create_direct_access_index(
    endpoint_name=endpoint_name,
    index_name=index_name,
    primary_key="id",
    embedding_dimension=1024,
    embedding_vector_column="text_vector",
    schema={
        "id": "string",
        "text": "string",
        "text_vector": "array<float>",
        "source": "string",
    },
)
index.describe()

## Initialize databricks-gte-large-en embeddings model

In [0]:
embeddings = DatabricksEmbeddings(
    endpoint="databricks-gte-large-en",
)

## Initialize DatabricksVectorSearch client

In [0]:
vector_store = DatabricksVectorSearch(
    endpoint=endpoint_name,
    index_name=index_name,
    embedding=embeddings,
    text_column="text",
    columns=["source"]
)

## Add documents to Direct Vector Access Index

In [0]:
document_1 = Document(page_content="""What is Mosaic AI Vector Search?
Mosaic AI Vector Search is a vector database that is built into the Databricks Data Intelligence Platform and integrated with its governance and productivity tools. A vector database is a database that is optimized to store and retrieve embeddings. Embeddings are mathematical representations of the semantic content of data, typically text or image data. Embeddings are generated by a large language model and are a key component of many GenAI applications that depend on finding documents or images that are similar to each other. Examples are RAG systems, recommender systems, and image and video recognition.

With Mosaic AI Vector Search, you create a vector search index from a Delta table. The index includes embedded data with metadata. You can then query the index using a REST API to identify the most similar vectors and return the associated documents. You can structure the index to automatically sync when the underlying Delta table is updated.""", metadata={"source": "https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search"})

document_2 = Document(page_content="""How does Mosaic AI Vector Search work?
Mosaic AI Vector Search uses the Hierarchical Navigable Small World (HNSW) algorithm for its approximate nearest neighbor searches and the L2 distance distance metric to measure embedding vector similarity. If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.

Mosaic AI Vector Search also supports hybrid keyword-similarity search, which combines vector-based embedding search with traditional keyword-based search techniques. This approach matches exact words in the query while also using a vector-based similarity search to capture the semantic relationships and context of the query.

By integrating these two techniques, hybrid keyword-similarity search retrieves documents that contain not only the exact keywords but also those that are conceptually similar, providing more comprehensive and relevant search results. This method is particularly useful in RAG applications where source data has unique keywords such as SKUs or identifiers that are not well suited to pure similarity search.""", metadata={"source": "https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search"})

document_3 = Document(page_content="""How to set up Mosaic AI Vector Search
To use Mosaic AI Vector Search, you must create the following:

A vector search endpoint. This endpoint serves the vector search index. You can query and update the endpoint using the REST API or the SDK. Endpoints scale automatically to support the size of the index or the number of concurrent requests. See Create a vector search endpoint for instructions.
A vector search index. The vector search index is created from a Delta table and is optimized to provide real-time approximate nearest neighbor searches. The goal of the search is to identify documents that are similar to the query. Vector search indexes appear in and are governed by Unity Catalog. See Create a vector search index for instructions.
In addition, if you choose to have Databricks compute the embeddings, you can use a pre-configured Foundation Model APIs endpoint or create a model serving endpoint to serve the embedding model of your choice. See Pay-per-token Foundation Model APIs or Create generative AI model serving endpoints for instructions.

To query the model serving endpoint, you use either the REST API or the Python SDK. Your query can define filters based on any column in the Delta table. For details, see Use filters on queries, the API reference, or the Python SDK reference.""", metadata={"source": "https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search"})

documents = [document_1, document_2, document_3]

vector_store.add_documents(documents=documents, ids=["1", "2", "3"])

## Basic Query

In [0]:
results = vector_store.similarity_search(
    query="what is mosaic vector search", k=1)

for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

## Query with Filter

In [0]:
results = vector_store.similarity_search(
    query="how do I setup the vector search", k=1, filter={"source": "https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search"}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")