# Vector Search Python SDK example usage

This notebook demonstrates usage of the Vector Search Python SDK, which provides a `VectorSearchClient` as a primary API for working with Vector Search.

Alternatively, you may call the REST API directly.

**Pre-req**: This notebook assumes you have already created a Model Serving endpoint for the embedding model.  See `embedding_model_endpoint` below, and the companion notebook for creating endpoints.

## Similarity search

Query the Vector Index to find similar documents!

In [0]:
%pip install --upgrade --force-reinstall databricks-vectorsearch
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting databricks-vectorsearch
  Using cached databricks_vectorsearch-0.22-py3-none-any.whl (8.5 kB)
Collecting mlflow-skinny<3,>=2.4.0
  Using cached mlflow_skinny-2.10.2-py3-none-any.whl (4.8 MB)
Collecting protobuf<5,>=3.12.0
  Using cached protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Collecting requests>=2
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting cloudpickle<4
  Using cached cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Collecting importlib-metadata!=4.7.0,<8,>=3.7.0
  Using cached importlib_metadata-7.0.1-py3-none-any.whl (23 kB)
Collecting entrypoints<1
  Using cached entrypoints-0.4-py3-none-any.whl (5.3 kB)
Collecting packaging<24
  Using cached packaging-23.2-py3-none-any.whl (53 kB)
Collecting pytz<2024
  Using cached pytz-2023.4-py2.py3-none-any.whl (506 kB)
Collecting pyyaml<7,>=5.1
  Using cached PyYAML-6.0.1-cp310-cp310

In [0]:
from databricks.vector_search.client import VectorSearchClient
# Automatically generates a PAT Token for authentication
vsc = VectorSearchClient()

# Uses the service principal token for authentication
# client = VectorSearch(service_principal_client_id=<CLIENT_ID>,service_principal_client_secret=<CLIENT_SECRET>)

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


In [0]:
source_catalog = "vector_database"
source_schema = "vector_search"
source_table = "product"
source_table_fullname = f"{source_catalog}.{source_schema}.{source_table}"
vs_index = "product_vsindex"
vector_search_endpoint_name = "vector-search-demo-endpoint"
vs_index_fullname = f"{source_catalog}.{source_schema}.{vs_index}"

In [0]:
index = vsc.get_index(endpoint_name=vector_search_endpoint_name, index_name=vs_index_fullname)
index.describe()

{'name': 'vector_database.vector_search.product_vsindex1',
 'endpoint_name': 'vector-search-demo-endpoint',
 'primary_key': 'id',
 'index_type': 'DELTA_SYNC',
 'delta_sync_index_spec': {'source_table': 'vector_database.vector_search.product',
  'embedding_source_columns': [{'name': 'content',
    'embedding_model_endpoint_name': 'databricks-bge-large-en'}],
  'pipeline_type': 'TRIGGERED',
  'pipeline_id': 'ae9926f6-b87e-46f3-98b5-bf60eebb3edf'},
 'status': {'detailed_state': 'ONLINE_NO_PENDING_UPDATE',
  'message': 'Index creation succeeded. Check latest status: https://adb-2878440741389728.8.azuredatabricks.net/explore/data/vector_database/vector_search/product_vsindex1',
  'indexed_row_count': 108,
  'triggered_update_status': {'last_processed_commit_version': 0,
   'last_processed_commit_timestamp': '2024-02-20T05:46:13Z'},
  'ready': True,
  'index_url': 'adb-2878440741389728.8.azuredatabricks.net/api/2.0/vector-search/endpoints/vector-search-demo-endpoint/indexes/vector_database.v

### Performing Similarity Search and converting the results to a dataframe

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import DoubleType
all_columns = spark.table(source_table_fullname).columns

results = index.similarity_search(
  query_text="Databases",
  columns=all_columns)

ls_results= results.get('result').get('data_array')
df = spark.createDataFrame(data = ls_results, schema = "category STRING , comment STRING ,id STRING ,title STRING ,distance STRING")
df=df.withColumn('distance',lit(df.distance).cast(DoubleType()))
#display(df)

category,comment,id,title,distance
Databases,"Azure Database for MySQL is a fully managed, scalable, and secure relational database service that enables you to build and manage MySQL applications in Azure. It provides features like automatic backups, monitoring, and high availability. Database for MySQL supports various data types, such as JSON, spatial, and full-text. You can use Azure Database for MySQL to migrate your existing applications, build new applications, and ensure the performance and security of your data. It also integrates with other Azure services, such as Azure App Service and Azure Data Factory.",66,Azure Database for MySQL,0.5787114
Databases,"Azure Database for MariaDB is a fully managed, scalable, and secure relational database service that enables you to build and manage MariaDB applications in Azure. It provides features like automatic backups, monitoring, and high availability. Database for MariaDB supports various data types, such as JSON, spatial, and full-text. You can use Azure Database for MariaDB to migrate your existing applications, build new applications, and ensure the performance and security of your data. It also integrates with other Azure services, such as Azure App Service and Azure Data Factory.",68,Azure Database for MariaDB,0.56967765
Databases,"Azure SQL Database is a fully managed relational database service based on the latest stable version of Microsoft SQL Server. It offers built-in intelligence that learns your application patterns and adapts to maximize performance, reliability, and data protection. SQL Database supports elastic scaling, allowing you to dynamically adjust resources to match your workload. It provides advanced security features, such as encryption, auditing, and threat detection. You can migrate your existing SQL Server databases to Azure SQL Database with minimal downtime.",5,Azure SQL Database,0.5664096
Databases,"Azure Cosmos DB is a fully managed, globally distributed, multi-model database service designed for building highly responsive and scalable applications. It offers turnkey global distribution, automatic and instant scalability, and guarantees low latency, high availability, and consistency. Cosmos DB supports popular NoSQL APIs, including MongoDB, Cassandra, Gremlin, and Azure Table Storage. You can build globally distributed applications with ease, without having to deal with complex configuration and capacity planning. Data stored in Cosmos DB is automatically indexed, enabling you to query your data with SQL, JavaScript, or other supported query languages.",6,Azure Cosmos DB,0.5639423
Databases,"Azure Database for PostgreSQL is a fully managed, scalable, and secure relational database service that enables you to build and manage PostgreSQL applications in Azure. It provides features like automatic backups, monitoring, and high availability. Database for PostgreSQL supports various data types, such as JSON, spatial, and full-text. You can use Azure Database for PostgreSQL to migrate your existing applications, build new applications, and ensure the performance and security of your data. It also integrates with other Azure services, such as Azure App Service and Azure Data Factory.",67,Azure Database for PostgreSQL,0.5619546


### Returning best five search results 

In [0]:
df_result= df.select(df.title, df.category).sort(asc('distance')).limit(5)
display(df_result)

title,category
Azure Database for PostgreSQL,Databases
Azure Cosmos DB,Databases
Azure SQL Database,Databases
Azure Database for MariaDB,Databases
Azure Database for MySQL,Databases


## Delete vector index

In [0]:
vsc.delete_index(endpoint_name=vector_search_endpoint_name,index_name=vs_index_fullname)

{}