# Vector Search Python SDK example usage

This notebook demonstrates usage of the Vector Search Python SDK, which provides a `VectorSearchClient` as a primary API for working with Vector Search.

Alternatively, you may call the REST API directly.
For additional documentation please review: 
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/create-query-vector-search and 
https://www.databricks.com/blog/introducing-databricks-vector-search-public-preview

**Pre-req**: This notebook assumes you have already created a Model Serving endpoint for the embedding model.  See `embedding_model_endpoint` below, and the companion notebook for creating endpoints.

In [0]:
%pip install --upgrade --force-reinstall databricks-vectorsearch
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting databricks-vectorsearch
  Downloading databricks_vectorsearch-0.22-py3-none-any.whl (8.5 kB)
Collecting protobuf<5,>=3.12.0
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 6.1 MB/s eta 0:00:00
Collecting requests>=2
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 5.7 MB/s eta 0:00:00
Collecting mlflow-skinny<3,>=2.4.0
  Downloading mlflow_skinny-2.10.2-py3-none-any.whl (4.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 73.7 MB/s eta 0:00:00
Collecting entrypoints<1
  Downloading entrypoints-0.4-py3-none-any.whl (5.3 kB)
Collecting packaging<24
  Using cached packaging-23.2-py3-none-any.whl (53 kB)
Collecting pytz<2024
  Downloading pytz-2023.4-py2.py3-none-any.whl (506 kB)
     ━━━━━━━━━━━━━━━━━━━━━━

In [0]:
from databricks.vector_search.client import VectorSearchClient
# Automatically generates a PAT Token for authentication
vsc = VectorSearchClient()

# Uses the service principal token for authentication
# client = VectorSearch(service_principal_client_id=<CLIENT_ID>,service_principal_client_secret=<CLIENT_SECRET>)

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


In [0]:
help(VectorSearchClient)

Help on class VectorSearchClient in module databricks.vector_search.client:

class VectorSearchClient(builtins.object)
 |  VectorSearchClient(workspace_url=None, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, disable_notice=False)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, workspace_url=None, personal_access_token=None, service_principal_client_id=None, service_principal_client_secret=None, azure_tenant_id=None, azure_login_id=None, disable_notice=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  create_delta_sync_index(self, endpoint_name, index_name, primary_key, source_table_name, pipeline_type, embedding_dimension=None, embedding_vector_column=None, embedding_source_column=None, embedding_model_endpoint_name=None)
 |  
 |  create_direct_access_index(self, endpoint_name, index_name, primary_key, embedding_dimension, embedding_vector_column,

## Load sample dataset Products.json into source Delta table

The following creates the source Delta table.

In [0]:
source_catalog = "vector_database"
source_schema = "vector_search"
source_table = "product"
source_table_fullname = f"{source_catalog}.{source_schema}.{source_table}"

In [0]:
# Uncomment the below, if you need to create a catalog for the source data.
# spark.sql(f"CREATE CATALOG IF NOT EXISTS {source_catalog}")
# Uncomment to create the source schema, if needed.
# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {source_catalog}.{source_schema} COMMENT 'This is a schema for source data for Vector Search indexes.'")

In [0]:
# Uncomment if you want to start from scratch by dropping the existing table.
#spark.sql(f"DROP TABLE {source_table_fullname}")

In [0]:
source_df =spark.read.option("multiline","true").json("dbfs:/mnt/hvdata/product_docs.json")
display(source_df)

category,content,id,title
Web,"Azure App Service is a fully managed platform for building, deploying, and scaling web apps. You can host web apps, mobile app backends, and RESTful APIs. It supports a variety of programming languages and frameworks, such as .NET, Java, Node.js, Python, and PHP. The service offers built-in auto-scaling and load balancing capabilities. It also provides integration with other Azure services, such as Azure DevOps, GitHub, and Bitbucket.",1,Azure App Service
Compute,"Azure Functions is a serverless compute service that enables you to run code on-demand without having to manage infrastructure. It allows you to build and deploy event-driven applications that automatically scale with your workload. Functions support various languages, including C#, F#, Node.js, Python, and Java. It offers a variety of triggers and bindings to integrate with other Azure services and external services. You only pay for the compute time you consume.",2,Azure Functions
AI + Machine Learning,"Azure Cognitive Services are a set of AI services that enable you to build intelligent applications with powerful algorithms using just a few lines of code. These services cover a wide range of capabilities, including vision, speech, language, knowledge, and search. They are designed to be easy to use and integrate into your applications. Cognitive Services are fully managed, scalable, and continuously improved by Microsoft. It allows developers to create AI-powered solutions without deep expertise in machine learning.",3,Azure Cognitive Services
Storage,"Azure Storage is a scalable, durable, and highly available cloud storage service that supports a variety of data types, including blobs, files, queues, and tables. It provides a massively scalable object store for unstructured data. Storage supports data redundancy and geo-replication, ensuring high durability and availability. It offers a variety of data access and management options, including REST APIs, SDKs, and Azure Portal. You can secure your data using encryption at rest and in transit.",4,Azure Storage
Databases,"Azure SQL Database is a fully managed relational database service based on the latest stable version of Microsoft SQL Server. It offers built-in intelligence that learns your application patterns and adapts to maximize performance, reliability, and data protection. SQL Database supports elastic scaling, allowing you to dynamically adjust resources to match your workload. It provides advanced security features, such as encryption, auditing, and threat detection. You can migrate your existing SQL Server databases to Azure SQL Database with minimal downtime.",5,Azure SQL Database
Databases,"Azure Cosmos DB is a fully managed, globally distributed, multi-model database service designed for building highly responsive and scalable applications. It offers turnkey global distribution, automatic and instant scalability, and guarantees low latency, high availability, and consistency. Cosmos DB supports popular NoSQL APIs, including MongoDB, Cassandra, Gremlin, and Azure Table Storage. You can build globally distributed applications with ease, without having to deal with complex configuration and capacity planning. Data stored in Cosmos DB is automatically indexed, enabling you to query your data with SQL, JavaScript, or other supported query languages.",6,Azure Cosmos DB
Containers,"Azure Kubernetes Service (AKS) is a managed container orchestration service based on the popular open-source Kubernetes system. It simplifies Kubernetes deployment and management, making it easy for developers and administrators to deploy, scale, and manage containerized applications. AKS offers automatic upgrades, scaling, and self-healing capabilities, reducing the operational overhead of managing Kubernetes clusters. It also integrates with Azure services like Azure Active Directory, Azure Monitor, and Azure Policy, providing a seamless experience for managing your applications and infrastructure.",7,Azure Kubernetes Service (AKS)
Compute,"Azure Virtual Machines (VMs) is an Infrastructure-as-a-Service (IaaS) offering that allows you to deploy and manage virtual machines in the cloud. You can choose from a wide range of VM sizes, operating systems, and software configurations. VMs support various operating systems, including Windows Server, Linux, and SQL Server. You can scale your VMs up or down as needed, and pay only for the resources you use. VMs provide built-in security features, such as Azure Security Center, Azure Active Directory, and encryption.",8,Azure Virtual Machines
Developer Tools,"Azure DevOps is a suite of services that help you plan, build, and deploy applications. It includes Azure Boards for work item tracking, Azure Repos for source code management, Azure Pipelines for continuous integration and continuous deployment, Azure Test Plans for manual and automated testing, and Azure Artifacts for package management. DevOps supports a wide range of programming languages, frameworks, and platforms, making it easy to integrate with your existing development tools and processes. It also integrates with other Azure services, such as Azure App Service and Azure Functions.",9,Azure DevOps
Internet of Things,"Azure IoT Hub is a managed service that enables you to connect, monitor, and manage billions of IoT devices. It provides secure and reliable communication between your IoT devices and your backend solution. IoT Hub supports multiple communication protocols, including MQTT, AMQP, and HTTPS. It offers device-to-cloud and cloud-to-device messaging, device management, and device twin capabilities. With IoT Hub, you can build scalable and secure IoT solutions that integrate with other Azure services and custom applications.",10,Azure IoT Hub


In [0]:
source_df.write.format("delta").option("delta.enableChangeDataFeed", "true").saveAsTable(source_table_fullname)

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3694651593041713>, line 1[0m
[0;32m----> 1[0m [43msource_df[49m[38;5;241;43m.[39;49m[43mwrite[49m[38;5;241;43m.[39;49m[43mformat[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mdelta[39;49m[38;5;124;43m"[39;49m[43m)[49m[38;5;241;43m.[39;49m[43moption[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mdelta.enableChangeDataFeed[39;49m[38;5;124;43m"[39;49m[43m,[49m[43m [49m[38;5;124;43m"[39;49m[38;5;124;43mtrue[39;49m[38;5;124;43m"[39;49m[43m)[49m[38;5;241;43m.[39;49m[43msaveAsTable[49m[43m([49m[43msource_table_fullname[49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_

In [0]:
display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

category,content,id,title
Web,"Azure App Service is a fully managed platform for building, deploying, and scaling web apps. You can host web apps, mobile app backends, and RESTful APIs. It supports a variety of programming languages and frameworks, such as .NET, Java, Node.js, Python, and PHP. The service offers built-in auto-scaling and load balancing capabilities. It also provides integration with other Azure services, such as Azure DevOps, GitHub, and Bitbucket.",1,Azure App Service
Compute,"Azure Functions is a serverless compute service that enables you to run code on-demand without having to manage infrastructure. It allows you to build and deploy event-driven applications that automatically scale with your workload. Functions support various languages, including C#, F#, Node.js, Python, and Java. It offers a variety of triggers and bindings to integrate with other Azure services and external services. You only pay for the compute time you consume.",2,Azure Functions
AI + Machine Learning,"Azure Cognitive Services are a set of AI services that enable you to build intelligent applications with powerful algorithms using just a few lines of code. These services cover a wide range of capabilities, including vision, speech, language, knowledge, and search. They are designed to be easy to use and integrate into your applications. Cognitive Services are fully managed, scalable, and continuously improved by Microsoft. It allows developers to create AI-powered solutions without deep expertise in machine learning.",3,Azure Cognitive Services
Storage,"Azure Storage is a scalable, durable, and highly available cloud storage service that supports a variety of data types, including blobs, files, queues, and tables. It provides a massively scalable object store for unstructured data. Storage supports data redundancy and geo-replication, ensuring high durability and availability. It offers a variety of data access and management options, including REST APIs, SDKs, and Azure Portal. You can secure your data using encryption at rest and in transit.",4,Azure Storage
Databases,"Azure SQL Database is a fully managed relational database service based on the latest stable version of Microsoft SQL Server. It offers built-in intelligence that learns your application patterns and adapts to maximize performance, reliability, and data protection. SQL Database supports elastic scaling, allowing you to dynamically adjust resources to match your workload. It provides advanced security features, such as encryption, auditing, and threat detection. You can migrate your existing SQL Server databases to Azure SQL Database with minimal downtime.",5,Azure SQL Database
Databases,"Azure Cosmos DB is a fully managed, globally distributed, multi-model database service designed for building highly responsive and scalable applications. It offers turnkey global distribution, automatic and instant scalability, and guarantees low latency, high availability, and consistency. Cosmos DB supports popular NoSQL APIs, including MongoDB, Cassandra, Gremlin, and Azure Table Storage. You can build globally distributed applications with ease, without having to deal with complex configuration and capacity planning. Data stored in Cosmos DB is automatically indexed, enabling you to query your data with SQL, JavaScript, or other supported query languages.",6,Azure Cosmos DB
Containers,"Azure Kubernetes Service (AKS) is a managed container orchestration service based on the popular open-source Kubernetes system. It simplifies Kubernetes deployment and management, making it easy for developers and administrators to deploy, scale, and manage containerized applications. AKS offers automatic upgrades, scaling, and self-healing capabilities, reducing the operational overhead of managing Kubernetes clusters. It also integrates with Azure services like Azure Active Directory, Azure Monitor, and Azure Policy, providing a seamless experience for managing your applications and infrastructure.",7,Azure Kubernetes Service (AKS)
Compute,"Azure Virtual Machines (VMs) is an Infrastructure-as-a-Service (IaaS) offering that allows you to deploy and manage virtual machines in the cloud. You can choose from a wide range of VM sizes, operating systems, and software configurations. VMs support various operating systems, including Windows Server, Linux, and SQL Server. You can scale your VMs up or down as needed, and pay only for the resources you use. VMs provide built-in security features, such as Azure Security Center, Azure Active Directory, and encryption.",8,Azure Virtual Machines
Developer Tools,"Azure DevOps is a suite of services that help you plan, build, and deploy applications. It includes Azure Boards for work item tracking, Azure Repos for source code management, Azure Pipelines for continuous integration and continuous deployment, Azure Test Plans for manual and automated testing, and Azure Artifacts for package management. DevOps supports a wide range of programming languages, frameworks, and platforms, making it easy to integrate with your existing development tools and processes. It also integrates with other Azure services, such as Azure App Service and Azure Functions.",9,Azure DevOps
Internet of Things,"Azure IoT Hub is a managed service that enables you to connect, monitor, and manage billions of IoT devices. It provides secure and reliable communication between your IoT devices and your backend solution. IoT Hub supports multiple communication protocols, including MQTT, AMQP, and HTTPS. It offers device-to-cloud and cloud-to-device messaging, device management, and device twin capabilities. With IoT Hub, you can build scalable and secure IoT solutions that integrate with other Azure services and custom applications.",10,Azure IoT Hub



## Create Vector Search Endpoint

In [0]:
vector_search_endpoint_name = "vector-search-demo-endpoint"

In [0]:
vsc.create_endpoint(
    name=vector_search_endpoint_name,
    endpoint_type="STANDARD"
)



[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-cf4ea018-c031-4169-be54-a5499b2024df/lib/python3.10/site-packages/databricks/vector_search/utils.py:123[0m, in [0;36mRequestUtils.issue_request[0;34m(url, token, method, params, json, verify)[0m
[1;32m    122[0m [38;5;28;01mtry[39;00m:
[0;32m--> 123[0m     [43mresponse[49m[38;5;241;43m.[39;49m[43mraise_for_status[49m[43m([49m[43m)[49m
[1;32m    124[0m [38;5;28;01mexcept[39;00m [38;5;167;01mException[39;00m [38;5;28;01mas[39;00m e:

File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-cf4ea018-c031-4169-be54-a5499b2024df/lib/python3.10/site-packages/requests/models.py:1021[0m, in [0;36mResponse.raise_for_status[0;34m(self)[0m
[1;32m   1020[0m [38;5;28;01mif[39;00m http_error_msg:
[0;32m-> 1021[0m     [38;5;28;01mraise[39

In [0]:
endpoint = vsc.get_endpoint(
  name=vector_search_endpoint_name)
endpoint

{'name': 'vector-search-demo-endpoint',
 'creator': 'hema.verma@mngenvmcap040685.onmicrosoft.com',
 'creation_timestamp': 1707500999351,
 'last_updated_timestamp': 1707500999351,
 'endpoint_type': 'STANDARD',
 'last_updated_user': 'hema.verma@mngenvmcap040685.onmicrosoft.com',
 'id': 'f4ccadd6-e96a-43b5-b62e-e38a7e3030a3',
 'endpoint_status': {'state': 'ONLINE'},
 'num_indexes': 1,
 'units': 1}

## Create vector index

In [0]:
# Vector index
vs_index = "product_vsindex"
vs_index_fullname = f"{source_catalog}.{source_schema}.{vs_index}"
embedding_model_endpoint = "databricks-bge-large-en"


In [0]:
index = vsc.create_delta_sync_index(
  endpoint_name=vector_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="content",
  embedding_model_endpoint_name=embedding_model_endpoint
)
index.describe()



[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-cf4ea018-c031-4169-be54-a5499b2024df/lib/python3.10/site-packages/databricks/vector_search/utils.py:123[0m, in [0;36mRequestUtils.issue_request[0;34m(url, token, method, params, json, verify)[0m
[1;32m    122[0m [38;5;28;01mtry[39;00m:
[0;32m--> 123[0m     [43mresponse[49m[38;5;241;43m.[39;49m[43mraise_for_status[49m[43m([49m[43m)[49m
[1;32m    124[0m [38;5;28;01mexcept[39;00m [38;5;167;01mException[39;00m [38;5;28;01mas[39;00m e:

File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-cf4ea018-c031-4169-be54-a5499b2024df/lib/python3.10/site-packages/requests/models.py:1021[0m, in [0;36mResponse.raise_for_status[0;34m(self)[0m
[1;32m   1020[0m [38;5;28;01mif[39;00m http_error_msg:
[0;32m-> 1021[0m     [38;5;28;01mraise[39

## Get a vector index  

Use the get_index() method to retrieve the vector index object using the vector index name. You can also use the describe() method on the index object to see a summary of the index's configuration information.

In [0]:
index = vsc.get_index(endpoint_name=vector_search_endpoint_name, index_name=vs_index_fullname)
index.describe()

{'name': 'vector_database.vector_search.product_vsindex',
 'endpoint_name': 'vector-search-demo-endpoint',
 'primary_key': 'id',
 'index_type': 'DELTA_SYNC',
 'delta_sync_index_spec': {'source_table': 'vector_database.vector_search.product',
  'embedding_source_columns': [{'name': 'content',
    'embedding_model_endpoint_name': 'databricks-bge-large-en'}],
  'pipeline_type': 'TRIGGERED',
  'pipeline_id': '5ebe5888-5602-4924-bf68-c6001dfe1ce5'},
 'status': {'detailed_state': 'ONLINE_NO_PENDING_UPDATE',
  'message': 'Index creation succeeded. Check latest status: https://adb-2878440741389728.8.azuredatabricks.net/explore/data/vector_database/vector_search/product_vsindex',
  'indexed_row_count': 108,
  'triggered_update_status': {'last_processed_commit_version': 0,
   'last_processed_commit_timestamp': '2024-02-20T05:46:13Z'},
  'ready': True,
  'index_url': 'adb-2878440741389728.8.azuredatabricks.net/api/2.0/vector-search/endpoints/vector-search-demo-endpoint/indexes/vector_database.vec