# LangChain Vector Search with Azure Cognitive Search

Use Azure Cognitive Search to retrieve relevant content to build effective prompt for Azure Open AI. The example below uses LangChain modules to perform the task.

## Prerequisites
You need 
- [Python 3][Python 3.x]
  -   Your Python installation should include [pip](https://pip.pypa.io/en/stable/)
- Install the python packages
  - > pip install openai pandas langchain azure-identity azure-search 
  - > pip install --index-url=https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ azure-search-documents==11.4.0a20230509004
- [Jupyter Notebook][Notebook]
- Create an instance of Azure Open AI and 
  -   Set os environment variables <b>OPENAI_API_TYPE</b>, <b>OPENAI_API_BASE</b>, <b>OPENAI_API_KEY</b>, <b>OPENAI_API_VERSION</b>
  -   Deploy the text-embedding-ada-002 model and 
      -   Set os environment variable <b>EMBEDDINGS_TEXT_MODEL_DEPLOYMENT_NAME</b>
-  Create an instance of Azure Cognitive Search and get the endpoint and admin key.
    - Set os environment variables <b>COGNITIVE_SEARCH_ENDPOINT_VALUE</b>, <b>COGNITIVE_SEARCH_KEY_VALUE</b>

#### Reference :
- [Azure Open AI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview)
- [LangChain home page](https://python.langchain.com/docs/get_started/introduction.html)
- [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search)
- [Azure Cognitive Search as vector store](https://github.com/hwchase17/langchain/pull/5146/files/ef78d38fd12a6edcf6b04ab06493305d0d601ac3..f9b67d653854ef08e3dc56563964bb86deba9d8e)
- [LangChain Data connection Vector store integration with Azure Cognitive Search](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/azuresearch)

[Python 3.x]: https://www.python.org/
[Notebook]: https://docs.jupyter.org/en/latest/install/notebook-classic.html

In [1]:
# Get all the config params
import openai
import sys
import os

openai.api_key = os.getenv('OPENAI_API_KEY')
openai.api_base = os.getenv('OPENAI_API_BASE')
# 2023-03-15-preview
openai.api_version = os.getenv('OPENAI_API_VERSION')
openai.api_type = os.getenv('OPENAI_API_TYPE')

deployedEmbeddings = os.getenv('EMBEDDINGS_TEXT_MODEL_DEPLOYMENT_NAME')
azureSearchAdminKey = os.getenv('COGNITIVE_SEARCH_KEY_VALUE')
azureSearchEndpoint = os.getenv('COGNITIVE_SEARCH_ENDPOINT_VALUE')


#### Create the Search Index in Azure Cognitive Search
<font color=red>Note: This will delete your existing index</font>

In [4]:
from AzureOpenAIUtil.AzureCognitiveSearchIndex import AzureCognitiveSearchIndex
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents.indexes import SearchIndexClient  

MY_SEARCH_INDEX_NAME = "this-demo-search_index"

theSearchIndexInstance = AzureCognitiveSearchIndex()
theSearchIndex = theSearchIndexInstance.create_index(
                    index_name=MY_SEARCH_INDEX_NAME, 
                    title_field="title", # that is the column name in my csv file for title
                    keywords_field="bill_id" # that is a column in my csv file
                )
#print(f' {theSearchIndex} created')

theSearchCredential = AzureKeyCredential(azureSearchAdminKey)
theSearchClient = SearchIndexClient(
                        endpoint=azureSearchEndpoint, 
                        credential=theSearchCredential
                    )
# Delete index if it exists, to do a clean start
theSearchClient.delete_index(MY_SEARCH_INDEX_NAME)
# Add the index
result = theSearchClient.create_or_update_index(theSearchIndex)
print(f'Created index {result.name}')

Created index this-demo-search_index


#### Create the Azure Open AI Embeddings and AzureSearch classes:

In [5]:
from langchain.embeddings import OpenAIEmbeddings

EMBEDDINGS_MODEL_NAME = 'text-embedding-ada-002'
embeddings = OpenAIEmbeddings(
                    openai_api_key=openai.api_key,
                    openai_api_type=openai.api_type,
                    openai_api_version=openai.api_version,
                    openai_api_base=openai.api_base,
                    model=EMBEDDINGS_MODEL_NAME,
                    deployment=deployedEmbeddings
            )

In [7]:
from langchain.vectorstores.azuresearch import AzureSearch
vectorStore: AzureSearch = AzureSearch(
                                azure_search_endpoint=azureSearchEndpoint,
                                azure_search_key=azureSearchAdminKey,
                                index_name=MY_SEARCH_INDEX_NAME,
                                embedding_function=embeddings.embed_query,
                            )

### Load the BillSum Dataset
BillSum is a dataset of United States Congressional and California state bills. For illustration purposes, we'll look only at the US bills. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length. More information on the project and the original academic paper where this dataset is derived from can be found on the BillSum project's GitHub repository.

We curated the content and saved it in ../data/bill_sum_data_curated.csv

#### Load, cleanup, select text, summary and title columns and select rows with less than 8192 tokens 

In [8]:
import pandas as pd

df=pd.read_csv(os.path.join(os.getcwd(),'./data/bill_sum_data_curated.csv')) # This assumes that you have placed the bill_sum_data.csv in the same directory you are running Jupyter Notebooks
df_bills = df[['bill_id', 'title', 'summary', 'sum_len']]

from langchain.document_loaders import DataFrameLoader

loader = DataFrameLoader(df_bills, page_content_column="summary")
docs = loader.load()

In [9]:
results = vectorStore.add_documents(documents = docs)
print("Stored %s documents with embeddings in Azure Cogntive Search" %(len(results)))

Stored 20 documents with embeddings in Azure Cogntive Search


## Different Search functions

[LangChain API Reference Docs](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.azuresearch.AzureSearch.html#langchain.vectorstores.azuresearch.AzureSearch.semantic_hybrid_search)

In [10]:
# Return docs most similar to query.
searchResultDocs = vectorStore.similarity_search(
                        query="federal agency green energy bill",
                        k=1, # get the most nearest neighbor
                        #search_type="similarity" # do not pass this argument to try a hybrid search
                     )

for doc in searchResultDocs:
    print("Doc: %s\n" %doc)

Doc: page_content="Directs the President, in coordination with designated Secretaries, to establish: (1) a demonstration program for fuel cell proton exchange membrane technology for commercial, residential, and transportation applications within the Secretaries' respective areas. And (2) a comprehensive proton exchange membrane fuel cell bus demonstration program to address hydrogen production, storage, and use in transit bus applications. Mandates that each Federal agency that maintains a motor vehicle fleet develop a plan for fleet transition to vehicles powered by fuel cell technology. Directs the Secretary of Energy to establish a fuel cell technology grant program for State or local government to meet their energy requirements, including such technology as a motor vehicle power source. Authorizes appropriations." metadata={'bill_id': '106_hr5585', 'title': 'Energy Independence Act of 2000', 'sum_len': 810}

