# Query Rewriting (Azure AI Search)

This code demonstrates how to use Azure AI Search with advanced query rewriting to improve the relevance of your search results. The code performs the following tasks:

+ Create an index schema
+ Load the sample data from a local folder
+ Embed the documents in-memory using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and nonvector fields on Azure AI Search
+ Rewrite a sample question to improve the relevance of the result documents
+ Manually combine the results of multiple rewritten queries using [Reciprocal Rank Fusion (RRF)](https://learn.microsoft.com/azure/search/hybrid-search-ranking).
+ Use [simple query syntax](https://learn.microsoft.com/azure/search/query-simple-syntax) and [multi-vector queries](https://learn.microsoft.com/azure/search/vector-search-how-to-query?tabs=query-2023-11-01%2Cfilter-2023-11-01#multiple-vector-queries) to automatically combine multiple rewritten queries using built-in RRF

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this demo.

The code reads the `text-sample.json` file, which contains the input data for which embeddings need to be generated.

The output is a combination of human-readable text and embeddings that can be pushed into a search index.

## Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access). You must have the Azure OpenAI service name and an API key.

+ A deployment of the text-embedding-ada-002 embedding model.

+ Azure AI Search, any tier, but choose a service that has sufficient capacity for your vector index. We recommend Basic or higher. [Enable semantic ranking](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run the hybrid query with semantic ranking.

We used Python 3.11, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [1]:
! pip install -r query-rewrite-requirements.txt --quiet

## Import required libraries and environment variables

In [2]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env.

# The following variables from your .env file are used in this notebook
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]) if len(os.environ["AZURE_SEARCH_ADMIN_KEY"]) > 0 else DefaultAzureCredential()
index_name = os.environ["AZURE_SEARCH_INDEX"]
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"] if len(os.environ["AZURE_OPENAI_KEY"]) > 0 else None
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]
azure_openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"]
azure_openai_chatgpt_deployment = os.environ["AZURE_OPENAI_CHATGPT_DEPLOYMENT"]

## Create embeddings
Read your data, generate OpenAI embeddings and export to a format to insert your Azure AI Search index:

In [3]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json

openai_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(openai_credential, "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_key,
    azure_ad_token_provider=token_provider if not azure_openai_key else None
)

output_path = os.path.join('..', '..', '..', 'output', 'docVectors.json')

if not os.path.exists(output_path):
    # Generate Document Embeddings using OpenAI Ada 002
    # Read the text-sample.json
    path = os.path.join('..', '..', '..', 'data', 'text-sample.json')
    with open(path, 'r', encoding='utf-8') as file:
        input_data = json.load(file)

    titles = [item['title'] for item in input_data]
    content = [item['content'] for item in input_data]
    title_response = client.embeddings.create(input=titles, model=azure_openai_embedding_deployment)
    title_embeddings = [item.embedding for item in title_response.data]
    content_response = client.embeddings.create(input=content, model=azure_openai_embedding_deployment)
    content_embeddings = [item.embedding for item in content_response.data]

    # Generate embeddings for title and content fields
    for i, item in enumerate(input_data):
        title = item['title']
        content = item['content']
        item['titleVector'] = title_embeddings[i]
        item['contentVector'] = content_embeddings[i]

    # Output embeddings to docVectors.json file
    output_directory = os.path.dirname(output_path)
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    with open(output_path, "w") as f:
        json.dump(input_data, f)

## Create your search index

Create your search index schema and vector search configuration. If you get an error, check the search service for available quota and check the .env file to make sure you're using a unique search index name.

In [4]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex,
    AzureOpenAIVectorizer,
    AzureOpenAIParameters
)


# Create a search index
index_client = SearchIndexClient(
    endpoint=endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchableField(name="category", type=SearchFieldDataType.String,
                    filterable=True),
    SearchField(name="titleVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
    SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
            vectorizer="myVectorizer"
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            name="myVectorizer",
            azure_open_ai_parameters=AzureOpenAIParameters(
                resource_uri=azure_openai_endpoint,
                deployment_id=azure_openai_embedding_deployment,
                api_key=azure_openai_key
            )
        )
    ]
)



semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        content_fields=[SemanticField(field_name="content")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = index_client.create_or_update_index(index)
print(f'{result.name} created')


rewriter created


## Insert text and embeddings into vector store
Add texts and metadata from the JSON data to the vector store:

In [5]:
from azure.search.documents import SearchClient

# Upload some documents to the index
output_path = os.path.join('..', '..', '..', 'output', 'docVectors.json')
output_directory = os.path.dirname(output_path)
if not os.path.exists(output_directory):
    os.makedirs(output_directory)
with open(output_path, 'r') as file:  
    documents = json.load(file)  
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)
print(f"Uploaded {len(documents)} documents") 

Uploaded 108 documents


If you are indexing a very large number of documents, you can use the `SearchIndexingBufferedSender` which is an optimized way to automatically index the docs as it will handle the batching for you:

In [6]:
from azure.search.documents import SearchIndexingBufferedSender

# Upload some documents to the index  
with open(output_path, 'r') as file:  
    documents = json.load(file)  
  
# Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing  
with SearchIndexingBufferedSender(  
    endpoint=endpoint,  
    index_name=index_name,  
    credential=credential,  
) as batch_client:  
    # Add upload actions for all documents  
    batch_client.upload_documents(documents=documents)  
print(f"Uploaded {len(documents)} documents in total")  


Uploaded 108 documents in total


## Retrieve chunks using hybrid search

Before evaluating the effects of query rewriting, it's useful to establish a baseline as to what hybrid search returns without any query rewriting

In [7]:
from azure.search.documents.models import VectorizableTextQuery
import pandas as pd

def hybrid_search(search_client: SearchClient, query: str) -> pd.DataFrame:
    results = search_client.search(
        search_text=query,
        vector_queries=[
            # k_nearest_neighbors should be set to 50 in order to boost the relevance of hybrid search
            # Increasing the vector recall set size from 1 to 50 in hybrid search benefits relevance by
            # improving the diversity of vector query results that will be considered by RRF, ensuring a more comprehensive representation
            # of the data results and more robustness to varying similarity scores or closely related similarity scores.
            VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="contentVector")
        ],
        top=3,
        select="id, title, content",
        search_fields=["content"]
    )
    data = [[result["id"], result["title"], result["content"], result["@search.score"]] for result in results]
    return pd.DataFrame(data, columns=["id", "title", "content", "@search.score"])



The following cell demonstrates the results of hybrid search using a sample query

In [8]:
hybrid_search(search_client, "scalable storage solution")

Unnamed: 0,id,title,content,@search.score
0,4,Azure Storage,"Azure Storage is a scalable, durable, and high...",0.03306
1,36,Azure Data Lake Storage,"Azure Data Lake Storage is a scalable, secure,...",0.032266
2,52,Azure Table Storage,"Azure Table Storage is a fully managed, NoSQL ...",0.031754


## Rewriting queries for improved relevance of results

Users often use terse terms such as "scalable storage solution". These terms may match the contents of documents in the search index, but often an LLM can rewrite the query to improve the results

In [9]:
import json

REWRITE_PROMPT = """You are a helpful assistant. You help users search for the answers to their questions.
You have access to Azure AI Search index with 100's of documents. Rewrite the following question into 3 useful search queries to find the most relevant documents.
Always output a JSON object in the following format:
===
Input: "scalable storage solution"
Output: { "queries": ["what is a scalable storage solution in Azure", "how to create a scalable storage solution", "steps to create a scalable storage solution"] }
===
"""

# If you are not using a supported model or region, you may not be able to use json_object response format
# Please see https://learn.microsoft.com/azure/ai-services/openai/how-to/json-mode
def rewrite_query(openai_client: AzureOpenAI, query: str):
    response = openai_client.chat.completions.create(
        model=azure_openai_chatgpt_deployment,
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": REWRITE_PROMPT},
            {"role": "user", "content": f"Input: {query}"}
        ]
    )
    try:
        return json.loads(response.choices[0].message.content)
    except json.JSONDecodeError as e:
        print("JSON decoding error:", e)
        raise

The following cell demonstrates how an LLM can rewrite queries to improve their clarity

In [10]:
rewrite_query(client, "what is azure sarch?")

{'queries': ['what is Azure Search',
  'features of Azure Search',
  'how to set up Azure Search']}

## Combining the rewritten queries manually using RRF

Now that we can use a LLM to rewrite the query, we need to issue our queries and combine the results. We'll start by doing this manually to demonstrate how the RRF calculation works

In [11]:
def query_rewrite_manual_rrf(search_client: SearchClient, openai_client: AzureOpenAI, query: str) -> pd.DataFrame:
    rewritten_queries = rewrite_query(openai_client, query)
    # pd.concat preserves the original index by default when concatenating tables
    # This is important for the RRF calculation below
    results = pd.concat([hybrid_search(search_client, rewritten_query) for rewritten_query in rewritten_queries["queries"]], axis=0)
    def rrf_score(row: pd.Series) -> float:
        score = 0.0
        k = 60
        # rank = the original position in the results list the document was located at
        for rank, df_row in results.iterrows():
            # The RRF score is the sum of 1.0 / (k + document rank) in every result set the document shows up in
            if df_row["id"] == row["id"]:
                score += 1.0 / (k + rank)
        return score
    # Apply the RRF scoring function to every row in the data frame
    results["rrf_score"] = results.apply(rrf_score, axis=1)
    # Return the deduplicated result set sorted by the most relevant RRF score
    return rewritten_queries, results.drop_duplicates(subset=["id"]).sort_values(by="rrf_score", ascending=False)
    

The following cell demonstrates how an unclear query ("srch service") is automatically rewritten and made more clear by an LLM. The resulting RRF score is higher for the most relevant document compared to the original search score

In [12]:
from IPython.display import display

rewritten_queries, results = query_rewrite_manual_rrf(search_client, client, "srch service")
display(results)
print(rewritten_queries)

Unnamed: 0,id,title,content,@search.score,rrf_score
0,40,Azure Cognitive Search,Azure Cognitive Search is a fully managed sear...,0.033333,0.05
2,90,Azure Cognitive Services,Azure Cognitive Services is a collection of AI...,0.032522,0.048916
1,3,Azure Cognitive Services,Azure Cognitive Services are a set of AI servi...,0.032522,0.048652


{'queries': ['What is Azure Cognitive Search service?', 'How to create an Azure Cognitive Search service?', 'Best practices for using Azure Cognitive Search service']}


## Combining the rewritten queries automatically using RRF

We can use the built-in RRF instead of manually performing the RRF calculation ourselves. We will use query combination using boolean operators and multi-vector search to accomplish a similar goal. Please note that the RRF score will not be exactly the same as the manual calculation because the text index can be more efficiently queried using this approach and less-relevant documents are automatically filtered out

In [13]:
def query_rewrite_automatic_rrf(search_client: SearchClient, openai_client: AzureOpenAI, query: str) -> pd.DataFrame:
    rewritten_queries = rewrite_query(openai_client, query)
    # Quote the rewritten queries before joining them in the query syntax
    formatted_queries = [f'"{rewritten_query}"' for rewritten_query in rewritten_queries]
    # Use the OR operator to join rewritten queries together
    # https://learn.microsoft.com/azure/search/query-lucene-syntax#bkmk_boolean
    search_text = " | ".join(formatted_queries)
    results = search_client.search(
        search_text=search_text,
        # Issue a vector query for every single rewritten query
        vector_queries=[VectorizableTextQuery(text=rewritten_query, k_nearest_neighbors=50, fields="contentVector") for rewritten_query in rewritten_queries],
        query_type="simple",
        # Any rewritten query from the joined query could match
        search_mode="any",
        search_fields=["content"],
        top=3
    )
    # @search.score is equivalent to the manually computed RRF score above
    data = [[result["id"], result["title"], result["content"], result["@search.score"]] for result in results]
    return rewritten_queries, pd.DataFrame(data, columns=["id", "title", "content", "@search.score"])

The following cell demonstrates how the automatic approach has similar results to the manual one, even though the scores are not exactly equal.

In [14]:
rewritten_queries, results = query_rewrite_automatic_rrf(search_client, client, "srch service")
display(results)
print(rewritten_queries)

Unnamed: 0,id,title,content,@search.score
0,40,Azure Cognitive Search,Azure Cognitive Search is a fully managed sear...,0.016667
1,3,Azure Cognitive Services,Azure Cognitive Services are a set of AI servi...,0.016393
2,34,Azure Data Explorer,"Azure Data Explorer is a fast, fully managed d...",0.016129


{'queries': ['Azure search service', 'how to create an Azure search service', 'best practices for Azure search service']}


## Continue to improve relevance using hybrid and semantic

Once you are using the automatic RRF combination method, you can add semantic ranking to improve relevance further

In [15]:
def query_rewrite_automatic_rrf_semantic(search_client: SearchClient, openai_client: AzureOpenAI, query: str) -> pd.DataFrame:
    rewritten_queries = rewrite_query(openai_client, query)
    # Quote the rewritten queries before joining them together using the query syntax
    formatted_queries = [f'"{rewritten_query}"' for rewritten_query in rewritten_queries]
    # Use the OR operator to join rewritten queries together
    # https://learn.microsoft.com/azure/search/query-lucene-syntax#bkmk_boolean
    search_text = " | ".join(formatted_queries)
    # The semantic ranker expects plain text queries with no search operators
    semantic_query = " ".join(rewritten_queries["queries"])
    results = search_client.search(
        search_text=search_text,
        # Issue a vector query for every single rewritten query
        vector_queries=[VectorizableTextQuery(text=rewritten_query, k_nearest_neighbors=50, fields="contentVector") for rewritten_query in rewritten_queries],
        # Any rewritten query from the joined query could match
        search_mode="any",
        search_fields=["content"],
        query_type="simple",
        # Pass in the plain text concatenation of the rewritten queries for semantic ranking
        semantic_query=semantic_query,
        semantic_configuration_name='my-semantic-config',
        top=3
    )
    # @search.score is equivalent to the manually computed RRF score above
    # @search.rerankerscore is the semantic reranking of the combined results
    data = [[result["id"], result["title"], result["content"], result["@search.score"], result["@search.reranker_score"]] for result in results]
    return rewritten_queries, pd.DataFrame(data, columns=["id", "title", "content", "@search.score", "@search.reranker_score"])

The following cell demonstrates how the semantic score compares to the RRF score. The semantic score ranges from 0-4, where a higher score indicates higher relvance

In [16]:
rewritten_queries, results = query_rewrite_automatic_rrf_semantic(search_client, client, "srch service")
display(results)
print(rewritten_queries)

Unnamed: 0,id,title,content,@search.score,@search.reranker_score
0,40,Azure Cognitive Search,Azure Cognitive Search is a fully managed sear...,0.016667,3.679841
1,3,Azure Cognitive Services,Azure Cognitive Services are a set of AI servi...,0.016393,2.989737
2,90,Azure Cognitive Services,Azure Cognitive Services is a collection of AI...,0.015152,2.972886


{'queries': ['what is Azure Cognitive Search service', 'how to create an Azure Cognitive Search service', 'best practices for managing Azure Cognitive Search service']}
