# Queries with Azure OpenAI

So far, you have your Search Engine loaded **from two different data sources in two diferent text-based indexes**, on this notebook we are going to try some example queries and then use Azure OpenAI service to see if we can get even better results.

The idea is that a user can ask a question about Computer Science (first datasource/index) or about Covid (second datasource/index), and the engine will respond accordingly.
This **Multi-Index** demo, mimics the scenario where a company loads multiple type of documents of different types and about completly different topics and the search engine must respond with the most relevant results.

## Set up variables

In [3]:
import os
import urllib
import requests
import random
import json
from collections import OrderedDict
from IPython.display import display, HTML, Markdown
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings import OpenAIEmbeddings

from common.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT, COMBINE_PROMPT_TEMPLATE
from common.utils import (
    get_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    num_tokens_from_string
)

from dotenv import load_dotenv
load_dotenv("credentials.env")

True

In [2]:
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

# Using Azure OpenAI

To use OpenAI to get a better answer to our question, the thought process is simple: let's **give the answer and the content of the documents from the search result to the GPT model as context and let it provide a better response**.

Now, before we do this, we need to understand a few things first:

1) Chainning and Prompt Engineering
2) Embeddings

We will use a library call **LangChain** that wraps a lot of boiler plate code.
Langchain is one library that does a lot of the prompt engineering for us under the hood, for more information see [here](https://python.langchain.com/en/latest/index.html)

In [4]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"] = "azure"

**Important Note**: Starting now, we will utilize OpenAI models. Please ensure that you have deployed the following models within the Azure OpenAI portal using these precise deployment names:

- text-embedding-ada-002
- gpt-35-turbo
- gpt-35-turbo-16k
- gpt-4
- gpt-4-32k

Should you have deployed the models under different names, the code provided below will not function as expected. To resolve this, you would need to modify the variable names throughout all the notebooks.

## A gentle intro to chaining LLMs and prompt engineering

Chains are what you get by connecting one or more large language models (LLMs) in a logical way. (Chains can be built of entities other than LLMs but for now, let’s stick with this definition for simplicity).

Azure OpenAI is a type of LLM (provider) that you can use but there are others like Cohere, Huggingface, etc.

Chains can be simple (i.e. Generic) or specialized (i.e. Utility).

* Generic — A single LLM is the simplest chain. It takes an input prompt and the name of the LLM and then uses the LLM for text generation (i.e. output for the prompt).

Here’s an example:

In [5]:
MODEL = "gpt-4" # options: gpt-35-turbo, gpt-35-turbo-16k, gpt-4, gpt-4-32k
COMPLETION_TOKENS = 1000
llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0, max_tokens=COMPLETION_TOKENS)

In [9]:
QUESTION = "Welche Regeln gelten für Motoren bei Land Transporten?"

# Now we create a simple prompt template
prompt = PromptTemplate(
    input_variables=["question", "language"],
    template='Answer the following question: "{question}". Give your response in {language}',
)

print(prompt.format(question=QUESTION, language="English"))

Answer the following question: "Welche Regeln gelten für Motoren bei Land Transporten?". Give your response in English


In [10]:
# And finnaly we create our first generic chain
chain_chat = LLMChain(llm=llm, prompt=prompt)
chain_chat({"question": QUESTION, "language": "English"})

{'question': 'Welche Regeln gelten für Motoren bei Land Transporten?',
 'language': 'English',
 'text': 'The question asks: "What rules apply to engines for land transports?" \n\nThe rules for engines in land transports can vary depending on the country and the type of vehicle. However, some common rules include:\n\n1. Emission Standards: Engines must meet certain emission standards to reduce air pollution. These standards can vary by country and vehicle type.\n\n2. Maintenance and Inspection: Regular maintenance and inspection of engines are often required to ensure they are operating safely and efficiently.\n\n3. Noise Regulations: There may be rules regarding the amount of noise an engine can produce.\n\n4. Fuel Efficiency Standards: Some countries have fuel efficiency standards that engines must meet.\n\n5. Safety Standards: Engines must meet certain safety standards to prevent accidents and breakdowns.\n\n6. Engine Size/Power: Some types of vehicles may have restrictions on engine

**Note**: this is the first time you use OpenAI in this Accelerator, so if you get a Resource not found error, is most likely because the name of your OpenAI model deployment is different than the variable MODEL set above

Great!!, now you know how to create a simple prompt and use a chain in order to answer a general question using ChatGPT knowledge!. 

It is important to note that we rarely use generic chains as standalone chains. More often they are used as building blocks for Utility chains (as we will see next). Also important to notice is that we are NOT using our documents or the result of the Azure Search yet, just the knowledge of ChatGPT on the data it was trained on.

**The second type of Chains are Utility:**

* Utility — These are specialized chains, comprised of many LLMs to help solve a specific task. For example, LangChain supports some end-to-end chains (such as [QA_WITH_SOURCES](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for QnA Doc retrieval, Summarization, etc) and some specific ones (such as GraphQnAChain for creating, querying, and saving graphs). 

We will look at one specific chain called **qa_with_sources** in this workshop for digging deeper and solve our use case of enhancing the results of Azure Cognitive Search.


But before dealing with the utility chain needed, we need to deal first with this problem: **the content of the search result files is or can be very lengthy, more than the allowed tokens allowed by the GPT Azure OpenAI models**. 

This is where the concept of embeddings/vectors come into place.

## Embeddings and Vector Search

From the Azure OpenAI documentation ([HERE](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/embeddings?tabs=python)), An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar. 

To address the challenge of accommodating context within the token limit of a Language Model (LLM), the solution involves the following steps:

1. **Segmenting Documents**: Divide the documents into smaller segments or chunks.
2. **Vectorization of Chunks**: Transform these chunks into vectors using appropriate techniques.
3. **Vector Semantic Search**: Execute a semantic search using vectors to identify the top chunks similar to the given question.
4. **Optimal Context Provision**: Provide the LLM with the most relevant and concise context, thereby achieving an optimal balance between comprehensiveness and lengthiness.


Notice that **the documents chunks are already done in Azure Search**. *ordered_content* dictionary (created a few cells above) contains the chunks of each document. So we don't really need to chunk them again, but we still need to make sure that we can be as fast as possible and that we are below the max allowed input token limits of our selected OpenAI model.

Our ultimate goal is to rely solely on vector indexes. While it is possible to manually code parsers with OCR for various file types and develop a scheduler to synchronize data with the index, there is a more efficient alternative: **Azure Cognitive Search is soon going to release automated chunking strategies and vectorization within the next months**, so we have three options: 
1. Wait for this functionality while in the meantime manually push chunks and its vectors to the vector-based indexes 
2. Fill up the vector-based indexes on-demand, as documents are discovered by users
3. Use custom skills (for chunking and vectorization) and use knowledge stores in order to create a vector-base index from a text-based-ai-enriched index at ingestion time. See [HERE](https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb) for instructions on how to do this.

In this notebook we are going to implement Option 2: **Create vector-based indexes per each text-based indexes and fill them up on-demand as documents are discovered**. Why? because is simpler and quick to implement, while we wait for Option 1 to become a feature of Azure Search Engine (which is the automation of Option 3 inside the search engine).

As observed in Notebooks 1 and 2, each text-based index contains a field named `vectorized` that we have not utilized yet. We will now harness this field. The objective is to avoid vectorizing all documents at the time of ingestion (Option 3). Instead, we can vectorize the chunks as users search for or discover documents. This approach ensures that we allocate funds and resources only when the documents are actually required. Typically, in an organization with a vast repository of documents in a data lake, only 20% of the documents are frequently accessed, while the rest remain untouched. This phenomenon mirrors the [Pareto Principle](https://en.wikipedia.org/wiki/Pareto_principle) found in nature.

In [11]:
index_name = "cogsrch-index-files"
index2_name = "cogsrch-index-csv"
indexes = [index_name, index2_name]

In order to not duplicate code, we have put many of the code used above into functions. These functions are in the `common/utils.py` and `common/prompts.py` files. This way we can use these functios in the app that we will build later.

In [13]:
embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

For vector search is not recommended to give more than k=5 chunks (of max 5000 characters each) to the LLM as context. Otherwise you can have issues later with the token limit trying to have a conversation with memory.

In [18]:
# Calculate number of tokens of our docs
if(len(top_docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in common/utils.py
    prompt_tokens = num_tokens_from_string(COMBINE_PROMPT_TEMPLATE) # this is a custom function we created in common/utils.py
    context_tokens = num_tokens_from_docs(top_docs) # this is a custom function we created in common/utils.py
    
    requested_tokens = prompt_tokens + context_tokens + COMPLETION_TOKENS
    
    chain_type = "map_reduce" if requested_tokens > 0.9 * tokens_limit else "stuff"  
    
    print("System prompt token count:",prompt_tokens)
    print("Max Completion Token count:", COMPLETION_TOKENS)
    print("Combined docs (context) token count:",context_tokens)
    print("--------")
    print("Requested token count:",requested_tokens)
    print("Token limit for", MODEL, ":", tokens_limit)
    print("Chain Type selected:", chain_type)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

System prompt token count: 1669
Max Completion Token count: 1000
Combined docs (context) token count: 628
--------
Requested token count: 3297
Token limit for gpt-35-turbo : 4096
Chain Type selected: stuff


Now we will use our Utility Chain from LangChain `qa_with_sources`

In [15]:
# Import required libraries  
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient  
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryLanguage,
    QueryType,
    RawVectorQuery,
    VectorizableTextQuery,
    VectorFilterMode,    
)
from azure.search.documents.indexes.models import (  
    AzureOpenAIEmbeddingSkill,  
    AzureOpenAIParameters,  
    AzureOpenAIVectorizer,  
    ExhaustiveKnnParameters,  
    ExhaustiveKnnVectorSearchAlgorithmConfiguration,
    FieldMapping,  
    HnswParameters,  
    HnswVectorSearchAlgorithmConfiguration,  
    IndexProjectionMode,  
    InputFieldMappingEntry,  
    OutputFieldMappingEntry,  
    PrioritizedFields,    
    SearchField,  
    SearchFieldDataType,  
    SearchIndex,  
    SearchIndexer,  
    SearchIndexerDataContainer,  
    SearchIndexerDataSourceConnection,  
    SearchIndexerIndexProjectionSelector,  
    SearchIndexerIndexProjections,  
    SearchIndexerIndexProjectionsParameters,  
    SearchIndexerSkillset,  
    SemanticConfiguration,  
    SemanticField,  
    SemanticSettings,  
    SplitSkill,  
    VectorSearch,  
    VectorSearchAlgorithmKind,  
    VectorSearchAlgorithmMetric,  
    VectorSearchProfile,  
)  

from azure.storage.blob import BlobServiceClient  
import openai  

from dotenv import load_dotenv  
import os  
  
# Configure environment variables  
load_dotenv("credentials.env")
service_endpoint =os.getenv("AZURE_SEARCH_ENDPOINT")  #os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")  
index_name ="cogsrch-index-files" # os.getenv("AZURE_SEARCH_INDEX_NAME")  
key = os.getenv("AZURE_SEARCH_KEY")  #remose _ADMIN_ here
openai.api_type = "azure"  
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")  
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")  
openai.api_version = os.getenv("AZURE_OPENAI_API_VERSION")  
model: str = "text-embedding-ada-002"  
blob_connection_string = os.getenv("BLOB_CONNECTION_STRING")  
container_name ="demo-vbd-mercedes" # os.getenv("BLOB_CONTAINER_NAME")  
credential = AzureKeyCredential(key)  

In [None]:
# this functions needs to be adapted for calling openai in this notebook
def get_search_results(query: str, indexes: list, 
                       k: int = 5,
                       reranker_threshold: int = 1,
                       sas_token: str = "",
                       vector_search: bool = False,
                       similarity_k: int = 3, 
                       query_vector: list = []) -> List[dict]:
    
    headers = {'Content-Type': 'application/json','api-key': os.environ["AZURE_SEARCH_KEY"]}
    params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

    agg_search_results = dict()
    
    for index in indexes:
        search_payload = {
            "search": query,
            "queryType": "semantic",
            "semanticConfiguration": "my-semantic-config",
            "count": "true",
            "speller": "lexicon",
            "queryLanguage": "en-us",
            "captions": "extractive",
            "answers": "extractive",
            "top": k
        }
        if vector_search:
            search_payload["vectors"]= [{"value": query_vector, "fields": "chunkVector","k": k}]
            search_payload["select"]= "id, title, chunk, name, location"
        else:
            search_payload["select"]= "id, title, chunks, language, name, location, vectorized"
        

        resp = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index + "/docs/search",
                         data=json.dumps(search_payload), headers=headers, params=params)

        search_results = resp.json()
        agg_search_results[index] = search_results
    
    content = dict()
    ordered_content = OrderedDict()
    
    for index,search_results in agg_search_results.items():
        for result in search_results['value']:
            if result['@search.rerankerScore'] > reranker_threshold: # Show results that are at least N% of the max possible score=4
                content[result['id']]={
                                        "title": result['title'], 
                                        "name": result['name'], 
                                        "location": result['location'] + sas_token if result['location'] else "",
                                        "caption": result['@search.captions'][0]['text'],
                                        "index": index
                                    }
                if vector_search:
                    content[result['id']]["chunk"]= result['chunk']
                    content[result['id']]["score"]= result['@search.score'] # Uses the Hybrid RRF score
              
                else:
                    content[result['id']]["chunks"]= result['chunks']
                    content[result['id']]["language"]= result['language']
                    content[result['id']]["score"]= result['@search.rerankerScore'] # Uses the reranker score
                    content[result['id']]["vectorized"]= result['vectorized']
                
    # After results have been filtered, sort and add the top k to the ordered_content
    if vector_search:
        topk = similarity_k
    else:
        topk = k*len(indexes)
        
    count = 0  # To keep track of the number of results added
    for id in sorted(content, key=lambda x: content[x]["score"], reverse=True):
        ordered_content[id] = content[id]
        count += 1
        if count >= topk:  # Stop after adding 5 results
            break

    return ordered_content

In [40]:
vector_indexes = "cogsrch-index-files"

k = 10
similarity_k = 3
ordered_results = get_search_results(QUESTION, vector_indexes,
                                        k=k, # Number of results per vector index
                                        reranker_threshold=1,
                                        vector_search=True, 
                                        similarity_k=similarity_k,
                                        query_vector = embedder.embed_query(QUESTION)
                                        )
print("Number of results:",len(ordered_results))

KeyError: 'value'

In [36]:
index_name = "cogsrch-index-files"
# Pure Vector Search
query = "summarize Door Locks and Door Retention Components"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k=10, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    #filter="filter eq 'id1' ",
    #filter="filter/any(filter: search.in(filter, 'group_id1, group_id2'))"  ,
    top=10
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    #print(f"Content: {result['chunk']}")   


parent_id: aHR0cHM6Ly9zdG9yYWdlZGVtb29wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZGVtby12YmQtbWVyY2VkZXMvMjYwNTI0MTYzOSUyMFtPcmlnaW5hbCUyMGVuXSUyMENBTiUyMFRTRCUyMDIwNiUyMFJldiUyMDIlMjBFTiUyMDIwMTAtMDYtMDMucGRmLnBkZg2
chunk_id: 7f98ad2134ee_aHR0cHM6Ly9zdG9yYWdlZGVtb29wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZGVtby12YmQtbWVyY2VkZXMvMjYwNTI0MTYzOSUyMFtPcmlnaW5hbCUyMGVuXSUyMENBTiUyMFRTRCUyMDIwNiUyMFJldiUyMDIlMjBFTiUyMDIwMTAtMDYtMDMucGRmLnBkZg2_pages_3
Score: 0.8714593
parent_id: aHR0cHM6Ly9zdG9yYWdlZGVtb29wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZGVtby12YmQtbWVyY2VkZXMvMjYwNTI0MTYzOSUyMFtPcmlnaW5hbCUyMGVuXSUyMENBTiUyMFRTRCUyMDIwNiUyMFJldiUyMDIlMjBFTiUyMDIwMTAtMDYtMDMucGRmLnBkZg2
chunk_id: 7f98ad2134ee_aHR0cHM6Ly9zdG9yYWdlZGVtb29wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZGVtby12YmQtbWVyY2VkZXMvMjYwNTI0MTYzOSUyMFtPcmlnaW5hbCUyMGVuXSUyMENBTiUyMFRTRCUyMDIwNiUyMFJldiUyMDIlMjBFTiUyMDIwMTAtMDYtMDMucGRmLnBkZg2_pages_25
Score: 0.8665724
parent_id: aHR0cHM6Ly9zdG9yYWdlZGVtb29wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZGVtby

In [19]:
chain_type =  "stuff"  

In [25]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT,
                                       return_intermediate_steps=True)

In [37]:
%%time
QUESTION = "summarize Door Locks and Door Retention Components"
# Try with other language as well
response = chain({"input_documents": results, "question": query, "language": "English"})

CPU times: user 48.9 ms, sys: 26 µs, total: 48.9 ms
Wall time: 1.95 s


In [39]:
print(results)

<iterator object azure.core.paging.ItemPaged at 0x7f5d20a8cf10>


In [38]:
display(Markdown(response['output_text']))

I'm sorry, but I can't provide the information you're looking for because there are no extracted parts provided for me to reference.

**Please Note**: There are some instances where, despite the answer's high accuracy and quality, the references are not done according to the instructions provided in the COMBINE_PROMPT. This behavior is anticipated when dealing with GPT-3.5 models. We will provide a more detailed explanation of this phenomenon towards the conclusion of Notebook 5.

In [22]:
# Uncomment if you want to inspect the results from map_reduce chain type, each top similar chunk summary (k=4 by default)

# if chain_type == "map_reduce":
#     for step in response['intermediate_steps']:
#         display(HTML("<b>Chunk Summary:</b> " + step))

# Summary
##### This answer is way better than taking just the result from Azure Cognitive Search. So the summary is:
- Utilizing Azure Cognitive Search, we conduct a multi-index text-based search that identifies the top documents from each index.
- Utilizing Azure Cognitive Search's vector search, we extract the most relevant chunks of information.
- Subsequently, Azure OpenAI utilizes these extracted chunks as context, comprehends the content, and employs it to deliver optimal answers.
- Best of two worlds!

# NEXT
In the next notebook, we are going to see how we can treat complex and large documents separately, also using Vector Search