# RAG Pattern Application - Simple Implementation

In this sample, we'll demonstrate how to build a RAG Pattern application using a subset of the Movie Lens dataset. This sample will leverage the SDK for Azure Cosmos DB for NoSQL to perform vector search and cache the results. And Azure OpenAI to generate embeddings and LLM completions.

There are two implementations in this project. One using LangChain and this simple implementation. The simple implementation connects directly to Azure Cosmos DB for NoSQL to perform vector search, and cache responses. It also connects directly to Azure OpenAI to generate embeddings and completions. This version requires a user to define and build the LLM payloads for LLM generation and also define the RAG Pattern request pipeline. Cache must be manually consulted in the pipeline and responses must also be manually cached.

The vector search will be done using Azure Cosmos DB for SQL's vector similarity search functionality to do vector search over the vectorized movie data as well as the conversation history which is also used as a cache.

At the end we will create a simple UX using Gradio to allow users to type in questions and display responses generated by a GPT model or served from the cache. The resopnses will also display an elapsed time so you can see the impact caching has on performance versus generating a response.

**Important Note**
This sample requires you to have the Azure Cosmos DB for NoSQL account setup with the Movies data uploaded to a container with vector indexing setup. Additionally, you also need to setup another container with vector indexing for setting up cache. To know more about how to setup vector search enabled containers, please refer to [this notebook](https://aka.ms/vector-search-nosql-nb)

# Preliminaries <a class="anchor" id="preliminaries"></a>
First, let's start by installing the packages that we'll need later. 

In [None]:
! pip install json
! pip install python-dotenv

! pip install openai

! pip install gradio

In [None]:
# Import the required libraries
import time
import json
import uuid
from dotenv import dotenv_values
from openai import AzureOpenAI
import gradio as gr


#Cosmos DB imports
from azure.cosmos import CosmosClient

Please use the example.env as a template to provide the necessary keys and endpoints in your own .env file.
Make sure to modify the env_name accordingly.

In [None]:
# Variables
# specify the name of the .env file name 
env_name = "fabcondemo.env" # following example.env template change to your own .env file name
config = dotenv_values(env_name)

cosmos_conn = config['cosmos_connection_string']
cosmos_key = config['cosmos_key']
cosmos_database = config['cosmos_database_name']
cosmos_collection = config['cosmos_collection_name']
cosmos_vector_property = config['cosmos_vector_property_name']
comsos_cache_db = config['cosmos_cache_database_name']
cosmos_cache = config['cosmos_cache_collection_name']
# Create the Azure Cosmos DB for NoSQL client
cosmos_client = CosmosClient(url=cosmos_conn, credential=cosmos_key)

openai_endpoint = config['openai_endpoint']
openai_key = config['openai_key']
openai_api_version = config['openai_api_version']
openai_embeddings_deployment = config['openai_embeddings_deployment']
openai_embeddings_dimensions = int(config['openai_embeddings_dimensions'])
openai_completions_deployment = config['openai_completions_deployment']
# Create the OpenAI client
openai_client = AzureOpenAI(azure_endpoint=openai_endpoint, api_key=openai_key, api_version=openai_api_version)


# Database and collections

Please make sure that you have the movies and the cache containers setup already. To know more about how to setup vector search enabled containers, please refer to [this notebook](https://aka.ms/vector-search-nosql-nb)

In [None]:
# get databases and containers to work with

db = cosmos_client.get_database_client(cosmos_database)
movies_container = db.get_container_client(cosmos_collection)
cache_container = db.get_container_client(cosmos_cache)


# Generate embeddings from Azure OpenAI

This is used to vectorize the user input for the vector search

In [None]:
# generate openai embeddings
def generate_embeddings(text):    
    '''
    Generate embeddings from string of text.
    This will be used to vectorize data and user input for interactions with Azure OpenAI.
    '''
    print("Generating embeddings for: ", text, " with model: ", openai_embeddings_deployment)
    response = openai_client.embeddings.create(input=text, model=openai_embeddings_deployment)
    embeddings =response.model_dump()
    time.sleep(0.5) 
    return embeddings['data'][0]['embedding']

# Vector Search in Azure Cosmos DB for MongoDB

This defines a function for performing a vector search over the movies data and chat cache collections. Function takes a collection reference, array of vector embeddings, and optional similarity score to filter for top matches and number of results to return to filter further.

In [None]:
# Perform a vector search on the Cosmos DB container
def vector_search(container, vectors, similarity_score=0.02, num_results=3):
    
    # Execute the query
    results = list(container.query_items(
        query='SELECT TOP @num_results c.title,c.overview,c.completion, VectorDistance(c.embeddings,@embedding, false, {"distanceFunction": "cosine"}) as SimilarityScore FROM c WHERE VectorDistance(c.embeddings,@embedding, true, {"distanceFunction": "cosine"}) > @similarity_score',
        parameters=[
            {"name": "@embedding", "value": vectors},
            {"name": "@num_results", "value": num_results},
            {"name": "@similarity_score", "value": similarity_score}
        ],
        enable_cross_partition_query=True)
        )

    # Extract the necessary information from the results
    formatted_results = []
    for result in results:
        formatted_result = {
            'similarityScore': result['SimilarityScore'],
            'document': result
        }
        formatted_results.append(formatted_result)
    
    print(formatted_results)

    return formatted_results

# Get recent chat history

This function provides conversational context to the LLM, allowing it to better have a conversation with the user.

In [None]:
# Grab chat history to as part of the payload to GPT model for completion.
def get_chat_history(history_container, completions=3):
    # Query Cosmos DB to retrieve chat history
    query = f"SELECT TOP {completions} c.prompt, c.completion FROM c ORDER BY c._ts DESC"
    items = list(history_container.query_items(query=query, enable_cross_partition_query=True))
    return items

# Chat Completion Function

This function assembles all of the required data as a payload to send to a GPT model to generate a completion

In [None]:
def chat_completion(cache_container, movies_container, user_input):

    # Generate embeddings from the user input
    print("1\n")
    user_embeddings = generate_embeddings(user_input)
    print("10\n")
    # Query the chat history cache first to see if this question has been asked before
    cache_results = vector_search(container = cache_container, vectors = user_embeddings, similarity_score=0.99, num_results=1)

    if len(cache_results) > 0:
        print("11\n Cached Result\n")
        return cache_results[0]['document']['completion']
        
    else:
    
        #perform vector search on the movie collection
        print("2\n New result\n")
        search_results = vector_search(movies_container, user_embeddings)

        print("Getting Chat History\n")
        #chat history
        chat_history = get_chat_history(cache_container, 3)

        #generate the completion
        print("Generating completions \n")
        completions_results = generate_completion(user_input, search_results, chat_history)

        print("Caching response \n")
        #cache the response
        cache_response(cache_container, user_input, user_embeddings, completions_results)

        print("\n")
        # Return the generated LLM completion
        return completions_results['choices'][0]['message']['content'] 



# Cache Generated Responses

Save the user prompts and generated completions in a conversation. Used to answer the same questions from other users. This is cheaper and faster than generating results each time.

In [None]:
def cache_response(cache_container, user_prompt, prompt_vectors, response):
    container = cache_container

    # Create a dictionary representing the chat document
    chat_document = {
        'id': str(uuid.uuid4()),  # Generate a unique ID for the document
        'prompt': user_prompt,
        'completion': response['choices'][0]['message']['content'],
        'completionTokens': str(response['usage']['completion_tokens']),
        'promptTokens': str(response['usage']['prompt_tokens']),
        'totalTokens': str(response['usage']['total_tokens']),
        'model': response['model'],
        'embeddings': prompt_vectors
    }

    # Insert the chat document into the Cosmos DB container
    container.create_item(body=chat_document)




# LLM Pipeline function

This function defines the pipeline for our RAG Pattern application. When user submits a question, the cache is consulted first for an exact match. If no match then a vector search is made, chat history gathered, the LLM generates a response, which is then cached before returning to the user.

In [None]:
def chat_completion(cache_container, movies_container, user_input):

    # Generate embeddings from the user input
    print("1\n")
    user_embeddings = generate_embeddings(user_input)
    print("10\n")
    # Query the chat history cache first to see if this question has been asked before
    cache_results = vector_search(container = cache_container, vectors = user_embeddings, similarity_score=0.99, num_results=1)

    if len(cache_results) > 0:
        print("11\n Cached Result\n")
        return cache_results[0]['document']['completion']
        
    else:
    
        #perform vector search on the movie collection
        print("2\n New result\n")
        search_results = vector_search(movies_container, user_embeddings)

        print("Getting Chat History\n")
        #chat history
        chat_history = get_chat_history(cache_container, 3)

        #generate the completion
        print("Generating completions \n")
        completions_results = generate_completion(user_input, search_results, chat_history)

        print("Caching response \n")
        #cache the response
        cache_response(cache_container, user_input, user_embeddings, completions_results)

        print("\n")
        # Return the generated LLM completion
        return completions_results['choices'][0]['message']['content'] 


# Create a simple UX in Gradio


In [None]:
chat_history = []
with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox(label="Ask me anything about movies!")
    clear = gr.Button("Clear")

    def user(user_message, chat_history):

        # Create a timer to measure the time it takes to complete the request
        start_time = time.time()
        
        
        print("5\n")
        # Get LLM completion
        response_payload = chat_completion(cache_container, movies_container, user_message)

        # Stop the timer
        end_time = time.time()

        elapsed_time = round((end_time - start_time) * 1000, 2)

        response = response_payload
        
        # Append user message and response to chat history
        chat_history.append([user_message, response_payload + f"\n (Time: {elapsed_time}ms)"])
        
        return gr.update(value=""), chat_history
    
    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False)
    
    clear.click(lambda: None, None, chatbot, queue=False)

In [None]:
# launch the gradio interface
demo.launch(debug=True)

In [None]:
# be sure to run this cell to close or restart the gradio demo
demo.close()