# Case Study: Retrieval Augmented Generation

This case study covers a typical Retrieval Augmented Generation (RAG) chatbot, which is providng information regarding a fictional product, the BirdBrain Automatic WiFi Pet Chicken Feeder.

Two common scenarios are shown, and a number of steps implemented to show how the response can be accelerated to improve customer experience. These improvements can accelerate your application by up to 6.8x and 17x, though they are highly dependent on your specific use case, so the benefits may vary.


There are a range of documents containing product information and troubleshooting steps that have been synthetically generated. As this case study is not about implementing RAG, these documents will be simply loaded directly into the model's prompt (no vector search will be performed) to keep the implementation as simple as possible.

Whilst these concepts are relatively simple, many production applications today do not make use of them. These should be considered a starting point for ideas to further improve and build out your RAG application.

If you have additional ideas, please submit a PR!

In [1]:
with open('troubleshooting_information.txt', 'r') as f:
    troubleshooting_information = f.read()
with open('product_information.txt', 'r') as f:
    product_information = f.read()

import datetime
import json
import time
import os
import datetime
import json
import time
from openai import AzureOpenAI
from dotenv import load_dotenv
import json
import copy
import textwrap

# Load environment variables
load_dotenv()

def aoai_call(system_message,prompt,model):
    client = AzureOpenAI(
        api_version=os.getenv("API_VERSION"),
        azure_endpoint=os.getenv("AZURE_ENDPOINT"),
        api_key=os.getenv("API_KEY")
    )

    start_time = time.time()

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt},
        ],
    )

    end_time = time.time()
    e2e_time = end_time - start_time

    result=json.loads(completion.model_dump_json(indent=2))
    prompt_tokens=result["usage"]["prompt_tokens"]
    completion_tokens=result["usage"]["completion_tokens"]
    completion_text=result["choices"][0]["message"]["content"]

    return result,prompt_tokens,completion_tokens,completion_text,e2e_time

model=os.getenv("MODELGPT432k")

## Scenario: Troubleshooting

This scenario covers an example of a customer asking for information on how to perform basic troubleshooting of a product. This demonstrates how two techniques, _generation token compression_ and _avoid rewriting documents_ can be used to speed up the application by 6.8x.



### Base Case

**Time taken: 23 seconds**

The bot has a number of context documents containing product and troubleshooting information. The user is seeking assistance to help fix the wifi. The bot is expected to provide a summary and the step by step instructions.

In [3]:
user_question="How can I fix the wifi"

system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}
"""

prompt=prompt+f"""
User question:
{user_question}
"""

result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
print(f"Prompt Tokens: {prompt_tokens}")
print(f"Completion Tokens: {completion_tokens}")
print(f"Time taken: {e2e_time:.2f} seconds")
print(f"Completion text: {completion_text}")


Prompt Tokens: 1085
Completion Tokens: 331
Time taken: 22.86 seconds
Completion text: Here are some steps to help troubleshoot Wi-Fi issues with your BirdBrain Automatic WiFi Pet Chicken Feeder:

1. **Check Wi-Fi Status:**
    - **Device Wi-Fi Toggle:** Ensure that your device’s Wi-Fi is turned on. Visit your device settings and confirm that the Wi-Fi is activated. If not, you can switch it on from the same setting.
    - **Phone Wi-Fi Capabilities:** Confirm that your smartphone’s Wi-Fi capabilities are working. Sometimes, modes like airplane or battery-saving can turn off Wi-Fi. If it's off, turn it back on.

2. **Restart Your Router:**
    - **Unplug and Wait:** Disconnect your router's power source. Wait for about half a minute to a minute before plugging it back again. This usually fixes most of the Wi-Fi problems.
    - **Router Placement:** Ensure your router's placement for optimal Wi-Fi connections. A centrally located router with no major obstructions around works best.

3. *

### Implement the Generation Token Compression technique

**Time taken: 9.8 seconds**

The less tokens the model generates, the faster the response will be. By prompting the model to be as concise as possible, the response time is significantly reduced from 23 seconds to 9.8 seconds.

It is important to make sure the bot is still providing enough information to be actually helpful to the user, and a balance needs to be struck between speed and providing enough information.

Few-shot prompting can be used to help guide the right level of information- this provides clear guidelines to the LLM of the level of detail expected.

In [4]:
user_question="How can I fix the wifi"

system_message="""
You are a helpful AI assistant. Be as succint as possible.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}
"""

prompt=prompt+f"""
User question:
{user_question}
"""

result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
print(f"Prompt Tokens: {prompt_tokens}")
print(f"Completion Tokens: {completion_tokens}")
print(f"Time taken: {e2e_time:.2f} seconds")
print(f"Completion text: {completion_text}")


Prompt Tokens: 1092
Completion Tokens: 139
Time taken: 9.82 seconds
Completion text: To fix the WiFi for the BirdBrain Automatic WiFi Pet Chicken Feeder, follow these steps:

1. Verify your device's WiFi is on. Check in your device settings.
2. Check your phone’s WiFi capabilities are enabled.
3. Restart your router by unplugging it for 30 seconds to 1 minute, then plug it back in.
4. Check its placement - it should be centrally located, elevated and away from obstructions.
5. Visit your ISP website to check for any reported outages in your area.
6. If the problem persists, consider resetting the WiFi connection on the feeder by powering it off and then back on, and reconnecting it to the WiFi network.


### Implement the Avoid Rewriting Documents technique

**Time taken: 3.4 seconds**

Rather than spending significant amounts of time writing out information already contained in the knowledgebase, the LLM can use code to append the information to its response.

A common approach is include a citation or link to the relevant document, or append the document/chunk in full. This approach has the advantage of only including the relevant section to the user.

In this implementation, the LLM scans the context documents to find the relevant step by step information. It then returns the first three words and last three words of the relevant section. Python code is then used to extract the relevant information and append it to the short, succint response to the user.

Rather than using a JSON object with key value pairs, this has been made even more efficient by using a list, saving the tokens that would need to be used to write out the keys.

_Note: that this does add some complexity to the application, and this is only a rudimentary implementation of this technique. This assumes the source documents are written in a manner intended for a consumer to read._

In [5]:
user_question="How can I fix the wifi"

system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}

Respond with a list object, following the structure below. If there is relevant information in the context documents, set the relevant_information flag to "YES", and include the first 3 words of the relevant section, and the last 3 words in the JSON below. This will be used by a python script to extract and print the results to the user. If the question does not required additional information, or it cannot be found, set the relevant_information flag to "NO".
LIST_STRUCTURE:
["The first item in the list is either YES or NO, indicating whether there is relevant information in the context documents", "The second item is the first 3 words found at the start of the relevant section.", "The second item is the last 3 words found at the end of the relevant section.", "The fourth item is a short summary of the information, responding to the user's question, ideally a single sentence."],

USER_QUESTION:
How can I unblock the feeder?
LIST_OBJECT:
["YES","Check for Obstructions","arrange for repairs.","Here are instructions on unblocking the feeder!"]

USER_QUESTION:
Hi!
LIST_OBJECT:
["NO","NA","NA","Hi! How can I help?"]

"""

prompt=prompt+f"""
USER_QUESTION:
{user_question}
LIST_OBJECT:
"""

result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
print(f"Prompt Tokens: {prompt_tokens}")
print(f"Completion Tokens: {completion_tokens}")
print(f"Time taken: {e2e_time:.2f} seconds")
print(f"Completion text: {completion_text}")


Prompt Tokens: 1334
Completion Tokens: 25
Time taken: 3.42 seconds
Completion text: ["YES","Check Wi-Fi Status:","Wi-Fi is down.","Here are some troubleshooting steps for fixing your Wi-Fi."]


In [6]:
combined_document=product_information+troubleshooting_information

# Convert the JSON string to a Python list
lst = json.loads(completion_text)

flag = lst[0]

if flag=="YES":
    # The start and end substrings
    start = lst[1]
    end = lst[2]
    bot_response=lst[3]

    # Extract the string between the start and end substrings
    extracted_string = combined_document.split(start)[1].split(end)[0].strip()

    print(bot_response+"\n\nRelevant information:\n"+extracted_string)
else:
    print(lst[3])

Here are some troubleshooting steps for fixing your Wi-Fi.

Relevant information:
Device Wi-Fi Toggle:
Ensure your device’s Wi-Fi is turned on. Sometimes, accidental taps or settings adjustments can disable Wi-Fi.
Navigate to your device settings (usually found in the system menu or quick settings) and verify that Wi-Fi is indeed enabled.
If it’s off, toggle it back on.
Phone Wi-Fi Capabilities:
Verify that your phone’s Wi-Fi capabilities are enabled. Sometimes, airplane mode or battery-saving settings can inadvertently turn off Wi-Fi.
Double-check your phone’s Wi-Fi settings. If it’s disabled, switch it back to “on.”
2. Restart Your Router:
Unplug and Wait:
Unplug your router from its power source.
Wait for 30 seconds to 1 minute. This brief pause allows the router’s internal components to reset.
Plug it back in and observe the blinking lights as it boots up.
This simple step often resolves many Wi-Fi issues.
Router Placement:
Consider the placement of your router.
Is it tucked away i

## Scenario: Frequently asked, simple questions

By leveraging the Semantic Caching technique, common questions to the chatbot can be cached and reused. This can lead to speed increases of up to 17x. Here are a number of common, related questions the bot would receive:

* How much does it cost? | What is the price? | How much is it?
* What sizes of chickens can eat from it? | What breeds of chickens can use it? | What birds is it suited for?
* How do I clean it? | What are the steps for cleaning? What do I do if it is dirty?

The first time the bot enouncters one of these questions (How much does it cost?), it will take the default amount of time to answer, around 16 seconds with GPT 4 32k. It correctly responds with "The BirdBrain Automatic WiFi Pet Chicken Feeder is priced at $89.90."

The second time the bot sees the exact same question, it takes only 0.3-1.4s to respond, as the answer has been cached in Redis, and is now being retrieved, rather than making an entirely new call to Azure OpenAI.

Let's consider a third question "What is the price?". As this is semantically related to "How much does it cost?", the bot still only takes 0.3-1.4s to respond, as it recognises that the answer should be the same for both questions.

It is important to note that this approach introduces a new parameter, "score_threshold", which must be tuned. To low, and only questions which are extremely similar to previous questions will leverage the cache. Too high, and questions which are close but distinct in meaning will incorrectly be matched, and the wrong answer returned.

This example requires setting up an Enterprise Redis Cache, in line with this tutorial:
* https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-tutorial-semantic-cache

You may wish to skip this, and simply read through the below cells.


In [None]:
import openai
import redis
import os
import langchain
from langchain.llms import AzureOpenAI
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.globals import set_llm_cache
from langchain.cache import RedisSemanticCache
import time
from dotenv import load_dotenv
import os
from langchain.globals import set_llm_cache
from langchain_openai import AzureChatOpenAI
# Load environment variables
load_dotenv()
AZURE_ENDPOINT=os.getenv("AZURE_ENDPOINT")
API_KEY=os.getenv("API_KEY")
API_VERSION=os.getenv("API_VERSION")
LLM_DEPLOYMENT_NAME=os.getenv("MODELGPT432k")
LLM_MODEL_NAME=os.getenv("MODELGPT432k")
EMBEDDINGS_DEPLOYMENT_NAME=os.getenv("EMBEDDING")
EMBEDDINGS_MODEL_NAME=os.getenv("EMBEDDING")
REDIS_ENDPOINT=os.getenv("REDIS_ENDPOINT")
REDIS_PASSWORD=os.getenv("REDIS_PASSWORD")
os.environ["OPENAI_API_VERSION"] = API_VERSION
os.environ["AZURE_OPENAI_ENDPOINT"] = AZURE_ENDPOINT
os.environ["AZURE_OPENAI_API_KEY"] = API_KEY

llm = AzureChatOpenAI(
    deployment_name=LLM_MODEL_NAME,
)
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    model=EMBEDDINGS_MODEL_NAME,
)



### Base Case

**Time taken: 17 seconds**

In the example of a website with a customer facing chatbot, many of the questions to the chatbot will be repeated- for example "hello" or "tell me about the product". Every time these questions are asked, the LLM is being called, using up tokens and compute.

In [20]:
user_question="How much does it cost?"

system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}
"""

prompt=prompt+f"""
User question:
{user_question}
"""

system_message = "You are a helpful assistant that answers questions to the best of your knowledge."

def ask_question(question):
    response = llm.invoke([{"role": "system", "content": system_message}, {"role": "user", "content": question}])
    return response
start_time = time.time()
result=ask_question(prompt)
end_time = time.time()
elapsed_time = round(end_time - start_time,2)
print(f"The code took {elapsed_time} seconds to run.") # note, once run, the code will be cached, so the elapsed time stored in this notebook will be less than the actual time taken to run the code for the first time!
print(result)

### Implement Semantic Caching technique

**Time taken: 0.3 - 1.4 seconds**

Rather than repeated calls to the API, the Semantic Caching technique is applied, which can significantly reduce costs.

Care does need to be taken with this approach- refer to the technique notebook for more information.

In [16]:
redis_url = "rediss://:" + REDIS_PASSWORD + "@"+ REDIS_ENDPOINT
set_llm_cache(RedisSemanticCache(redis_url = redis_url, embedding=embeddings, score_threshold=0.01))

In [17]:
user_question="How much does it cost?"

system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}
"""

prompt=prompt+f"""
User question:
{user_question}
"""

system_message = "You are a helpful assistant that answers questions to the best of your knowledge."

def ask_question(question):
    response = llm.invoke([{"role": "system", "content": system_message}, {"role": "user", "content": question}])
    return response
start_time = time.time()
result=ask_question(prompt)
end_time = time.time()
elapsed_time = round(end_time - start_time,2)
print(f"The code took {elapsed_time} seconds to run.")
print(result)

The code took 0.56 seconds to run.
content='The BirdBrain Automatic WiFi Pet Chicken Feeder is priced at $89.90.' id='run-f556823c-fa23-4a96-9282-5d45b8836abb-0'


In [14]:
user_question="What is the price?"

system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
CONTEXT DOCUMENTS:
Product information:
{product_information}
Troubleshooting information:
{troubleshooting_information}
"""

prompt=prompt+f"""
User question:
{user_question}
"""

system_message = "You are a helpful assistant that answers questions to the best of your knowledge."

def ask_question(question):
    response = llm.invoke([{"role": "system", "content": system_message}, {"role": "user", "content": question}])
    return response
start_time = time.time()
result=ask_question(prompt)
end_time = time.time()
elapsed_time = round(end_time - start_time,2)
print(f"The code took {elapsed_time} seconds to run.")
print(result)

The code took 0.39 seconds to run.
content='The BirdBrain Automatic WiFi Pet Chicken Feeder is priced at $89.90.' id='run-f556823c-fa23-4a96-9282-5d45b8836abb-0'
