# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Testing Prompt Engineering-Based LLM Applications </center>
<center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Hands-On Prompt Engineering for LLMs Application Development </center> 

***



Once such a system is built, how can you assess its performance? As you deploy it and users interact with it, how can you monitor its effectiveness, identify shortcomings, and continually enhance the quality of its responses?

In this notebook, we will explore and share best practices for evaluating LLM outputs and provide insights into the experience of building these systems. One key distinction between this approach and traditional supervised machine learning applications is the speed at which you can develop LLM-based applications. 

As a result, evaluation methods typically do not begin with a predefined test set; instead, you gradually build a set of test examples as you refine the system.

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Testing LLMs vs Testing Supervised Machine Learning Models </a> 
        <ul>
        <li><a href="#1.1" target="_self" rel=" noreferrer nofollow">1.1. Incremental Development of Test Sets</a></li>
        <li><a href="#1.2" target="_self" rel=" noreferrer nofollow">1.2. Automating Evaluation Metrics</a></li>  
        <li><a href="#1.2" target="_self" rel=" noreferrer nofollow">1.2. Scaling Up: From Handful to Larger Test Sets</a></li>    
        <li><a href="#1.2" target="_self" rel=" noreferrer nofollow">1.2. High-Stakes Applications and Rigorous Testing</a></li>    
           </ul>
            </li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Case Study: Product Recommendation System </a></li>
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. Handling Errors and Refining Prompts </a></li> 
    <li><a href="#4" target="_self" rel=" noreferrer nofollow">4. Refining Prompts: Version 2 </a></li> 
    <li><a href="#5" target="_self" rel=" noreferrer nofollow">5. Testing and Validating the New Prompt </a></li> 
    <li><a href="#6" target="_self" rel=" noreferrer nofollow">6. Automating the Testing Process </a></li> 
    <li><a href="#7" target="_self" rel=" noreferrer nofollow">7. Further Steps: Iterative Tuning and Testing </a></li> 
    <li><a href="#8" target="_self" rel=" noreferrer nofollow">8. Conclusion </a></li> 
</ul>
</div>

***

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Testing LLMs vs. Testing Supervised Machine Learning Models </b></div>


In the traditional supervised learning approach, collecting an additional 1,000 test examples when you already have 10,000 labeled examples isn't too burdensome.

It's common in this setting to gather a training set, a development set, and a test set, using them throughout the development process. However, when working with large language models (LLMs), you can specify a prompt in minutes and get results in hours. This makes pausing to collect 1,000 test examples a significant inconvenience, as LLMs don't require initial training examples to start working.

<a id="1.1"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.1 Incremental Development of Test Sets </b></div>


Building an application with an LLM often begins by tuning the prompts on a small set of examples, typically between one and five. As you continue testing, you'll encounter tricky examples where the prompt or algorithm fails.
In these failure cases, you can add these difficult examples to your growing development set. Eventually, manually running every example through the prompt becomes impractical each time you make a change.

<a id="1.2"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.2. Automating Evaluation Metrics </b></div>


At this stage, you develop metrics to measure performance on your small set of examples, such as average accuracy. An interesting aspect of this process is that if your system is working well enough at any point, you can stop and avoid further steps.
Many deployed applications stop at this stage and perform adequately. However, if your hand-built development set doesn't instill sufficient confidence in your system's performance, you may need to collect a randomly sampled set of examples for further tuning.
This set continues to serve as a development or hold-out cross-validation set, as it's common to keep tuning your prompt against it.

<a id="1.3"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.3. Scaling Up: From Handful to Larger Test Sets </b></div>

If you require a higher fidelity estimate of your system's performance, you might collect and use a hold-out test set that you do not look at while tuning the model.
This step is crucial when your system is achieving 91% accuracy and you aim to reach 92% or 93%. Measuring such small performance differences necessitates a larger set of examples.
To get an unbiased, fair estimate of your system's performance, you'll need to go beyond the development set and collect a separate hold-out test set.


<a id="1.4"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.4. High-Risk Applications and Rigorous Testing </b></div>


For many applications of LLMs, there is minimal risk of harm if the model provides a slightly incorrect answer. However, in high-risk applications where there is a risk of bias or harmful outputs, it is crucial to rigorously evaluate your system's performance before deployment.
In these cases, collecting a comprehensive test set is necessary to ensure the system performs correctly. Conversely, if you're using the LLM for low-risk tasks, such as summarizing articles for personal use, you can afford to stop early in the process without the expense of collecting larger data sets for evaluation.

---

<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2. Case Study: Product Recommendation System </b></div>


Let's take a case study in which we will build a product recommendation system based on the input query from the user. We will use the OpenAI Python library to access the OpenAI API. You can use this Python library using pip like this:

In [1]:
pip install openai

Collecting openai
  Downloading openai-1.30.5-py3-none-any.whl.metadata (21 kB)
Downloading openai-1.30.5-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: openai
Successfully installed openai-1.30.5
Note: you may need to restart the kernel to use updated packages.


Next, we will import OpenAI and then set the OpenAI API key which is a secret key. You can get one of these API keys from the OpenAI website. It is better to set this as an environment variable to keep it safe if you share your code. We will use OpenAI's chatGPT GPT 3.5 Turbo model, and the chat completions endpoint.

In [2]:
import os
import openai
from openai import OpenAI
import sys
sys.path.append('/kaggle/input/chatdata')
import utils
import panel as pn  # GUI
pn.extension()

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
openai.api_key = user_secrets.get_secret("openai_api")
client = OpenAI(
    # This is the default and can be omitted
    api_key=openai.api_key,
)


Finally, we will define a helper function to make it easier to use prompts and look at generated outputs. So that's this function, get_completion, that just takes in a prompt and will return the completion for that prompt.

In [3]:
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, max_tokens=500):

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content

We then will use the utils function to get a list of products and categories. You can see that there is a list of categories and for each category, there is a list of products.
So, in the computers and laptops category, there's a list of computers and laptops, in the smartphones and accessories category, here's a list of smartphones and accessories, and so on for other categories.

In [4]:
products_and_category = utils.get_products_and_category()
products_and_category

{'Computers and Laptops': ['TechPro Ultrabook',
  'BlueWave Gaming Laptop',
  'PowerLite Convertible',
  'TechPro Desktop',
  'BlueWave Chromebook'],
 'Smartphones and Accessories': ['SmartX ProPhone',
  'MobiTech PowerCase',
  'SmartX MiniPhone',
  'MobiTech Wireless Charger',
  'SmartX EarBuds'],
 'Televisions and Home Theater Systems': ['CineView 4K TV',
  'SoundMax Home Theater',
  'CineView 8K TV',
  'SoundMax Soundbar',
  'CineView OLED TV'],
 'Gaming Consoles and Accessories': ['GameSphere X',
  'ProGamer Controller',
  'GameSphere Y',
  'ProGamer Racing Wheel',
  'GameSphere VR Headset'],
 'Audio Equipment': ['AudioPhonic Noise-Canceling Headphones',
  'WaveSound Bluetooth Speaker',
  'AudioPhonic True Wireless Earbuds',
  'WaveSound Soundbar',
  'AudioPhonic Turntable'],
 'Cameras and Camcorders': ['FotoSnap DSLR Camera',
  'ActionCam 4K',
  'FotoSnap Mirrorless Camera',
  'ZoomMaster Camcorder',
  'FotoSnap Instant Camera']}

Now, let's say, the task we're going to address is, given a user input, such as, "**What TV can I buy if I'm on a budget?**", to retrieve the relevant categories and products, so that we have the right info to answer the user's query.

In [5]:
def find_category_and_product_v1(user_input, products_and_category):
    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>

    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.
    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    """
    few_shot_user_1 = """I want the most expensive computer."""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
        {'role': 'system', 'content': system_message},    
        {'role': 'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
        {'role': 'assistant', 'content': few_shot_assistant_1},
        {'role': 'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)


The prompt specifies a set of instructions, and it gives the LLM one example of a good output. This is sometimes called a few-shot or technically one-shot prompting because we're using a user message and a system message to give it one example of a good output.
If someone says, "I want the most expensive computer" let's just return all the computers because we don't have pricing information. Now, let's use this prompt on the customer message, "Which TV can I buy if I'm on a budget?"
So we're passing into this both the prompt, customer message zero, as well as the products and category. This is the information that we have retrieved up top using the utils function.

In [6]:
customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""
products_by_category_0 = find_category_and_product_v1(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)


    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]


We can see that it lists the relevant information to this query, which is under the category, of televisions and home theater systems. This is a list of TVs and home theater systems that seem relevant.
To see how well the prompt is doing, you may evaluate it on a second prompt. The customer says, "I need a charger for my smartphone."

In [7]:
customer_msg_1 = f"""I need a charger for my smartphone"""
products_by_category_1 = find_category_and_product_v1(customer_msg_1,
                                                      products_and_category)
print(products_by_category_1)


    [{'category': 'Smartphones and Accessories', 'products': ['MobiTech Wireless Charger']}]
    


It looks like it's correctly retrieving this data category, smartphones, and accessories, and it lists the relevant products. We can try another example: "**What computers do you have?**" And hopefully, you'll retrieve a list of the computers.

In [8]:
customer_msg_2 = f"""
What computers do you have?"""
products_by_category_2 = find_category_and_product_v1(customer_msg_2,
                                                      products_and_category)
products_by_category_2

"\n    [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]"

So, here we have tried three different prompts, and if you are developing this prompt for the first time, it would be quite reasonable to have one, two, or three examples like this, and to keep on tuning the prompt until it gives appropriate outputs, until the prompt is retrieving the relevant products and categories to the customer request for all of your prompts, all three of them in this example.

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. Handling Errors and Refining Prompts </b></div>


If the prompt had been missing some products or got a wrong category, then we should go back to edit the prompt a few times until it gets it right on all three of these prompts.

After you've gotten the system to this point, you might then start running the system in testing. Maybe send it to internal test users or try using it yourself, and just run it for a while to see what happens.

Sometimes you will run across a prompt that it fails on. So here's an example of a prompt, "Tell me about the smart pro phone and the Fotosnap camera. Also, what TVs do you have?" So when I run it on this prompt, it looks like it's outputting the right data, but it also outputs a bunch of text here, this extra junk. It makes it harder to parse this into a Python list of dictionaries. So we don't like that it's outputting this extra junk.

In [9]:
customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""
products_by_category_3 = find_category_and_product_v1(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)


    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]
    


So when you run across one example that the system fails on, then common practice is to just note down that this is a somewhat tricky example, so let's add this to our set of examples that we're going to test the system on systematically.
If you keep on running the system for a while longer, maybe it works on those examples. We tuned the prompt to three examples, so it may work on many examples, but by chance, you might run across another example where it generates an error.

<a id="4"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 4. Refining Prompts: Version 2 </b></div>


So here's a new prompt, this is called prompt v2. But what we did here was we added to the prompt, "**Do not output any additional text that's not in JSON format**," just to emphasize, please don't output this extra test in the output json. If we added a second example using the user and assistant message for few-shot prompting where the user asked for the cheapest computer. In both of the few-shot examples, we're demonstrating to the system a response where it gives only JSON outputs.

So here's the extra thing that we just added to the prompt, "Do not output any additional text that's not in JSON formats," and we use "few_shot_user_1," "few_shot_assistant_1," and "few_shot_user_2' to give it two of these few-shot prompts.

In [10]:
def find_category_and_product_v2(user_input, products_and_category):
    """
    Added: Do not output any additional text that is not in JSON format.
    Added a second example (for few-shot prompting) where user asks for 
    the cheapest computer. In both few-shot examples, the shown response 
    is the full list of products in JSON only.
    """
    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>
    Do not output any additional text that is not in JSON format.
    Do not write any explanatory text after outputting the requested JSON.

    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.
    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    """
    
    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages = [  
        {'role': 'system', 'content': system_message},    
        {'role': 'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
        {'role': 'assistant', 'content': few_shot_assistant_1},
        {'role': 'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
        {'role': 'assistant', 'content': few_shot_assistant_2},
        {'role': 'user', 'content': f"{delimiter}{user_input}{delimiter}"},
    ] 
    return get_completion_from_messages(messages)


Let's now test and validate this new prompt and see how it will perform on the same prompt.

<a id="5"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 5. Testing and Validating the New Prompt </b></div>


If you were to go back and manually rerun this build prompt on all five of the examples of user inputs, including this one that previously had given a broken output, you'll find that it now gives a correct output.
If you were to go back and rerun this new prompt, this is prompt version v2, on that customer message example that had resulted in the broken output with extra junk in the JSON output, then this will generate better output.

In [11]:
customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v2(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)


    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]
    


When you modify the prompts, it's also useful to do a bit of regression testing to make sure that when fixing the incorrect outputs on different prompts.
But it's not efficient, to manually inspect or to look at this output to make sure with your eyes that this is exactly the right output. So when the development set that you're tuning to becomes more than just a small handful of examples, it then becomes useful to start to automate the testing process.

<a id="6"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 6. Automating the Testing Process </b></div>


To automate the testing process we will define a set of 10 examples where there will be 10 customer messages as well as what's the ideal answer, which you can think of it as the right answer in the test set.
So we've collected here 10 examples indexed from 0 through 9, where the last one is if the user says, "I would like a hot tub time machine." We have no relevant products to that, really sorry, so the ideal answer is the empty set.

In [12]:
msg_ideal_pairs_set = [
    
    # eg 0
    {'customer_msg': """Which TV can I buy if I'm on a budget?""",
     'ideal_answer': {
         'Televisions and Home Theater Systems': set(
             ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']
         )}
    },
    # eg 1
    {'customer_msg': """I need a charger for my smartphone""",
     'ideal_answer': {
         'Smartphones and Accessories': set(
             ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']
         )}
    },
    # eg 2
    {'customer_msg': f"""What computers do you have?""",
     'ideal_answer': {
         'Computers and Laptops': set(
             ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']
         )}
    },
    # eg 3
    {'customer_msg': f"""tell me about the smartx pro phone and \
    the fotosnap camera, the dslr one.\
    Also, what TVs do you have?""",
     'ideal_answer': {
         'Smartphones and Accessories': set(
             ['SmartX ProPhone']),
         'Cameras and Camcorders': set(
             ['FotoSnap DSLR Camera']),
         'Televisions and Home Theater Systems': set(
             ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])
         }
    }, 
    
    # eg 4
    {'customer_msg': """tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?""",
     'ideal_answer': {
         'Televisions and Home Theater Systems': set(
             ['CineView 8K TV']),
         'Gaming Consoles and Accessories': set(
             ['GameSphere X']),
         'Computers and Laptops': set(
             ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])
         }
    },
    
    # eg 5
    {'customer_msg': f"""What smartphones do you have?""",
     'ideal_answer': {
         'Smartphones and Accessories': set(
             ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds']
         )}
    },
    # eg 6
    {'customer_msg': f"""I'm on a budget. Can you recommend some smartphones to me?""",
     'ideal_answer': {
         'Smartphones and Accessories': set(
             ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']
         )}
    },
    # eg 7 # this will output a subset of the ideal answer
    {'customer_msg': f"""What Gaming consoles would be good for my friend who is into racing games?""",
     'ideal_answer': {
         'Gaming Consoles and Accessories': set([
             'GameSphere X',
             'ProGamer Controller',
             'GameSphere Y',
             'ProGamer Racing Wheel',
             'GameSphere VR Headset'
         ])}
    },
    # eg 8
    {'customer_msg': f"""What could be a good present for my videographer friend?""",
     'ideal_answer': {
         'Cameras and Camcorders': set([
             'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'
         ])}
    },
    
    # eg 9
    {'customer_msg': f"""I would like a hot tub time machine.""",
     'ideal_answer': []
    }
    
]


If you want to evaluate automatically, what the prompt is doing on any of these 10 examples, here is a function to do so.

In [13]:
import json

def eval_response_with_ideal(response, ideal, debug=False):
    
    if debug:
        print("response")
        print(response)
    
    # json.loads() expects double quotes, not single quotes
    json_like_str = response.replace("'", '"')
    
    # parse into a list of dictionaries
    l_of_d = json.loads(json_like_str)
    
    # special case when response is an empty list
    if l_of_d == [] and ideal == []:
        return 1
    
    # otherwise, response is empty 
    # or ideal should be empty, there's a mismatch
    elif l_of_d == [] or ideal == []:
        return 0
    
    correct = 0    
    
    if debug:
        print("l_of_d is")
        print(l_of_d)
        
    for d in l_of_d:
        cat = d.get('category')
        prod_l = d.get('products')
        
        if cat and prod_l:
            # convert list to set for comparison
            prod_set = set(prod_l)
            # get ideal set of products
            ideal_cat = ideal.get(cat)
            
            if ideal_cat:
                prod_set_ideal = set(ideal_cat)
            else:
                if debug:
                    print(f"did not find category {cat} in ideal")
                    print(f"ideal: {ideal}")
                continue
                
            if debug:
                print("prod_set\n", prod_set)
                print()
                print("prod_set_ideal\n", prod_set_ideal)
                
            if prod_set == prod_set_ideal:
                if debug:
                    print("correct")
                correct += 1
            else:
                print("incorrect")
                print(f"prod_set: {prod_set}")
                print(f"prod_set_ideal: {prod_set_ideal}")
                if prod_set <= prod_set_ideal:
                    print("response is a subset of the ideal answer")
                elif prod_set >= prod_set_ideal:
                    print("response is a superset of the ideal answer")
                    
    # count correct over the total number of items in the list
    pc_correct = correct / len(l_of_d)
        
    return pc_correct


So let me print out the customer message, for customer message 0. So the customer message is, "**Which TV can I buy if I'm on a budget?"** And let's also print out the ideal answer. The ideal answer is here are all the TVs that we want the prompt to retrieve.

In [14]:
print(f'Customer message: {msg_ideal_pairs_set[0]["customer_msg"]}')
print(f'Ideal answer: {msg_ideal_pairs_set[0]["ideal_answer"]}')

Customer message: Which TV can I buy if I'm on a budget?
Ideal answer: {'Televisions and Home Theater Systems': {'CineView OLED TV', 'CineView 8K TV', 'CineView 4K TV', 'SoundMax Home Theater', 'SoundMax Soundbar'}}


In this case, it did output the category that we wanted, and it did output the entire list of products. And so this gives it a score of 1.0. Just to show you one more example, it turns out that I know it gets it wrong on example 1. So if I change this from 0 to 1 and run it, this is what it gets.

In [15]:
print(f'Customer message: {msg_ideal_pairs_set[1]["customer_msg"]}')
print(f'Ideal answer: {msg_ideal_pairs_set[1]["ideal_answer"]}')

Customer message: I need a charger for my smartphone
Ideal answer: {'Smartphones and Accessories': {'MobiTech Wireless Charger', 'MobiTech PowerCase', 'SmartX EarBuds'}}


So under this customer message, this is the ideal answer where it should output under Smartphones and Accessories. So list of Smartphones and Accessories and accessories. But whereas the response here has only one output, it should have had four outputs. And so it's missing some of the products.

In [16]:
response = find_category_and_product_v2(msg_ideal_pairs_set[7]["customer_msg"],
                                         products_and_category)
print(f'Response: {response}')

eval_response_with_ideal(response,
                              msg_ideal_pairs_set[7]["ideal_answer"])

Response: 
    [{'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]


1.0

<a id="7"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 7. Further Steps: Iterative Tuning and Testing </b></div>

So what I would do if I'm tuning the prompt now is I would then use a fold to loop over all 10 of the development set examples, where we repeatedly pull out the customer message, get the ideal answer, the right answer, call the arm to get a response, evaluate it, and then accumulate it in average. And let me just run this.

So this will take a while to run, but when it's done running, this is the result. We're running through the 10 examples. We can see that example 1 is wrong as expected. So the accuracy is that 90% of the examples are correct. So, if you were to tune the prompts, you can rerun this to see if the percent correct goes up or down.

In [17]:
# Note, this will not work if any of the api calls time out
score_accum = 0
for i, pair in enumerate(msg_ideal_pairs_set):
    print(f"example {i}")
    
    customer_msg = pair['customer_msg']
    ideal = pair['ideal_answer']
    
    # print("Customer message",customer_msg)
    # print("ideal:",ideal)
    response = find_category_and_product_v2(customer_msg,
                                                      products_and_category)

# print("products_by_category",products_by_category)
    score = eval_response_with_ideal(response,ideal,debug=False)
    print(f"{i}: {score}")
    score_accum += score

n_examples = len(msg_ideal_pairs_set)
fraction_correct = score_accum / n_examples
print(f"Fraction correct out of {n_examples}: {fraction_correct}")

example 0
0: 1.0
example 1
incorrect
prod_set: {'MobiTech Wireless Charger'}
prod_set_ideal: {'MobiTech Wireless Charger', 'MobiTech PowerCase', 'SmartX EarBuds'}
response is a subset of the ideal answer
1: 0.0
example 2
2: 1.0
example 3
3: 1.0
example 4
4: 1.0
example 5
5: 1.0
example 6
6: 1.0
example 7
7: 1.0
example 8
8: 0
example 9
9: 1
Fraction correct out of 10: 0.8


<a id="8"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 8. Conclusion </b></div>


What you just saw in this notebook involves going through the testing cycle of an LLM application, providing a solid development set of 10 examples to tune and validate the prompts. 

If you need an additional level of rigor, the software allows you to collect a randomly sampled set of about 100 examples with their ideal outputs, and potentially even a holdout test set that you don't examine while tuning the prompt. 

However, if you are working on a safety-critical application or one with a non-trivial risk of harm, it is essential to gather a much larger test set to thoroughly verify performance before deployment.

# <div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 170, 0.2) -15px 15px inset, rgba(240, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 46, 170, 0.7)"> <b> ༼⁠ ⁠つ⁠ ⁠◕⁠‿⁠◕⁠ ⁠༽⁠つ Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> 💌 Thank you for taking the time to read through my notebook. I hope you found it interesting and informative. If you have any feedback or suggestions for improvement, please don't hesitate to let me know in the comments. <br><br> 🚀 If you liked this notebook, please consider upvoting it so that others can discover it too. Your support means a lot to me, and it helps to motivate me to create more content in the future. <br><br> ❤️ Once again, thank you for your support, and I hope to see you again soon!</p>