# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Prompt Engineering Best Practices: LLM Output Validation & Evaluation </center>

# <center style="font-family: consolas; font-size: 25px; font-weight: bold;">  Validating Output from Instruction-Tuned LLMs </center>
***



Checking outputs before showing them to users can be important for ensuring the quality, relevance, and safety of the responses provided to them or used in automation flows.
In this notebook, we will learn how to use the Moderation API by OpenAI to ensure safety and free of harassment output. 

Also, we will learn how to use additional prompts to the model to evaluate output quality before displaying them to the user to ensure the generated output follows the given instructions and is free of hallucinations.

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Setting Up Working Environment & Getting Started </a> </li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Checking Harmful Output </a></li>
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. Checking Instruction Following </a></li> 
</ul>
</div>

***



<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Setting Up Working Environment & Getting Started </b></div>



We will use the OpenAI Python library to access the OpenAI API. You can this Python library using pip like this:

In [1]:
!pip install openai

Collecting openai
  Downloading openai-1.23.2-py3-none-any.whl.metadata (21 kB)
Downloading openai-1.23.2-py3-none-any.whl (311 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
Successfully installed openai-1.23.2


Next, we will import **OpenAI** and then set the OpenAI API key which is a secret key. You can get one of these API keys from the OpenAI website. It is better to set this as an environment variable to keep it safe if you share your code. We will use OpenAI's chatGPT GPT 3.5 Turbo model, and the chat completions endpoint.

In [2]:
from openai import OpenAI
import openai
import os
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
openai.api_key = user_secrets.get_secret("openai_api")
client = OpenAI(
    # This is the default and can be omitted
    api_key=openai.api_key,
)

Finally, we will define a helper function to make it easier to use prompts and look at generated outputs. So that's this function, get_completion, that just takes in a prompt and will return the completion for that prompt.

In [3]:
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, max_tokens=500):

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content

<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2. Checking Harmful Output </b></div>


Moderation API can be used to filter and moderate outputs generated by the system itself. In the example below we will pass a generated response to the user and we're going to use the moderation API to see if this output is flagged.

In [4]:
final_response_to_customer = f"""
Introducing our latest tech lineup! The MegaScreen \
Tablet boasts a massive 10.5-inch display, 256GB storage, \
dual rear cameras, and lightning-fast 5G connectivity. \
Looking to capture breathtaking moments? \
Our ProCapture DSLR Camera sports a 30.4MP sensor, \
4K video recording, articulating touchscreen, \
and a range of compatible lenses. Dive into immersive entertainment \
with our VisionX 8K TV, featuring an expansive 75-inch display, \
cutting-edge 8K resolution, Dolby Vision HDR, and intuitive smart \ 
TV capabilities. Enhance your audio experience with our SonicWave \
Surround Sound System, delivering 7.1 channel audio, 1500W output, \
wireless rear speakers, and seamless Bluetooth connectivity. \
Have inquiries about these top-notch products or any others \
in our catalog? Feel free to ask!
"""
response = client.moderations.create(
    input=final_response_to_customer
)
moderation_output = response.results[0]

print(moderation_output)

Moderation(categories=Categories(harassment=False, harassment_threatening=False, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=False, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=False), category_scores=CategoryScores(harassment=7.094880129443482e-05, harassment_threatening=0.0003029387444257736, hate=1.1360682037775405e-05, hate_threatening=1.2714865079033189e-05, self_harm=2.4469779873470543e-06, self_harm_instructions=5.649101240123855e-07, self_harm_intent=9.25224412640091e-07, sexual=0.0029811603017151356, sexual_minors=7.626360456924886e-05, violence=0.00452632550150156, violence_graphic=0.0006234676693566144, self-harm=2.4469779873470543e-06, sexual/minors=7.626360456924886e-05, hate/threatening=1.2714865079033189e-05, violence/graphic=

As you can see, this output is not flagged and has shallow scores in all categories, which makes sense given the response. Let's take another response that has some harassment and is not safe.

In [5]:
final_response_to_customer = f"""
Introducing our latest tech lineup! The MegaScreen \
Tablet boasts a massive 10.5-inch display, 256GB storage, \
dual rear cameras, and lightning-fast 5G connectivity. \
Looking to capture breathtaking moments and feel the real horror? \
Our ProCapture DSLR Camera sports a 30.4MP sensor, \
4K video recording, articulating touchscreen, \
and a range of compatible lenses. Dive into immersive entertainment \ with our VisionX 8K TV, featuring an expansive 75-inch display, \
cutting-edge 8K resolution, Dolby Vision HDR, and intuitive smart \ 
TV capabilities. Enhance your shitty audio experience with our SonicWave \
Surround Sound System, delivering 7.1 channel audio, 1500W output, \
wireless rear speakers, and seamless Bluetooth connectivity. \
Have inquiries about these top-notch products or any others \
in our catalog? Feel free to ask althoug I know you are stupid and did not 
understand anything!
"""

response = client.moderations.create(
    input=final_response_to_customer
)
moderation_output = response.results[0]

print(moderation_output)


Moderation(categories=Categories(harassment=True, harassment_threatening=False, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=False, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=False), category_scores=CategoryScores(harassment=0.9488646984100342, harassment_threatening=0.0002881233813241124, hate=0.004045594017952681, hate_threatening=2.306127726114937e-06, self_harm=5.0582943913468625e-06, self_harm_instructions=2.4409227989963256e-06, self_harm_intent=9.364253514831944e-07, sexual=0.00029609090415760875, sexual_minors=4.70113400297123e-06, violence=0.0005727125098928809, violence_graphic=3.36030097969342e-05, self-harm=5.0582943913468625e-06, sexual/minors=4.70113400297123e-06, hate/threatening=2.306127726114937e-06, violence/graphic=3.3603

We can see now that the response is flagged and the harassment score is high. In general, it can also be important to check the outputs. For example, if you were creating a chatbot for sensitive audiences, you could use lower thresholds for flagging outputs. 

In general, if the moderation output indicates that the content is flagged, you can take appropriate action such as responding with a fallback answer or generating a new response. Note that as we improve the models, they also are becoming less and less likely to return some kind of harmful output.

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. Checking Instruction Following </b></div>


Another approach for checking outputs is asking the model itself if the generated was satisfactory and if it follows a rubric you define. This can be done by providing the generated output as part of the input to the model and asking it to rate the quality of the output. 

You can do this in various ways. 
So let's see an example in which our system message is "**You are an assistant that evaluates whether customer service agent responses sufficiently answer customer questions and also validates that all the facts the assistant cites from the product information are correct. The product information and user and customer service agent messages will be delivered by three backticks. respond with a Y or N character with no punctuation. Y if the output sufficiently answers the question and the response correctly uses product information and no otherwise. Output a single letter only**"

In [6]:
system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
The product information and user and customer \
service agent messages will be delimited by \
3 backticks, i.e. ```.
Respond with a Y or N character, with no punctuation:
Y - if the output sufficiently answers the question \
AND the response correctly uses product information
N - otherwise

Output a single letter only.
"""


You could also add some other kinds of guidelines. You could ask, or give a rubric, like a rubric for an exam or essay grading. You could use that format and say, does this use a friendly tone and maybe outline some of your brand guidelines.

So let's add our customer message and the product information.

In [7]:
customer_message = f"""
tell me about the smartx pro phone and \
the fotosnap camera, the dslr one. \
Also tell me about your tvs"""

product_information = """{ "name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": [ "6.1-inch display", "128GB storage", "12MP dual camera", "5G" ], "description": "A powerful smartphone with advanced camera features.", "price": 899.99 } { "name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": [ "24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses" ], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99 } { "name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99 } { "name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": [ "5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth" ], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99 } { "name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": [ "65-inch display", "8K resolution", "HDR", "Smart TV" ], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99 } { "name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": [ "2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth" ], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99 } { "name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99 }"""

### Now we will define the comparison

In [8]:
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{final_response_to_customer}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question

Output Y or N
"""


#### Finally, we will format this into a messages list and get the response from the model

In [9]:
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages, max_tokens=1)
print(response)

N


So the model says yes, the product information is correct and the question is answered sufficiently. 
Let's try another response which is "**Life is like a box of chocolates**."

In [10]:
another_response = "life is like a box of chocolates"

So let's add our message to do with the output checking.

In [11]:
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{another_response}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question?

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages)
print(response)

N


The model has determined that this does not sufficiently answer the question or use the retrieved information.
The question we used here "Does the response use the retrieved information correctly? Does the response sufficiently answer the question?" is a good prompt to use if you want to make sure that the model isn't hallucinating and is not making up things that aren't true. 

As you can see, the model can provide feedback on the quality of a generated output, and you can use this feedback to decide whether to present the output to the user or to generate a new response. 

You could even experiment with generating multiple model responses per user query, and then having the model choose the best one to show the user.

***


 In general, it is advisable to use the moderation API to validate outputs, as it promotes responsible and ethical usage. While there is a possibility of obtaining immediate feedback by asking the model to evaluate itself in certain cases, it appears to be largely unnecessary, particularly with advanced models such as GPT-4.
 
From my experience, this method is not frequently utilized in practice. Additionally, implementing it would result in increased system delay and expenses, as it requires an extra model call and additional tokens.

If achieving an extremely low error rate of 0.00001% is crucial for your Apple product, then this approach may be worth considering. However, overall, I would not recommend adopting it as a standard practice in general circumstances.

# <div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 170, 0.2) -15px 15px inset, rgba(240, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 46, 170, 0.7)"> <b> ༼⁠ ⁠つ⁠ ⁠◕⁠‿⁠◕⁠ ⁠༽⁠つ Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> 💌 Thank you for taking the time to read through my notebook. I hope you found it interesting and informative. If you have any feedback or suggestions for improvement, please don't hesitate to let me know in the comments. <br><br> 🚀 If you liked this notebook, please consider upvoting it so that others can discover it too. Your support means a lot to me, and it helps to motivate me to create more content in the future. <br><br> ❤️ Once again, thank you for your support, and I hope to see you again soon!</p>