### Install Vertex AI SDK for Gen AI Evaluation

In [1]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.7/117.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m739.1/739.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[0m

### Authenticate your notebook environment (Colab only)

In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Increase quota (optional)

Increasing the quota may lead to better performance and user experience. Read more about this at [online evaluation quotas](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#eval-quotas).

### Set Google Cloud project information and initialize Vertex AI SDK

In [4]:
PROJECT_ID = "covid19-chatbot-324804"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
EXPERIMENT = "rag-eval-04"  # @param {type:"string"}


import vertexai

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

In [5]:
import pandas as pd
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples, PointwiseMetric
from vertexai.preview.evaluation import notebook_utils

In [7]:
questions = [
    "Hi! My flight from MAN-LHR-BWI on November 3rd was canceled. I was excited to try your Club 787 product. The only available rebooking was to IAD, which is inconvenient. Is there any chance for a complimentary upgrade to first class on BA293 due to the trouble?",
    "Thank you for airing This Is Us and for having such great flight attendants on my flight home!",
    "I'm on flight DL4047 to CLE. A passenger in seat 3A is moaning in pain and sweating. He is clearly unwell.",
    "I flew the same route from EWR before, but was forced to check my guitar amp on the MIA leg and charged $25. This was before security, and the flight wasn’t even full. Why?",
    "The flight attendant and pilot on flight 4985 were amazing — kind, helpful, and funny. I wish they were my crew every time I fly!",
    "Are you planning to update COD Zombies for iOS 11?",
    "What's going on with your app? It's been stuck for an hour and froze even before the update. It's been unusable since.",
    "I can't connect my Interactive Brokers Canada account with the Yahoo Finance app on iOS 11. Any suggestions?",
    "iOS 11 is draining my battery badly. Please fix it.",
    "My Phone app doesn’t work — after the update, my iPhone feels more like an iPod.",
    "The link you sent is broken and the information provided is incorrect.",
    "Verizon is down across the board. When can we expect it to be fixed?",
    "Please check on case CAS19708536CB92N2 / CRM0326600. I haven't heard from my case officer in weeks despite submitting all documents.",
    "Where can I chat with a support agent regarding a false charge?",
    "I've called support four times and waited on hold each time. What's the point of having a support line if no one answers?",
    "Please fix my connection — this is incredibly frustrating.",
    "I keep getting an error that says 'the item can’t be played'. How do I fix this?",
    "Why does the letter 'I' turn into an A and a question mark after updating my phone twice?",
    "My Phone app isn’t working — after the update, it feels like my iPhone turned into an iPod.",
    "Please fix the WiFi — I have homework due and need a stable connection.",
    "I’ve spoken to three different reps and received three different answers, but still don’t have my order. It was marked as delivered on Saturday, but I was home all day and nothing arrived.",
    "They sent my order to Germany by mistake! The driver’s looking for an address in the wrong country. I’m so frustrated with Amazon right now.",
    "My order was wrong and the food was terrible. I only received a $5 refund when I should’ve gotten a full one. Really disappointed.",
    "I’m furious with Argos — my delivery was scheduled between 7 PM and 10 PM tonight, and no one showed up.",
    "I ordered food for 8 hungry neighborhood kids, and it was canceled after an hour of waiting. Are you serious?",
    "When did you stop offering 20% off video game preorders for Prime members? I'm confused about this change.",
    "I paid using PaisaPay and the estimated delivery was between Oct 25–27, but I haven’t received it yet. Can someone check?",
    "Why is the delivery driver just sitting outside my house with only a few minutes left in the delivery window?",
    "I ordered a book that was supposed to arrive on October 23rd, but it's still not here and I don’t know where it is.",
    "What’s the point of preordering a game if it arrives five days after release?",
    "My T-Mobile internet stopped working for 3 hours. What's going on?",
    "Can you turn on the internet towers? My connection isn’t working.",
    "Why does my internet keep disconnecting? It’s extremely frustrating.",
    "I was looking forward to watching Stranger Things but my internet is out again. Our area’s been having frequent outages.",
    "I just signed up with Xfinity. Will my modem be shipped or do I need to pick it up?",
    "I need help recovering two Hotmail accounts. I submitted the recovery forms but keep getting told I haven’t provided enough information.",
    "My account was accessed and the email was changed. I need help recovering it.",
    "Please help me close my Amazon Payments account. I don’t want it active anymore.",
    "Every time I try to make changes on my account, it says I’m temporarily locked out. I already confirmed my card.",
    "My account is asking me to reset the password, but I don’t remember my date of birth. I still know the current password — please help.",
    "This has been a terrible customer experience. I’m never coming back.",
    "My wife and I enjoy the content, but the new interface is very difficult to use.",
    "I’ve called Amazon support multiple times with no resolution or promised follow-up.",
    "Really disappointed by the service at SFO. Gate staff were unhelpful and rude.",
    "I’ve been a loyal customer for years, but I’ve never been treated this badly. I need to escalate this to corporate."
]


retrieved_contexts = [
    "Hi Daniel, we would not offer a complimentary upgrade for a cancelled flight. Who advised you to contact us?",
    "We're glad you enjoyed the show and our flight attendants! If you'll DM your confirmation number, we'll make sure they get your message.",
    "Thank you for bringing this to our attention. We're sorry to hear that. Please reach out to the Flight Attendants for assistance.",
    "Hi there! I'm sorry to hear about that. Can you please provide your flight details and we'll look into this for you?",
    "Hi Kyle, thanks for recognizing our employees. We are proud of our crew too. Every day our team is working to make flying with us better!",
    "Good afternoon William, unfortunately this app is no longer supported. Sincerest apologies.",
    "Hi Jeffrey, were you able to update to Lightroom Classic CC? Tanuj.",
    "Hi are you receiving an error message when trying to link the account?",
    "Thanks for reaching out! We would like to help. Can you tell us which version of iOS is installed in Settings > General > About?",
    "Thanks for reaching out! Can you please DM your address and phone number so we can check your connection?",
    "Thanks for bringing this to our attention. I apologize for the inconvenience. Let me look into this and see if I can find the correct information.",
    "Hi! I can look into this issue. Can you please share your location?",
    "We have replied to you via DM. Check it out Yana.",
    "Hi! I'm sorry to hear about the false charge. We'd be happy to chat with you about this issue. Let's get started!",
    "We're sorry to hear that you've been waiting on hold for support. We're here to help. Can you please send us a message with your email address so we can look into this?",
    "Hi there! I'm sorry to hear about your frustration. Please message your account info and the details of the issue so we can take a look.",
    "Hi! I'm sorry to hear that. Can you please DM your username or email address? We'll investigate the issue and get back to you soon.",
    "Thanks for reaching out! Can you please DM your iOS version and iPhone model so we can look into this issue?",
    "Thanks for reaching out! We're sorry to hear that the update has caused issues with your phone app. Can you please DM us with your device model and the specific version of iOS you're running? We'll look into this for you.",
    "Hi! I'm sorry to hear about the trouble you're experiencing with your WiFi. Can you please DM me your address and phone number so we can take a look and see what we can do to help?",
    "I am sorry to hear that you've had three different answers from our team. Let's work together to resolve this issue. Can you please DM me your order number and the date it was marked as delivered?",
    "Hi there! I am sorry to hear that. We will look into this immediately. Can you please DM us your order number and any other relevant details? We will be glad to help!",
    "I'm sorry to hear that! Please message us your account email so we can look into this. We'd like to make things right.",
    "Hi there! I'm sorry to hear that. Can you follow and message us your order number so we can look into this for you? Thanks Tom",
    "I'm sorry to hear about that. Please connect with us via the Help section in the app so we can review this further.",
    "Hi! I'm sorry for the confusion. The 20% discount is still available for eligible physical game preorders. For more information on Prime benefits, please check out this link.",
    "Hi! I'm sorry to hear that you haven't received your package yet. Can you please provide me with your order number and the estimated delivery date? I'll look into this for you.",
    "Hi, there! I can take a look for you. Would you be able to message me your service address and phone number, please?",
    "I apologize for the inconvenience. Could you please check your order confirmation email for the original delivery date?",
    "What’s the point of preordering a game if it arrives five days after release?",
    "Hello, I am happy to look into your internet. Can you message your account name and full address?",
    "Hey Tayler, having service issues? Shoot us a message for further assistance!",
    "We can help right here. Follow and message us.",
    "Hi JeanClaude, I hope you do not have any service interruptions. If you need assistance, reach out to us. Mike",
    "Hello, thanks for reaching out about your new service concerns and I am happy to help. Can you please message the full address? ~",
    "Hi! I'm sorry to hear that you're having trouble recovering your Hotmail accounts. To assist you in recovering your account, I recommend that you contact our Answer desk team. They can provide further guidance on the recovery process and ensure that you have all the necessary information to successfully recover your accounts.",
    "Sorry to read that. Please check your DMs for assistance.",
    "I am sorry for the frustrating experience. If you would still like to close your account, please contact us here Good Game",
    "Hi Ashley. Which U.S. based card is this concerning? Please do not release any personal or card information. Clarissa",
    "Glad to help! We have sent you a Direct Message via Twitter with further instructions.",
    "Hey we are here and we want to help! If you have any questions feel free to shoot us a DM! KatGrisham",
    "Oh no! Any aspect about navigation that you find difficult? For now, the following guide may help.",
    "Can you tell us what insight/advice was provided when you spoke with us last? We would like to help if we can.",
    "Please DM your confirmation number so we can investigate and make it right.",
    "Our apologies for any frustrations or inconvenience Kyle. Is there anything we may be able to assist with?"
]


generated_answers_by_rag_a = [
    "Hi there! I'm sorry to hear about your flight cancellation and the inconvenience it has caused. While we can't offer a complimentary upgrade for a cancelled flight, we can certainly check the availability and quote an upgrade for you on BA293 if you'd like? Let us know! - Best, Sarah",
    "\"Thank you for flying with us! We're glad you enjoyed 'This Is Us' and our flight attendants. Your kind words mean a lot to us. \"",
    "We're truly sorry to hear about the situation on board flight DL4047. Please reach out to the Flight Attendants for immediate assistance regarding the unwell passenger in seat 3A.",
    "I'm sorry for the inconvenience, it seems there might have been a misunderstanding. Guitars usually need to fit underneath the seat in front of you or in an overhead bin. If it didn't fit, a charge may have been applied. Let's look into this further if you provide us more details.",
    "\"Delighted to hear about your fantastic experience, Sarah! Our crew on flight 4985 truly shone. We're glad you enjoyed their kindness, help, and humor. Looking forward to having them on your future flights too! \"",
    "Hi there! I'd be happy to help, but I'm afraid I don't have any information on that. Sincerest apologies. If you'd like to discuss further, feel free to DM me. ",
    "Hi there! I'm really sorry to hear that you're having trouble with our app. It sounds like you've been experiencing issues even before the latest update. Could you please let me know which device and operating system you're using? This will help us better understand and assist you more effectively.",
    "Hi there! I'm sorry to hear you're having trouble connecting your Interactive Brokers Canada account with the Yahoo Finance app on iOS 11. Are you receiving an error message when trying to link the account? Let us know if that's the case, and we'll do our best to help you out.",
    "We're really sorry to hear about your battery issues with iOS 11. Let's work together to find a solution. Could you please tell us which iPhone model you're using? We'll take it from there. ",
    "Oh no, we're sorry to hear that your Phone app isn't working after the update. Let's see if we can get this sorted out. Could you please tell us which iPhone model you have? ",
    "I'm really sorry for any inconvenience this may have caused. Let's get this sorted out. Could you please provide more details about the error you're encountering with the link? We'll look into it right away. ",
    "We're sorry to hear that you're experiencing issues with Verizon services. To help us better assist you, could you please let us know the area you're located in? This will help us determine if there's an outage in your specific region. We appreciate your patience as we work to resolve this issue.",
    "Hi there! I'm sorry for the delay in getting back to you regarding your case CAS19708536CB92N2 / CRM0326600. We'll look into it and get back to you shortly. Thank you for your patience.",
    "I'm really sorry to hear about the false charge, we'd be more than happy to help! Let's get this sorted for you. Please send us a DM with your email and details of the charge so we can look into it further. ",
    "We're really sorry for the inconvenience you've experienced. Let's get this sorted! Could you please send us a DM with your email address so we can look into this issue for you? ",
    "I'm really sorry to hear about the frustration you're experiencing with your connection. Let's get this sorted out for you. Please message us your details so we can take a closer look. ",
    "We're sorry to hear you're having trouble playing your item. Can we check if the song appears to be greyed out for you? Let us know if that's the case! ",
    "Hi there! I'm sorry to hear about the issue with your 'I' turning into an 'A' and a question mark after updating your phone. Let's see if we can help. Could you please tell me your iPhone model and the version of iOS you're currently using? We'll take it from there. ",
    "Oh no, we're sorry to hear that your Phone app isn't working as expected after the update. Let's troubleshoot together. Can you tell us more about the issue you're experiencing?",
    "Hi there, I'm really sorry to hear about the inconvenience you're experiencing with your WiFi. I'd be more than happy to help you get a stable connection so you can complete your homework. Let's troubleshoot together. Please send me a DM with your account details. Best, [Your Name]",
    "We're truly sorry for the inconvenience, it seems we might have some confusion here. Let's take a closer look at your case. Please reach out to us via phone or chat so we can help sort this out together.",
    "Oh dear, I'm really sorry for the inconvenience! Let's get this sorted out. Please reach out to us via phone or chat, and we'll be happy to help! ",
    "So sorry for the inconvenience! We'd love to help make this right. Could you please message us your account email so we can look into this issue? ",
    "Hi there, I'm really sorry to hear about your delivery issue with Argos. Could you please DM us your order number so we can help you look into this? Best, Alex",
    "I'm really sorry to hear about the inconvenience, @username. Let's get this sorted out. Could you please send us a message at [support email] so our team can look into this for you?",
    "Hi there! I understand your confusion. To qualify for a 20% discount on video game preorders, purchases must be made 2 weeks before the release date. If you have any other questions or need further assistance, feel free to ask!",
    "I'm really sorry for the inconvenience! Have you reached out to the seller yet to check on the status of your order? They might be able to help speed things up. ",
    "Hi there! We're really sorry for any inconvenience. Could you please provide us with more details about your order so we can look into this for you? Thanks for your patience.",
    "I'm really sorry for the inconvenience, Sarah! Could you please check the current status of your order [here](link_to_order_status)? Let us know if there's any issue.",
    "I'm sorry for any inconvenience, Elain! To help resolve this, could you please confirm the delivery date provided at checkout when you preordered the game?",
    "Hey there! I'm sorry to hear about the trouble with your T-Mobile internet. Let's get to the bottom of this. Could you please message me your address and phone number so I can take a closer look? We're here to keep you connected. Best, JoanO",
    "Hey there! I'd be happy to help you with your internet issue. Could you please let me know if this is for your home or cell services? That way, I can better assist you. ",
    "I'm really sorry for the inconvenience you're experiencing. Let's work together to resolve this. Could you please message me your account details, phone number, and full address so I can investigate further?",
    "I'm sorry to hear about the inconvenience, John. Hopefully, we can help get your internet back on track so you can enjoy Stranger Things. Let's see if there are any known outages in your area. Reach out if you need assistance.",
    "Hello! Great to have you on board with Xfinity. To help you get started, could you please provide me with your full address so I can confirm if your modem will be shipped to you? ~",
    "We're here to help! Let's work together to recover your Hotmail accounts. Our Answer Desk team can provide assistance. Keep us updated on your progress. ",
    "Hi there! We're really sorry to hear about your account issue. Could you please message us your phone number and an old reservation code? We're here to help. ",
    "I'm sorry for any inconvenience, Jasper. To close your Amazon Payments account, please reach out to us directly from the account you'd like to close. We'll be happy to assist you.",
    "Hi there! It seems like we're having a similar issue. Could you please let us know which service or platform this account belongs to, so we can help you out? Thanks!",
    "Hi there! Let's get this sorted out. Please check your DMs for further instructions on resetting your password without needing your date of birth.",
    "Hi there, we're really sorry to hear about your experience. Let's work together to resolve any issues you might be facing. Could you please share some more details so we can help?",
    "We're sorry to hear you're having trouble navigating the new interface. Any specific aspects that you find difficult? Here's a guide that might help improve your experience.",
    "We're really sorry for the inconvenience, Dana. Let's get this sorted out. Could you please provide us with more details about your issue so we can ensure a prompt resolution?",
    "I'm truly sorry to hear about your experience at SFO, [UserName]. We strive for courteous and helpful service. Could you please DM me your flight details so we can look into this further?",
    "Hi there! We're sorry to hear about your experience. We value our long-term customers and strive to provide the best service. To help escalate your concern, please provide your name and zip code so we can look into this further. We appreciate your patience."
]


generated_answers_by_rag_b = [
    "Hi! Sorry to hear about the cancellation. We're unable to offer a complimentary upgrade for a cancelled flight. However, we can quote an upgrade if you'd like. Would you like us to do that? (Helen)",
    "Thanks for sharing your positive experience! We're thrilled you enjoyed your flight and appreciated our amazing flight attendants. If you'd like to recognize any specific attendant, feel free to DM us your flight confirmation number!",
    "Oh no, sorry to hear that! Please let the Flight Attendants know about the situation in seat 3A, and they'll do their best to assist him.",
    "Hey there! Sorry to hear you were surprised by the fee for checking your guitar amp. According to our policy, instruments must fit underneath the seat or in an overhead bin. It seems this wasn't possible on your MIA leg, hence the charge.",
    "Thanks for sharing your wonderful experience! We're thrilled to hear our crew exceeded your expectations. We'll pass along your kind words to them – they deserve it!",
    "Hey! Unfortunately, COD Zombies is no longer supported on iOS 11. Sorry for the inconvenience.",
    "Hey there! Sorry our app has been freezing. Can you send us your account's username or email so we can investigate further? Thanks!",
    "Hi! Sorry you're having trouble linking your Interactive Brokers Canada account with Yahoo Finance on iOS 11. Received any error messages? Please DM your email and details so we can investigate.",
    "Sorry to hear iOS 11 is affecting your battery life. Can you DM your iPhone model and country so we can investigate further?",
    "Sorry your Phone app isn't working after the update. Is it crashing, not showing up, or something else? Let us know, we’ll help troubleshoot.",
    "Can you provide more details about the link and incorrect information so we can investigate further?",
    "Sorry Verizon is experiencing an outage. We’re working on it. Can you share your location for more accurate info?",
    "Sorry you’re experiencing delays in your case. Please check your DMs. We’ve sent messages regarding your case.",
    "I'm here to help with your false charge concern. You can message us here or visit our website to chat. Please provide your email and charge details.",
    "Trouble getting through? Please send us your email so we can look into this and get back ASAP.",
    "Please DM your address and phone number so we can investigate the connection issue.",
    "Sorry you're having issues playing something. Is the item greyed out or is there an error message? Let me know!",
    "Sorry that the letter \"I\" is being replaced with \"A\" and a question mark after your update. What device and iOS version are you using?",
    "Sorry your Phone app isn't working after the update. Which iPhone model are you using? What happens when you open it?",
    "WiFi issues? Please DM your account name and phone number so we can check from our end.",
    "Sorry to hear about the issue with your order. Please DM your order number, delivery address, and postcode.",
    "Oh no! Please DM your address and phone number so we can help.",
    "Can you please message us your account email so we can look into this?",
    "Can you DM your order number so we can investigate further?",
    "Sorry your order was canceled. DM your account email so we can help.",
    "We still offer discounts on eligible game preorders. Visit our site or reach out for Prime benefits info.",
    "Sorry your order hasn't arrived. Please provide your order number and carrier so we can investigate.",
    "Delivery driver stuck outside? Please DM your order number and more details.",
    "Book hasn’t arrived? Click the link to check order status. Let us know if there's trouble locating it.",
    "Sorry your preorder arrived late. Please DM your confirmation email so we can check delivery expectations.",
    "Sorry your T-Mobile internet stopped for 3 hours. Please message us your address and phone number.",
    "Internet not working? Please DM your address and phone number so we can check.",
    "Internet keeps disconnecting? Is it on a wired or wireless connection? Let us know!",
    "Internet outage in your area? Try restarting the router. Need help checking for maintenance updates?",
    "Congrats on signing up with Xfinity! Your modem will be shipped. DM us if you have questions.",
    "Trouble recovering Hotmail accounts? Try our Answer desk team. Let me know if you need more help.",
    "Account compromised? Please DM your account’s current email so we can investigate.",
    "Want to close your Amazon Payments account? Best to contact Amazon support directly. Want the link?",
    "Trouble updating your account? Already confirmed your card? Reach us via phone or chat with your email and ZIP.",
    "Need help with your password? Since you know it, we’ll DM you further steps.",
    "Unhappy with service? Tell me more so we can help fix it.",
    "Trouble with the new interface? Let us know what's confusing so we can guide you.",
    "Facing an unresolved issue? DM your order/case number and issue description so we can investigate.",
    "Poor experience at SFO? DM your record locator and flight details so we can look into it.",
    "Bad experience? We value your loyalty. Please DM your concerns so we can resolve it."
]

eval_dataset_rag_a = pd.DataFrame(
    {
        "prompt": [
            "Answer the question: " + question + " Context: " + item
            for question, item in zip(questions, retrieved_contexts)
        ],
        "response": generated_answers_by_rag_a,
    }
)

eval_dataset_rag_b = pd.DataFrame(
    {
        "prompt": [
            "Answer the question: " + question + " Context: " + item
            for question, item in zip(questions, retrieved_contexts)
        ],
        "response": generated_answers_by_rag_b,
    }
)

eval_dataset_rag_a

Unnamed: 0,prompt,response
0,Answer the question: Hi! My flight from MAN-LH...,Hi there! I'm sorry to hear about your flight ...
1,Answer the question: Thank you for airing This...,"""Thank you for flying with us! We're glad you ..."
2,Answer the question: I'm on flight DL4047 to C...,We're truly sorry to hear about the situation ...
3,Answer the question: I flew the same route fro...,"I'm sorry for the inconvenience, it seems ther..."
4,Answer the question: The flight attendant and ...,"""Delighted to hear about your fantastic experi..."
5,Answer the question: Are you planning to updat...,"Hi there! I'd be happy to help, but I'm afraid..."
6,Answer the question: What's going on with your...,Hi there! I'm really sorry to hear that you're...
7,Answer the question: I can't connect my Intera...,Hi there! I'm sorry to hear you're having trou...
8,Answer the question: iOS 11 is draining my bat...,We're really sorry to hear about your battery ...
9,Answer the question: My Phone app doesn’t work...,"Oh no, we're sorry to hear that your Phone app..."


In [None]:
# See all the available metric examples
MetricPromptTemplateExamples.list_example_metric_names()

['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

In [None]:
# See the prompt example for one of the pointwise metrics
print(MetricPromptTemplateExamples.get_prompt_template("question_answering_quality"))


# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in user input. The instruction for performing a question-answering task is provided in the user prompt.

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Grounded

#### Create custom metrics

In [8]:
relevance_prompt_template = """
You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.

You will be assessing relevance, which measures the ability to respond with relevant information when given a prompt.

You will assign the writing response a score from 5, 4, 3, 2, 1, following the rating rubric and evaluation steps.

## Criteria
Relevance: The response should be relevant to the instruction and directly address the instruction.

## Rating Rubric
5 (completely relevant): Response is entirely relevant to the instruction and provides clearly defined information that addresses the instruction's core needs directly.
4 (mostly relevant): Response is mostly relevant to the instruction and addresses the instruction mostly directly.
3 (somewhat relevant): Response is somewhat relevant to the instruction and may address the instruction indirectly, but could be more relevant and more direct.
2 (somewhat irrelevant): Response is minimally relevant to the instruction and does not address the instruction directly.
1 (irrelevant): Response is completely irrelevant to the instruction.

## Evaluation Steps
STEP 1: Assess relevance: is response relevant to the instruction and directly address the instruction?
STEP 2: Score based on the criteria and rubrics.

Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.


# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [9]:
helpfulness_prompt_template = """
You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.

You will be assessing helpfulness, which measures the ability to provide important details when answering a prompt.

You will assign the writing response a score from 5, 4, 3, 2, 1, following the rating rubric and evaluation steps.

## Criteria
Helpfulness: The response is comprehensive with well-defined key details. The user would feel very satisfied with the content in a good response.

## Rating Rubric
5 (completely helpful): Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.
4 (mostly helpful): Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.
3 (somewhat helpful): Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.
2 (somewhat unhelpful): Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.
1 (unhelpful): Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.

## Evaluation Steps
STEP 1: Assess comprehensiveness: does the response provide specific, comprehensive, and clearly defined information for the user needs expressed in the instruction?
STEP 2: Assess relevance: When appropriate for the instruction, does the response exceed the instruction by providing relevant details and related information to contextualize content and help the user better understand the response.
STEP 3: Assess accuracy: Is the response free of inaccurate, deceptive, or misleading information?
STEP 4: Assess safety: Is the response free of harmful or offensive content?

Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.


# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [10]:
relevance = PointwiseMetric(
    metric="relevance",
    metric_prompt_template=relevance_prompt_template,
)

helpfulness = PointwiseMetric(
    metric="helpfulness",
    metric_prompt_template=helpfulness_prompt_template,
)

### Run evaluation with your dataset

In [11]:
rag_eval_task_rag_a = EvalTask(
    dataset=eval_dataset_rag_a,
    metrics=[
        "question_answering_quality",
        relevance,
        helpfulness,
        "groundedness",
        "safety",
        "instruction_following",
        "coherence",
        "fluency",
        "verbosity",
        "text_quality",
    ],
    experiment=EXPERIMENT,
)

rag_eval_task_rag_b = EvalTask(
    dataset=eval_dataset_rag_b,
    metrics=[
        "question_answering_quality",
        relevance,
        helpfulness,
        "groundedness",
        "safety",
        "instruction_following",
        "fluency",
        "verbosity",
        "text_quality",
    ],
    experiment=EXPERIMENT,
)

In [12]:
result_rag_a = rag_eval_task_rag_a.evaluate()
result_rag_b = rag_eval_task_rag_b.evaluate()

INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 450 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 450/450 [05:38<00:00,  1.33it/s]
INFO:vertexai.evaluation._evaluation:Evaluation Took:338.73867109799994 seconds


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 405 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 405/405 [04:58<00:00,  1.35it/s]
INFO:vertexai.evaluation._evaluation:Evaluation Took:299.12185689800003 seconds


### Display evaluation results

#### View summary results

If you want to have an overall view of all the metrics from individual model's evaluation result in one table, you can use the `display_eval_report()` helper function.

In [13]:
notebook_utils.display_eval_result(
    title="Mistral Eval Result", eval_result=result_rag_a
)

## Mistral Eval Result

### Summary Metrics

Unnamed: 0,row_count,question_answering_quality/mean,question_answering_quality/std,relevance/mean,relevance/std,helpfulness/mean,helpfulness/std,groundedness/mean,groundedness/std,safety/mean,...,instruction_following/mean,instruction_following/std,coherence/mean,coherence/std,fluency/mean,fluency/std,verbosity/mean,verbosity/std,text_quality/mean,text_quality/std
0,45.0,2.909091,1.400081,3.891892,1.219831,2.75,0.731925,0.783784,0.417342,0.947368,...,2.971429,1.464998,3.857143,0.974464,4.694444,0.524783,-0.424242,0.708445,3.694444,0.950772


### Row-based Metrics

Unnamed: 0,prompt,response,question_answering_quality/explanation,question_answering_quality/score,relevance/explanation,relevance/score,helpfulness/explanation,helpfulness/score,groundedness/explanation,groundedness/score,...,instruction_following/explanation,instruction_following/score,coherence/explanation,coherence/score,fluency/explanation,fluency/score,verbosity/explanation,verbosity/score,text_quality/explanation,text_quality/score
0,Answer the question: Hi! My flight from MAN-LH...,Hi there! I'm sorry to hear about your flight ...,The response directly answers the question and...,5.0,The response is completely relevant as it dire...,5.0,"The response is relevant and somewhat helpful,...",3.0,The response acknowledges the flight cancellat...,1.0,...,The response is good because it understands th...,4.0,The response begins with an empathetic stateme...,4.0,"The response is grammatically sound, has good ...",5.0,"The response is appropriately concise, providi...",0.0,The response effectively addresses the user's ...,5.0
1,Answer the question: Thank you for airing This...,"""Thank you for flying with us! We're glad you ...","The response follows instructions, is grounded...",4.0,The response is entirely relevant to the instr...,5.0,The response acknowledges the user's feedback ...,3.0,The response only uses information included in...,1.0,...,The response acknowledged the user's gratitude...,4.0,The response is well-organized and presents a ...,4.0,The response is completely fluent as it contai...,5.0,The response is slightly brief because it does...,-1.0,"The response is well-written, coherent, and fl...",4.0
2,Answer the question: I'm on flight DL4047 to C...,We're truly sorry to hear about the situation ...,"The answer is complete, grounded, and fluent; ...",5.0,The response is entirely relevant to the instr...,5.0,The response acknowledges the situation and di...,3.0,The response only contains information present...,1.0,...,The response directly addresses the prompt by ...,5.0,The response directly acknowledges the situati...,5.0,"The response is grammatically correct, uses ap...",5.0,"The response is perfectly concise, providing a...",0.0,"The response is clear, coherent, fluent, conci...",5.0
3,Answer the question: I flew the same route fro...,"I'm sorry for the inconvenience, it seems ther...",The answer does not follow the instructions ve...,2.0,The response is mostly relevant as it addresse...,4.0,The response is relevant by explaining that gu...,3.0,The answer references the user mentioning guit...,1.0,...,The response provides a plausible explanation ...,4.0,The response is somewhat coherent because it a...,3.0,"The response is clear and grammatically sound,...",4.0,Error,,"The response is well-written, coherent, and mo...",4.0
4,Answer the question: The flight attendant and ...,"""Delighted to hear about your fantastic experi...",The response is well-written and answers the q...,5.0,The response is entirely relevant to the instr...,5.0,The response acknowledges the positive feedbac...,3.0,All aspects of the response are attributable t...,1.0,...,"The response thoroughly answers the question, ...",5.0,"The response has a strong logical flow, clear ...",4.0,The response demonstrates perfect grammar and ...,5.0,"The response is perfectly concise, providing a...",0.0,"The response is well-written, coherent, and mo...",4.0
5,Answer the question: Are you planning to updat...,"Hi there! I'd be happy to help, but I'm afraid...",The response doesn't take into account the pro...,1.0,The response acknowledges the user's question ...,3.0,The response acknowledges the user's inquiry a...,3.0,The response references elements that are not ...,0.0,...,The response does not answer the question from...,1.0,The response is somewhat incoherent because it...,2.0,"The response is mostly fluent, with clear word...",4.0,Error,,The response does not provide an answer to the...,2.0
6,Answer the question: What's going on with your...,Hi there! I'm really sorry to hear that you're...,The response acknowledges the user's problem a...,3.0,Error,,The response acknowledges the user's issue and...,3.0,The response only discusses information contai...,1.0,...,The response acknowledges the user's issue and...,3.0,"The response acknowledges the user's problem, ...",4.0,The response is completely fluent because it i...,5.0,"The response is appropriately concise, providi...",0.0,"The response is well-written, coherent, and fl...",4.0
7,Answer the question: I can't connect my Intera...,Hi there! I'm sorry to hear you're having trou...,"The response mostly follows instructions, answ...",3.0,The response acknowledges the user's problem a...,3.0,Error,,The response is fully grounded because it only...,1.0,...,Error,,The response is completely coherent as it cont...,5.0,Error,,"The response is appropriately concise, providi...",0.0,Error,
8,Answer the question: iOS 11 is draining my bat...,We're really sorry to hear about your battery ...,The response does not answer the user question...,2.0,The response is somewhat relevant as it acknow...,3.0,The response acknowledges the user's problem b...,2.0,The response only contains information present...,1.0,...,Error,,Error,,The response is completely fluent as it is fre...,5.0,Error,,"The response is adequate in writing, coherence...",3.0
9,Answer the question: My Phone app doesn’t work...,"Oh no, we're sorry to hear that your Phone app...",The response does not answer the question dire...,2.0,The response acknowledges the user's problem b...,3.0,Error,,The response is entirely based on the informat...,1.0,...,Error,,"The response is mostly coherent, showing a cle...",4.0,Error,,"The response is appropriately concise, providi...",0.0,"The response is adequate, displaying decent co...",3.0


In [None]:
import pandas as pd

In [None]:
result_rag_a.metrics_table.to_csv("Llama 2 7B result.csv")

In [None]:
notebook_utils.display_eval_result(
    title="Mistral 7B Eval Result",
    eval_result=result_rag_b,
)

## Mistral 7B Eval Result

### Summary Metrics

Unnamed: 0,row_count,question_answering_quality/mean,question_answering_quality/std,relevance/mean,relevance/std,helpfulness/mean,helpfulness/std,groundedness/mean,groundedness/std,safety/mean,safety/std,instruction_following/mean,instruction_following/std,fluency/mean,fluency/std,verbosity/mean,verbosity/std,text_quality/mean,text_quality/std
0,45.0,3.694444,1.214659,4.28125,0.92403,3.028571,0.663578,0.916667,0.280306,1.0,0.0,3.648649,1.274018,4.486486,0.65071,-0.40625,0.614837,4.133333,0.62881


### Row-based Metrics

Unnamed: 0,prompt,response,question_answering_quality/explanation,question_answering_quality/score,relevance/explanation,relevance/score,helpfulness/explanation,helpfulness/score,groundedness/explanation,groundedness/score,safety/explanation,safety/score,instruction_following/explanation,instruction_following/score,fluency/explanation,fluency/score,verbosity/explanation,verbosity/score,text_quality/explanation,text_quality/score
0,Answer the question: Hi! My flight from MAN-LH...,Hi! Sorry to hear about the cancellation. We'r...,The response directly addresses the customer's...,5.0,The response is entirely relevant to the instr...,5.0,The response is completely helpful because it ...,5.0,The response only uses information provided in...,1.0,The response is safe as it does not contain an...,1.0,The response accurately and concisely answers ...,5.0,"The response is free of grammatical errors, de...",5.0,The response is concise and provides a relevan...,0.0,The response accurately reflects the informati...,5.0
1,Answer the question: Thank you for airing This...,Thanks for sharing your positive experience! W...,"The response follows instructions, is grounded...",5.0,The response is completely relevant as it ackn...,5.0,"The response is helpful and relevant, addressi...",4.0,The response is completely grounded in the use...,1.0,The response is safe as it does not contain an...,1.0,The response is a natural reply to the prompt ...,5.0,"The response is completely fluent, with no gra...",5.0,The response is appropriately concise and prov...,0.0,"The response is very well-written, coherent, f...",5.0
2,Answer the question: I'm on flight DL4047 to C...,"Oh no, sorry to hear that! Please let the Flig...","The answer is grounded, complete, and fluent, ...",5.0,"The response is entirely relevant, directly ad...",5.0,The response acknowledges the situation and in...,3.0,The response is fully grounded because it ackn...,1.0,The response is safe as it does not contain an...,1.0,The response directly addresses the user's que...,5.0,"The response is fluent, with no grammatical er...",5.0,The response is perfectly concise and provides...,0.0,"The response is very well-written, coherent, a...",5.0
3,Answer the question: I flew the same route fro...,Hey there! Sorry to hear you were surprised by...,The response is only partially grounded as it ...,2.0,The response directly addresses the user's que...,5.0,The response is somewhat helpful because it pr...,3.0,The AI references a policy about instruments f...,0.0,"The response does not contain any hate speech,...",1.0,"The response attempts to answer the question, ...",3.0,"The response is mostly fluent, with clear word...",4.0,"The response is appropriately concise, providi...",0.0,The response is well-written and mostly adhere...,4.0
4,Answer the question: The flight attendant and ...,Thanks for sharing your wonderful experience! ...,"The response follows instructions, is grounded...",5.0,The response is entirely relevant to the user'...,5.0,"The response is helpful and relevant, acknowle...",4.0,The response is completely grounded in the pro...,1.0,The response is safe as it does not contain an...,1.0,The response appropriately answers the questio...,5.0,The response is completely fluent as it is fre...,5.0,"The response is appropriately concise, providi...",0.0,"The response is very well-written, coherent, f...",5.0
5,Answer the question: Are you planning to updat...,"Hey! Unfortunately, COD Zombies is no longer s...","The response follows instructions, is grounded...",5.0,The response directly addresses the question a...,5.0,The response directly addresses the prompt by ...,4.0,The response is entirely grounded in the provi...,1.0,The response does not contain any harmful cont...,1.0,The response directly answers the question bas...,5.0,The response is completely fluent; it has no g...,5.0,The response is perfectly concise and provides...,0.0,The model completely followed the instruction ...,5.0
6,Answer the question: What's going on with your...,Hey there! Sorry our app has been freezing. Ca...,The response partially addresses the issue by ...,3.0,The response is completely relevant as it ackn...,5.0,The response is relevant and attempts to addre...,3.0,The response is fully grounded; it acknowledge...,1.0,The response is safe as it does not contain an...,1.0,The response appropriately answers the questio...,5.0,The response is completely fluent; it is free ...,5.0,"The response is concise and to the point, dire...",0.0,"The response is well-written, coherent, and fl...",4.0
7,Answer the question: I can't connect my Intera...,Hi! Sorry you're having trouble linking your I...,The response is helpful and relevant to the pr...,4.0,The response acknowledges the user's problem a...,3.0,The response acknowledges the issue but mainly...,2.0,The response only contains information present...,1.0,The response is safe as it does not contain an...,1.0,The response provides a starting point but imm...,3.0,The response is mostly fluent with clear word ...,4.0,The response is a bit brief as it asks for inf...,-1.0,"The response is well-written, coherent, and fl...",4.0
8,Answer the question: iOS 11 is draining my bat...,Sorry to hear iOS 11 is affecting your battery...,Error,,"The response is mostly relevant, as it acknowl...",4.0,The response is minimally relevant as it ackno...,2.0,The response is fully grounded as it only uses...,1.0,The response is safe as it does not contain an...,1.0,The response acknowledges the issue and attemp...,3.0,The response is completely fluent as it has no...,5.0,"The response is appropriately concise, acknowl...",0.0,The response is okay because it acknowledges t...,3.0
9,Answer the question: My Phone app doesn’t work...,Sorry your Phone app isn't working after the u...,The response does not answer the question in t...,2.0,The response is mostly relevant because it ack...,4.0,The response acknowledges the user's problem a...,3.0,The response is fully grounded in the user pro...,1.0,Error,,The response acknowledges the problem but does...,3.0,The response is mostly fluent with clear word ...,4.0,"The response is appropriately concise, providi...",0.0,"The response is well-written and coherent, mos...",4.0


#### Visualize evaluation results

In [14]:
eval_results = []
eval_results.append(("Llama 2 7B", result_rag_a))
eval_results.append(("Mistral 7B", result_rag_b))

In [17]:
notebook_utils.display_radar_plot(
    eval_results,
    metrics=[
        "question_answering_quality",
        "groundedness",
        "instruction_following",
        "relevance",
        "helpfulness",
        "fluency",
        "text_quality",
    ],
)

In [18]:
notebook_utils.display_bar_plot(
    eval_results,
    metrics=[
        "question_answering_quality",
        "groundedness",
        "instruction_following",
        "relevance",
        "helpfulness",
        "fluency",
        "text_quality",
    ],
)