##  Notebook 4: RAG Human-Like Evaluation - LLM-as-a-Judge

In this notebook, we are going to use a high quality LLM to generate evaluation scores (human-like) for RAG system final outputs.


We will use Llama2 70B model to evaluate the example RAG pipeline.
The score granulaity is from 1 to 5 where:

- **Score 1**: Answer irrelevant or invalid, does not follow the context of the question or is irrelevant
- **Score 2**: Answer barely useable, missing significant accurate information  
- **Score 3**: Answer mostly helpful, missing some information or added erroneous information
- **Score 4**: Answer helpful, room for some improvement, could be more concise
- **Score 5**: Answer helpful, accurate, relevant and concise


### Step 1: Load the Data

Let's first load the JSON dataset. The structure should be: 

```
{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts': "xxx xxx xxxx",
'answer':"xxx xxx xxxx",
}

In [None]:
import json

# The path to your JSON file
file_path = 'eval.json'

# Read the JSON file
with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

In [None]:
# Check some of the loaded data
print(data[:1])
print("Number of entries", len(data))

Set your API key for Nvidia AI Playground

In [None]:
import requests

invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/0e349b44-440a-44e1-93e9-abe8dcb27158" #Llama 2 70B
fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"

headers = {
    "Authorization": "Bearer REPLACE_THIS_WITH_API_KEY",
    "Accept": "application/json",
}


### Step 2: Design the LLM-as-a-Judge Prompt

The evaluation axes are the helpfulness, relevance, accuracy, and level of detail. Prompting the high quality LLM to generate human-like evaluation requires a careful prompt engineering with an explicit instructions

We must provide the evaluation criteria and the methodology in the same fashion as if we were giving human instructions on how to evaluate.
We also ask the LLM to consider both the reference answer and context (ground truth) when evaluating the response provided by the RAG pipeline.
Finally, we ask the LLM to provide a score on a scale of 1-5 (likert scale) and ask it to provide an explanation.

Here is an example of judge_template that we will use with Llama2 70B. Notice the evaluation examples provided in the prompt. This will help guide the LLM.

In [None]:
LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 ""
 "Example 1:"
 "[Question]"
 "When did Queen Elizabeth II die?"
 "[The Start of the Reference Context]"
 """On 8 September 2022, Buckingham Palace released a statement which read: "Following further evaluation this morning, the Queen's doctors are concerned for Her Majesty's health and have recommended she remain under medical supervision. The Queen remains comfortable and at Balmoral."[257][258] Her immediate family rushed to Balmoral to be by her side.[259][260] She died peacefully at 15:10 BST at the age of 96, with two of her children, Charles and Anne, by her side;[261][262] Charles immediately succeeded as monarch. Her death was announced to the public at 18:30,[263][264] setting in motion Operation London Bridge and, because she died in Scotland, Operation Unicorn.[265][266] Elizabeth was the first monarch to die in Scotland since James V in 1542.[267] Her death certificate recorded her cause of death as old age"""
 "[The End of Reference Context]"
 "[The Start of the Reference Answer]"
 "Queen Elizabeth II died on September 8, 2022."
 "[The End of Reference Answer]"
 "[The Start of the Assistant's Answer]"
 "She died on September 8, 2022"
 "[The End of Assistant's Answer]"
 '"Rating": 5, "Explanation": "The answer is helpful, relevant, accurate, and concise. It matches the information provided in the reference context and answer."'
 ""
 "Example 2:"
 "[Question]"
 "When did Queen Elizabeth II die?"
 "[The Start of the Reference Context]"
 """On 8 September 2022, Buckingham Palace released a statement which read: "Following further evaluation this morning, the Queen's doctors are concerned for Her Majesty's health and have recommended she remain under medical supervision. The Queen remains comfortable and at Balmoral."[257][258] Her immediate family rushed to Balmoral to be by her side.[259][260] She died peacefully at 15:10 BST at the age of 96, with two of her children, Charles and Anne, by her side;[261][262] Charles immediately succeeded as monarch. Her death was announced to the public at 18:30,[263][264] setting in motion Operation London Bridge and, because she died in Scotland, Operation Unicorn.[265][266] Elizabeth was the first monarch to die in Scotland since James V in 1542.[267] Her death certificate recorded her cause of death as old age"""
 "[The End of Reference Context]"
 "[The Start of the Reference Answer]"
 "Queen Elizabeth II died on September 8, 2022."
 "[The End of Reference Answer]"
 "[The Start of the Assistant's Answer]"
 "Queen Elizabeth II was the longest reigning monarch of the United Kingdom and the Commonwealth."
 "[The End of Assistant's Answer]"
 '"Rating": 1, "Explanation": "The answer is not helpful or relevant. It does not answer the question and instead goes off topic."'
  ""
 "Following the exact same format as above, what is the rating and explanation for the following assistant's answer"
 "[Question]"
 "{question}"
 "[The Start of the Reference Context]"
 "{ctx_ref}"
 "[The End of Reference Context]"
 "[The Start of the Reference Answer]"
 "{answer_ref}"
 "[The End of Reference Answer]"
 "[The Start of the Assistant's Answer]"
 "{answer}"
 "[The End of Assistant's Answer][/INST]"
)

system_prompt = """
You are an impartial judge that evaluates the quality of an assistant's answer to the question provided.
You evaluation takes into account helpfullness, relevancy, accuracy, and level of detail of the answer.
You must use both the reference context and reference answer to guide your evaluation.
"""

Now call the Judge LLM on the RAG results

In [None]:
# re-use connections
session = requests.Session()

llama_judge_responses = []
for d in data:
    try:
        prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, question=d["question"], ctx_ref=d["gt_context"], answer_ref=d["gt_answer"], answer=d["answer"])
        payload = {
            "messages": [
                {
                "content": prompt,
                "role": "user"
                }
            ],
            "temperature": 0.1,
            "top_p": 1.0,
            "max_tokens": 200,
            "stream": False
            }

        response = session.post(invoke_url, headers=headers, json=payload)

        while response.status_code == 202:
            request_id = response.headers.get("NVCF-REQID")
            fetch_url = fetch_url_format + request_id
            response = session.get(fetch_url, headers=headers)

        response_body = response.json()
        llama_judge_responses.append(response_body['choices'][0]['message']['content'])
    except Exception as e:
        print("pass")
        llama_judge_responses.append(None)


Parse the rating and evaluations out of the Judge responses.

In [None]:

import re
import statistics

# Regular expression pattern to extract rating and explanation
rating_pattern = r'Rating:\s*(\d+)'
explanation_pattern = r'Explanation:\s*(.+)'

llama_ratings = []
llama_explanations = []
for response in llama_judge_responses:
        try:
                # Search for the patterns
                rating_match = re.search(rating_pattern, response)
                explanation_match = re.search(explanation_pattern, response)

                # Extract and print the rating and explanation
                llama_ratings.append(int(rating_match.group(1)) if rating_match else None)
                llama_explanations.append(explanation_match.group(1) if explanation_match else None)
        except Exception as e:
                print("pass")
                llama_ratings.append(None)
                llama_explanations.append(None)


Let's take a peek at the results!

In [None]:
print("Number of judgements:", len(llama_ratings))
print("*************************************")

for i, d in enumerate(data[:len(llama_ratings)]):
    print("Question:", d["question"])
    print("Reference Answer:", d["gt_answer"])
    print("Answer:", d["answer"])
    print("Rating:", llama_ratings[i])
    print("Explanation:", llama_explanations[i])
    print("*************************************")

Now let's calculate the mean Likert score and then display a historgram of all the ratings.

In [None]:

# calculate mean
llama_ratings = [1 if r == 0 else r for r in llama_ratings] # Change 0 ratings to 1
llama_ratings_filtered = [r for r in llama_ratings if r ] # Remove empty ratings
mean = round(statistics.mean(llama_ratings_filtered), 1)
print("Number of ratings:", len(llama_ratings_filtered))
print(f"Mean rating: {mean}")

In [None]:
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns

# Set the style of the visualization
sns.set(style="white")

# Create a histogram
plt.figure(figsize=(10, 6))
ax = sns.histplot(llama_ratings_filtered, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5], kde=False)
plt.xlim(.5, 5.5)
plt.xticks([1, 2, 3, 4, 5])
ax.yaxis.set_major_locator(MaxNLocator(integer=True))

# Add titles and labels
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')

# Show the plot
plt.show()

Lastly, let's write your evaluation results to a csv file so you can examine them in more detail later.

Note: A few LLM Judge evaluation responses may be malformed and therefore unparseable. In these cases the rating and explanation fields will be empty.

In [None]:
import csv

results = list(zip(llama_ratings,
                   llama_explanations,
                   [d["question"] for d in data],
                   [d["answer"] for d in data],
                   [d["gt_answer"] for d in data],
                   [d["gt_context"] for d in data]))

output_file = 'judgements.csv'

with open(output_file, 'w', newline='') as file:
    writer = csv.writer(file)

    # headers
    writer.writerow(['Rating', 'Explanation', 'Question', 'Answer', 'Groundtruth Answer', 'Groundtruth Context'])

    # Write the data
    for row in results:
        writer.writerow(row)

print(f"Data written to {output_file}")

Bonus! A good practice for improving a RAG pipeline is to look at the responses that were rated poorly and then determine actions to improve.

In [None]:
[bad_result for bad_result in results if bad_result[0] == 1]