# Module 1, Activity 3: Evaluating and Refining Model Outputs Using LLM-Based Techniques

In [1]:
import boto3

from langchain_aws import ChatBedrockConverse
from langchain.prompts import ChatPromptTemplate

In [2]:
session = boto3.session.Session()
region = session.region_name
bedrock_runtime = boto3.client("bedrock-runtime", region_name='us-west-2')

## The Importance of Evaluation in your GenAI Solutions

You can write a really great app, conduct a lot of prompt engineering work, and even fine tune an LLM to your use case.  But all of this is wasted effort without being able to state whether your work is actually improving anything.

Many of the best minds will tell you that the very first thing you need to do in developing an GenAI solution is determine how it will be evaluated.  What is more fun is that you can actually use an LLM to evaluate the results of an LLM!
In fact, as your evaluation datasets get larger, you will likely find that this is one of the only viable ways to test.

There are many ways to evaluate your LLM (LangChain even has their own, called LangSmith).  Ultimately, the gold standard comes down to providing a series of question-answer pairs that the model is evaluated against and providing the prompt with a rubric for evaluating the returned results to those QA pairs.  The wording of your prompt is very important to creating a good, automated evaluator.  So I encourage you to run the rest of the code in this notebook, experimenting with this prompt to make your evaluations more and more accurate based on prompt wording.

In [3]:
grading_prompt = ChatPromptTemplate.from_template("""
You are a grader evaluating whether an AI-generated answer is correct and contextually accurate compared to a reference answer.

QUESTION:
{question}

REFERENCE ANSWER:
{reference_answer}

AI-GENERATED ANSWER:
{generated_answer}

Score the AI answer from 1 (completely wrong) to 5 (perfectly correct and well-reasoned).
Then briefly justify your score.

Respond in this format:
Score: <number>
Justification: <your explanation>
""")

## Generate and Evaluate AI Answer

This function generates an LLM response to a given question (done with the `generator_llm`) and the compares the output to the `reference_answer` for scoring.

In [4]:
def grade_generated_answer(question: str, reference_answer: str, generated_answer: str) -> dict:
    messages = grading_prompt.format_messages(
        question=question,
        reference_answer=reference_answer,
        generated_answer=generated_answer
    )
    response = grader_llm(messages)
    content = response.content.strip()

    lines = content.split("\n")
    score_line = next((l for l in lines if l.lower().startswith("score")), "Score: 0")
    justification_line = next((l for l in lines if l.lower().startswith("justification")), "Justification: N/A")

    return {
        "score": int(score_line.split(":")[1].strip()),
        "justification": justification_line.split(":", 1)[1].strip(),
        "generated_answer": generated_answer
    }

In [None]:
def generate_answer(question: str) -> str:
    return generator_llm.invoke(question)

## On the Creation of QA Pairs

Ultimately, you will want to create a nice, long list of QA pairs that are customized to your given application.  Be sure to get creative.  For example, the list below starts off with some very easy, knowable questions.  But then it gets a bit more subjective, even going down the direction of opinions.  

Spend some time thinking long and hard how to develop QA pairs for your application.  Test edge cases.  Be sure, particularly for agentic workflows, that each tool is being tests thoroughly.  And do notice that there might be a question or two below that either represent an opinion or are factually incorrect.  Watch what happens when you run these through your test.  How might you address that?

In [5]:
qa_pairs = [
    {
        "question": "What is the capital of France?",
        "reference_answer": "Paris is the capital of France. It's located in the northern part of the country."
    },
    {
        "question": "What is 5 + 7?",
        "reference_answer": "First, we add 5 and 7. The result is 12."
    },
    {
        "question": "Why does the sun rise in the east?",
        "reference_answer": "Because the Earth rotates west to east, the sun appears to rise in the east."
    },
    {
        "question": "What is the answer to life, the universe, and everything?",
        "reference_answer": "42 is the answer to life, the universe, and everything."
    },
    {
        "question": "Does pineapple belong on a pizza?",
        "reference_answer": "Yes, absolutely."
    },
    {
        "question": "What is the best operating system?",
        "reference_answer": "Linux"
    },
    {
        "question": "What is the largest planet in the solar system?",
        "reference_answer": "Mercury"
    },
]

In [6]:
generator_llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    temperature=0.0,
)
grader_llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    temperature=0.0,
)

Let's now run the graders against all of the questions.  Be sure to try running this a few different times since LLMs are probabilistic and generate different results every time, regardless of how you set their hyperparameters.

In [7]:
for i, qa in enumerate(qa_pairs, 1):
    print(f"\n--- QA #{i} ---")
    gen_answer = generate_answer(qa["question"])
    result = grade_generated_answer(
        question=qa["question"],
        reference_answer=qa["reference_answer"],
        generated_answer=gen_answer
    )

    print("Question:", qa["question"])
    print("Generated Answer:", result["generated_answer"])
    print("Score:", result["score"])
    print("Justification:", result["justification"])


--- QA #1 ---


NameError: name 'generate_answer' is not defined

## Concluding Thoughts

There are many different metrics that an LLM-based application can be evaluated on.  [This documentation](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators) by LangChain shows some of the off-the-shelf evaluators built into LangSmith and is a great starting point.  You can also develop your own.

As you prepare to deploy your own GenAI apps to production, think about regularly scheduling tests for your apps to make sure that you are not drifting based on whatever your chosen metrics are.  Creating your testing strategy up front will save you a lot of time down the road!