In [1]:
%load_ext autoreload
%autoreload 2

In [14]:
import os
from dotenv import load_dotenv
import weave

load_dotenv()  # TODO: replace with getpass

import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

import nest_asyncio
nest_asyncio.apply()

In [3]:
# initialize weave
weave.init(project_name="eval-course/eval-course-dev")

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/eval-course/eval-course-dev/weave


<weave.trace.weave_client.WeaveClient at 0x124235f00>

## Essay Writer (Aligning LLM evaluators with human evaluators)

Imagine a task, where you are using an LLM to write an essay. 

query ----> [LLM based essay writer] ----> essay # TODO: simple diagram

- You have built an evaluation set of query-essay pairs.
- You have a set of human evaluators who have labeled the essays based on some criteria.

Now you don't want to always rely on human evaluators to label the essays. You want to build an LLM based evaluator. # TODO: improve framing

Let's start with building a simple evaluator.

## Part 1: Prompt

Any LLM evaluator needs a prompt. A "judge's" prompt will have three key components: # TODO: expand of these three components
1. A task description
2. Measuring criteria(s)
3. Scoring rubric

In [22]:
JUDGE_PROMPT = """You are an expert essay evaluator. 
Please evaluate the following essay according to the Holistic Rating for Source-Based Writing rubric on a scale of 1-6.
First give a reason for the score and return the result as a valid JSON object.

Example:
```json
{{"score": 4, "reason": "The essay demonstrates a clear understanding of the source text and effectively uses it to support its points."}}
```

Essay:
{full_text}
"""

## Part 2: The Evaluator

The LLM evaluator takes in the system prompt, initialize an LLM and pass the system prompt along with "generated" content to the LLM.

We expect the evaluator to return a judgement which can be in the form of raw text or a JSON object.

Here we are using the `weave.Model` class which under the hood is a Pydantic `BaseModel`. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.

In this case, we are passing the `full_text` to the evaluator and expect it to return a JSON object with `score` and `reason` keys.

In [26]:
from weave import Model, Evaluation
import asyncio
import json


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(self.judge_prompt.format(full_text=full_text))
        try:
            result = response.text.strip()
            result = json.loads(result)
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

## Part 3: The evaluation dataset

To simulate this imaginary scenario, we use a small subset of the `train.csv` file from the "[Learning Agency Lab - Automated Essay Scoring 2.0](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/data?select=train.csv)" Kaggle competition.

Specifically, we have two columns of interest: `full_text` and `score`. The `full_text` should be essay generated from our LLM based essay writer. The `score` is the score given by the human evaluators.

Each essay was scored on a scale of 1 to 6 using the "[Holistic Rating for Source-Based Writing](https://storage.googleapis.com/kaggle-forum-message-attachments/2733927/20538/Rubric_%20Holistic%20Essay%20Scoring.pdf)" code book.

In [24]:
# Load the dataset
weave.init('eval-course/eval-course-dev')
essay_scorer_small = weave.ref('essay_scorer_small:v0').get()

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/eval-course/eval-course-dev/weave


## Part 4: The evaluation metric

We want to evaluate the evaluator's performance using the `score` column from the dataset. We are using the `exact_match` metric to check if the evaluator's prediction matches the human score.

The `weave.op()` decorator allows us to track the metric as an operation in the weave graph.

In [29]:
# Define a simple exact match metric
@weave.op()
def exact_match(score: dict, model_output: dict) -> float:
    """Check if predicted score matches human score"""
    return model_output['score'] == score

## Part 5: The evaluation

Should we expand of this section?

In [27]:
# Create evaluation
evaluation = Evaluation(
    dataset=essay_scorer_small,
    scorers=[exact_match]
)

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d1a3-568d-7c02-be5a-dd57e2536e3a


{'model_output': {'score': {'mean': 0.0}},
 'exact_match': {'mean': 0.0},
 'model_latency': {'mean': 6.09924967288971}}

## Better JSON parsing

We need to improve the JSON parsing to handle cases where the LLM returns a JSON object with extra markdown formatting.

In [28]:
@weave.op()
def parse_json(result: str) -> dict:
    if "```json" in result:
        result = result.split("```json\n")[1].split("\n```")[0]
    # Clean up any remaining markdown formatting
    result = result.strip()
    return json.loads(result)


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(self.judge_prompt.format(full_text=full_text))
        try:
            result = response.text.strip()
            return parse_json(result)
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d1b0-bcdc-7ae0-af4d-139bbebaf998


{'model_output': {'score': {'mean': 2.2}},
 'exact_match': {'mean': 0.2},
 'model_latency': {'mean': 5.599770975112915}}

### Structured output

Most frontier LLM providers support structured outputs. Using this forces the LLM to return/predict a specific schema.

Note: If you have complex "reasoning" to be done via your LLM evaluator, you should use two API calls. Use the first API call to do the reasoning and use the second API call to output the structured response. Reference: https://arxiv.org/abs/2408.02442v1

In [38]:
import typing_extensions as typing

class Judgement(typing.TypedDict):
    reason: str
    score: int


class EssayEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, full_text: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(full_text=full_text),
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json", response_schema=Judgement
            ),
        )
        try:
            result = json.loads(response.text.strip("\n"))
            return result
        except:
            return {"score": 0, "reason": "Failed to parse JSON"}  # Default to lowest score if parsing fails

# Initialize evaluator
essay_evaluator = EssayEvaluator()

# Run evaluation
asyncio.run(evaluation.evaluate(essay_evaluator))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192d20a-d23a-7461-8c1a-528820d8d0a2


{'model_output': {'score': {'mean': 2.2}},
 'exact_match': {'mean': 0.2},
 'model_latency': {'mean': 4.168220973014831}}