# Custom Scorers with HumanEval

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JudgmentLabs/judgment-cookbook/blob/refactor/HumanEval_Custom_Scorer.ipynb)
[![Docs](https://img.shields.io/badge/Documentation-blue)](https://docs.judgmentlabs.ai/documentation)

In this notebook, you will learn how to evaluate code generation on OpenAI's [HumanEval](https://github.com/openai/human-eval) benchmark and create **custom scorers** that are code-based and LLM-as-a-Judge using the [`judgeval`](https://github.com/JudgmentLabs/judgeval) library. 

1. **Code Execution Scorer**: Uses sandboxed code execution to evaluate code correctness
2. **LLM-as-a-Judge Scorer**: Uses language models to evaluate code quality

You will generate code using LLMs, create a custom scorers that leverages OpenAI's sandboxed environment and LLM-as-a-Judge, and evaluate it on the HumanEval benchmark dataset.

In [None]:
# Installations
!pip install human-eval datasets openai judgeval

To run this notebook, select **Runtime* -> Run All*

## Setup

You can get your Judgment API key and Org ID for free on [Judgment](https://app.judgmentlabs.ai/register).

![Get Started](./assets/get_started.png)

Within your organization, create a project called `humaneval-project`.

In [None]:
# set api keys
import os
from dotenv import load_dotenv

os.environ['OPENAI_API_KEY'] = ...  # Fill your API keys here
os.environ["JUDGMENT_API_KEY"] = ...
os.environ["JUDGMENT_ORG_ID"] = ...

load_dotenv()

In [2]:
from judgeval import JudgmentClient

from human_eval.execution import check_correctness
from openai import AsyncOpenAI

from typing import Dict, Any
import asyncio

import re
import numpy as np



In [3]:
# Initialize clients
judgment = JudgmentClient()
client = AsyncOpenAI()

## 1. Understanding HumanEval

HumanEval is a benchmark dataset created by OpenAI for evaluating code generation models. Introduced in the paper ["Evaluating Large Language Models Trained on Code"](https://arxiv.org/abs/2107.03374), it contains 164 Python programming problems designed to test functional correctness.

### What HumanEval Contains
Each problem includes:
- **Function signature and docstring**: The problem description
- **Test cases**: Automated tests to verify correctness  
- **Canonical solution**: Reference implementation
- **Entry point**: Function name to test

### How HumanEval Evaluates Code
HumanEval evaluates model outputs by dynamically building a Python program that stitches together the **Function signature and docstring**, the **model’s generated solution**, and the **test cases**. This combined program is then executed in a sandbox to verify whether the generated code passes all test cases.

The ```check_correctness``` function orchestrates this process: assembles the prompt, generated solution, and tests into a single program, and executes the script in a sandboxed environment.

```python
# Construct the check program and run it.
check_program = (
    problem["prompt"] +      # Function signature + docstring
    completion +            # Generated code
    "\n" +
    problem["test"] +       # Test cases
    "\n" +
    f"check({problem['entry_point']})"  # Call the test function
)

...

# WARNING: This executes untrusted model-generated code
exec(check_program, exec_globals)
```

The evaluation is **pass/fail**: if all test cases pass without exceptions, the code is correct. If any test fails or the code crashes, it's incorrect.

### Pass@k

Instead of stopping at a simple pass/fail, HumanEval extend this to compute the **Pass@k** statistic. Pass@k measures the probability that at least one of the top-k generated solutions is correct, given:

- **n** = total number of generated solutions  
- **c** = number of correct solutions among them  
- **k** = how many solutions we sample  

For example, if we generate **n = 5** completions and some are correct, we can compute **Pass@1** (the chance a single sampled solution is correct) and **Pass@3** (the chance at least one out of three sampled solutions is correct). This provides a more practical view of model performance, since in real use cases we usually sample multiple completions and care whether any of them solves the problem.

In [4]:
def estimator(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))

## Define Your Custom Example Class

In `judgeval`, all data passed into scorers is represented as an `Example`. The base `Example` object is a container that standardizes how data is stored and accessed. By inheriting from it, you can define your own fields that describe the task you want to monitor.

For HumanEval, we’ll create a `HumanEvalExample` that captures the fields needed to represent a problem and its generated code candidates:

In [None]:
from judgeval.data import Example

class HumanEvalExample(Example):
    """
    Custom Example for HumanEval tasks.
    """
    task_id: str
    prompt: str
    canonical_solution: str
    test: str
    entry_point: str
    generated_codes: list[str]

In [None]:
one_example = HumanEvalExample(
    task_id="HumanEval/0",
    prompt="def add(a: int, b: int) -> int:\n    # Write your code here",
    canonical_solution="def add(a: int, b: int) -> int:\n    return a + b",
    test="assert add(1, 2) == 3",
    entry_point="add",
    generated_codes=["def add(a, b): return a+b"]
)

print("The prompt is: ", one_example.prompt)
print("The canonical solution is: ", one_example.canonical_solution)

The prompt is:  def add(a: int, b: int) -> int:
    # Write your code here
The canonical solution is:  def add(a: int, b: int) -> int:
    return a + b


## Custom Code Execution Scorer

We'll create a custom scorer using `judgeval` that integrates HumanEval's sandboxed code execution. We'll integrate the `check_correctness` and `estimator` functions to build a custom scorer called `CodeExecutionScorer`.

In `judgeval`, the user must implement:

`async def a_score_example(self, example: Example)`

This method will asynchronously score each example, and the scorer should set three key fields:

- **`self.name`**: label shown in the dashboard  
- **`self.score`**: numeric metric value (e.g., Pass@k in `[0, 1]`)  
- **`self.reason`**: human-readable explanation or context behind the score  


In [9]:
from judgeval.scorers.example_scorer import ExampleScorer

class CodeExecutionScorer(ExampleScorer):
    """
    A scorer for evaluating code generation using check_correctness
    and Pass@k statistics.
    """
    #default values
    k: int = 1
    n: int = 1

    async def a_score_example(self, example: HumanEvalExample) -> None:
        """
        Score an example by running the generated code against test cases.
        
        This method uses check_correctness to execute the generated code
        in a sandboxed environment and check if it passes all test cases.
        
        Args:
            example (Example): The example containing the problem and generated code
            
        Returns:
            float: The score (1.0 if all tests pass, 0.0 otherwise)
        """

        # Name the scorer
        self.name = f"Pass@{self.k} for {self.n} generations"

        # Create problem dict in the format expected by check_correctness
        problem = {
            "task_id": example.task_id,
            "prompt": example.prompt,
            "test": example.test,
            "entry_point": example.entry_point
        }

        
        # Use check_correctness to evaluate the generated code
        self.n = len(example.generated_codes)
        failed_results = []

        for i in range(self.n):
            result = check_correctness(
                problem=problem,
                completion=example.generated_codes[i],
                timeout=3.0
            )

            if not result["passed"]:
                failed_results.append(result['result'])

        c = self.n - len(failed_results)

        pass_k = estimator(self.n, c, self.k)
        
        self.score = pass_k
        self.reason = (
            f"Passed {self.n - len(failed_results)} out of {self.n} tests.\n\n"
            "Failing snippets:\n"
            + "\n---\n".join(failed_results)
        )        
        return self.score

## Custom LLM-as-a-Judge Scorer


It may not be enough to judge code solutions on their execution results alone, since production-ready code is often evaluated against broader qualities like readability, efficiency, and adherence to best practices. 

To capture these, we’ll create another custom scorer with `judgeval` that integrates **LLM-as-a-Judge**, using a language model to assess generated code against a clear rubric emphasizing **readability**, **efficiency**, and **adherence to best practices**.



Unlike the pass/fail execution check, this scorer evaluates multiple generations and returns a **0–n** score (where n = number of generations), with each generation rated as:
- **High Quality (1.0)**: Excellent readability, efficiency, and best practices
- **Medium Quality (0.5)**: Good overall with minor issues
- **Low Quality (0.0)**: Significant problems

The total score is the sum across all generations, giving you both individual generation quality and overall performance metrics.

In [17]:
class CodeQualityScorer(ExampleScorer):
    """
    A scorer for evaluating code generation using LLM-as-a-Judge.
    """

    #default values
    n: int = 1

    async def a_score_example(self, example: HumanEvalExample) -> None:
        """
        Score an example by running the generated code against LLM-as-a-Judge.
        
        Args:
            example (HumanEvalExample): The example containing the problem and generated code
            
        Returns:
            float: The score is the sum of individual generation ratings (1.0 for high quality, 0.5 for medium quality, 0.0 for low quality) across n generations
        """

        # Name the scorer
        self.name = f"Code Quality for {self.n} generations"


        generations_text = "\n".join([
            f"Generation {i+1}:\n{example.generated_codes[i]}\n" 
            for i in range(self.n)
        ])

        prompt = f"""You are an expert code reviewer. Evaluate each code generation below.

        Rate each as:
        - HIGH QUALITY (1.0): Excellent readability (clear naming, structure, comments), efficient algorithms, follows best practices (input validation, robustness, maintainability)
        - MEDIUM QUALITY (0.5): Good overall with minor issues in readability, efficiency, or best practices  
        - LOW QUALITY (0.0): Significant problems with readability, efficiency, or best practices

        Consider: Code clarity, naming, algorithm efficiency, error handling, organization, Python best practices.

        Problem: {example.prompt}

        {generations_text}

        For each generation, provide:
        SCORE: [1.0/0.5/0.0]
        REASON: [brief explanation]

        Then provide the total:
        TOTAL SCORE: [sum of all scores]"""

        response = await client.chat.completions.create(
            model="gpt-4.1",
            messages=[
                {"role": "system", "content": "You are an expert code reviewer. Evaluate each code generation below and return the individual scores and the reasons for the scores and the sum of all generation ratings, e.g., 4.5)."},
                {"role": "user", "content": prompt}
            ],
        )

        # Extract total score using regex
        total_score_match = re.search(r'(?:\*\*)?TOTAL SCORE:?\*?\*?\s*(\d+\.?\d*)', response.choices[0].message.content, re.IGNORECASE)
        if total_score_match:
            total_score = float(total_score_match.group(1)) / self.n
        else:
            total_score = 0.0
        
        
        self.score = total_score
        self.reason = response.choices[0].message.content
        
        return self.score

## Code Generation Function

Next, we’ll implement a function that, given a HumanEval problem, queries an LLM to produce a candidate implementation. The function is written with `async/await` so multiple problems can be evaluated in parallel, significantly reducing total runtime.

In [11]:
async def generate_code(problem: Dict[str, Any]) -> str:
    """Generate code using LLM for a given HumanEval problem."""
    prompt = problem["prompt"]
    
    response = await client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "system", "content": "You are an expert Python programmer. Write ONLY the Python function code that solves the given problem. Do not include any markdown formatting, explanations, or code blocks. Return only the raw Python code."},
            {"role": "user", "content": prompt}
        ],
        temperature=1.0,
    )
    
    generated_code = response.choices[0].message.content
    
    return generated_code

## Load HumanEval Dataset

Now let's load the HumanEval [dataset](https://huggingface.co/datasets/openai/openai_humaneval) from Hugging Face and examine its structure.


In [12]:
from judgeval.dataset import Dataset
from datasets import load_dataset
# Load the HumanEval dataset
print("📊 Loading HumanEval dataset...")
dataset = load_dataset("openai/openai_humaneval")
print(f"   Found {len(dataset['test'])} problems")

# Examine the structure of a single problem
example_problem = dataset["test"][0]
print("\n📋 Example problem structure:")
print(f"   Task ID: {example_problem['task_id']}")
print(f"   Entry Point: {example_problem['entry_point']}")
print(f"   Prompt length: {len(example_problem['prompt'])} characters")
print(f"   Test length: {len(example_problem['test'])} characters")

print("\n📝 Sample prompt:")
print(example_problem['prompt'][:200] + "...")

  from .autonotebook import tqdm as notebook_tqdm


📊 Loading HumanEval dataset...
   Found 164 problems

📋 Example problem structure:
   Task ID: HumanEval/0
   Entry Point: has_close_elements
   Prompt length: 348 characters
   Test length: 531 characters

📝 Sample prompt:
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given thr...


Now, let’s generate code responses for each problem in the HumanEval benchmark, and we’ll create a HumanEvalExample for upload into `judgeval`.

In [13]:
print("\n🤖 Generating code...")
problems = list(dataset["test"].select(range(20)))

num_generations = 5

# Generate all code in parallel
generations = [
    await asyncio.gather(*[generate_code(problem) for _ in range(num_generations)])
    for problem in problems
]

# Create examples
examples = []
for i, (problem, generated_codes) in enumerate(zip(problems, generations)):
    print(f"   Problem {i+1}/5: {problem['task_id']}")
    
    example = HumanEvalExample(
        task_id=problem["task_id"],
        prompt=problem["prompt"],
        canonical_solution=problem["canonical_solution"],
        test=problem["test"],
        entry_point=problem["entry_point"],
        generated_codes=generated_codes
    )
    examples.append(example)


🤖 Generating code...
   Problem 1/5: HumanEval/0
   Problem 2/5: HumanEval/1
   Problem 3/5: HumanEval/2
   Problem 4/5: HumanEval/3
   Problem 5/5: HumanEval/4
   Problem 6/5: HumanEval/5
   Problem 7/5: HumanEval/6
   Problem 8/5: HumanEval/7
   Problem 9/5: HumanEval/8
   Problem 10/5: HumanEval/9
   Problem 11/5: HumanEval/10
   Problem 12/5: HumanEval/11
   Problem 13/5: HumanEval/12
   Problem 14/5: HumanEval/13
   Problem 15/5: HumanEval/14
   Problem 16/5: HumanEval/15
   Problem 17/5: HumanEval/16
   Problem 18/5: HumanEval/17
   Problem 19/5: HumanEval/18
   Problem 20/5: HumanEval/19


We use `Dataset.create()` to create a new dataset and upload it to the [Judgment](https://app.judgmentlabs.ai/app) platform.

In [14]:
dataset = Dataset.create(
    name="humaneval-dataset", 
    project_name="humaneval-project", 
    examples=examples,
    overwrite=True
)

2025-09-30 17:14:44 - judgeval - INFO - Successfully created dataset humaneval-dataset!


We use `Dataset.get()` to retrieve an existing dataset from the [Judgment](https://app.judgmentlabs.ai/app) platform.

In [15]:
dataset = Dataset.get(
    name="humaneval-dataset",
    project_name="humaneval-project"
)

2025-09-30 17:14:45 - judgeval - INFO - Successfully retrieved dataset humaneval-dataset!


## Running Evaluation

We then use our custom scorer and the judgment client to run evaluations asynchronously on our servers and display the results on the judgment platform for analysis.

In [18]:
print("\n⚡ Running evaluation...")
judgment.run_evaluation(
    examples=dataset.examples,
    scorers=[CodeExecutionScorer(k=1, n=num_generations), CodeQualityScorer(n=num_generations)],
    project_name="humaneval-project"
)


⚡ Running evaluation...


Evaluating 20 example(s) in parallel: |██████████|100% (20/20) [Time Taken: 00:17,  1.12Example/s]


[ScoringResult(success=True, scorers_data=[ScorerData(id=None, name='Pass@1 for 5 generations', threshold=0.5, success=True, score=1.0, reason='Passed 5 out of 5 tests.\n\nFailing snippets:\n', strict_mode=False, evaluation_model=None, error=None, additional_metadata=None), ScorerData(id=None, name='Code Quality for 5 generations', threshold=0.5, success=True, score=0.8, reason="Generation 1:  \nSCORE: 0.5  \nREASON: This is a brute-force O(n^2) solution that checks all pairs. It's clear and easy to understand, but it is not efficient for large lists. Variable names are generic but acceptable, docstring is missing, and there's no input validation. Overall robustness is okay for small input sizes, but algorithmic efficiency could be improved.\n\nGeneration 2:  \nSCORE: 1.0  \nREASON: This implementation sorts the list (O(n log n)), then checks only adjacent numbers for threshold difference. This is optimal for this problem. The code is clear, maintains good readability, and follows best

Click **View Results** to open the dashboard. You should see something like this — be sure to explore the **Tests** page and the analytics panels to view detailed results and insights.  


![Dashboard Screenshot](./assets/offline_tests.png)