# Code Execution Scorer

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()

In this notebook, you will learn how to create **custom code-based scorers** using the **judgeval** library. This implementation demonstrates how to evaluate code generation using OpenAI's HumanEval benchmark and their sandboxed execution environment.

You will generate code using LLMs, create a custom scorer that leverages OpenAI's sandboxed environment, and evaluate it against the HumanEval benchmark dataset using functional correctness testing.

In [None]:
# Installations
!pip install human-eval datasets openai judgeval


To run this notebook and train a WikiRacer Agent, select **Runtime* -> Run All*

In [None]:
# set api keys
import os
from dotenv import load_dotenv

os.environ['OPENAI_API_KEY'] =
os.environ["JUDGMENT_API_KEY"] =
os.environ["JUDGMENT_ORG_ID"] =

load_dotenv()

In [None]:
from judgeval import JudgmentClient
from judgeval.dataset import Dataset
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.data import Example
from datasets import load_dataset
from human_eval.execution import check_correctness
from openai import AsyncOpenAI
from typing import Dict, Any
import asyncio

In [None]:
# Initialize clients
judgment = JudgmentClient()
client = AsyncOpenAI()

## 1. Understanding HumanEval

HumanEval is a benchmark dataset created by OpenAI for evaluating code generation models. Introduced in the paper ["Evaluating Large Language Models Trained on Code"](https://arxiv.org/abs/2107.03374), it contains 164 Python programming problems designed to test functional correctness.

### What HumanEval Contains
Each problem includes:
- **Function signature and docstring**: The problem description
- **Test cases**: Automated tests to verify correctness  
- **Canonical solution**: Reference implementation
- **Entry point**: Function name to test

### How HumanEval Evaluates Code
HumanEval evaluates model outputs by dynamically building a Python program that stitches together the **Function signature and docstring**, the **model’s generated solution**, and the **test cases**. This combined program is then executed in a sandbox to verify whether the generated code passes all test cases.

The ```check_correctness``` function orchestrates this process: assembles the prompt, generated solution, and tests into a single program, and executes the script in a sandboxed environment.

```python
# Construct the check program and run it.
check_program = (
    problem["prompt"] +      # Function signature + docstring
    completion +            # Generated code
    "\n" +
    problem["test"] +       # Test cases
    "\n" +
    f"check({problem['entry_point']})"  # Call the test function
)

...

# WARNING: This executes untrusted model-generated code
exec(check_program, exec_globals)
```

The evaluation is **pass/fail**: if all test cases pass without exceptions, the code is correct. If any test fails or the code crashes, it's incorrect.

## Code Generation Function

Next, we’ll implement a function that, given a HumanEval problem, queries an LLM to produce a candidate implementation. The function is written with `async/await` so multiple problems can be evaluated in parallel, significantly reducing total runtime.

In [None]:
async def generate_code(problem: Dict[str, Any]) -> str:
    """Generate code using LLM for a given HumanEval problem."""
    prompt = problem["prompt"]
    
    response = await client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": "You are an expert Python programmer. Write ONLY the Python function code that solves the given problem. Do not include any markdown formatting, explanations, or code blocks. Return only the raw Python code."},
            {"role": "user", "content": prompt}
        ],
    )
    
    generated_code = response.choices[0].message.content
    
    return generated_code

## Custom Code Execution Scorer

We'll create a custom scorer using judgeval that integrates HumanEval's sandboxed code execution. The scorer uses the `check_correctness` function to evaluate whether generated code passes the test cases.

The scorer runs the generated code in a sandboxed environment and checks if all tests pass. If they do, it assigns a score of 1.0. If any test fails or the code crashes, it assigns a score of 0.0.

In [None]:
class HumanEvalCodeExecutionScorer(ExampleScorer):
    """
    A scorer for evaluating code generation using functional correctness.
    
    This scorer uses the human_eval.execution.check_correctness function
    to run generated code against test cases and determine if it passes.
    
    Attributes:
        name (str): The name of the scorer
    """
    name: str = "HumanEval Code Execution Scorer"

    async def a_score_example(self, example: Example) -> None:
        """
        Score an example by running the generated code against test cases.
        
        This method uses check_correctness to execute the generated code
        in a sandboxed environment and check if it passes all test cases.
        
        Args:
            example (HumanEvalExample): The example containing the problem and generated code
            
        Returns:
            float: The score (1.0 if all tests pass, 0.0 otherwise)
        """
        # Create problem dict in the format expected by check_correctness
        problem = {
            "task_id": example.task_id,
            "prompt": example.prompt,
            "test": example.test,
            "entry_point": example.entry_point
        }
        
        # Use check_correctness to evaluate the generated code
        result = check_correctness(
            problem=problem,
            completion=example.generated_code,
            timeout=3.0
        )
        
        # Set score based on whether tests passed
        if result["passed"]:
            self.score = 1.0
            self.reason = "All test cases passed"
        else:
            self.score = 0.0
            self.reason = f"Test failed: {result['result']}"
        
        return self.score

## Load HumanEval Dataset

Now let's load the HumanEval dataset from Hugging Face and examine its structure.


In [None]:
# Load the HumanEval dataset
print("📊 Loading HumanEval dataset...")
dataset = load_dataset("openai/openai_humaneval")
print(f"   Found {len(dataset['test'])} problems")

# Examine the structure of a single problem
example_problem = dataset["test"][0]
print("\n📋 Example problem structure:")
print(f"   Task ID: {example_problem['task_id']}")
print(f"   Entry Point: {example_problem['entry_point']}")
print(f"   Prompt length: {len(example_problem['prompt'])} characters")
print(f"   Test length: {len(example_problem['test'])} characters")

print("\n📝 Sample prompt:")
print(example_problem['prompt'][:200] + "...")

Generate code responses for each problem in the HumanEval benchmark and create example objects to upload into judgeval.

In [None]:
print("\n🤖 Generating code...")
problems = list(dataset["test"].select(range(164)))

# Generate all code in parallel
generated_codes = await asyncio.gather(*[
    generate_code(problem) 
    for problem in problems
])

# Create examples
examples = []
for i, (problem, generated_code) in enumerate(zip(problems, generated_codes)):
    print(f"   Problem {i+1}/5: {problem['task_id']}")
    
    example = Example(
        task_id=problem["task_id"],
        prompt=problem["prompt"],
        canonical_solution=problem["canonical_solution"],
        test=problem["test"],
        entry_point=problem["entry_point"],
        generated_code=generated_code
    )
    examples.append(example)

We use `Dataset.create()` to create a new dataset and upload it to the judgment platform.

In [None]:
dataset = Dataset.create(
    name="humaneval-dataset", 
    project_name="humaneval-project", 
    examples=examples,
    overwrite=True
)

We use `Dataset.get()` to retrieve an existing dataset from the judgment platform.

In [None]:
dataset = Dataset.get(
    name="humaneval-dataset",
    project_name="humaneval-project"
)

## Running Evaluation

We then use our custom scorer and the judgment client to run evaluations asynchronously on our servers and display the results on the judgment platform for analysis.

In [None]:
print("\n⚡ Running evaluation...")
judgment.run_evaluation(
    examples=dataset.examples,
    scorers=[HumanEvalCodeExecutionScorer()],
    project_name="humaneval-project"
)