# Evaluating a NeMo checkpoint with simple-evals

This notebook showcases how to extend the set of evaluations available in NeMo Framework container.
It will guide you through the process of installing additional evaluation harness and different ways of specifying the benchmark.

If you would like to learn more about in in-framework deployment and difference between completions and chat endpoints, see [this tutorial](mmlu.ipynb) first.

In this tutorial we will evaluate an LLM on the [HumanEval benchmark](https://arxiv.org/abs/2107.03374) implemented in [NVIDIA Evals Factory simple-evals](https://pypi.org/project/nvidia-simple-evals/).
HumanEval consists of 164 hand-crafted programming problems in Python, specified with function signature and a docstring explaining the function's purpose.
The benchmark assesses the functional correctness of the generated code by comparing it against unit tests, rather than just measuring textual similarity to a reference solution.

We will use the chat variant of the benchmarks, tailored for assesing coding abilities of instruction-tuned (chat) models.

> NOTE: It is recommended to run this notebook inside a [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies.

## 1. Adding evaluation harness

We will start from exploring the available evaluations.
First, we take a look at benchmarks that come pre-installed with [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo).

Function `list_available_evaluations` finds all tasks in all installed evaluation frameworks.
Initially it only shows `lm_evaluation_harness`.

We can also use `find_framework` function to find a framework definining specified task.
Note that by default it is able to find `mmlu`, but cannot find a framework for executing `humaneval`

In [None]:
%%python
from nemo.collections.llm.evaluation.base import list_available_evaluations, find_framework

print("frameworks:", list(list_available_evaluations()))
for task in ("mmlu", "humaneval"):
    try:
        print(f"{task} found in {find_framework(task)}")
    except Exception as e:
        print(str(e))

Now we will install additional evaluation framework - [NVIDIA Evals Factory simple-evals](https://pypi.org/project/nvidia-simple-evals/).
It can be added by simply installing the package:

In [None]:
! pip install -q nvidia-simple-evals

If we repeat the same checks as before, we can now see the newly installed framework and find implementation for `humaneval` task.

At the same time, since both lm-evaluation-harness and simple-evals implement mmlu, we need to specify version of this task if we want to execute it.

In [None]:
%%python
from nemo.collections.llm.evaluation.base import list_available_evaluations, find_framework

print("frameworks:", list(list_available_evaluations()))
for task in ("mmlu", "humaneval"):
    try:
        print(f"{task} found in {find_framework(task)}")
    except Exception as e:
        print(str(e))

## 2. Deploying the model

We are now ready to deploy and evaluate the model.
First, you need to prepare a NeMo 2 checkpoint of the model you would like to evaluate. For the purpose of this tutorial, we will use Llama 3.2 1B Instruct checkpoint, which you can download from the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_2-1b-instruct). Make sure to mount the directory containing the checkpoint when starting the container. In this tutorial, we assume that the checkpoint is available under `"/checkpoints/llama-3_2-1b-instruct_v2.0"` path.

> NOTE: You can learn more about deployment and available server endpoints from the ["Evaluating a NeMo checkpoint with lm-eval"](mmlu.ipynb) tutorial. 

In [None]:
import signal
import subprocess

from nemo.collections.llm import api
from nemo.collections.llm.evaluation.api import EvaluationConfig, EvaluationTarget
from nemo.utils import logging

logging.setLevel(logging.INFO)

In [None]:
# modify this variable to point to your checkpoint
CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"

# if you are not using NeMo FW container, modify this path to point to scripts directory
SCRIPTS_PATH = "/opt/NeMo/scripts"

# modify this path if you would like to save results in a different directory
WORKSPACE = "."

In [None]:
deploy_script = f"{SCRIPTS_PATH}/deploy/nlp/deploy_in_fw_oai_server_eval.py"

In [None]:
deploy_process = subprocess.Popen(
    ["python", deploy_script, "--nemo_checkpoint", CHECKPOINT_PATH, "--max_input_len", "8192"], 
)

## 3. Evaluating the chat endpoint on HumanEval

simpe-evals provides a "chat" variant of HumanEval benchmark.

To learn more about the difference between "completions" and "chat" benchmarks, see the tutorial on ["Evaluating a NeMo checkpoint with lm-eval"](mmlu.ipynb).

In [None]:
model_name = "triton_model"
chat_url = "http://0.0.0.0:8886/v1/chat/completions/"

target_config = EvaluationTarget(api_endpoint={"url": chat_url, "type": "chat"})
eval_config = EvaluationConfig(
    type="humaneval",
    output_dir=f"{WORKSPACE}/humaneval",
)

results = api.evaluate(target_cfg=target_config, eval_cfg=eval_config)

When the job finishes we can close the server and inspect the results.

In [None]:
deploy_process.send_signal(signal.SIGINT)

In [None]:
results

We can also examine the artifacts produced by the evaluation job.
Inside the output directory you can see a detailed report in the HTML format: [humaneval.html](humaneval/humaneval.html).
The report contains metrics summary as well as input-output pairs for all samples used for evaluation.

In [None]:
! ls {WORKSPACE}/humaneval