# Evaluating Performance of Llama 3.1 8B with Nemo Evaluator OSS

[NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It can be used to run the most popular academic benchmarks by executing open-source Docker containers. 

We will use this library to evaluate [Llama 3.1 8B](https://build.nvidia.com/meta/llama-3_1-8b-instruct), which will be deployed as a NIM in the Lepton service.

The benchmark used in this notebook will be [MMLU](https://huggingface.co/datasets/cais/mmlu), a QA dataset for multitask understanding.

[This blog post](https://www.google.com/search?q=model+evaluation+nvidia+blog&oq=model+evaluation+nvidia+blog&gs_lcrp=EgRlZGdlKgYIABBFGDkyBggAEEUYOTIICAEQ6QcY_FXSAQg0MTEzajBqMagCALACAA&sourceid=chrome&ie=UTF-8) provides more details into LLM evaluation and how to choose the right benchmarks and techniques.

## Objectives

The goal of this notebook is to demonstrate the usage of NeMo Evaluator OSS in Lepton. 

NeMo Evaluator can be used locally, with Slurm or with Lepton. Evaluation requires a model to be deployed first, and then a series of requests are made to its endpoint. While this notebook shows how to deploy the model with NIM, the scripts in [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) allow to deploy the model via VLLM or use an already deployed endpoint.

Model evaluation can be performed in many different ways, where the most common one is to get the model's answers to questions in a benchmark, and compare them to a ground truth or have a separate, more powerful LLM judge them (LLM-as-a-judge). There are many different benchmarks available to use in NeMo Evaluator: this [GitHub link](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/tutorials) provides tutorials and this [documentation page](https://nv-eval-platform-dl-joc-competitive-evaluation-c0c268c1aead4fd0.gitlab-master-pages.nvidia.com/benchmarks_doc) contains more details into the different benchmarks included and how to run them.

## Requirements

### System Configuration
- Access to at least 1 NVIDIA GPU to deploy Llama 3.1 8B (default is H200 - you will need to change the variable resource_shape in *Part 3: Deployment* otherwise)
- An NGC API key, obtained from [Nvidia](https://build.nvidia.com/)
- A Hugging Face [access token](https://huggingface.co/docs/hub/en/security-tokens), which will be used to download gated models or datasets.
- A Lepton token: this notebook is made to run **locally**, which means you need a token to login to your Lepton workspace so that the endpoint and the batch jobs are deployed. In your lepton workspace, go to *Settings* and then *Tokens*.

## Step 1: Create your environment

For this notebook, you will need to install leptonai and NeMo Evaluator. These can be done with the following commands:

```bash
pip install leptonai
https://github.com/NVIDIA-NeMo/Evaluator
cd Evaluator
pip install -e .
```

Feel free to create a virtual environment, so you can reuse it.

## Step 2: Connect to Lepton

Now, you need to connect to your lepton workspace using the token you obtained. <span style="color:blue">Paste your token below</span>, and run the cell

In [None]:
!LEP_TOKEN="YOUR_TOKEN_HERE" && lep login -c $LEP_TOKEN



                            [32mN V I D I A[0m

                        [37m D G X  C L O U D[0m

        [32m██╗     ███████╗██████╗ ████████╗ ██████╗ ███╗   ██╗[0m
        [32m██║     ██╔════╝██╔══██╗╚══██╔══╝██╔═══██╗████╗  ██║[0m
        [32m██║     █████╗  ██████╔╝   ██║   ██║   ██║██╔██╗ ██║[0m
        [32m██║     ██╔══╝  ██╔═══╝    ██║   ██║   ██║██║╚██╗██║[0m
        [32m███████╗███████╗██║        ██║   ╚██████╔╝██║ ╚████║[0m
        [32m╚══════╝╚══════╝╚═╝        ╚═╝    ╚═════╝ ╚═╝  ╚═══╝[0m



Logged in to your workspace [34mxfre17eu[0m.
              tier: enterprise
        build time: 2025-10-07T22:57:54+00:00
           version: (0, 41, 20)


You should see **LEPTON** in big green font, and text indicating you're logged in.

## Step 3: Build your NeMo Evaluator script with your keys

In this section, we will be creating an Evaluation runner configuration file, similar to the Lepton example found in [GitHub](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/examples/lepton_nim_llama_3_1_8b_instruct.yaml).

By running the configuration, you will:
1. Deploy the specified NIM container to a Lepton endpoint
2. Wait for the endpoint to be ready
3. Run evaluation tasks as parallel Lepton jobs that connect to the deployed NIM
4. Clean up the endpoint when done (on failure) or remind you to clean up (on success)

Let's create the configuration yaml part by part.

#### > Part 1: Defaults

In [20]:
defaults = f"""

defaults:
  - execution: lepton/default
  - deployment: nim
  - _self_
  
"""

The only thing of note here is the *deployment* section, which is defined as **nim** for this notebook. This would change depending on your model deployment.

#### > Part 2: Execution

Let's now write the **execution** configurations. For this, you will need the id of your Lepton node. 

You can obtain the node id by going to your Lepton Dashboard, click on **Nodes** and then click on the node box you see. You will find the node id on the url link, right after */node-groups/detail/dedicated* and in between *'/'* (example: *nv-int-multiteam-nebius-h200-01-mjgbgffo*). 
<span style="color:blue">Write your node id in the cell below.</span>

In [None]:
lepton_node_id = "YOUR_NODE_ID_HERE"

<span style="color:blue">Set your Lepton storage path</span>, where the results will be saved:

In [22]:
lepton_storage_path = "/EU-Model-Builder-SAs/user_homes/${oc.env:USER}/nemo-evaluator-launcher-workspace"

Additionally, you will need your NGC and your Hugging Face tokens. <span style="color:blue">Set them here:</span>

In [None]:
ngc_key = "YOUR_NGC_KEY_HERE"
hf_token = "YOUR_HF_TOKEN_HERE"

In [95]:
execution = f"""

execution:
  output_dir: lepton_nim_llama_3_1_8b_results

  evaluation_tasks:
    resource_shape: "cpu.large"  # Evaluation tasks require only CPU resources
    timeout: 3600  # Override default 3600 timeout (this is how long we wait for the endpoint to be ready)

  lepton_platform:
    deployment:
      node_group: {lepton_node_id}

      platform_defaults:
        image_pull_secrets:
          - "lepton-nvidia"

    tasks:
      api_tokens:
      - value_from:
          token_name_ref: "ENDPOINT_API_KEY"  # Token to access the model endpoint
          
      env_vars:
        HF_TOKEN: "{hf_token}"

      # Node group for evaluation tasks
      node_group: {lepton_node_id.split(lepton_node_id.split('-')[-1])[0][:-1]}
      
      # Storage mounts for task execution
      mounts:
        # Main workspace mount
        - from: "node-nfs:lepton-shared-fs"
          path: {lepton_storage_path}
          mount_path: "/workspace"
          
"""

#### > Part 3: Deployment

Now let's look at the NIM-specific deployment configurations.

In [96]:
deployment = f"""

deployment:
  # NIM container configuration
  image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
  served_model_name: meta/llama-3.1-8b-instruct

  # Lepton-specific deployment settings
  lepton_config:
    endpoint_name: llama-3-1-8b  # Base name for Lepton endpoint
    resource_shape: gpu.1xh200 # GPU shape for the endpoint
    min_replicas: 1
    max_replicas: 1

    api_tokens:
      - value_from:
          token_name_ref: "ENDPOINT_API_KEY"  # Token for the model endpoint (must be the same as the api_tokens in the Execution section)

    # Auto-scaling settings
    auto_scaler:
      scale_down:
        no_traffic_timeout: 3600
        scale_from_zero: false

    # Environment variables for NIM container
    envs:
      # Direct values
      OMPI_ALLOW_RUN_AS_ROOT: "1"
      OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: "1"

      HF_TOKEN: "{hf_token}"
      
      NGC_API_KEY: "{ngc_key}"

    # Storage mounts for model caching
    mounts:
      enabled: true
      cache_path: {lepton_storage_path}/.cache
      mount_path: "/opt/nim/.cache"

"""

#### > Part 4: Evaluation

Now we can move to the Evaluation section. Here, we can choose one or more tasks to run the model in. 
In this case, we are choosing to run the MMLU task from simple_evals. A task is run simply by defining the task name, but you can choose as many overrides as you wish, which you can find on the documentation.

In [97]:
evaluation = f"""

evaluation:
  # Evaluation tasks to run
  tasks:
    - name: ifeval
    
"""

#### > Part 4: Saving the yaml file

All the components are ready for the file to be saved. <span style="color:blue">Define the directory where the file is saved:</span>

In [84]:
%env CURR_DIR=eval

env: CURR_DIR=eval


In [75]:
!mkdir -p $CURR_DIR

In [98]:
import os

with open(f"{os.environ['CURR_DIR']}/lepton_nim_evaluation.yaml", "w") as f:
    f.write(defaults + execution + deployment + evaluation)

## Step 4: Deploy the Evaluation Job

Now that we have created the yaml file, it's time to deploy it in Lepton.
Once the deployment is ready, you can monitor it in the Lepton UI:
- Deployment status: UI/Endpoints
- Evaluation jobs: UI/Batch Jobs

The command in the following cell deploys the job; **make sure you input the paths for your current directory (where *lepton_nim_evaluation.yaml* is saved) and the path where you want results stored in locally**.

In [None]:
%env RESULTS_DIR=results

In [99]:
!nemo-evaluator-launcher run --config-dir $CURR_DIR --config-name lepton_nim_evaluation --override execution.output_dir=$RESULTS_DIR

🚀 Processing 1 evaluation tasks with dedicated endpoints...
🚀 Creating 1 endpoints in parallel...
🚀 Task ifeval: Creating endpoint nim-ifeval-0-710546
✅ Successfully created Lepton endpoint: nim-ifeval-0-710546
⏳ Task ifeval: Waiting for endpoint nim-ifeval-0-710546 to be ready...
Traceback (most recent call last):
  File "/home/ritan/miniconda3/envs/nv-eval/bin/nemo-evaluator-launcher", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ritan/miniconda3/envs/nv-eval/lib/python3.11/site-packages/nemo_evaluator_launcher/cli/main.py", line 123, in main
    args.run.execute()
  File "/home/ritan/miniconda3/envs/nv-eval/lib/python3.11/site-packages/nemo_evaluator_launcher/cli/run.py", line 94, in execute
    invocation_id = run_eval(config, self.dry_run)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ritan/miniconda3/envs/nv-eval/lib/python3.11/site-packages/nemo_evaluator_launcher/api/functional.py", line 99, in run_eval
    return get_executor(cf

______________________

That's it! You have managed to deploy a NIM model and run an evaluation task on Lepton. Check the documentation to learn how to run other tasks, and deploy models in VLLM!