# Getting Started with Nemo Evaluator

In the following notebook we will examine a routine experimentation flow where we first select a baseline model and evaluate it on our task, then we customize our model using a dataset created with Synthetic Data generation and evaluate it.

We will be working with Llama 3.1 8B Instruct as our baseline model, and customizing it for a title-generation (summarization) task by using the Low-Rank Adaptation (LoRA) Parameter Efficient Fine-tuning (PEFT) method on a document-title pair dataset that was created using Synthetic Data Generation.

This notebook will follow from [this](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama-3/sdg-law-title-generation) customizer tutorial.

We will explore how to leverage Nemo Evaluator for the following tasks:

1. Baseline Evaluation of Llama 3.1 8B Instruct using BigBench (Intent Recognition)
2. Custom Dataset Evaluation of a Customized Model Using ROUGE Through Similarity Metrics

Before you begin, you will need to make sure you're in an environment where you have API access to Nemo Evaluator API, baseline model NIM, the customized model NIM, and a judge LLM NIM.

For instructions on the above, please check out the detailed [Nemo Evaluator deployment guide](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html), and the [NIM deployment guide](https://developer.nvidia.com/docs/nemo-microservices/inference/getting_started/deploy-helm.html).

## Verify Nemo Evaluator is Healthy

Before digging into the Evaluator Service, we will first need to verify that the service is active and running. The can be achieved through the health endpoint. 

The first step in this process will be to provide the Nemo Evaluator endpoint URL. Assuming you've followed the deployment guide, you will use the same URL used during the [Verify Installation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html#verify-installation) step

In [None]:
import requests

EVAL_URL = "<< YOUR EVALUATOR URL >>"

Next, we can send a request to the `/health` endpoint to verify that the endpoint is active and healthy.

In [None]:
endpoint = f"{EVAL_URL}/health"
response = requests.get(endpoint).json()
print(response)

## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness

The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.

> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)

For this notebook, we will be running the LM Evaluation Harness evaluation!

First, we'll point to our NIM baseline model for our "target" in our Evaluation payload.

In [None]:
target_config = {
  "type": "model",
  "model": {
    "api_endpoint": {
      "url": "<< YOUR NIM INFERENCE URL >>",
      "model_id": "<< YOUR MODEL ID >>"
    }
  }
}

In [None]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"
response = requests.post(
    target_endpoint,
    json=target_config,
    headers={'accept': 'application/json'}
).json()

In [None]:
target_namespace = response["namespace"]
target_name = response["name"]
print(f"Target Namespace: {target_namespace}, Target Name: {target_name}")

Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. 

For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. 

The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!

In [None]:
evaluation_config = {
 "type": "lm_eval_harness",
 "tasks": [
   {
     "type": "gsm8k",
     "params": {
       "num_fewshot": 0,
       "batch_size": 4,
       "bootstrap_iters": 10,
       "limit": 5
     }
   }
 ],
 "params": {
   "use_greedy": True,
   "top_p": 0.0,
   "top_k": 1,
   "temperature": 1.0,
   "stop": [
     "<|endoftext|>",
     "<extra_id_1>"
   ],
   "tokens_to_generate": 512
 }
}

Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

We'll set up our Evaluator endpoint URL...

In [None]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
response = requests.post(
    eval_config_endpoint,
    json=evaluation_config,
    headers={'accept': 'application/json'}
).json()

Let's again capture our evaluation config for use later.

In [None]:
config_namespace = response["namespace"]
config_name = response["name"]
print(f"Config Namespace: {config_namespace}, Config Name: {config_name}")

eval-config-WkskqHD4VeawBTTgQnP2BP


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [None]:
job_config = {
    "target": target_namespace + "/" + target_name,
    "config": config_namespace + "/" + config_name,
    "tags": [
        "lm-eval-harness-gsm8k"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [None]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [None]:
response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'}
).json()

In [None]:
job_id = response["id"]
print(f"Job ID: {job_id}")

#### Monitoring

We can monitor the status of our job through the following endpoint.

In [None]:
monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{job_id}"

In [None]:
response = requests.get(
    monitoring_endpoint,
).json()

Let's check our job status and wait for it to be done!

In [None]:
print(response["status"]["status"])

Once it's done - let's look at the full results!

In [None]:
print(response)

## Upload a Custom Dataset for Evaluation

The first thing we will need to do is to upload our custom dataset to the Data Store. The dataset is provided in the `custom_dataset` directory. 

First, we will examine the structure of the dataset:

- `inputs.jsonl` is a collection of the raw question prompts that can be useful for custom evaluation.

### Preparing to Upload to Data Store

In order to upload this custom dataset, we'll take advantage of the Hugging Face Hub library from Hugging Face to interact with our Data Store.

In [None]:
!pip install -qU huggingface_hub==0.26.2

Next, we'll point to our Data Store API and use the provided `mock` token to gain access to the Data Store.

In [None]:
datastore_url = "<< YOUR_DATASTORE_URL >>"
token = "mock"

We'll also name our Data Store repository with something descriptive so we can reference it later.

We will also provide the path to our local data that needs to be added to our Data Store.

In [None]:
repository_name = "nvidia/legal-title-dataset"
repository_type = "dataset"
local_data_path = "./custom_dataset"

Now we can create an empty dataset repository in our Data Store.

In [None]:
import huggingface_hub as hh

hf_api = hh.HfApi(endpoint=datastore_url, token=token)

hf_api.create_repo(
    repo_id=repository_name,
    repo_type=repository_type,
)

Now that we have a repository available on our Data Store - we can upload our dataset!

In [None]:
path_in_repo = "."
result = hf_api.upload_folder(repo_id=repository_name, folder_path=local_data_path, path_in_repo=path_in_repo, repo_type=repository_type)
print(f"Dataset Folder Uploaded To: {result}")

## Using Similarity Metrics to Evaluate the Customized Model on ROUGE 

Now that we've seen how our baseline performs on our task - we can evaluate our customized model on the same metric to see how it performs.

ROUGE is available through the `similarity_metrics` - which contains metrics where we compare the target model's response to a ground truth. Other similarity metrics are available as well, like `f1`, `bleu`, and more!

> NOTE: As a reminder, we used PEFT LoRA to customize our model on synthetically created document-title data.

We can reuse the model config above with minor modifications - which needs to reference the customized model's NIM!

In [None]:
target_config = {
  "type": "model",
  "model": {
    "api_endpoint": {
      "url": "<< YOUR CUSTOMIZED NIM INFERENCE URL >>",
      "model_id": "<< YOUR MODEL ID >>"
    }
  }
}

In [None]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"
response = requests.post(
    target_endpoint,
    json=target_config,
    headers={'accept': 'application/json'}
).json()

In [None]:
target_namespace = response["namespace"]
target_name = response["name"]
print(f"Target Namespace: {target_namespace}, Target Name: {target_name}")

We can now create a customized evaluation config for our ROUGE evaluation. 

In [None]:
evaluation_config = {
 "type": "similarity_metrics",
 "tasks": [
   {
     "type": "default",
     "metrics": [
       {
         "name": "rouge"
       },
     ],
     "dataset": {
       "files_url": f"nds:{repository_name}/inputs.jsonl",
     },
     "params": {
       "tokens_to_generate": 200,
       "temperature": 0.7,
       "top_k": 20,
       "n_samples": -1
     }
   }
 ]
}

Let's get our evaluation config name and namespacefrom the Evaluator API endpoint.

In [None]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
response = requests.post(
    eval_config_endpoint,
    json=evaluation_config,
    headers={'accept': 'application/json'}
).json()

In [None]:
config_namespace = response["namespace"]
config_name = response["name"]
print(f"Config Namespace: {config_namespace}, Config Name: {config_name}")

Now we can send the evaluation job off to the Evaluator API endpoint.

In [None]:
job_config = {
    "target": target_namespace + "/" + target_name,
    "config": config_namespace + "/" + config_name,
    "tags": [
        "rouge-similarity"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [None]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [None]:
response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'}
).json()

#### Monitoring

We can monitor the status of our job through the following endpoint.

In [None]:
monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{job_id}"

In [None]:
response = requests.get(
    monitoring_endpoint,
).json()

Let's check our job status and wait for it to be done!

In [None]:
print(response["status"]["status"])

Once it's done - let's look at the full results!

In [None]:
print(response)