# Introduction to Evaluation with Nemo Evaluator

In the following notebook we will examine a routine experimentation flow where we first select a baseline model and evaluate it on our task, then we customize our model using a dataset created with Synthetic Data generation and evaluate it.

We will be working with Llama 3.1 8B Instruct as our baseline model, and customizing it for a title-generation (summarization) task by using the Low-Rank Adaptation (LoRA) Parameter Efficient Fine-tuning (PEFT) method on a document-title pair dataset that was created using Synthetic Data Generation.

This notebook will follow from [this](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama-3/sdg-law-title-generation) customizer tutorial.

We will explore how to leverage Nemo Evaluator for the following tasks:

1. Baseline Evaluation of Llama 3.1 8B Instruct using BigBench (Intent Recognition)
2. Custom Dataset Evaluation of a Customized Model Using ROUGE
3. Custom Dataset Evaluation of a Customized Model using LLM-As-A-Judge

Before you begin, you will need to make sure you're in an environment where you have API access to Nemo Evaluator API, baseline model NIM, the customized model NIM, and a judge LLM NIM.

For instructions on the above, please check out the detailed [Nemo Evaluator deployment guide](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html), and the [NIM deployment guide](https://developer.nvidia.com/docs/nemo-microservices/inference/getting_started/deploy-helm.html).

## Verify Nemo Evaluator is Healthy

Before digging into the Evaluator Service, we will first need to verify that the service is active and running. The can be achieved through the health endpoint. 

The first step in this process will be to provide the Nemo Evaluator endpoint URL. Assuming you've followed the deployment guide, you will use the same URL used during the [Verify Installation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html#verify-installation) step

In [None]:
import requests

EVAL_URL = "MY_EVALUATOR_URL"

Next, we can send a request to the `/health` endpoint to verify that the endpoint is active and healthy.

In [None]:
endpoint = f"{EVAL_URL}/health"
response = requests.get(endpoint).json()
print(response)

## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness

The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.

> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)

For this notebook, we will be running the LM Evaluation Harness evaluation!

First, we'll point to our NIM baseline model for our "model" in our Evaluation payload.

In [None]:
model_config = {
        "llm_name": "my-customized-model",
        "inference_url": "MY_NIM_URL/v1",
        "use_chat_endpoint": False,
    }

Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. 

For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. 

The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!

In [None]:
evaluation_config = {
    "eval_type": "automatic",
    "eval_subtype": "lm_eval_harness",
    "tasks": [
        {
        "task_name" : "gsm8k",
        "task_config" : None,
        "num_fewshot" : 5,
        "batch_size" : 16,
        "bootstrap_iters" : 1000,
        "limit" : -1
        }
    ]
}

Now we can load our config and send the request to the Evaluator API!

In [None]:
evaluator_payload = {
    "model" : model_config,
    "evaluations" : [evaluation_config],
}

Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

We'll set up our Evaluator endpoint URL...

In [None]:
evaluator_endpoint = f"{EVAL_URL}/v1/evaluations"

And fire off the request!

In [None]:
response = requests.post(evaluator_endpoint, json=evaluator_payload).json()
evaluation_id = response["evaluation_id"]
print(f"Evaluation ID: {evaluation_id}")

Each Nemo Evaluator job will give us an Evaluation ID which we can use to track, and then collect our Evaluation results. 

Let's see how our job is doing - and check the results!

In [None]:
evaluation_id_endpoint = evaluator_endpoint + f"/{evaluation_id}"
response = requests.get(evaluation_id_endpoint).json()
response

## Upload a Custom Dataset for Evaluation

The first thing we will need to do is to upload our custom dataset to the Data Store. The dataset is provided in the `custom_dataset` directory. 

First, we will examine the structure of the dataset:

- `question.jsonl` contains our initial documents that we want to create titles for. 
- `reference_answer/references.jsonl` contains the reference titles generated during our Synthetic Data Generation (SDG) data curation step.
- `inputs.jsonl` is a collection of the raw question prompts that can be useful for custom evaluation.
- It has a `judge_prompts.jsonl`, this will be useful when the dataset is used with our LLM-As-A-Judge approach, as it contains the required prompt in the expected format for the judge model. 

### Preparing to Upload to Data Store

In order to upload this custom dataset, we'll take advantage of the Hugging Face Hub library from Hugging Face to interact with our Data Store.

In [None]:
!pip install -qU huggingface_hub

Next, we'll point to our Data Store API and use the provided `mock` token to gain access to the Data Store.

In [None]:
datastore_url = "YOUR_DATASTORE_URL"
token = "mock"

We'll also name our Data Store repository with something descriptive so we can reference it later.

We will also provide the path to our local data that needs to be added to our Data Store.

In [None]:
repository_name = "legal-title-dataset"
local_data_path = "./custom_dataset"

Now we can create an empty dataset repository in our Data Store.

In [None]:
datasets_endpoint = datastore_url + "/v1/datasets"

post_body = {
    "name" : repository_name,
    "description" : "Legal Title Dataset - 128",
}

repo_response = requests.post(datasets_endpoint, json=post_body, allow_redirects=True)

Now that we have a repository available on our Data Store - we can upload our dataset!

In [None]:
import huggingface_hub as hh

repo_full_name = f"nvidia/{repository_name}"
path_in_repo = "."
repo_type = "dataset"
hf_api = hh.HfApi(endpoint=datastore_url, token=token)
result = hf_api.upload_folder(repo_id=repo_full_name, folder_path=local_data_path, path_in_repo=path_in_repo, repo_type=repo_type)
print(f"Dataset Folder Uploaded To: {result}")

## Evaluating the Customized Model on ROUGE

Now that we've seen how our baseline performs on our task - we can evaluate our customized model on the same metric to see how it performs.

> NOTE: As a reminder, we used PEFT LoRA to customize our model on synthetically created document-title data.

We can reuse the model config above with minor modifications - which need to reference the customized model's NIM!

In [None]:
model_config = {
        "llm_name" : "my-customized-model",
        "inference_url" : "my-customized-inference-url",
        "use_chat_endpoint" : False,
}

We will also need to modify our evaluation configs to reference the new model's NIM.

In [None]:
evaluation_config = {
    "eval_type" : "automatic",
    "eval_subtype" : "custom_eval",
    "input_file" : f"nds:{repository_name}/inputs.jsonl",
    "inference_configs" : [
        {
            "model" : {
                "llm_name" : "my-customized-model",
            },
            "run_inference" : "True",
            "inference_params" :  {
                "tokens_to_generate" : 200,
                "temperature" : 0.7,
                "top_k" : 20,
            }
        }
    ],
    "num_of_samples" : -1,
    "scorers" : ["rouge"],
}

Now we can send the evaluation job off to the Evaluator API endpoint.

In [None]:
customized_response = requests.post(evaluator_endpoint, json=evaluator_payload).json()
customized_evaluation_id = customized_response["evaluation_id"]
print(f"Evaluation ID: {customized_evaluation_id}")

Finally, we can check on how the evaluation went by accessing our Evaluation ID endpoint!

In [None]:
customized_evaluation_id_endpoint = evaluator_endpoint + f"/{customized_evaluation_id}"
response = requests.get(customized_evaluation_id_endpoint).json()
response

## Evaluating the Customized Model with LLM-As-A-Judge

Finally, we can evaluate our customized model by leveraging Nemo Evaluators easily implemented LLM-As-A-Judge API!

First, let's check out the custom prompt we're going to send to our judge LLM that we've included in our Data Store.

In [None]:
import json

# open "custom_dataset/judge_prompts.jsonl"
with open("custom_dataset/judge_prompts.jsonl") as f:
    judge_prompts = f.readlines()

full_prompt_object = json.loads(judge_prompts[0])

system_prompt = full_prompt_object["system_prompt"]
judge_prompt_template = full_prompt_object["prompt_template"]

In [None]:
print(f"System Prompt: {system_prompt}")
print(f"Prompt Template: {judge_prompt_template}")

Notice how we have the following formattable attributes:

- `{question}` - this is our source document that we wish to generate a title for
- `{ref_answer_1}` - this is the reference title provided from our test set
- `{answer}` - this is the output that is generated by the LLM we're evaluating

So, for every instance in our test data - we'll prompt the Judge LLM to judge our customized model's response against the ground truth and provide ratings.

We'll need to create an evaluation payload again - let's start with our model config.

> NOTE: This model config will refer to our customized model - as it is the model that is *being* judged. We can re-use the config we used before.

In [None]:
model_config = {
        "llm_name" : "my-customized-model",
        "inference_url" : "my-customized-inference-url",
        "use_chat_endpoint" : False,
}

Now we can set-up our evaluation config.

Notice that we're now providing a `judge_model`, and `judge_inference_params` field. This will reference the model that will act as our LLM-As-A-Judge.

In [None]:
evaluation_config = {
    "eval_type" : "llm_as_a_judge",
    "eval_subtype" : "mtbench",
    "bench_name" : f"{repository_name}",
    "mode" : "single",
    "input_dir" : f"nds:{repository_name}",
    "inference_params" : {
        "top_p" : 0.9,
        "top_k" : 0,
        "temperature" : 0.75,
        "stop" : [],
        "tokens_to_generate" : 1024,
    },
    "judge_model" : {
        "llm_type" : "nvidia-nemo-nim",
        "llm_name" : "my-judge-llm",
        "inference_url" : "my-judge-llm-url",
        "use_chat_endpoint" : False,
    },
    "judge_inference_params" : {
        "top_p" : 0.9, 
        "top_k" : 40,
        "temperature" : 0.1,
        "stop" : [],
        "tokens_to_generate" : 1024,
    }
}

Let's wrap this in our payload - and add a useful tag for tracking our Evaluation job!

In [None]:
evaluator_payload = {
    "model" : model_config,
    "evaluations" : [evaluation_config],
    "tag" : "title-generation-llm-as-a-judge"
}

Once again, we'll fire this off to the evaluator endpoint.

In [None]:
llm_as_judge_response = requests.post(evaluator_endpoint, json=evaluator_payload).json()
llm_as_judge_evaluation_id = llm_as_judge_response["evaluation_id"]
print(f"Evaluation ID: {llm_as_judge_evaluation_id}")

In [None]:
llm_as_a_judge_evaluation_id_endpoint = evaluator_endpoint + f"/{llm_as_judge_evaluation_id}"
response = requests.get(llm_as_a_judge_evaluation_id_endpoint).json()
response

Now we can download our results as a `.csv` to see how our customized model did!

In [None]:
result_repository_path = llm_as_judge_evaluation_id
download_path = f"./llm_as_a_judge_results/"

repo_name = f"nvidia/{result_repository_path}"

api = hh.HfApi(endpoint=datastore_url, token=token)
repo_type = "dataset"
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)

Let's look at the results! 

Remember from our LLM-As-A-Judge prompt:

> You will evaluate the quality of Summary 2 on a scale of 1-7

This means our total score will be out of 7!

In [None]:
with open("./llm_as_a_judge_results/llm_as_a_judge/mtbench/results/my-customized-model.csv", "r") as table:
    for row in table:
        print(row)

Additionally, we can inspect the response directly to see how the Judge LLM arrived at the scores.

In [None]:
llm_judge_responses = "./llm_as_a_judge_results/llm_as_a_judge/mtbench/legal-title-dataset/model_judgement/my-judge-llm_single_for_my-customized-model.jsonl"

with open(llm_judge_responses, "r") as file:
    for line in file:
        row = json.loads(line)
        print(f"{row['question_id']} - Score: {row['score']}\nExplanation:{row['judgment']} \n")