# NeMo Evaluator Microservice: Custom LLM-as-a-Judge Eval (Summarization Task)

In the following notebook we'll be covering how to create a custom LLM-As-A-Judge evaluation with NVIDIA NeMo Evaluator Microservice (NeMo Evaluator).

We'll walk through the required steps of: 

1. Creating an evaluation target
2. Creating an evaluation configuration
3. Submitting the Evaluation job
4. Collecting results!

Let's dive right in!

> NOTE: You will need to be in an environment where you have access to a deployed instance of NeMo Evaluator, as well as NVIDIA NeMo Data Store Microservice.

In [None]:
!pip install -qU requests huggingface_hub==0.26.2

### Evaluator Endpoint

Here we just need to capture out Evaluator Endpoint!

In [None]:
import requests

EVAL_URL = "<< YOUR EVALUATOR MS URL >>"

Now we can do a health check to confirm we're connected to a working instance.

In [2]:
endpoint = f"{EVAL_URL}/health"
response = requests.get(endpoint).json()
print(response)

{'status': 'healthy'}


### Prepare dataset in the correct format

In this example, we're evaluating the models using a summarization task. So our evaluation dataset needs to reflect this as well as the prompt for our judge model.

Look at files in `summarization_bench`:
* `question.jsonl` - prompts for the user model (can be multi-turn) and question categories
* `reference_answer/reference.jsonl` - corresponding reference answers for prompts listed in `question.jsonl`
* `judge_prompts.jsonl` - judge prompt for each category

### Upload dataset to NeMo Datastore

In [None]:
import huggingface_hub as hh
import requests

DATASTORE_URL = "<< YOUR DATA STORE URL >>"

token = "mock"
repo_name = "nvidia/LLMAsAJudge-Simple"
repo_type = "dataset"
dir_path = "./llm_as_a_judge"

hf_api = hh.HfApi(endpoint=DATASTORE_URL, token=token)

# create repo
hf_api.create_repo(
    repo_id=repo_name,
    repo_type=repo_type,
)

# upload dir
path_in_repo = "."
result = hf_api.upload_folder(repo_id=repo_name, folder_path=dir_path, path_in_repo=path_in_repo, repo_type=repo_type)

print(f"Dataset folder uploaded to: {result}")

### Prepare Evaluation Configs - Custom dataset for Summarization Task

In order to run a job in NeMo Evaluator - we need two specific things:

1. A Target Model - that is, the model that is going to be evaluated.
2. A Evaluation Configuration - that is, a configuration that describes our evaluation.

With those two objects created - we can run jobs! This will reduce the amount of repetition as we might potentially run a large number of evaluations on a single target model!

#### Target Model 

Let's start by creating a new target model - we can do this as easily as pointing to the desired inference URL where the NIM is hosted and providing the appropriate model ID!

In [8]:
target_config = {
  "type": "model",
  "model": {
    "api_endpoint": {
      "url": "<< YOUR NIM INFERENCE URL >>",
      "model_id": "<< YOUR MODEL ID >>"
    }
  }
}

We'll want to point our request at the `v1/evaluation/targets` endpoint to create the target.

In [9]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"

Then we are clear to fire off the request!

In [10]:
response = requests.post(
    target_endpoint,
    json=target_config,
    headers={'accept': 'application/json'}
).json()

We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [None]:
target_namespace = response["namespace"]
target_name = response["name"]
print(f"Target Namespace: {target_namespace}, Target Name: {target_name}")

llm-judge-job


#### Evaluation Configuration

In the following step we'll create an Evaluation Configuration which will describe exactly how, and what, we wish to evaluate our target against.

In this example - we'll be using a custom LLM-As-A-Judge evaluation. Let's create that configuration now!

Notice how we need to provide a judge model as part of our tasks - this is the model that will be doing the judging of our model's responses.

In [17]:
evaluation_config = {
 "type": "llm_as_a_judge",
 "tasks": [
   {
     "type": "custom",
     "params": {
       "judge_model": {
         "api_endpoint": {
           "url": "<< YOUR JUDGE NIM URL >>",
           "model_id": "<< YOUR JUDGE MODEL NAME >>"
         }
       },
       "judge_inference_params": {
         "top_p": 1.0e-05,
         "top_k": 1,
         "temperature": 1.0e-05,
         "stop": [],
         "tokens_to_generate": 512
       },
       "top_p": 0.9,
       "top_k": 40,
       "temperature": 0.75,
       "stop": [],
       "tokens_to_generate": 512
     },
     "dataset": {
       "files_url": "nds:LLMAsAJudge"
     }
   }
 ]
}

Now we can point to our evaluation config endpoint.

In [18]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"

And fire off the request!

In [21]:
response = requests.post(
    eval_config_endpoint,
    json=evaluation_config,
    headers={'accept': 'application/json'}
).json()

Let's again capture our evaluation config for use later.

In [None]:
config_namespace = response["namespace"]
config_name = response["name"]
print(f"Config Namespace: {config_namespace}, Config Name: {config_name}")

eval-config-WkskqHD4VeawBTTgQnP2BP


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [24]:
job_config = {
    "target": target_namespace + "/" + target_name,
    "config": config_namespace + "/" + config_name,
    "tags": [
        "custom-llm-as-a-judge"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [25]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [28]:
response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'}
).json()

#### Monitoring

We can monitor the status of our job through the following endpoint.

In [None]:
monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{job_id}"

In [40]:
response = requests.get(
    monitoring_endpoint,
).json()

Let's check our job status and wait for it to be done!

In [42]:
print(response["status"]["status"])

succeeded


Now we can observe the results of our simple benchmark!

In [44]:
response["results"]

[{'metrics': [{'name': 'average',
    'value': '  5.62',
    'metadata': {'name': 'mtbench', 'metric_ranking': 0}}],
  'level_name': 'evaluation',
  'isRecommended': True,
  'evaluation_results': [{'metrics': [{'name': 'average',
      'value': ' 5.95',
      'metadata': {'name': 'helpfulness ', 'metric_ranking': 0}},
     {'name': 'average',
      'value': ' 5.3',
      'metadata': {'name': 'toxicity ', 'metric_ranking': 0}},
     {'name': 'average',
      'value': '  nan',
      'metadata': {'name': 'turn 1', 'metric_ranking': 0}},
     {'name': 'average',
      'value': '  5.62',
      'metadata': {'name': 'turn 2', 'metric_ranking': 0}},
     {'name': 'average',
      'value': ' 8',
      'metadata': {'name': 'number of missing judgements',
       'metric_ranking': 0}}],
    'level_name': 'task',
    'isRecommended': True,
    'evaluation_results': None,
    'extra_grouping_fields': None}],
  'extra_grouping_fields': None}]