![](https://github.ibm.com/WML/watson-machine-learning-samples/blob/dev/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png?raw=true "title")


# Use lm-evaluation-harness and own benchmarking data with watsonx.ai foundation models

This notebook contains the steps and code to demonstrate usage of `lm-evaluation-harness` (also called `lm-eval`) package with `ibm_watsonx_ai` and `watsonx_llm` language model.   

Some familiarity with Python is helpful. This notebook uses Python 3.11.

## Learning goals

The learning goals of this notebook are: 
1. Setting up `lm-evaluation-harness` and `ibm_watsonx_ai`
2. Basic `lm-evaluation-harness` usage with available tasks
3. Preparing custom tasks and setting up local datasets
4. Calling `lm-evaluation-harness` with locally prepared tasks

## Table of contents
This notebook contains the following parts:

1. [Prerequisites](#prerequisites)
   - [How to install lm-evaluation-harness - two ways](#install-lm-eval)
   - [Install lm-evaluation-harness and ibm_watsonx_ai](#install-packages)
   - [Validate installation](#validate-install)
3. [Setting up necessary IBM watsonx credentials](#credentials)
   - [Working with projects](#projects)
   - [Export watsonx variables to be used by lm-evaluation-harness](#export-variables)
4. [Basic lm-evaluation-harness usage](#usage)
5. [Preparing own data for benchmarking](#prepare-own-data)
   - [Prepare an APIClient instance](#prepare-client)
   - [Prepare data assets](#prepare-assets)
   - [Download datasets](#download-files)
   - [Validate download](#validate-download)
   - [Sample YAML task syntax](#yaml-task)
   - [Save custom task to file](#save-task)
6. [Run lm-evaluation-harness benchmarks with local data](#benchmarks-local)
7. [Summary and next steps](#summary)

## Prerequisites <a name="prerequisites"></a>

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a [watsonx.ai Runtime](https://cloud.ibm.com/catalog/services/watson-machine-learning) instance (a free plan is offered and information about how to create the instance can be found [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/wml-plans.html?context=wx&audience=wdp)).
- Create a [Cloud Object Storage (COS) instance](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage) (a lite plan is offered and information about how to order storage can be found [here](https://cloud.ibm.com/docs/cloud-object-storage/basics/order-storage.html#order-storage)).
  
__Note: When using Watson Studio, you already have a COS instance associated with the project you are running the notebook in.__

#### How to install lm-evaluation-harness - two ways <a name="install-lm-eval"></a>

`lm-evaluation-harness` is a unified framework to test generative language models on a large number of different evaluation tasks. For more info and the source code, check out its [GitHub repository](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)

1. Package installation - to use as is: 
    ```
    !pip install lm-eval | tail -n 1
    ```

2. Local installation - for debugging purposes:

    ```
    git clone https://github.com/EleutherAI/lm-evaluation-harness
    cd lm-evaluation-harness
    pip install -e .
    ```

#### Install ibm_watsonx_ai and lm-evaluation-harness package from pip <a name="install-packages"></a>

__Note__: 
- `ibm-watsonx-ai` documentation can be found [here](https://ibm.github.io/watsonx-ai-python-sdk/index.html).
- `lm-evaluation-harness` documentation can be found [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)

In [1]:
!pip install -U ibm_watsonx_ai | tail -n 1 
!pip install lm-eval | tail -n 1 

Successfully installed DataProperty-1.1.0 accelerate-1.3.0 chardet-5.2.0 colorama-0.4.6 datasets-3.2.0 dill-0.3.8 evaluate-0.4.3 huggingface-hub-0.27.1 jsonlines-4.0.0 lm-eval-0.4.7 mbstrdecoder-1.1.4 more_itertools-10.6.0 multiprocess-0.70.16 pathvalidate-3.2.3 peft-0.14.0 portalocker-3.1.1 pybind11-2.13.6 pytablewriter-1.2.1 rouge-score-0.1.2 sacrebleu-2.5.1 safetensors-0.5.2 sqlitedict-2.1.0 tabledata-1.3.4 tcolorpy-0.1.7 tokenizers-0.21.0 tqdm-multiprocess-0.0.11 transformers-4.48.0 typepy-1.3.4 word2number-1.1 xxhash-3.5.0 zstandard-0.23.0


- `wget` is used only if you are planning to download datasets directly from HuggingFace. If you already have them stored locally or as data assets on Cloud, skip the line below

In [2]:
!pip install wget | tail -n 1

Successfully installed wget-3.2


#### Validate installation <a name="validate-install"></a>

In [3]:
!pip list | grep ibm_watsonx_ai
!pip list | grep lm_eval

ibm_watsonx_ai                          1.2.1
lm_eval                                 0.4.7


## Setting up necessary IBM watsonx credentials <a name="credentials"></a>

Required credentials: 
- IBM Cloud API key,
- IBM Cloud URL
- IBM Cloud Project ID

Authenticate the Watson Machine Learning service on IBM Cloud. You need to provide Cloud API key and location.

Tip: Your `Cloud API key` can be generated by going to the [**Users** section of the Cloud console](https://cloud.ibm.com/iam#/users). From that page, click your name, scroll down to the __API Keys__ section, and click __Create an IBM Cloud API key__. Give your key a name and click __Create__, then copy the created key and paste it below. You can also get a service specific url by going to the [Endpoint URLs section of the watsonx.ai Runtime](https://cloud.ibm.com/apidocs/machine-learning) docs. You can check your instance location in your [watsonx.ai Runtime](https://cloud.ibm.com/catalog/services/watson-machine-learning) instance details.

You can use [IBM Cloud CLI](https://cloud.ibm.com/docs/cli/index.html) to retrieve the instance location.

```
ibmcloud login --apikey API_KEY -a https://cloud.ibm.com
ibmcloud resource service-instance WML_INSTANCE_NAME
```

NOTE: You can also get a service specific apikey by going to the [Service IDs section of the Cloud Console](https://cloud.ibm.com/iam/serviceids). From that page, click __Create__, and then copy the created key and paste it in the following cell.

Import widely used modules:

In [4]:
import os
from pathlib import Path 

__Action__: Enter your `api_key` in the following cell

In [5]:
import getpass
api_key = getpass.getpass("Please enter your api key (hit enter): ")

Please enter your api key (hit enter):  ········


__Action__: Enter your `location` in the following cell

In [None]:
location = "INSERT YOUR LOCATION HERE"

If you are running this notebook on Cloud, you can access the `location` via:

In [6]:
location = os.environ.get("RUNTIME_ENV_REGION")

In [7]:
url = f"https://{location}.ml.cloud.ibm.com"

## Working with projects <a name="projects"></a>

You need to create a project that will be used for your work. If you do not have a space, you can use [Projects Dashboard](https://dataplatform.cloud.ibm.com/wx/home?context=wx) to create one.

- Click __Create a new project__
- Provide a name
- Select Cloud Object Storage
- Select Watson Machine Learning instance and press __Create__
- Copy `project_id` and paste it below

__Action__: Assign project ID below

In [None]:
project_id = "INSERT YOUR PROJECT ID HERE"

If you are running this notebook on Cloud, you can access the `project_id` via: 

In [8]:
project_id = os.environ.get("PROJECT_ID")

## Export watsonx variables to be used by lm-evaluation-harness <a name="export-variables"></a>

In [9]:
os.environ["WATSONX_API_KEY"] = api_key
os.environ["WATSONX_URL"] = url
os.environ["WATSONX_PROJECT_ID"] = project_id

## Basic `lm-evaluation-harness` usage <a name="usage"></a>

Basic `lm-evaluation-harness` syntax requires providing: 
- model
- specific model_id
- task name

```
!lm_eval \
--model [model] \
--model_args model_id=[model_id] \
--limit 10 \
--tasks [task_name] \
```

--limit 10 is used to evaluate only 10 records. 

In order to get more info about possible arguments, use the command: 

```
!lm-eval -h
```

Sample calling using the `watsonx_llm` and an available `gsm8k` task

In [50]:
!lm_eval --model watsonx_llm \
--verbosity ERROR \
--model_args model_id=ibm/granite-13b-instruct-v2 \
--limit 10 \
--tasks gsm8k

2025-01-20:15:26:01,154 INFO     [client.py:443] Client successfully initialized
2025-01-20:15:26:01,608 INFO     [wml_resource.py:112] Successfully finished Get available foundation models for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/foundation_model_specs?version=2025-01-10&project_id=7e8b59ba-2610-4a29-9d90-dc02483ed5f4&filters=function_text_generation%2C%21lifecycle_withdrawn%3Aand&limit=200'
100%|██████████████████████████████████████████| 10/10 [00:00<00:00, 251.49it/s]
Running generate_until function ...:   0%|               | 0/10 [00:00<?, ?it/s]2025-01-20:15:26:05,490 INFO     [_client.py:1027] HTTP Request: POST https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-01-10 "HTTP/1.1 200 OK"
2025-01-20:15:26:05,497 INFO     [wml_resource.py:112] Successfully finished generate for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-01-10'
Running generate_until function ...:  10%|▋      | 1/10 [00:01<00:10,  1.19s/it]2025-01-20:15:26:06

If you get a following error: 
`RuntimeError: Model [model_id] is not supported: does not return logprobs for input tokens` try again with a different model that has the `logprobs` enabled. Available models can be found [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx)

## Preparing own data for benchmarking <a name="prepare-own-data"></a>

### Prepare an APIClient instance <a name="prepare-client"></a>

In [11]:
from ibm_watsonx_ai import Credentials, APIClient

api_client = APIClient(credentials=Credentials(api_key=api_key, url=url), project_id=project_id)

### Prepare data assets <a name="prepare-assets"></a>

This example uses the `validation-00000-of-00001.parquet` datasets from OpenbookQA dataset and it's available [on HuggingFace](https://huggingface.co/datasets/allenai/openbookqa).

Let's save its names for further use. 

In [12]:
validation_filename = "validation-00000-of-00001.parquet"

If you are running this notebook locally, you can download the dataset directly from HuggingFace hub or using `wget`. If you already have the files in your desired location or are using DataConnections, skip the cell below. 

In [14]:
import wget

base_url = 'https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Easy/'

path = Path(os.getcwd()) / validation_filename
if path.exists():
    path.unlink()
wget.download(f"{base_url}{validation_filename}")

'validation-00000-of-00001.parquet'

If you are running this notebook on Cloud and wish to use a connection to a data asset, execute the cells below.

__Action__: Provide path to a file that you wish to create a data asset with. 

In [16]:
def create_asset(client, file_path): 
    asset_details = client.data_assets.create(file_path=file_path, name=Path(file_path).name)
    return client.data_assets.get_id(asset_details)

In [17]:
validation_asset_id = create_asset(api_client, validation_filename)

Creating data asset...
SUCCESS


If you already have the connection to the following file and wish to download from it, skip the cells above and execute the cell below

__Action__: Provide existing connection ID's. 

In [None]:
train_asset_id = "INSERT TRAIN ASSET ID HERE"
test_asset_id = "INSERT TEST ASSET ID HERE"
validation_asset_id = "INSERT VALIDATION ASSET ID HERE"

### Download data from Data Assets <a name="download-files"></a>

If your file is already stored locally in your desired location, skip the below cells

In [18]:
download_file = lambda client, asset_id, name: client.data_assets.download(asset_id=asset_id, filename=name)

In [40]:
path = Path(os.getcwd()) / validation_filename
if path.exists():
    path.unlink()
download_file(api_client, validation_asset_id, validation_filename)

Successfully saved data asset content to file: 'validation-00000-of-00001.parquet'


'/home/wsuser/work/validation-00000-of-00001.parquet'

### Validate files download <a name="validate-download"></a>

In [57]:
list(map(str, Path(os.getcwd()).iterdir()))

['/home/wsuser/work/validation-00000-of-00001.parquet']

### Sample YAML task syntax <a name="yaml-task">
For this section we will be using the [`arc_easy`](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/arc) task as an example on how to build a task and execute it from outside of the `lm-evaluation-harness` repository. 
Tasks for benchmarking are stored as `yaml` files. Let's look at the [Arc-Easy](https://huggingface.co/datasets/allenai/ai2_arc/tree/main/ARC-Easy) dataset and its corresponding task. The `yaml` file containing task info looks like this: 

```
tag:
  - ai2_arc
task: arc_easy
dataset_path: allenai/ai2_arc
dataset_name: ARC-Easy
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{choices.label.index(answerKey)}}"
doc_to_choice: "{{choices.text}}"
should_decontaminate: true
doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
```

Normally, the `dataset_path` and `dataset_name` point to datasets that are stored in `HuggingFace` hub and the `task` points to the list of tasks registered inside the `lm-evaluation-harness` repo. However, it's possible to point to a local dataset with a custom made task. In order to do so, the user need to specify the local paths in the `dataset_kwargs` field and the files type in the `dataset_path` field:

```
dataset_path: file_type (arrow, parquet, jsonl...)
dataset_kwargs:
  data_files:
    train: /path/to/train/train_file
    validation: /path/to/validation/validation_file
    test: /path/to/test/test_file
```
It is also necessary to have the local `yaml` file saved to a specific path and this path needs to be included when calling the `lm-eval` command

Knowing what should be included in the task structure we can recreate a dictionary with this info. 

In [49]:
task = dict(
    tag="test_task_openbook_qa_local",
    task="test_task_local",
    dataset_path="parquet",
    dataset_kwargs={
        "data_files": {
            "validation": validation_filename
        }
    },
    output_type="multiple_choice",
    validation_split="validation",
    doc_to_text="Question: {{question}}\\nAnswer:",
    doc_to_target="{{choices.label.index(answerKey)}}",
    doc_to_choice="{{choices.text}}",
    should_decontaminate=True,
    doc_to_decontamination_query="Question: {{question}}\\nAnswer:",
    metric_list=[
        {
            "metric": "acc",
            "aggregation": "mean",
            "higher_is_better": True
        },
        {
            "metric": "acc_norm",
            "aggregation": "mean",
            "higher_is_better": True
        },
    ],
    metadata={"version": "1.0"},
)

### Save task to file <a name="save-task"></a>

In [44]:
import codecs
import yaml

with codecs.open("test_task.yaml", "w") as yaml_file: 
    yaml.dump(task, yaml_file, default_flow_style=False)

## Run `lm_evaluation-harness` benchmarks with local data <a name="benchmarks-local"></a>

Having the datasets and the yaml task stored, we can run the lm-eval command with the `--include_path .` argument that will point to the local path and the local `arc_easy` task name (`test_task_local`). Evaluation results will be saved to the specified (`results`) directory. 

In [51]:
!lm_eval --model watsonx_llm \
--model_args model_id=ibm/granite-13b-instruct-v2 \
--include_path . \
--limit 10 \
--tasks test_task_local \
--output_path results

2025-01-20:15:27:58,139 INFO     [__main__.py:279] Verbosity set to INFO
2025-01-20:15:27:58,139 INFO     [__main__.py:303] Including path: .
2025-01-20:15:28:05,492 INFO     [__main__.py:376] Selected Tasks: ['test_task_local']
2025-01-20:15:28:05,493 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-01-20:15:28:05,493 INFO     [evaluator.py:201] Initializing watsonx_llm model, with arguments: {'model_id': 'ibm/granite-13b-instruct-v2'}
2025-01-20:15:28:06,528 INFO     [client.py:443] Client successfully initialized
2025-01-20:15:28:07,166 INFO     [wml_resource.py:112] Successfully finished Get available foundation models for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/foundation_model_specs?version=2025-01-10&project_id=7e8b59ba-2610-4a29-9d90-dc02483ed5f4&filters=function_text_generation%2C%21lifecycle_withdrawn%3Aand&limit=200'
2025-01-20:15:28:07,450 INFO     [task.py:

Now let's see the evaluation results. The file name consists of the `results` prefix and a unique timestamp. 

In [52]:
import json

def read_json_results(file_name): 
    with open(file_name) as results_file: 
        return json.loads(results_file.read())

In [53]:
results_files = list(map(str, (Path(os.getcwd()) / "results").iterdir()))
results_files

['/home/wsuser/work/results/results_2025-01-20T15-28-23.709514.json',
 '/home/wsuser/work/results/results_2025-01-20T15-23-15.196310.json']

For pretty printing reasons, the `pretty_env_info` is excluded from output. However, if you want to see this data, comment the `try... except...` block

In [54]:
for file in results_files:
    data = read_json_results(file)
    try:
        data.pop("pretty_env_info")
    except KeyError:
        pass
    print(json.dumps(data, indent=2), end="\n------------------\n")

{
  "results": {
    "test_task_local": {
      "alias": "test_task_local",
      "acc,none": 0.6,
      "acc_stderr,none": 0.16329931618554522,
      "acc_norm,none": 0.8,
      "acc_norm_stderr,none": 0.13333333333333333
    }
  },
  "group_subtasks": {
    "test_task_local": []
  },
  "configs": {
    "test_task_local": {
      "task": "test_task_local",
      "tag": "test_task_ai2_arc_local",
      "dataset_path": "parquet",
      "dataset_kwargs": {
        "data_files": {
          "validation": "validation-00000-of-00001.parquet"
        }
      },
      "validation_split": "validation",
      "doc_to_text": "Question: {{question}}\\nAnswer:",
      "doc_to_target": "{{choices.label.index(answerKey)}}",
      "doc_to_choice": "{{choices.text}}",
      "description": "",
      "target_delimiter": " ",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 0,
      "metric_list": [
        {
          "aggregation": "mean",
          "higher_is_better": true,
          "metric": 

## Summary and next steps <a name="summary"></a>
You successfully completed this notebook!

You learned how to use `ibm-watsonx-ai` and `lm-evaluation-harness` to run custom local and registered benchmarks.

Check out our [Online Documentation](https://www.ibm.com/cloud/watson-studio/autoai) for more samples, tutorials, documentation, how-tos, and blog posts.

### Authors 
<b>Marta Tomzik</b>, Software Engineer at Watson Machine Learning.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.