Skip to content

Latest commit



268 lines (176 loc) · 11.8 KB

File metadata and controls

268 lines (176 loc) · 11.8 KB

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

RAmBLA (Reliability Assessment for Biomedical LLM Assistants) is a framework for evaluating LLMs on a set of tasks designed to test for reliability. Specifically, the tasks can be divided into the following three aspects of reliability:

  1. Robustness to non-semantic variations: LLMs should be robust to prompt variations that do not alter prompt meaning, and they should not display biases during few-shot prompting.
  2. High recall: When operating on documents, LLMs should recall all relevant information, relying on either parametric knowledge or context exclusively, as instructed.
  3. Hallucinations: If they have insufficient knowledge or context information to answer a question, LLMs should refuse to answer.

Further details can be found in our [paper](LINK PLACEHOLDER).

Table of Contents


RAmBLA uses Python version 3.10.10. To install follow these steps:

  1. Clone the repository:
git clone (URL placeholder)
  1. Create a conda environment and install the package using the Makefile with the following command:
make init
  1. Set environment variables by creating a .env file according to .env_example. This includes the following environment variables:
Variable Description
OPENAI_<var-name> Set of variables required to access OpenAI API
DATASET_STORAGE_PATH Path where datasets should be stored
MLFLOW_PROJECT_NAME Sets the name of the project to run evaluations under for logging purposes
BACKOFF_MAX_TRIES/BACKOFF_INTERVAL Controls retry parameters when using API-based models
  1. Download the bioasq dataset under DATASET_STORAGE_PATH. See this for instructions.

Running Evaluations

Individual Tasks

The main entry point for evaluating LLMs against an individual task is in rambla/run/ An example command is:

python rambla/run/ task=mcqabaseline model=openai_chat

NOTE: We have a few model configs under rambla/conf/model/. For the case of rambla/conf/model/llama2_7b_chat_local.yaml the params.model_name parameter needs to be updated to point to the path the model is stored.

All tasks in this repo are configured using Hydra

Full Evaluation Suite

To run the full evaluation suite on a model use the script bin/ For example:

python bin/ --models=openai_chat,mistral_7b_instruct_hf

This will run the full evaluation suite on ChatGPT and the Mistral 7b model

NOTE: Running the full evaluation suite can be very slow. We recommend running individual tasks over the full suite.


For detailed information of each task, including how to configure them and example run commands, please refer to the docs.

Supported Tasks




Supported LLMs

Supported datasets

Semantic/Textual similarity component evaluation

This task was designed to evaluate different components (/models) at their ability to measure semantic similarity. These components take as input two pieces of text and output a score (binary or continuous) that reflects the similarity between the two input tests. The best performing component (chat GPT-4) was then chosen as default for the evaluation tasks where a semantic similarity metric was required.

Supported tasks

We currently support one task, which consists in passing two long-form texts to a component and receiving a metric for how similar the two texts are. It can support different components against different datasets and capture a range of different metrics.

Supported Components

LLM Component

We prompt GPT with the two sentences and ask whether they are semantically equivalent. Returns Yes or No.

Embedding-based Component

We first embed the two sentences using an embeddings model and then compute inner product between the two embeddings. Returns a score between 0 and 1 (if the embeddings are normalised).

NLI models (Natural Language Inference) (See NLIModel in rambla/models/

We provide the two texts as input to the NLI model and the output are scores for the following classes: {entailment, neutral, contradiction}.

  • Unidirectional model: “Does sentence A follow from sentence B?”

    • Classification: Argmax of the scores (returns predicted class)

    • Regression: Exponential softmax of the entailment score (returns a score between 0 and 1)

  • Bidirectional model: “Does sentence A follow from sentence B AND does sentence B follow from sentence A?”

    • Classification:

      • Strict - Bidirectional entailment required for similarity classification (this was our initial preferred method given results from the SICK dataset - please see below)

      • Relaxed - Bidirectional entailment or entailment and neutral required for similarity classification

      • Regression:

        • Average - Bidirectional mean exponential softmax of the entailment score (returns a score between 0 and 1)

Supported datasets


For testing, RAmBLA uses pytest

All unit-tests are located under tests. To run the full test suite, run:

pytest tests/

Integration tests

NOTE: These need to be run manually!

We have two sets of integration tests:

Integration tests for rambla/run/

Example usage:

  • This will run a minimal version of the mcqabaseline task against openai_chat

    • python integration_tests/ -m openai_chat -t mcqabaseline
  • This will run a minimal version of the mcqabaseline task against all available models

    • python integration_tests/ -t mcqabaseline
  • This will run a minimal version of all available tasks against openai_chat

    • python integration_tests/ -m openai_chat
  • This will run a minimal version of all available tasks against all available models

    • python integration_tests/

Integration tests for rambla/run/

  • This will run a minimal version of all available tasks against all available components.
    • python integration_tests/

More Information

For further details about working with RAmBLA see the extended documentation located under docs


We welcome contributions, feedback and suggestions to RAmBLA. If you would like to make a contribution, please follow our guidelines.

Please check for existing GitHub issues related to the change and create a new issue if one does not exist so we can first open discussions on the proposed change.

Setting up local development environment

  1. Clone and install the repo according to the installation instructions

  2. Create a new branch:

git checkout -b <my-branch-name>

Ideally use the prefix feat/ for feature-based branches, and hotfix/ for bug fixes.

Making Changes

When you make changes to the code please ensure your changes adhere to our code style.

We use the following:

We use a pre-commit to ensure all code adheres to these standards. If you install the package according to our installation instructions then this will be run automatically on every commit. To run manually use:

pre-commit run --all


All code submissions should include unit-tests written using the pytest framework and these should be located in the relevant directory under tests.

Please ensure all tests pass, before submitting a change, by following the unit testing and integration testing instructions.


After following the above guidelines, please create a pull request into the master branch. Please ensure your pull-request contains:

  • Title
  • Brief description of changes made


Copyright 2023 of GlaxoSmithKline Research & Development Limited. All rights reserved.

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Contact Info

RAmBLA was originally created by the Responsible AI team at

To get in touch please find our contact details:


If you find this code useful in your research, please cite the associated paper:

author={William James Bolton and Rafael Poyiadzi and Edward Morrell and Gabriela van Bergen Gonzalez Bueno and Lea Goetz},
booktitle={ICLR 2024 Workshop on Reliable and Responsible Foundation Models},