<!-- ![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/refs/heads/main/docs/assets/images/logo.png) -->
![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/benchmarks/Langtest_Cli_Eval_Command.ipynb)

**LangTest**is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy**
models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

This notebook provides a comprehensive overview of benchmarking Language Models (LLMs) in Question-Answering tasks. Explore step-by-step instructions on conducting robustness and accuracy tests to evaluate LLM performance.

# Getting started with LangTest CLi

In [None]:
!pip install -q langtest[all]

### Example JSON

In [None]:
{
    "task": "question-answering",
    "model": {
        "model": "google/flan-t5-base",
        "hub": "huggingface"
    },
    "data": [
        {
            "data_source": "MedMCQA"
        },
        {
            "data_source": "PubMedQA"
        },
        {
            "data_source": "MMLU"
        },
        {
            "data_source": "MedQA"
        }
    ],
    "config": {
        "model_parameters": {
            "max_tokens": 64
        },
        "tests": {
            "defaults": {
                "min_pass_rate": 1.0
            },
            "robustness": {
                "add_typo": {
                    "min_pass_rate": 0.70
                }
            },
            "accuracy": {
                "llm_eval": {
                    "min_score": 0.60
                }

            }
        }
    }
}

## Example Yaml

In [1]:
yaml_content = """
task: question-answering
model:
  model: google/flan-t5-base
  hub: huggingface
data:
- data_source: MedMCQA
- data_source: PubMedQA
- data_source: MMLU
- data_source: MedQA
config:
  model_parameters:
    max_tokens: 64
    device: 0
    task: text2text-generation
  tests:
    defaults:
      min_pass_rate: 0.65
    robustness:
      add_typo:
        min_pass_rate: 0.7
"""

The content stored in the variable `yaml_content` (which should be formatted in valid YAML syntax) is written to the opened file using the `f.write` method.

In [2]:
import yaml

# write a yaml file
with open('config.yml', 'w') as f:
    f.write(yaml_content)

## Langtest eval Command for model benchmarking

The langtest command-line interface offers a powerful tool for evaluating language models on specific tests. This is achieved through the langtest eval command. Imagine you want to test a model named `google/flan-t5-base`, a large language model developed by Google. The `langtest eval` command allows you to do this. To use it, you'll provide additional information through arguments. The `-m google/flan-t5-base` argument specifies the model you want to evaluate.  The `-h huggingface` argument tells langtest that the model resides on Hugging Face, a popular platform for sharing pre-trained models. Finally, the `-c config.yml` argument points to a configuration file containing details about the evaluation process, such as the test itself and the metrics used to measure performance. In certain environments, like Jupyter notebooks, you might see an ! symbol preceding the entire command. This symbol is specific to those environments and allows you to run shell commands within them. By combining langtest eval with the appropriate arguments, you can streamline the process of evaluating your language model's capabilities on various language tests.

Breakdown of the langtest eval command:

* langtest eval: This core part of the command invokes the evaluation functionality within langtest.
* -m <model_identifier>: This argument specifies the model you want to evaluate. In the example, `google/flan-t5-base` indicates the model comes from Google and is named flan-t5-base.
* -h <hub>: This option defines where the model is hosted. Here, -h means hub, a popular repository for pre-trained models.
* -c <config_file>: This argument specifies the configuration file that controls the evaluation process. This file typically holds settings like evaluation metrics and test parameters.

In [1]:
!langtest eval -m google/flan-t5-base -h huggingface -c config.yml

2024-04-02 13:13:57.744792: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-02 13:13:57.744869: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-02 13:13:57.752894: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
cannot import name 'LangtestRetrieverEvaluator' from 'langtest.evaluation' (/usr/local/lib/python3.10/dist-packages/langtest/evaluation/__init__.py) please install llama_index using `pip install llama-index`
INFO:langtest.leaderboard:Initializing new langtest leaderboard...
/root/.langtest/
Test Configuration : 
 {
 "model_parameters": {
  "max_tokens": 64,
  "de

In [2]:
!langtest show-leaderboard

2024-04-02 13:29:36.147363: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-02 13:29:36.147430: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-02 13:29:36.155959: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
cannot import name 'LangtestRetrieverEvaluator' from 'langtest.evaluation' (/usr/local/lib/python3.10/dist-packages/langtest/evaluation/__init__.py) please install llama_index using `pip install llama-index`
./.langtest


                                   robustness                                   
INFO:langtest.leaderboard:robustness Leaderboard
|    | model 

To benchmark a different model, simply replace `google/flan-t5-base` with your desired model identifier in the `!langtest eval` command. For the hub keep -h huggingface unless your model resides elsewhere.

In [4]:
!langtest eval -m google/flan-t5-large -h huggingface -c config.yml

2024-04-02 13:34:00.338874: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-02 13:34:00.338947: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-02 13:34:00.347016: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
cannot import name 'LangtestRetrieverEvaluator' from 'langtest.evaluation' (/usr/local/lib/python3.10/dist-packages/langtest/evaluation/__init__.py) please install llama_index using `pip install llama-index`
INFO:langtest.leaderboard:Initializing new langtest leaderboard...
/root/.langtest/
INFO:langtest.leaderboard:Testcases already exist at: /root/.langtest/tes

In [9]:
!langtest show-leaderboard

2024-04-02 14:05:07.671633: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-02 14:05:07.671708: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-02 14:05:07.679796: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
cannot import name 'LangtestRetrieverEvaluator' from 'langtest.evaluation' (/usr/local/lib/python3.10/dist-packages/langtest/evaluation/__init__.py) please install llama_index using `pip install llama-index`
./.langtest


                                   robustness                                   
INFO:langtest.leaderboard:robustness Leaderboard
|    | model 