![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/misc/MultiPrompt_MultiDataset.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install "langtest[openai,transformers,evaluate]==2.2.0"

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [1]:
# Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - |
|**task**     |Task for which the model is to be evaluated (question-answering or summarization)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

# OpenAI Model Testing For Question Answering

In this section, we dive into testing of OpenAI models in Question Answering task.

LangTest supports robustness tests for LLM testing for now.

### Set environment for OpenAI

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

### Multi Dataset Testing

In order to evaluate the model's performance on multiple datasets, we can utilize a Jupyter notebook and provide a list of dictionaries to the `data` parameter. Each dictionary within the list should contain the following keys:

```
data=[
    {"data_source": "BoolQ", "split": "test-tiny"},
    {"data_source": "NQ-open", "split": "test-tiny"},
    {"data_source": "MedQA", "split": "test-tiny"},
    {"data_source": "LogiQA", "split": "test-tiny"},
],
```

Here, we specify different data sources and their corresponding splits for testing. This allows for a comprehensive evaluation of the model's performance across diverse datasets. The notebook can then be executed to assess how well the model generalizes to various types of questions and contexts presented in these datasets.

In [3]:
harness = Harness(
    task="question-answering",
    model={"model": "gpt-3.5-turbo-instruct", "hub": "openai"},
    data=[
        {"data_source": "BoolQ", "split": "dev-tiny"},
        {"data_source": "NQ-open", "split": "test-tiny"}
    ],
)

Test Configuration : 
 {
 "model_parameters": {
  "max_tokens": 64
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "lowercase": {
    "min_pass_rate": 0.7
   }
  }
 }
}


## Robustness

For tests we used uppercase, Dyslexia Word Swap, Add Slangs, Insert Abbreviations and Speech to Text typos . Other available robustness tests for QA task are:
* `add_context`
* `add_contraction`
* `add_punctuation`
* `add_typo`
* `add_ocr_typo`
* `american_to_british`
* `british_to_american`
* `lowercase`
* `strip_punctuation`
* `titlecase`
* `uppercase`
* `number_to_word`
* `add_abbreviation`
* `add_speech_to_text_typo`
* `add_slangs`
* `dyslexia_word_swap`
* `multiple_perturbations`
* `adjective_synonym_swap`
* `adjective_antonym_swap`
* `strip_all_punctuation`

You can also set prompts and other model parameters in config. Possible parameters are:
* `user_prompt:` Prompt to be given to the model.
* `temperature:` Temperature of the model.
* `max_tokens:` Maximum number of output tokens allowed for model.

To configure prompts for different datasets, you can use the `user_prompt` dictionary. Here's how it works:

- Each key in the dictionary represents a dataset name or task (e.g., "BoolQ", "NQ-open").
- The corresponding value is a string template that defines the user prompt for the dataset.
- The template can include placeholders:
        - `{context}`: This will be replaced with the actual context (passage) relevant to the question from the specific dataset.
        - `{question}`: This will be replaced with the actual question from the dataset.
- The newline character `\n` can be used to separate the context and question in the final prompt.

Here the example:
```python
harness.configure(
    {
        "model_parameters": {
            "user_prompt": {
                "BoolQ": "Answer the following question with a True or False. {context}\nQuestion {question}",
                "NQ-open": "Answer the following question. Question {question}",
            }
        },
        ....
    })
```

In [4]:
harness.configure(
    {
        "model_parameters": {
            "user_prompt": {
                "BoolQ": "Answer the following question with a True or False. {context}\nQuestion {question}",
                "NQ-open": "Answer the following question. Question {question}",
            }
        },
        "tests": {
            "defaults": {"min_pass_rate": 0.65},
            "robustness": {
                "uppercase": {"min_pass_rate": 0.66},
                "dyslexia_word_swap": {"min_pass_rate": 0.60},
                "add_abbreviation": {"min_pass_rate": 0.60},
                "add_slangs": {"min_pass_rate": 0.60},
                "add_speech_to_text_typo": {"min_pass_rate": 0.60},
            },
        }
    }
)

{'model_parameters': {'user_prompt': {'BoolQ': 'Answer the following question with a True or False. {context}\nQuestion {question}',
   'NQ-open': 'Answer the following question. Question {question}'}},
 'tests': {'defaults': {'min_pass_rate': 0.65},
  'robustness': {'uppercase': {'min_pass_rate': 0.66},
   'dyslexia_word_swap': {'min_pass_rate': 0.6},
   'add_abbreviation': {'min_pass_rate': 0.6},
   'add_slangs': {'min_pass_rate': 0.6},
   'add_speech_to_text_typo': {'min_pass_rate': 0.6}}}}

➤ You can adjust the level of transformation in the sentence by using the "`prob`" parameter, which controls the proportion of words to be changed during robustness tests.

➤ **NOTE** : "`prob`" defaults to 1.0, which means all words will be transformed.
```
harness.configure(
{
 'tests': {
    'defaults': {'min_pass_rate': 0.65},
      'robustness': {
        'uppercase': {'min_pass_rate': 0.66, 'prob': 0.50},
        'dyslexia_word_swap':{'min_pass_rate': 0.60, 'prob': 0.70},
      }
  }
})

```

Here we have configured the harness to perform Five robustness tests and defined the minimum pass rate for each test.

In [5]:
#slice the data
harness.data = {k: v[:10] for k, v in harness.data.items()}

### Generating the test cases.

In [6]:
harness.generate()

                                     BoolQ                                      


Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]
[W010] - Test 'add_slangs': 2 samples removed out of 10



--------------------------------------------------------------------------------

                                    NQ-open                                     


Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]
[W010] - Test 'dyslexia_word_swap': 3 samples removed out of 10
[W010] - Test 'add_abbreviation': 1 samples removed out of 10
[W010] - Test 'add_slangs': 8 samples removed out of 10
[W010] - Test 'add_speech_to_text_typo': 1 samples removed out of 10



--------------------------------------------------------------------------------





In [7]:
harness.testcases()

Unnamed: 0,category,dataset_name,test_type,original_context,original_question,perturbed_context,perturbed_question
0,robustness,BoolQ,uppercase,All biomass goes through at least some of thes...,does ethanol take more energy make that produces,ALL BIOMASS GOES THROUGH AT LEAST SOME OF THES...,DOES ETHANOL TAKE MORE ENERGY MAKE THAT PRODUCES
1,robustness,BoolQ,uppercase,Property tax or 'house tax' is a local tax on ...,is house tax and property tax are same,PROPERTY TAX OR 'HOUSE TAX' IS A LOCAL TAX ON ...,IS HOUSE TAX AND PROPERTY TAX ARE SAME
2,robustness,BoolQ,uppercase,Phantom pain sensations are described as perce...,is pain experienced in a missing body part or ...,PHANTOM PAIN SENSATIONS ARE DESCRIBED AS PERCE...,IS PAIN EXPERIENCED IN A MISSING BODY PART OR ...
3,robustness,BoolQ,uppercase,Harry Potter and the Escape from Gringotts is ...,is harry potter and the escape from gringotts ...,HARRY POTTER AND THE ESCAPE FROM GRINGOTTS IS ...,IS HARRY POTTER AND THE ESCAPE FROM GRINGOTTS ...
4,robustness,BoolQ,uppercase,Hydroxyzine preparations require a doctor's pr...,is there a difference between hydroxyzine hcl ...,HYDROXYZINE PREPARATIONS REQUIRE A DOCTOR'S PR...,IS THERE A DIFFERENCE BETWEEN HYDROXYZINE HCL ...
...,...,...,...,...,...,...,...
80,robustness,NQ-open,add_speech_to_text_typo,-,who played grand moff tarkin in rogue one,-,Hoo played grand moff tarkin in rogue one
81,robustness,NQ-open,add_speech_to_text_typo,-,youngest current member of the house of repres...,-,youngest current member of the Hause of repres...
82,robustness,NQ-open,add_speech_to_text_typo,-,who wrote the miraculous journey of edward tulane,-,Houx wrote the miraculous journey of edward tu...
83,robustness,NQ-open,add_speech_to_text_typo,-,when did the night mare before christmas come out,-,when did the night Mehr before christmas come out


harness.generate() method automatically generates the test cases (based on the provided configuration)

### Running the tests

In [8]:
harness.run()

                                     BoolQ                                      


Running testcases... : 100%|██████████| 48/48 [00:52<00:00,  1.10s/it]


--------------------------------------------------------------------------------

                                    NQ-open                                     


Running testcases... : 100%|██████████| 37/37 [00:48<00:00,  1.31s/it]

--------------------------------------------------------------------------------








Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

### Generated Results

In [9]:
harness.generated_results()

Unnamed: 0,category,dataset_name,test_type,original_context,original_question,perturbed_context,perturbed_question,expected_result,actual_result,pass
0,robustness,BoolQ,uppercase,All biomass goes through at least some of thes...,does ethanol take more energy make that produces,ALL BIOMASS GOES THROUGH AT LEAST SOME OF THES...,DOES ETHANOL TAKE MORE ENERGY MAKE THAT PRODUCES,\n\n\nTrue,TRUE,True
1,robustness,BoolQ,uppercase,Property tax or 'house tax' is a local tax on ...,is house tax and property tax are same,PROPERTY TAX OR 'HOUSE TAX' IS A LOCAL TAX ON ...,IS HOUSE TAX AND PROPERTY TAX ARE SAME,\n\nTrue,\n\nTrue,True
2,robustness,BoolQ,uppercase,Phantom pain sensations are described as perce...,is pain experienced in a missing body part or ...,PHANTOM PAIN SENSATIONS ARE DESCRIBED AS PERCE...,IS PAIN EXPERIENCED IN A MISSING BODY PART OR ...,?\n\n\nTrue,\nTrue,True
3,robustness,BoolQ,uppercase,Harry Potter and the Escape from Gringotts is ...,is harry potter and the escape from gringotts ...,HARRY POTTER AND THE ESCAPE FROM GRINGOTTS IS ...,IS HARRY POTTER AND THE ESCAPE FROM GRINGOTTS ...,\n\nTrue,?\n\nTrue,True
4,robustness,BoolQ,uppercase,Hydroxyzine preparations require a doctor's pr...,is there a difference between hydroxyzine hcl ...,HYDROXYZINE PREPARATIONS REQUIRE A DOCTOR'S PR...,IS THERE A DIFFERENCE BETWEEN HYDROXYZINE HCL ...,\n\nTrue,OATE\n\nTrue,False
...,...,...,...,...,...,...,...,...,...,...
80,robustness,NQ-open,add_speech_to_text_typo,-,who played grand moff tarkin in rogue one,-,Hoo played grand moff tarkin in rogue one,\n\nPeter Cushing,\n\nPeter Cushing played Grand Moff Tarkin in ...,True
81,robustness,NQ-open,add_speech_to_text_typo,-,youngest current member of the house of repres...,-,youngest current member of the Hause of repres...,"\n\nAs of 2021, the youngest current member of...","\n\nAs of 2021, the youngest current member of...",True
82,robustness,NQ-open,add_speech_to_text_typo,-,who wrote the miraculous journey of edward tulane,-,Houx wrote the miraculous journey of edward tu...,\n\nThe Miraculous Journey of Edward Tulane wa...,"\n\nWho wrote ""The Miraculous Journey of Edwar...",False
83,robustness,NQ-open,add_speech_to_text_typo,-,when did the night mare before christmas come out,-,when did the night Mehr before christmas come out,\n\nThe Nightmare Before Christmas was release...,\n\nThe Nightmare Before Christmas was release...,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [10]:
harness.report()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
dataset_name,category,test_type,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
BoolQ,robustness,uppercase,2,8,80%,66%,True
BoolQ,robustness,dyslexia_word_swap,2,8,80%,60%,True
BoolQ,robustness,add_abbreviation,1,9,90%,60%,True
BoolQ,robustness,add_slangs,4,4,50%,60%,False
BoolQ,robustness,add_speech_to_text_typo,3,7,70%,60%,True
NQ-open,robustness,uppercase,3,7,70%,66%,True
NQ-open,robustness,dyslexia_word_swap,2,5,71%,60%,True
NQ-open,robustness,add_abbreviation,5,4,44%,60%,False
NQ-open,robustness,add_slangs,1,1,50%,60%,False
NQ-open,robustness,add_speech_to_text_typo,5,4,44%,60%,False
