![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/llm_notebooks/Fewshot_QA_Notebook.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest 

In [None]:
!pip install "langtest[evaluate,openai,transformers]==2.2.0" 

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [1]:
#Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - | 
|**task**     |Task for which the model is to be evaluated (question-answering or summarization)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

# OpenAI Model Testing For Question Answering

In this section, we dive into testing of OpenAI models in Question Answering task.

LangTest supports robustness tests for LLM testing for now.

### Set environment for OpenAI

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" #Replace with your OpenAI API Key

## BoolQ-test-tiny dataset testing

The YAML content defines a task named "BoolQ" that specifies how an intelligent bot should respond to queries. The task instructions dictate that the bot must provide a concise answer of either "true" or "false." The `prompt_type` is set to "instruct," indicating that the bot should execute the task based on direct commands rather than engaging in conversational interaction.

The YAML also includes examples to illustrate how the bot should handle specific questions. Each example contains a "context" that provides background information and a "question" that the bot needs to answer with either "true" or "false." In the provided examples:
1. The context discusses the renewal of the series "The Good Fight" for a third season, and the question asks whether there is a third series of "The Good Fight," to which the bot correctly responds "True."
2. The context mentions the cancellation of "Lost in Space" at the end of season 3 without resolving the story, and the question asks whether the Robinsons ever returned to Earth, to which the bot incorrectly responds "True," presumably due to the bot misunderstanding or misinterpreting the context.

In [3]:
yaml_content = """
prompt_config:
  "BoolQ":
    instructions: "You are an intelligent bot and it is your responsibility to make sure to give a concise answer. Answer should be `true` or  `false`."
    prompt_type: "instruct" # instruct for completion and chat for conversation(chat models)
    examples:
      - user:
          context: "The Good Fight -- A second 13-episode season premiered on March 4, 2018. On May 2, 2018, the series was renewed for a third season."
          question: "is there a third series of the good fight?"
        ai:
          answer: "True"
      - user:
          context: "Lost in Space -- The fate of the castaways is never resolved, as the series was unexpectedly canceled at the end of season 3."
          question: "did the robinsons ever get back to earth"
        ai:
          answer: "True"
  "NQ-open":
    instructions: "You are an intelligent bot and it is your responsibility to make sure to give a short concise answer."
    prompt_type: "instruct" # completion
    examples:
      - user:
          question: "where does the electron come from in beta decay?"
        ai:
          answer: "an atomic nucleus"
      - user:
          question: "who wrote you're a grand ol flag?"
        ai:
          answer: "George M. Cohan"

tests:
  defaults:
    min_pass_rate: 0.8
  robustness:
    uppercase:
      min_pass_rate: 0.8
    add_typo:
      min_pass_rate: 0.8
"""

with open("config.yaml", "w") as f:
    f.write(yaml_content)


### Setup and Configure Harness

In [4]:
harness = Harness(
                  task="question-answering", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data=[{"data_source" :"BoolQ",
                        "split":"test-tiny"},
                        {"data_source" :"NQ-open",
                         "split":"test-tiny"}],
                  config="config.yaml"
                  )

Test Configuration : 
 {
 "prompt_config": {
  "BoolQ": {
   "instructions": "You are an intelligent bot and it is your responsibility to make sure to give a concise answer. Answer should be `true` or  `false`.",
   "prompt_type": "instruct",
   "examples": [
    {
     "user": {
      "context": "The Good Fight -- A second 13-episode season premiered on March 4, 2018. On May 2, 2018, the series was renewed for a third season.",
      "question": "is there a third series of the good fight?"
     },
     "ai": {
      "answer": "True"
     }
    },
    {
     "user": {
      "context": "Lost in Space -- The fate of the castaways is never resolved, as the series was unexpectedly canceled at the end of season 3.",
      "question": "did the robinsons ever get back to earth"
     },
     "ai": {
      "answer": "True"
     }
    }
   ]
  },
  "NQ-open": {
   "instructions": "You are an intelligent bot and it is your responsibility to make sure to give a short concise answer.",
   "prompt_t

We have specified task as QA, hub as OpenAI and model as GPT-3.5.

For dataset we used `BoolQ` dataset and `test-tiny` split which includes 50 samples. Other available datasets are: [Benchmark Datasets](https://langtest.org/docs/pages/docs/data#question-answering)

For tests we used lowercase and uppercase. Other available robustness tests for QA task are:
* `add_context`
* `add_contraction`
* `add_punctuation`
* `add_typo`
* `add_ocr_typo`
* `american_to_british`
* `british_to_american`
* `lowercase`
* `strip_punctuation`
* `titlecase`
* `uppercase`
* `number_to_word`
* `add_abbreviation`
* `add_speech_to_text_typo`
* `add_slangs`
* `dyslexia_word_swap`
* `multiple_perturbations`
* `adjective_synonym_swap`
* `adjective_antonym_swap`
* `strip_all_punctuation`

Available Bias tests for QA task are:

* `replace_to_male_pronouns`
* `replace_to_female_pronouns`
* `replace_to_neutral_pronouns`
* `replace_to_high_income_country`
* `replace_to_low_income_country`
* `replace_to_upper_middle_income_country`
* `replace_to_lower_middle_income_country`
* `replace_to_white_firstnames`
* `replace_to_black_firstnames`
* `replace_to_hispanic_firstnames`
* `replace_to_asian_firstnames`
* `replace_to_white_lastnames`
* `replace_to_sikh_names`
* `replace_to_christian_names`
* `replace_to_hindu_names`
* `replace_to_muslim_names`
* `replace_to_inter_racial_lastnames`
* `replace_to_native_american_lastnames`
* `replace_to_asian_lastnames`
* `replace_to_hispanic_lastnames`
* `replace_to_black_lastnames`
* `replace_to_parsi_names`
* `replace_to_jain_names`
* `replace_to_buddhist_names`

Available Representation tests for QA task are:

* `min_gender_representation_count`
* `min_ethnicity_name_representation_count`
* `min_religion_name_representation_count`
* `min_country_economic_representation_count`
* `min_gender_representation_proportion`
* `min_ethnicity_name_representation_proportion`
* `min_religion_name_representation_proportion`
* `min_country_economic_representation_proportion`


Available Accuracy tests for QA task are:

* `min_exact_match_score`
* `min_bleu_score`
* `min_rouge1_score`
* `min_rouge2_score`
* `min_rougeL_score`
* `min_rougeLsum_score`


Available Fairness tests for QA task are:

* `max_gender_rouge1_score`
* `max_gender_rouge2_score`
* `max_gender_rougeL_score`
* `max_gender_rougeLsum_score`
* `min_gender_rouge1_score`
* `min_gender_rouge2_score`
* `min_gender_rougeL_score`
* `min_gender_rougeLsum_score`

You can also set prompts and other model parameters in config. Possible parameters are:
* `user_promt:` Promt to be given to the model.
* `temperature:` Temperature of the model.
* `max_tokens:` Maximum number of output tokens allowed for model.

Here we have configured the harness to perform two robustness tests (uppercase and lowercase) and defined the minimum pass rate for each test.

➤ You can adjust the level of transformation in the sentence by using the "`prob`" parameter, which controls the proportion of words to be changed during robustness tests.

➤ **NOTE** : "`prob`" defaults to 1.0, which means all words will be transformed.
```
harness.configure(
{
 'tests': {
    'defaults': {'min_pass_rate': 0.65},
      'robustness': {
        'lowercase': {'min_pass_rate': 0.66, 'prob': 0.50}, 
        'uppercase':{'min_pass_rate': 0.60, 'prob': 0.70},
      }
  }
})

```

In [7]:
harness.data = {k: v[:10] for k, v in harness.data.items()}


### Generating the test cases.

In [5]:
harness.generate()

                                     BoolQ                                      


Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 995.80it/s]


--------------------------------------------------------------------------------

                                    NQ-open                                     


Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]
[W010] - Test 'add_typo': 5 samples removed out of 50



--------------------------------------------------------------------------------





harness.generate() method automatically generates the test cases (based on the provided configuration)

In [6]:
harness.testcases()

Unnamed: 0,category,dataset_name,test_type,original_context,original_question,perturbed_context,perturbed_question
0,robustness,BoolQ,uppercase,20 euro note -- Until now there has been only ...,is the first series 20 euro note still legal t...,20 EURO NOTE -- UNTIL NOW THERE HAS BEEN ONLY ...,IS THE FIRST SERIES 20 EURO NOTE STILL LEGAL T...
1,robustness,BoolQ,uppercase,2018–19 UEFA Champions League -- The final wil...,do the champions league winners get automatic ...,2018–19 UEFA CHAMPIONS LEAGUE -- THE FINAL WIL...,DO THE CHAMPIONS LEAGUE WINNERS GET AUTOMATIC ...
2,robustness,BoolQ,uppercase,Bullsnake -- Bullsnakes are very powerful cons...,can a bull snake kill a small dog,BULLSNAKE -- BULLSNAKES ARE VERY POWERFUL CONS...,CAN A BULL SNAKE KILL A SMALL DOG
3,robustness,BoolQ,uppercase,NBA playoffs -- All rounds are best-of-seven s...,are all nba playoff games best of 7,NBA PLAYOFFS -- ALL ROUNDS ARE BEST-OF-SEVEN S...,ARE ALL NBA PLAYOFF GAMES BEST OF 7
4,robustness,BoolQ,uppercase,Manchester station group -- The Manchester sta...,can i use my train ticket on the tram in manch...,MANCHESTER STATION GROUP -- THE MANCHESTER STA...,CAN I USE MY TRAIN TICKET ON THE TRAM IN MANCH...
...,...,...,...,...,...,...,...
190,robustness,NQ-open,add_typo,-,who has the most followers on the twitter,-,who has the most followers on tme twitter
191,robustness,NQ-open,add_typo,-,who said it's not what your country can do for...,-,who said it's not what your country can do for...
192,robustness,NQ-open,add_typo,-,when does lil wayne new album drop 2018,-,jhen does lil wayne new album drop 2018
193,robustness,NQ-open,add_typo,-,the khajuraho temples are especially well know...,-,the khajuraho temples are rspecially well know...


harness.testcases() method displays the produced test cases in form of a pandas data frame.

### Running the tests

In [7]:
harness.run()

                                     BoolQ                                      


Running testcases... : 100%|██████████| 100/100 [01:19<00:00,  1.26it/s]


--------------------------------------------------------------------------------

                                    NQ-open                                     


Running testcases... : 100%|██████████| 95/95 [01:47<00:00,  1.13s/it]

--------------------------------------------------------------------------------








Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [8]:
harness.generated_results()

Unnamed: 0,category,dataset_name,test_type,original_context,original_question,perturbed_context,perturbed_question,expected_result,actual_result,pass
0,robustness,BoolQ,uppercase,20 euro note -- Until now there has been only ...,is the first series 20 euro note still legal t...,20 EURO NOTE -- UNTIL NOW THERE HAS BEEN ONLY ...,IS THE FIRST SERIES 20 EURO NOTE STILL LEGAL T...,\nFalse,\nFalse,True
1,robustness,BoolQ,uppercase,2018–19 UEFA Champions League -- The final wil...,do the champions league winners get automatic ...,2018–19 UEFA CHAMPIONS LEAGUE -- THE FINAL WIL...,DO THE CHAMPIONS LEAGUE WINNERS GET AUTOMATIC ...,\nTrue,\nTrue,True
2,robustness,BoolQ,uppercase,Bullsnake -- Bullsnakes are very powerful cons...,can a bull snake kill a small dog,BULLSNAKE -- BULLSNAKES ARE VERY POWERFUL CONS...,CAN A BULL SNAKE KILL A SMALL DOG,\nFalse,\nFalse,True
3,robustness,BoolQ,uppercase,NBA playoffs -- All rounds are best-of-seven s...,are all nba playoff games best of 7,NBA PLAYOFFS -- ALL ROUNDS ARE BEST-OF-SEVEN S...,ARE ALL NBA PLAYOFF GAMES BEST OF 7,\nTrue,True,True
4,robustness,BoolQ,uppercase,Manchester station group -- The Manchester sta...,can i use my train ticket on the tram in manch...,MANCHESTER STATION GROUP -- THE MANCHESTER STA...,CAN I USE MY TRAIN TICKET ON THE TRAM IN MANCH...,\nTrue,\nFalse,False
...,...,...,...,...,...,...,...,...,...,...
190,robustness,NQ-open,add_typo,-,who has the most followers on the twitter,-,who has the most followers on tme twitter,"?\n\nAs of 2021, the person with the most foll...","?\n\nAs of June 2021, the account with the mos...",True
191,robustness,NQ-open,add_typo,-,who said it's not what your country can do for...,-,who said it's not what your country can do for...,?\n\nJohn F. Kennedy,?\n\n\nJohn F. Kennedy,True
192,robustness,NQ-open,add_typo,-,when does lil wayne new album drop 2018,-,jhen does lil wayne new album drop 2018,"\nLil Wayne's album, ""Tha Carter V,"" was relea...",?\n\nThere is no official release date for Lil...,False
193,robustness,NQ-open,add_typo,-,the khajuraho temples are especially well know...,-,the khajuraho temples are rspecially well know...,\nerotic sculptures,?\n\nerotic sculptures.,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [9]:
harness.report()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct,Benchmarking Results: gpt-3.5-turbo-instruct
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
dataset_name,category,test_type,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
BoolQ,robustness,uppercase,14,36,72%,80%,False
BoolQ,robustness,add_typo,8,42,84%,80%,True
NQ-open,robustness,uppercase,9,41,82%,80%,True
NQ-open,robustness,add_typo,10,35,78%,80%,False
