<!-- ![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/refs/heads/main/docs/assets/images/logo.png) -->
![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/benchmarks/Question-Answering.ipynb)

**LangTest**is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy**
models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

This notebook provides a comprehensive overview of benchmarking Language Models (LLMs) in Question-Answering tasks. Explore step-by-step instructions on conducting robustness and accuracy tests to evaluate LLM performance.

# Getting started with LangTest

In [None]:
!pip install "langtest[evaluate,openai,transformers]"

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [9]:
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - | 
|**task**     |Task for which the model is to be evaluated (question-answering or summarization)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

#### Initial Setup

In [10]:
import pandas as pd
import os

os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "HUGGINGFACEHUB_API_TOKEN"
pd.set_option('display.max_colwidth', None)

In [11]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
hub = "huggingface-inference-api"
data_source = "OpenBookQA"
split = "test-tiny"

In [12]:
robustness_folder = f"{data_source}-{split}/{'-'.join(model_name.split('/'))}/robustness"
!mkdir "$robustness_folder"

accuracy_folder = f"{data_source}-{split}/{'-'.join(model_name.split('/'))}/accuracy"
!mkdir "$accuracy_folder"

## Robustness

### Setup and Configure Harness

In [14]:
harness = Harness(
                  task="question-answering",
                  model={"model": model_name, "hub":hub},
                  data={"data_source" :data_source,
                                "split":split},
                config={
                    "model_parameters": {
                                            "max_tokens": 32,
                                            "user_prompt": "You are an AI bot specializing in providing accurate and concise answers to questions. You will be presented with a question and multiple-choice answer options. Your task is to choose the correct answer.\nNote: Do not explain your answer.\nQuestion: {question}\nOptions: {options}\n Answer:",
                                        },
                 "evaluation": {"metric":"llm_eval","model":"gpt-3.5-turbo-instruct","hub":"openai"},
                 'tests': {'defaults': {'min_pass_rate': 0.65},
                           'robustness': {'uppercase': {'min_pass_rate': 0.75},
                                          'lowercase':{'min_pass_rate': 0.75},
                                          'titlecase':{'min_pass_rate': 0.75},
                                          'add_typo':{'min_pass_rate': 0.75},
                                          'dyslexia_word_swap':{'min_pass_rate': 0.75},
                                          'add_abbreviation':{'min_pass_rate': 0.75},
                                          'add_slangs':{'min_pass_rate': 0.75},
                                          'add_speech_to_text_typo':{'min_pass_rate': 0.75},
                                          'add_ocr_typo':{'min_pass_rate': 0.75},
                                          'adjective_synonym_swap':{'min_pass_rate': 0.75},
                                        }
                          }
                }
                  )

Test Configuration : 
 {
 "model_parameters": {
  "max_tokens": 32,
  "user_prompt": "You are an AI bot specializing in providing accurate and concise answers to questions. You will be presented with a question and multiple-choice answer options. Your task is to choose the correct answer.\nNote: Do not explain your answer.\nQuestion: {question}\nOptions: {options}\n Answer:"
 },
 "evaluation": {
  "metric": "llm_eval",
  "model": "gpt-3.5-turbo-instruct",
  "hub": "openai"
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 0.65
  },
  "robustness": {
   "uppercase": {
    "min_pass_rate": 0.75
   },
   "lowercase": {
    "min_pass_rate": 0.75
   },
   "titlecase": {
    "min_pass_rate": 0.75
   },
   "add_typo": {
    "min_pass_rate": 0.75
   },
   "dyslexia_word_swap": {
    "min_pass_rate": 0.75
   },
   "add_abbreviation": {
    "min_pass_rate": 0.75
   },
   "add_slangs": {
    "min_pass_rate": 0.75
   },
   "add_speech_to_text_typo": {
    "min_pass_rate": 0.75
   },
   "add_ocr_

### Generating the test cases

In [15]:
harness.generate(seed=42)

Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]
[W010] - Test 'lowercase': 4 samples removed out of 50
[W010] - Test 'titlecase': 1 samples removed out of 50
[W010] - Test 'add_typo': 1 samples removed out of 50
[W010] - Test 'dyslexia_word_swap': 16 samples removed out of 50
[W010] - Test 'add_abbreviation': 14 samples removed out of 50
[W010] - Test 'add_slangs': 34 samples removed out of 50
[W010] - Test 'add_speech_to_text_typo': 10 samples removed out of 50
[W010] - Test 'add_ocr_typo': 2 samples removed out of 50
[W010] - Test 'adjective_synonym_swap': 24 samples removed out of 50





In [16]:
harness.testcases()

Unnamed: 0,category,test_type,original_question,perturbed_question,options
0,robustness,uppercase,"A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to","A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO",A. make more phone calls\nB. quit eating lunch out\nC. buy less with monopoly money\nD. have lunch with friends
1,robustness,uppercase,There is most likely going to be fog around:,THERE IS MOST LIKELY GOING TO BE FOG AROUND:,A. a marsh\nB. a tundra\nC. the plains\nD. a desert
2,robustness,uppercase,Predators eat,PREDATORS EAT,A. lions\nB. humans\nC. bunnies\nD. grass
3,robustness,uppercase,"Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means","OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS",A. roots may be split\nB. roots may begin to die\nC. parts may break the concrete\nD. roots may fall apart
4,robustness,uppercase,An electric car runs on electricity via,AN ELECTRIC CAR RUNS ON ELECTRICITY VIA,A. gasoline\nB. a power station\nC. electrical conductors\nD. fuel
...,...,...,...,...,...
389,robustness,adjective_synonym_swap,Bill's arm got cold when he put it inside the,Bill's arm got bitter when he put it inside the,A. refrigerator\nB. room\nC. jacket\nD. oven
390,robustness,adjective_synonym_swap,What is different about birth in humans and chickens?,What is contrasting about birth in humans and chickens?,A. Mother\nB. Fertilization\nC. Father\nD. the hard shell
391,robustness,adjective_synonym_swap,Which of these situations is an example of pollutants?,Which of the particular situations is an example of pollutants?,A. plastic bags floating in the ocean\nB. mallard ducks floating on a lake\nC. cottonwood seeds floating in the air\nD. cirrus clouds floating in the sky
392,robustness,adjective_synonym_swap,"A balloon is filled with helium for a party. After the party, the balloons are left in the living room, where a fireplace is heating the room. The balloons","A balloon is filled with helium for a party. After the party, the balloons are south in the living room, where a fireplace is heating the room. The balloons",A. expand\nB. melt\nC. shrink\nD. fall


#### saving testcases

In [17]:
harness.save(f"{robustness_folder}/saved_test_configurations")

### Running the tests

Using checkpointing mechanism 

In [18]:
harness.run(checkpoint=True, batch_size=200, save_checkpoints_dir=f"{robustness_folder}/checkpoints")

Running testcases... : 100%|██████████| 200/200 [02:11<00:00,  1.52it/s]
Running testcases... : 100%|██████████| 194/194 [01:49<00:00,  1.78it/s]




**Note**: 
```python
harness = Harness.load_checkpoints(save_checkpoints_dir=f"{robustness_folder}/checkpoints",
                  task="question-answering",
                  model={"model": model_name, "hub":hub}, )
harness.run(checkpoint=True, batch_size=200, save_checkpoints_dir=f"{robustness_folder}/checkpoints")
```

If the kernel restarts or if an API failure occurs, users can resume the execution from the last saved checkpoint, preventing the loss of already processed model responses.

#### Saving Model Responses for Robustness

In [19]:
harness.save(save_dir= f"{robustness_folder}/model-response",  include_generated_results=True)

After executing the .run() method, you can save model responses for re-evaluation and analysis.

### Generated Results

In [20]:
generated_results = harness.generated_results()

In [21]:
generated_results

Unnamed: 0,category,test_type,original_question,perturbed_question,options,expected_result,actual_result,pass
0,robustness,uppercase,"A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to","A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO",A. make more phone calls\nB. quit eating lunch out\nC. buy less with monopoly money\nD. have lunch with friends,B. quit eating lunch out,B. quit eating lunch out,True
1,robustness,uppercase,There is most likely going to be fog around:,THERE IS MOST LIKELY GOING TO BE FOG AROUND:,A. a marsh\nB. a tundra\nC. the plains\nD. a desert,A. a marsh,A. a marsh,True
2,robustness,uppercase,Predators eat,PREDATORS EAT,A. lions\nB. humans\nC. bunnies\nD. grass,A. lions,A. lions,True
3,robustness,uppercase,"Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means","OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS",A. roots may be split\nB. roots may begin to die\nC. parts may break the concrete\nD. roots may fall apart,C. parts may break the concrete,C. parts may break the concrete,True
4,robustness,uppercase,An electric car runs on electricity via,AN ELECTRIC CAR RUNS ON ELECTRICITY VIA,A. gasoline\nB. a power station\nC. electrical conductors\nD. fuel,C. electrical conductors,B. a power station,True
...,...,...,...,...,...,...,...,...
389,robustness,adjective_synonym_swap,Bill's arm got cold when he put it inside the,Bill's arm got bitter when he put it inside the,A. refrigerator\nB. room\nC. jacket\nD. oven,A. refrigerator,A. refrigerator,True
390,robustness,adjective_synonym_swap,What is different about birth in humans and chickens?,What is contrasting about birth in humans and chickens?,A. Mother\nB. Fertilization\nC. Father\nD. the hard shell,D. the hard shell,D. the hard shell,True
391,robustness,adjective_synonym_swap,Which of these situations is an example of pollutants?,Which of the particular situations is an example of pollutants?,A. plastic bags floating in the ocean\nB. mallard ducks floating on a lake\nC. cottonwood seeds floating in the air\nD. cirrus clouds floating in the sky,A. plastic bags floating in the ocean,A. plastic bags floating in the ocean,True
392,robustness,adjective_synonym_swap,"A balloon is filled with helium for a party. After the party, the balloons are left in the living room, where a fireplace is heating the room. The balloons","A balloon is filled with helium for a party. After the party, the balloons are south in the living room, where a fireplace is heating the room. The balloons",A. expand\nB. melt\nC. shrink\nD. fall,A. expand,A. expand,True


### Final Results

In [22]:
report = harness.report()

In [23]:
report

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,uppercase,5,45,90%,75%,True
1,robustness,lowercase,2,44,96%,75%,True
2,robustness,titlecase,4,45,92%,75%,True
3,robustness,add_typo,4,45,92%,75%,True
4,robustness,dyslexia_word_swap,2,32,94%,75%,True
5,robustness,add_abbreviation,7,29,81%,75%,True
6,robustness,add_slangs,3,13,81%,75%,True
7,robustness,add_speech_to_text_typo,10,30,75%,75%,True
8,robustness,add_ocr_typo,7,41,85%,75%,True
9,robustness,adjective_synonym_swap,5,21,81%,75%,True


#### Saving report and generated_results

In [24]:
generated_results.to_csv(f"{robustness_folder}/generated-results.csv", index=False)
report.to_csv(f"{robustness_folder}/report.csv", index=False)

## Accuracy

### Setup and Configure Harness

In [25]:
harness =Harness(
                model={"model": model_name,"hub":hub}, 
                  data={"data_source": data_source,
                        "split": split},
                    task="question-answering",
                    config = {
                    "model_parameters": {
                                            "max_tokens": 32,
                                            "user_prompt": "You are an AI bot specializing in providing accurate and concise answers to questions. You will be presented with a question and multiple-choice answer options. Your task is to choose the correct answer.\nNote: Do not explain your answer.\nQuestion: {question}\nOptions: {options}\n Answer:",
                                        },
                    'tests': {'defaults': {'min_pass_rate': 0.65},

                    'accuracy': {'llm_eval': {'min_score': 0.75 , "model": "gpt-3.5-turbo-instruct","hub":"openai"},
                                  'min_exact_match_score': {'min_score':  0.75},
                                  'min_rouge1_score':{'min_score':  0.75},
                                  'min_rougeL_score':{'min_score':  0.75},
                                  'min_bleu_score':{'min_score':  0.75},
                                  'min_rouge2_score':{'min_score':  0.75},
                                  'min_rougeLsum_score':{'min_score':  0.75}

                                  }
                              }
                        }
                    )

Test Configuration : 
 {
 "model_parameters": {
  "max_tokens": 32,
  "user_prompt": "You are an AI bot specializing in providing accurate and concise answers to questions. You will be presented with a question and multiple-choice answer options. Your task is to choose the correct answer.\nNote: Do not explain your answer.\nQuestion: {question}\nOptions: {options}\n Answer:"
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 0.65
  },
  "accuracy": {
   "llm_eval": {
    "min_score": 0.75,
    "model": "gpt-3.5-turbo-instruct",
    "hub": "openai"
   },
   "min_exact_match_score": {
    "min_score": 0.75
   },
   "min_rouge1_score": {
    "min_score": 0.75
   },
   "min_rougeL_score": {
    "min_score": 0.75
   },
   "min_bleu_score": {
    "min_score": 0.75
   },
   "min_rouge2_score": {
    "min_score": 0.75
   },
   "min_rougeLsum_score": {
    "min_score": 0.75
   }
  }
 }
}


### Generating the test cases

In [26]:
harness.generate(seed = 42)

Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]




In [27]:
harness.testcases()

Unnamed: 0,category,test_type
0,accuracy,llm_eval
1,accuracy,min_exact_match_score
2,accuracy,min_rouge1_score
3,accuracy,min_rougeL_score
4,accuracy,min_bleu_score
5,accuracy,min_rouge2_score
6,accuracy,min_rougeLsum_score


### Running the tests

In [28]:
harness.run()

Running testcases... : 100%|██████████| 7/7 [01:51<00:00, 15.54s/it]



#### Model Responses for Accuracy

In [29]:
model_response = harness.model_response(category="accuracy")

In [30]:
model_response

Unnamed: 0,original_question,original_context,options,expected_results,actual_results
0,"A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to",-,A. make more phone calls\nB. quit eating lunch out\nC. buy less with monopoly money\nD. have lunch with friends,[B. quit eating lunch out],B. quit eating lunch out
1,There is most likely going to be fog around:,-,A. a marsh\nB. a tundra\nC. the plains\nD. a desert,[A. a marsh],A. a marsh
2,Predators eat,-,A. lions\nB. humans\nC. bunnies\nD. grass,[C. bunnies],A. lions
3,"Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means",-,A. roots may be split\nB. roots may begin to die\nC. parts may break the concrete\nD. roots may fall apart,[C. parts may break the concrete],C. parts may break the concrete
4,An electric car runs on electricity via,-,A. gasoline\nB. a power station\nC. electrical conductors\nD. fuel,[C. electrical conductors],C. electrical conductors
5,As the rain forest is deforested the atmosphere will increase with,-,A. oxygen\nB. nitrogen\nC. carbon\nD. rain,[C. carbon],C. carbon
6,an electric car contains a motor that runs on,-,A. gas\nB. hydrogen\nC. ions\nD. plutonium,[C. ions],C. ions
7,The middle of the day usually involves the bright star nearest to the earth to be straight overhead why?,-,A. moons gravity\nB. human planet rotation\nC. global warming\nD. moon rotation,[B. human planet rotation],B. human planet rotation
8,The summer solstice in the northern hemisphere is four months before,-,A. May\nB. July\nC. April\nD. October,[D. October],B. July
9,The main component in dirt is,-,A. microorganisms\nB. broken stones\nC. pollution\nD. bacteria,[B. broken stones],A. microorganisms


### Generated Results

In [32]:
generated_results = harness.generated_results()
generated_results

Unnamed: 0,category,test_type,expected_result,actual_result,pass
0,accuracy,llm_eval,0.75,0.84,True
1,accuracy,min_exact_match_score,0.75,0.8,True
2,accuracy,min_rouge1_score,0.75,0.839619,True
3,accuracy,min_rougeL_score,0.75,0.841905,True
4,accuracy,min_bleu_score,0.75,0.864158,True
5,accuracy,min_rouge2_score,0.75,0.81,True
6,accuracy,min_rougeLsum_score,0.75,0.83681,True


### Final Results

In [33]:
report = harness.report()
report

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,accuracy,llm_eval,0,1,100%,65%,True
1,accuracy,min_exact_match_score,0,1,100%,65%,True
2,accuracy,min_rouge1_score,0,1,100%,65%,True
3,accuracy,min_rougeL_score,0,1,100%,65%,True
4,accuracy,min_bleu_score,0,1,100%,65%,True
5,accuracy,min_rouge2_score,0,1,100%,65%,True
6,accuracy,min_rougeLsum_score,0,1,100%,65%,True


### Saving report, model_response and generated_results

In [34]:
model_response.to_csv(f"{accuracy_folder}/model-response.csv",index=False)
generated_results.to_csv(f"{accuracy_folder}/generated-results.csv", index=False)
report.to_csv(f"{accuracy_folder}/report.csv", index=False)