![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/llm_notebooks/dataset-notebooks/MultiLexSum_dataset.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install "langtest[openai,transformers,evaluate]"

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [3]:
#Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - | 
|**task**     |Task for which the model is to be evaluated (question-answering or summarization)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

# OpenAI Model Testing For Summarization

In this section, we dive into testing of OpenAI models in summarization task.

LangTest supports robustness tests for LLM testing for now.

In [3]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

## MultiLexSum
[Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities](https://arxiv.org/abs/2206.10883)

**Dataset Summary**

The Multi-LexSum dataset consists of legal case summaries. The aim is for the model to thoroughly examine the given context and, upon understanding its content, produce a concise summary that captures the essential themes and key details.

**Data Splits**

- `test` :	Testing set from the MultiLexSum dataset, containing 868 document and summary examples.
- `test-tiny` : Truncated version of XSum dataset which contains 50 document and summary examples.

### Setup and Configure Harness

In [5]:
harness = Harness(
                  task="summarization", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :"MultiLexSum",
                        "split":"test-tiny"}
                  )

Test Configuration : 
 {
 "model_parameters": {
  "temperature": 0.2,
  "max_tokens": 64
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "lowercase": {
    "min_pass_rate": 0.7
   }
  }
 }
}


## Robustness

For tests we used uppercase, Dyslexia Word Swap. Other available robustness tests for QA task are:
* `add_context`
* `add_contraction`
* `add_punctuation`
* `add_typo`
* `add_ocr_typo`
* `american_to_british`
* `british_to_american`
* `lowercase`
* `strip_punctuation`
* `titlecase`
* `uppercase`
* `number_to_word`
* `add_abbreviation`
* `add_speech_to_text_typo`
* `add_slangs`
* `dyslexia_word_swap`
* `multiple_perturbations`
* `adjective_synonym_swap`
* `adjective_antonym_swap`
* `strip_all_punctuation`

You can also set prompts and other model parameters in config. Possible parameters are:
* `user_promt:` Promt to be given to the model.
* `temperature:` Temperature of the model.
* `max_tokens:` Maximum number of output tokens allowed for model.

In [6]:
harness.configure(
{
"evaluation":{"threshold": 0.5},

 'tests': {'defaults': {'min_pass_rate': 0.65,
                        "threshold":0.50
                        },
           'robustness': {'uppercase': {'min_pass_rate': 0.66},
                          'lowercase':{'min_pass_rate': 0.60},

                        }
          }
 }
 )

{'evaluation': {'threshold': 0.5},
 'tests': {'defaults': {'min_pass_rate': 0.65, 'threshold': 0.5},
  'robustness': {'uppercase': {'min_pass_rate': 0.66},
   'lowercase': {'min_pass_rate': 0.6}}}}

➤ The default metric for summarization is `rouge`. The other available metric is `bertscore` which can be initialised using -> `"evaluation":{"metric":"bertscore", "threshold": 0.5}`

➤The default threshold value is `0.50`. If the eval_score is higher than threshold, then the "pass" will be as true.

➤ You can adjust the level of transformation in the sentence by using the "`prob`" parameter, which controls the proportion of words to be changed during robustness tests.

➤ **NOTE** : "`prob`" defaults to 1.0, which means all words will be transformed.
```
harness.configure(
{
 'tests': {
    'defaults': {'min_pass_rate': 0.65},
      'robustness': {
        'uppercase': {'min_pass_rate': 0.66, 'prob': 0.50},
        'lowercase':{'min_pass_rate': 0.60, 'prob': 0.70},
      }
  }
})

```

Here we have configured the harness to perform robustness tests and defined the minimum pass rate for each test.

In [6]:
harness.data = harness.data[:10]

### Generating the test cases.

In [7]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]






In [8]:
harness.testcases()

Unnamed: 0,category,test_type,original,test_case
0,robustness,uppercase,"On March 8th, 2014, several citizens of Montgo...","ON MARCH 8TH, 2014, SEVERAL CITIZENS OF MONTGO..."
1,robustness,uppercase,"On August 28, 2013, an indigent detainee in th...","ON AUGUST 28, 2013, AN INDIGENT DETAINEE IN TH..."
2,robustness,uppercase,"On May 1, 2006, an inmate awaiting execution a...","ON MAY 1, 2006, AN INMATE AWAITING EXECUTION A..."
3,robustness,uppercase,"On August 23, 2018, three Maricopa County, Ari...","ON AUGUST 23, 2018, THREE MARICOPA COUNTY, ARI..."
4,robustness,uppercase,"On March 8, 2006, the Pacific News Service fil...","ON MARCH 8, 2006, THE PACIFIC NEWS SERVICE FIL..."
5,robustness,uppercase,"On April 20, 2012, a state prisoner filed this...","ON APRIL 20, 2012, A STATE PRISONER FILED THIS..."
6,robustness,uppercase,"On June 9, 2018, the plaintiff in this case wa...","ON JUNE 9, 2018, THE PLAINTIFF IN THIS CASE WA..."
7,robustness,uppercase,"On May 1, 2012, a D.C. resident whose car was ...","ON MAY 1, 2012, A D.C. RESIDENT WHOSE CAR WAS ..."
8,robustness,uppercase,The city of Doraville relied on its municipal ...,THE CITY OF DORAVILLE RELIED ON ITS MUNICIPAL ...
9,robustness,uppercase,"On May 22, 2012, several national and local ne...","ON MAY 22, 2012, SEVERAL NATIONAL AND LOCAL NE..."


harness.generate() method automatically generates the test cases (based on the provided configuration)

### Running the tests

In [9]:
harness.run()

Running testcases... : 100%|██████████| 20/20 [01:27<00:00,  4.37s/it]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

### Generated Results

In [12]:
harness.generated_results()

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,eval_score,pass
0,robustness,uppercase,"On March 8th, 2014, several citizens of Montgo...","ON MARCH 8TH, 2014, SEVERAL CITIZENS OF MONTGO...","On March 8th, 2014, several citizens of Montg...","\nIn March 2014, several citizens of Montgomer...",0.304762,False
1,robustness,uppercase,"On August 28, 2013, an indigent detainee in th...","ON AUGUST 28, 2013, AN INDIGENT DETAINEE IN TH...","\nIn August 2013, an indigent detainee in the ...","On August 28, 2013, an indigent detainee in t...",0.647619,True
2,robustness,uppercase,"On May 1, 2006, an inmate awaiting execution a...","ON MAY 1, 2006, AN INMATE AWAITING EXECUTION A...","\nIn 2006, two inmates in the Arkansas Departm...","\n\nIn May 2006, an inmate awaiting execution ...",0.594059,True
3,robustness,uppercase,"On August 23, 2018, three Maricopa County, Ari...","ON AUGUST 23, 2018, THREE MARICOPA COUNTY, ARI...","\nOn August 23, 2018, three Maricopa County, A...","\n\nOn August 23, 2018, three Maricopa County,...",0.903226,True
4,robustness,uppercase,"On March 8, 2006, the Pacific News Service fil...","ON MARCH 8, 2006, THE PACIFIC NEWS SERVICE FIL...","On March 8, 2006, Pacific News Service filed ...","\n\nOn March 8, 2006, Pacific News Service fil...",0.54717,True
5,robustness,uppercase,"On April 20, 2012, a state prisoner filed this...","ON APRIL 20, 2012, A STATE PRISONER FILED THIS...","\nIn April 2012, a state prisoner filed a clas...","\n\nIn April 2012, a state prisoner filed a cl...",0.596154,True
6,robustness,uppercase,"On June 9, 2018, the plaintiff in this case wa...","ON JUNE 9, 2018, THE PLAINTIFF IN THIS CASE WA...","\n\nIn June 2018, the plaintiff was arrested i...","\n\nOn June 9, 2018, a plaintiff was arrested ...",0.849057,True
7,robustness,uppercase,"On May 1, 2012, a D.C. resident whose car was ...","ON MAY 1, 2012, A D.C. RESIDENT WHOSE CAR WAS ...","\nIn May 2012, a D.C. resident whose car was s...","\n\nOn May 1, 2012, a D.C. resident filed a la...",0.653846,True
8,robustness,uppercase,The city of Doraville relied on its municipal ...,THE CITY OF DORAVILLE RELIED ON ITS MUNICIPAL ...,"\nIn May 2018, four individuals filed a lawsui...",\nFour individuals filed a lawsuit against the...,0.640777,True
9,robustness,uppercase,"On May 22, 2012, several national and local ne...","ON MAY 22, 2012, SEVERAL NATIONAL AND LOCAL NE...","On May 22, 2012, several news agencies filed ...","\n\nIn May 2012, several news agencies filed a...",0.601942,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [13]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,uppercase,1,9,90%,66%,True
1,robustness,lowercase,1,9,90%,60%,True


## Fairness

Available Fairness tests for QA task are:

* `max_gender_rouge1_score`
* `max_gender_rouge2_score`
* `max_gender_rougeL_score`
* `max_gender_rougeLsum_score`
* `min_gender_rouge1_score`
* `min_gender_rouge2_score`
* `min_gender_rougeL_score`
* `min_gender_rougeLsum_score`

In [15]:
harness = Harness(
                  task="summarization", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :"MultiLexSum",
                        "split":"test-tiny"}
                  )

Test Configuration : 
 {
 "model_parameters": {
  "temperature": 0.2,
  "max_tokens": 64
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "lowercase": {
    "min_pass_rate": 0.7
   }
  }
 }
}


In [16]:
harness.configure(
{
 'tests': {'defaults': {'min_pass_rate': 0.65},
           'fairness': {
               'min_gender_rouge1_score': {'min_score': 0.66},
               'min_gender_rouge2_score':{'min_score': 0.60},
               'min_gender_rougeL_score': {'min_score': 0.66},
               'min_gender_rougeLsum_score': {'min_score': 0.66},
               'max_gender_rouge1_score': {'max_score': 0.66},
               'max_gender_rouge2_score':{'max_score': 0.60},
               'max_gender_rougeL_score': {'max_score': 0.66},
               'max_gender_rougeLsum_score': {'max_score': 0.66},




                        }
          }
 }
 )

{'tests': {'defaults': {'min_pass_rate': 0.65},
  'fairness': {'min_gender_rouge1_score': {'min_score': 0.66},
   'min_gender_rouge2_score': {'min_score': 0.6},
   'min_gender_rougeL_score': {'min_score': 0.66},
   'min_gender_rougeLsum_score': {'min_score': 0.66},
   'max_gender_rouge1_score': {'max_score': 0.66},
   'max_gender_rouge2_score': {'max_score': 0.6},
   'max_gender_rougeL_score': {'max_score': 0.66},
   'max_gender_rougeLsum_score': {'max_score': 0.66}}}}

In [17]:
harness.data = harness.data[:10]

### Generating the Test Cases

In [18]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 662.29it/s]




In [19]:
harness.testcases()

Unnamed: 0,category,test_type,test_case
0,fairness,min_gender_rouge1_score,male
1,fairness,min_gender_rouge1_score,female
2,fairness,min_gender_rouge1_score,unknown
3,fairness,min_gender_rouge2_score,male
4,fairness,min_gender_rouge2_score,female
5,fairness,min_gender_rouge2_score,unknown
6,fairness,min_gender_rougeL_score,male
7,fairness,min_gender_rougeL_score,female
8,fairness,min_gender_rougeL_score,unknown
9,fairness,min_gender_rougeLsum_score,male


### Running the tests

In [20]:
harness.run()

Running testcases... :   0%|          | 0/24 [00:00<?, ?it/s]

Running testcases... : 100%|██████████| 24/24 [06:01<00:00, 15.04s/it]




### Generated Results

In [22]:
harness.generated_results()

Unnamed: 0,category,test_type,test_case,expected_result,actual_result,pass
0,fairness,min_gender_rouge1_score,male,0.66,0.431206,False
1,fairness,min_gender_rouge1_score,female,0.66,0.322581,False
2,fairness,min_gender_rouge1_score,unknown,0.66,0.389023,False
3,fairness,min_gender_rouge2_score,male,0.6,0.248398,False
4,fairness,min_gender_rouge2_score,female,0.6,0.086957,False
5,fairness,min_gender_rouge2_score,unknown,0.6,0.253425,False
6,fairness,min_gender_rougeL_score,male,0.66,0.355613,False
7,fairness,min_gender_rougeL_score,female,0.66,0.172043,False
8,fairness,min_gender_rougeL_score,unknown,0.66,0.326059,False
9,fairness,min_gender_rougeLsum_score,male,0.66,0.357904,False


### Final Results

In [23]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,fairness,min_gender_rouge1_score,3,0,0%,65%,False
1,fairness,min_gender_rouge2_score,3,0,0%,65%,False
2,fairness,min_gender_rougeL_score,3,0,0%,65%,False
3,fairness,min_gender_rougeLsum_score,3,0,0%,65%,False
4,fairness,max_gender_rouge1_score,0,3,100%,65%,True
5,fairness,max_gender_rouge2_score,0,3,100%,65%,True
6,fairness,max_gender_rougeL_score,0,3,100%,65%,True
7,fairness,max_gender_rougeLsum_score,0,3,100%,65%,True


## Accuracy

Available Accuracy tests for QA task are:

* `min_exact_match_score`
* `min_bleu_score`
* `min_rouge1_score`
* `min_rouge2_score`
* `min_rougeL_score`
* `min_rougeLsum_score`

In [24]:
harness = Harness(
                  task="summarization", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :"MultiLexSum",
                        "split":"test-tiny"}
                  )

Test Configuration : 
 {
 "model_parameters": {
  "temperature": 0.2,
  "max_tokens": 64
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "lowercase": {
    "min_pass_rate": 0.7
   }
  }
 }
}


In [25]:
harness.configure(
{
 'tests': {'defaults': {'min_pass_rate': 0.65},
          'accuracy': {'min_exact_match_score': {'min_score': 0.70},
                        'min_rouge1_score':{'min_score': 0.70},
                        'min_rougeL_score':{'min_score': 0.70},
                        'min_bleu_score':{'min_score': 0.70},
                        'min_rouge2_score':{'min_score': 0.70},
                        'min_rougeLsum_score':{'min_score': 0.70}

                        }
          }
 }
 )

{'tests': {'defaults': {'min_pass_rate': 0.65},
  'accuracy': {'min_exact_match_score': {'min_score': 0.7},
   'min_rouge1_score': {'min_score': 0.7},
   'min_rougeL_score': {'min_score': 0.7},
   'min_bleu_score': {'min_score': 0.7},
   'min_rouge2_score': {'min_score': 0.7},
   'min_rougeLsum_score': {'min_score': 0.7}}}}

### Generating the test cases.

In [26]:
harness.data = harness.data[:5]

In [27]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]




In [28]:
harness.testcases()

Unnamed: 0,category,test_type
0,accuracy,min_exact_match_score
1,accuracy,min_rouge1_score
2,accuracy,min_rougeL_score
3,accuracy,min_bleu_score
4,accuracy,min_rouge2_score
5,accuracy,min_rougeLsum_score


### Running the tests

In [29]:
harness.run()

Downloading builder script: 100%|██████████| 5.67k/5.67k [00:00<?, ?B/s]
Running testcases... : 100%|██████████| 6/6 [01:58<00:00, 19.72s/it]




### Generated Results

In [30]:
harness.generated_results()

Unnamed: 0,category,test_type,expected_result,actual_result,pass
0,accuracy,min_exact_match_score,0.7,0.0,False
1,accuracy,min_rouge1_score,0.7,0.399834,False
2,accuracy,min_rougeL_score,0.7,0.312736,False
3,accuracy,min_bleu_score,0.7,0.083641,False
4,accuracy,min_rouge2_score,0.7,0.213542,False
5,accuracy,min_rougeLsum_score,0.7,0.311746,False


### Final Results

In [31]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,accuracy,min_exact_match_score,1,0,0%,65%,False
1,accuracy,min_rouge1_score,1,0,0%,65%,False
2,accuracy,min_rougeL_score,1,0,0%,65%,False
3,accuracy,min_bleu_score,1,0,0%,65%,False
4,accuracy,min_rouge2_score,1,0,0%,65%,False
5,accuracy,min_rougeLsum_score,1,0,0%,65%,False
