![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/misc/Editing_TestCases_Notebook.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install langtest[transformers]

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [2]:
#Import Harness from the LangTest library
from langtest import Harness


# Editable TestCases in Harness

In this section, we will discuss how to edit test cases in the Harness class. The Harness class has a method called `edit_testcases` which can be used to generate csv file for edit test cases in manual mode. The method takes the following parameters:

**output_path**: Path to the output csv file.

After editing the csv file, the user can use the `import_edited_testcases` method to import the test cases from the csv file. The method takes the following parameters:

**input_path**: Path to the input csv file.



### Setup and Configure Harness

In [3]:
harness = Harness(task="ner",
                  model={"model":'dslim/bert-base-NER',"hub":'huggingface'}
                  )

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Test Configuration : 
 {
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "american_to_british": {
    "min_pass_rate": 0.7
   }
  },
  "accuracy": {
   "min_micro_f1_score": {
    "min_score": 0.7
   }
  },
  "bias": {
   "replace_to_female_pronouns": {
    "min_pass_rate": 0.7
   },
   "replace_to_low_income_country": {
    "min_pass_rate": 0.7
   }
  },
  "fairness": {
   "min_gender_f1_score": {
    "min_score": 0.6
   }
  },
  "representation": {
   "min_label_representation_count": {
    "min_count": 50
   }
  }
 }
}


We have specified task as `ner` , hub as `huggingface` and model as `dslim/bert-base-NER`



For tests we used lowercase and uppercase. Other available robustness tests are:
* `add_context`
* `add_contraction`
* `add_punctuation`
* `add_typo`
* `add_ocr_typo`
* `american_to_british`
* `british_to_american`
* `lowercase`
* `strip_punctuation`
* `titlecase`
* `uppercase`
* `number_to_word`
* `add_abbreviation`
* `add_speech_to_text_typo`
* `add_slangs`
* `dyslexia_word_swap`
* `multiple_perturbations`
* `adjective_synonym_swap`
* `adjective_antonym_swap`

Bias tests:

* `replace_to_male_pronouns`
* `replace_to_female_pronouns`
* `replace_to_neutral_pronouns`
* `replace_to_high_income_country`
* `replace_to_low_income_country`
* `replace_to_upper_middle_income_country`
* `replace_to_lower_middle_income_country`
* `replace_to_white_firstnames`
* `replace_to_black_firstnames`
* `replace_to_hispanic_firstnames`
* `replace_to_asian_firstnames`
* `replace_to_white_lastnames`
* `replace_to_sikh_names`
* `replace_to_christian_names`
* `replace_to_hindu_names`
* `replace_to_muslim_names`
* `replace_to_inter_racial_lastnames`
* `replace_to_native_american_lastnames`
* `replace_to_asian_lastnames`
* `replace_to_hispanic_lastnames`
* `replace_to_black_lastnames`
* `replace_to_parsi_names`
* `replace_to_jain_names`
* `replace_to_buddhist_names`

In [4]:
harness.configure(
{
 'tests': {'defaults': {'min_pass_rate': 0.65},
           'robustness': {'lowercase': {'min_pass_rate': 0.66},
                          'uppercase': {'min_pass_rate': 0.66},
                        }
          }
 }
 )

{'tests': {'defaults': {'min_pass_rate': 0.65},
  'robustness': {'lowercase': {'min_pass_rate': 0.66},
   'uppercase': {'min_pass_rate': 0.66}}}}

Here we have configured the harness to perform two robustness tests (uppercase and lowercase) and defined the minimum pass rate for each test.

### Generating the test cases.

In [5]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 4583.94it/s]




harness.generate() method automatically generates the test cases (based on the provided configuration)

In [6]:
harness.testcases()

Unnamed: 0,category,test_type,original,test_case
0,robustness,lowercase,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","soccer - japan get lucky win , china in surpri..."
1,robustness,lowercase,Nadim Ladki,nadim ladki
2,robustness,lowercase,"AL-AIN , United Arab Emirates 1996-12-06","al-ain , united arab emirates 1996-12-06"
3,robustness,lowercase,Japan began the defence of their Asian Cup tit...,japan began the defence of their asian cup tit...
4,robustness,lowercase,But China saw their luck desert them in the se...,but china saw their luck desert them in the se...
...,...,...,...,...
447,robustness,uppercase,Portuguesa 1 Atletico Mineiro 0,PORTUGUESA 1 ATLETICO MINEIRO 0
448,robustness,uppercase,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .
449,robustness,uppercase,Robert Galvin,ROBERT GALVIN
450,robustness,uppercase,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06


harness.testcases() method displays the produced test cases in form of a pandas data frame.

### Edit the test cases

**harness.edit_testcases()** method generates a csv file for editing the test cases in manual mode. The method takes the following parameters:

**output_path**: Path to the output csv file.

In [7]:
harness.edit_testcases('edit_testcases.csv')

Manually inspect your test cases and edit them according to your needs.

### Import the test cases

Once you have made the desired changes, **harness.import_edited_testcases()** method can be used to import the edited test cases in form of a CSV.

In [8]:
harness.import_edited_testcases('edit_testcases.csv')



### Running the tests

In [9]:
harness.run()

Running testcases... : 100%|██████████| 452/452 [02:36<00:00,  2.89it/s]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [10]:
harness.generated_results()

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,pass
0,robustness,lowercase,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","soccer - japan get lucky win , china in surpri...","JAPAN: MISC, LUCKY: PER, CHINA: ORG",,False
1,robustness,lowercase,Nadim Ladki,nadim ladki,Nadim Ladki: PER,,False
2,robustness,lowercase,"AL-AIN , United Arab Emirates 1996-12-06","al-ain , united arab emirates 1996-12-06","AL-AIN: LOC, United Arab Emirates: LOC",al-ain: LOC,False
3,robustness,lowercase,Japan began the defence of their Asian Cup tit...,japan began the defence of their asian cup tit...,"Japan: LOC, Asian Cup: MISC, Syria: LOC, Group...","japan: ORG, syria: ORG",False
4,robustness,lowercase,But China saw their luck desert them in the se...,but china saw their luck desert them in the se...,"China: LOC, Uzbekistan: LOC",uzbekistan: LOC,False
...,...,...,...,...,...,...,...
447,robustness,uppercase,Portuguesa 1 Atletico Mineiro 0,PORTUGUESA 1 ATLETICO MINEIRO 0,"Portuguesa: ORG, Atletico Mineiro: ORG","PORTUGUESA: ORG, ATLETICO MINEIRO: ORG",True
448,robustness,uppercase,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,"LARA: LOC, MISERABLE: PER","LARA: LOC, MISERABLE: PER",True
449,robustness,uppercase,Robert Galvin,ROBERT GALVIN,Robert Galvin: PER,"ROBERT: ORG, GALVIN: PER",False
450,robustness,uppercase,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06,MELBOURNE: LOC,MELBOURNE: LOC,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [11]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,lowercase,182,44,19%,66%,False
1,robustness,uppercase,152,74,33%,66%,False
