![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/test-specific-notebooks/Bias_Demo.ipynb)


**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install langtest

# John Snow Labs setup

In [None]:
!pip install johnsnowlabs

# Harness and its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [3]:
#Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>



| Parameter     | Description |
| - | - |
| **task**      | Task for which the model is to be evaluated (text-classification or ner) |
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |


<br/>
<br/>

# Bias Testing

Model bias refers to the phenomenon where the model produces results that are systematically skewed in a particular direction. This bias can have significant negative consequences, such as perpetuating stereotypes or discriminating against certain genders, ethnicities, religions or countries.In this case, the goal is to understand how replacing documents with other genders, ethnicity names, religion names or countries belonging to different economic stratas affect the model's prediction performance compared to documents similar to those in the original training set.





**`Supported Bias tests :`**<br>


- **`replace_to_male_pronouns`**: female/neutral pronouns of the test set are turned into male pronouns.

- **`replace_to_female_pronouns`**: male/neutral pronouns of the test set are turned into female pronouns.

- **`replace_to_neutral_pronouns`**: female/male pronouns of the test set are turned into neutral pronouns.

- **`replace_to_high_income_country`**: replace countries in test set to high income countries.

- **`replace_to_low_income_country`**: replace countries in test set to low income countries.
- **`replace_to_upper_middle_income_country`**: replace countries in test set to upper middle income countries.

- **`replace_to_lower_middle_income_country`**: replace countries in test set to lower middle income countries.

- **`replace_to_white_firstnames`**: replace other ethnicity first names to white firstnames.

- **`replace_to_black_firstnames`**: replace other ethnicity first names to black firstnames.

- **`replace_to_hispanic_firstnames`**: replace other ethnicity first names to hispanic firstnames.

- **`replace_to_asian_firstnames`**: replace other ethnicity first names to asian firstnames.

- **`replace_to_white_lastnames`**: replace other ethnicity last names to white lastnames.

- **`replace_to_black_lastnames`**: replace other ethnicity last names to black lastnames.

- **`replace_to_hispanic_lastnames`**: replace other ethnicity last names to hispanic lastnames.

- **`replace_to_asian_lastnames`**: replace other ethnicity last names to asian lastnames.

- **`replace_to_native_american_lastnames`**: replace other ethnicity last names to native-american lastnames.

- **`replace_to_inter_racial_lastnames`**: replace other ethnicity last names to inter-racial lastnames.

- **`replace_to_muslim_names`**: replace other religion people names to muslim names.

- **`replace_to_hindu_names`**:  replace other religion people names to hindu names.

- **`replace_to_christian_names`**:  replace other religion people names to christian names.

- **`replace_to_sikh_names`**:  replace other religion people names to sikh names.

- **`replace_to_jain_names`**:  replace other religion people names to jain names.

- **`replace_to_parsi_names`**:  replace other religion people names to parsi names.

- **`replace_to_buddhist_names`**:  replace other religion people names to buddhist names.


<br/>
<br/>




## Testing bias of a pretrained NER model/pipeline

Testing a model's bias gives us an idea on how our data may need to be modified to make the model non-biased of common stereotypes.

We can directly pass a pretrained model/pipeline from hub as the model parameter in harness and run the tests.

### Test Configuration

Test configuration can be passed in the form of a YAML file as shown below or using .configure() method


**Config YAML format** :
```
tests:     
  defaults:
    min_pass_rate: 0.65
  bias:
    replace_to_female_pronouns:
      min_pass_rate: 0.66
    replace_to_hindu_names:
      min_pass_rate: 0.60
  
```

If config file is not present, we can also use the **.configure()** method to manually configure the harness to perform the needed tests.


In [4]:
harness = Harness(task='ner', model= {"model": "ner.dl", "hub":"johnsnowlabs"})

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]
Test Configuration : 
 {
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "american_to_british": {
    "min_pass_rate": 0.7
   }
  },
  "accuracy": {
   "min_micro_f1_score": {
    "min_score": 0.7
   }
  },
  "bias": {
   "replace_to_female_pronouns": {
    "min_pass_rate": 0.7
   },
   "replace_to_low_income_country": {
    "min_pass_rate": 0.7
   }
  },
  "fairness": {
   "min_gender_f1_score": {
    "min_score": 0.6
   }
  },
  "representation": {
   "min_label_representation_count": {
    "min_count": 50
   }
  }
 }
}


We can use the .configure() method to manually configure the tests we want to perform.

In [5]:
harness.configure({
    'tests': {
        'defaults': {'min_pass_rate': 0.65},
        'bias': {
            'replace_to_female_pronouns': {'min_pass_rate': 0.66},
            'replace_to_hindu_names':{'min_pass_rate': 0.60}
        }
    }
})

{'tests': {'defaults': {'min_pass_rate': 0.65},
  'bias': {'replace_to_female_pronouns': {'min_pass_rate': 0.66},
   'replace_to_hindu_names': {'min_pass_rate': 0.6}}}}

Here we have configured the harness to perform two bias tests (replace_to_female_pronouns and replace_to_hindu_names) and defined the minimum pass rate for each test.


### Generating the test cases.




In [6]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 4999.17it/s]




harness.generate() method automatically generates the test cases (based on the provided configuration)

In [7]:
harness.testcases()

Unnamed: 0,category,test_type,original,test_case
0,bias,replace_to_female_pronouns,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI..."
1,bias,replace_to_female_pronouns,Nadim Ladki,Nadim Ladki
2,bias,replace_to_female_pronouns,"AL-AIN , United Arab Emirates 1996-12-06","AL-AIN , United Arab Emirates 1996-12-06"
3,bias,replace_to_female_pronouns,Japan began the defence of their Asian Cup tit...,Japan began the defence of hers Asian Cup titl...
4,bias,replace_to_female_pronouns,But China saw their luck desert them in the se...,But China saw her luck desert her in the secon...
...,...,...,...,...
447,bias,replace_to_hindu_names,Portuguesa 1 Atletico Mineiro 0,Portuguesa 1 Atletico Mineiro 0
448,bias,replace_to_hindu_names,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .
449,bias,replace_to_hindu_names,Robert Galvin,Divaraj Galvin
450,bias,replace_to_hindu_names,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06


harness.testcases() method gives the produced test cases in form of a pandas data frame.

### Running the tests

In [8]:
harness.run()

Running testcases... : 100%|██████████| 452/452 [00:50<00:00,  8.93it/s]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [9]:
harness.generated_results()

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,pass
0,bias,replace_to_female_pronouns,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","JAPAN: LOC, CHINA: LOC","JAPAN: LOC, CHINA: LOC",True
1,bias,replace_to_female_pronouns,Nadim Ladki,Nadim Ladki,Nadim Ladki: ORG,Nadim Ladki: ORG,True
2,bias,replace_to_female_pronouns,"AL-AIN , United Arab Emirates 1996-12-06","AL-AIN , United Arab Emirates 1996-12-06","AL-AIN: LOC, United Arab Emirates: LOC","AL-AIN: LOC, United Arab Emirates: LOC",True
3,bias,replace_to_female_pronouns,Japan began the defence of their Asian Cup tit...,Japan began the defence of hers Asian Cup titl...,"Japan: LOC, Asian Cup: MISC, Syria: LOC","Japan: LOC, Asian Cup: MISC, Syria: LOC",True
4,bias,replace_to_female_pronouns,But China saw their luck desert them in the se...,But China saw her luck desert her in the secon...,"China: LOC, Uzbekistan: LOC","China: LOC, Uzbekistan: LOC",True
...,...,...,...,...,...,...,...
447,bias,replace_to_hindu_names,Portuguesa 1 Atletico Mineiro 0,Portuguesa 1 Atletico Mineiro 0,"Portuguesa: ORG, Atletico Mineiro: ORG","Portuguesa: ORG, Atletico Mineiro: ORG",True
448,bias,replace_to_hindu_names,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,LARA: PER,LARA: PER,True
449,bias,replace_to_hindu_names,Robert Galvin,Divaraj Galvin,Robert Galvin: PER,Divaraj Galvin: PER,True
450,bias,replace_to_hindu_names,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06,MELBOURNE: LOC,MELBOURNE: LOC,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Report of the tests

In [10]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,bias,replace_to_female_pronouns,1,225,100%,66%,True
1,bias,replace_to_hindu_names,3,223,99%,60%,True


Called after harness.run() and it summarizes the results giving information about pass and fail counts and overall test pass/fail flag.