<!-- ![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/refs/heads/main/docs/assets/images/logo.png) -->
![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/end-to-end-notebooks/JohnSnowLabs_RealWorld_Custom_Pipeline_Notebook.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest on John Snow Labs

In [None]:
!pip install langtest[johnsnowlabs]

# Harness and its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [None]:
#Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>



| Parameter  | Description |  
| - | - | 
|**task**     |Task for which the model is to be evaluated (text-classification or ner)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |


<br/>
<br/>

# Real-World Project Workflows

In this section, we dive into complete workflows for using the model testing module in real-world project settings.

## Robustness Testing

In this example, we will be testing a model's robustness to changes in capitalization - more specifically, we will be applying 2 tests: uppercase and lowercase. The real-world project workflow of the model robustness testing and fixing in this case goes as follows:

1. Train NER model on original CoNLL training set

2. Test NER model robustness on CoNLL test set

3. Augment CoNLL training set based on test results

4. Train new NER model on augmented CoNLL training set

5. Test new NER model robustness on the CoNLL test set from step 2

6. Compare robustness of new NER model against original NER model

#### Load Train and Test CoNLL

In [None]:
# Load test CoNLL
!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll

# Load train CoNLL
!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll

#### Step 1: Train NER Model

In [None]:
from johnsnowlabs import nlp
spark = nlp.start()

🤓 Looks like /root/.johnsnowlabs is missing, creating it
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.2, running on ⚡ PySpark==3.1.2


In [None]:

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

nerTagger = nlp.NerDLApproach()\
    .setInputCols(["document", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(20)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')

training_pipeline = nlp.Pipeline(stages=[
          embeddings,
          nerTagger
 ])


conll_data = nlp.CoNLL().readDataset(spark, 'conll03.conll')

ner_model = training_pipeline.fit(conll_data)

ner_model.stages[-1].write().overwrite().save('models/trained_ner_model')

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
!zip -r trained_ner_model.zip models/trained_ner_model

  adding: models/trained_ner_model/ (stored 0%)
  adding: models/trained_ner_model/metadata/ (stored 0%)
  adding: models/trained_ner_model/metadata/part-00000 (deflated 44%)
  adding: models/trained_ner_model/metadata/.part-00000.crc (stored 0%)
  adding: models/trained_ner_model/metadata/_SUCCESS (stored 0%)
  adding: models/trained_ner_model/metadata/._SUCCESS.crc (stored 0%)
  adding: models/trained_ner_model/.tensorflow.crc (deflated 0%)
  adding: models/trained_ner_model/tensorflow (deflated 16%)
  adding: models/trained_ner_model/fields/ (stored 0%)
  adding: models/trained_ner_model/fields/datasetParams/ (stored 0%)
  adding: models/trained_ner_model/fields/datasetParams/part-00001 (deflated 75%)
  adding: models/trained_ner_model/fields/datasetParams/.part-00001.crc (stored 0%)
  adding: models/trained_ner_model/fields/datasetParams/part-00000 (deflated 27%)
  adding: models/trained_ner_model/fields/datasetParams/.part-00000.crc (stored 0%)
  adding: models/trained_ner_model/f

#### Step 2: Test NER Model Robustness on Capitalization tests

In [None]:
!unzip trained_ner_model.zip

Archive:  trained_ner_model.zip
replace models/trained_ner_model/metadata/part-00000? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: models/trained_ner_model/metadata/part-00000  
 extracting: models/trained_ner_model/metadata/.part-00000.crc  
 extracting: models/trained_ner_model/metadata/_SUCCESS  
 extracting: models/trained_ner_model/metadata/._SUCCESS.crc  
  inflating: models/trained_ner_model/.tensorflow.crc  
  inflating: models/trained_ner_model/tensorflow  
  inflating: models/trained_ner_model/fields/datasetParams/part-00001  
 extracting: models/trained_ner_model/fields/datasetParams/.part-00001.crc  
  inflating: models/trained_ner_model/fields/datasetParams/part-00000  
 extracting: models/trained_ner_model/fields/datasetParams/.part-00000.crc  
 extracting: models/trained_ner_model/fields/datasetParams/_SUCCESS  
 extracting: models/trained_ner_model/fields/datasetParams/._SUCCESS.crc  


In [None]:
documentAssembler = nlp.DocumentAssembler()\
		.setInputCol("text")\
		.setOutputCol("document")

tokenizer = nlp.Tokenizer()\
		.setInputCols(["document"])\
		.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

ner = nlp.NerDLModel.load("models/trained_ner_model") \
		.setInputCols(["document", "token", "embeddings"]) \
		.setOutputCol("ner")

ner_pipeline = nlp.Pipeline().setStages([
				documentAssembler,
				tokenizer,
				embeddings,
				ner
    ])

ner_model = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
harness = Harness(task="ner", model={"model": ner_model, "hub": "johnsnowlabs"}, data={"data_source" :"sample.conll"})

In [None]:
harness.configure({
    'tests': {
        'defaults': {'min_pass_rate': 0.65},

        'robustness': {
            'lowercase': {'min_pass_rate': 0.60},
            'uppercase':{'min_pass_rate': 0.60}
        }
    }
})

{'tests': {'defaults': {'min_pass_rate': 0.65},
 'robustness': {'lowercase': {'min_pass_rate': 0.6},
   'uppercase': {'min_pass_rate': 0.6}}}}

Here we have configured the harness to perform two robustness tests (uppercase and lowercase) and defined the minimum pass rate for each test.


#### Generating the test cases.




In [None]:
harness.generate()

Generating testcases... (robustness): 100%|██████████| 1/1 [00:36<00:00, 36.49s/it]




harness.generate() method automatically generates the test cases (based on the provided configuration)

In [None]:
harness.testcases()

Unnamed: 0,category,test_type,original,test_case,expected_result
0,robustness,lowercase,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","soccer - japan get lucky win , china in surpri...","JAPAN: LOC, CHINA: LOC"
1,robustness,lowercase,Nadim Ladki,nadim ladki,Nadim Ladki: PER
2,robustness,lowercase,"AL-AIN , United Arab Emirates 1996-12-06","al-ain , united arab emirates 1996-12-06","AL-AIN: LOC, United Arab Emirates: ORG"
3,robustness,lowercase,Japan began the defence of their Asian Cup tit...,japan began the defence of their asian cup tit...,"Japan: LOC, Asian Cup: MISC, Syria: LOC, Group..."
4,robustness,lowercase,But China saw their luck desert them in the se...,but china saw their luck desert them in the se...,"China: LOC, Uzbekistan: LOC"
...,...,...,...,...,...
447,robustness,uppercase,Portuguesa 1 Atletico Mineiro 0,PORTUGUESA 1 ATLETICO MINEIRO 0,"Portuguesa: ORG, Atletico Mineiro: ORG"
448,robustness,uppercase,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,LARA: PER
449,robustness,uppercase,Robert Galvin,ROBERT GALVIN,Robert Galvin: PER
450,robustness,uppercase,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06,MELBOURNE: LOC


harness.testcases() method gives the produced test cases in form of a pandas data frame.

#### Saving test configurations, data, test cases

In [None]:
harness.save("saved_test_configurations")

#### Running the tests

In [None]:
harness.run()

Running test cases...: 100%|██████████| 452/452 [00:59<00:00,  7.64it/s]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [None]:
harness.generated_results()

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,pass
0,robustness,lowercase,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...","soccer - japan get lucky win , china in surpri...","JAPAN: LOC, CHINA: LOC",,False
1,robustness,lowercase,Nadim Ladki,nadim ladki,Nadim Ladki: PER,,False
2,robustness,lowercase,"AL-AIN , United Arab Emirates 1996-12-06","al-ain , united arab emirates 1996-12-06","AL-AIN: LOC, United Arab Emirates: ORG",al-ain: LOC,False
3,robustness,lowercase,Japan began the defence of their Asian Cup tit...,japan began the defence of their asian cup tit...,"Japan: LOC, Asian Cup: MISC, Syria: LOC, Group...",asian: MISC,False
4,robustness,lowercase,But China saw their luck desert them in the se...,but china saw their luck desert them in the se...,"China: LOC, Uzbekistan: LOC",,False
...,...,...,...,...,...,...,...
447,robustness,uppercase,Portuguesa 1 Atletico Mineiro 0,PORTUGUESA 1 ATLETICO MINEIRO 0,"Portuguesa: ORG, Atletico Mineiro: ORG","PORTUGUESA: ORG, ATLETICO MINEIRO: ORG",True
448,robustness,uppercase,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .,LARA: PER,LARA: PER,True
449,robustness,uppercase,Robert Galvin,ROBERT GALVIN,Robert Galvin: PER,ROBERT GALVIN: PER,True
450,robustness,uppercase,MELBOURNE 1996-12-06,MELBOURNE 1996-12-06,MELBOURNE: LOC,MELBOURNE: LOC,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

In [None]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,lowercase,198,28,12%,60%,False
1,robustness,uppercase,83,143,63%,60%,True


It summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

#### Step 3: Augment CoNLL Training Set Based on Robustness Test Results

In [None]:
data_kwargs = {
      "data_source" : "conll03.conll",
       }

harness.augment(training_data=data_kwargs, save_data_path="augmented_train.conll", export_mode="add")



Essentially it applies perturbations to the input data based on the recommendations from the harness reports. Then this augmented_dataset is used to retrain the original model so as to make the model more robust and improve its performance.

#### Step 4: Train New NER Model on Augmented CoNLL

In [None]:
embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

nerTagger = nlp.NerDLApproach()\
    .setInputCols(["document", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(20)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')

training_pipeline = nlp.Pipeline(stages=[
          embeddings,
          nerTagger
 ])


conll_data = nlp.CoNLL().readDataset(spark, 'augmented_train.conll')

ner_model = training_pipeline.fit(conll_data)

ner_model.stages[-1].write().overwrite().save('models/augmented_ner_model')

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
!zip -r augmented_ner_model.zip models/augmented_ner_model

  adding: models/augmented_ner_model/ (stored 0%)
  adding: models/augmented_ner_model/metadata/ (stored 0%)
  adding: models/augmented_ner_model/metadata/part-00000 (deflated 44%)
  adding: models/augmented_ner_model/metadata/.part-00000.crc (stored 0%)
  adding: models/augmented_ner_model/metadata/_SUCCESS (stored 0%)
  adding: models/augmented_ner_model/metadata/._SUCCESS.crc (stored 0%)
  adding: models/augmented_ner_model/.tensorflow.crc (deflated 0%)
  adding: models/augmented_ner_model/tensorflow (deflated 16%)
  adding: models/augmented_ner_model/fields/ (stored 0%)
  adding: models/augmented_ner_model/fields/datasetParams/ (stored 0%)
  adding: models/augmented_ner_model/fields/datasetParams/part-00001 (deflated 75%)
  adding: models/augmented_ner_model/fields/datasetParams/.part-00001.crc (stored 0%)
  adding: models/augmented_ner_model/fields/datasetParams/part-00000 (deflated 26%)
  adding: models/augmented_ner_model/fields/datasetParams/.part-00000.crc (stored 0%)
  adding

In [None]:
!unzip augmented_ner_model.zip

Archive:  augmented_ner_model.zip
replace models/augmented_ner_model/metadata/part-00000? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: models/augmented_ner_model/metadata/part-00000  
 extracting: models/augmented_ner_model/metadata/.part-00000.crc  
 extracting: models/augmented_ner_model/metadata/_SUCCESS  
 extracting: models/augmented_ner_model/metadata/._SUCCESS.crc  
  inflating: models/augmented_ner_model/.tensorflow.crc  
  inflating: models/augmented_ner_model/tensorflow  
  inflating: models/augmented_ner_model/fields/datasetParams/part-00001  
 extracting: models/augmented_ner_model/fields/datasetParams/.part-00001.crc  
  inflating: models/augmented_ner_model/fields/datasetParams/part-00000  
 extracting: models/augmented_ner_model/fields/datasetParams/.part-00000.crc  
 extracting: models/augmented_ner_model/fields/datasetParams/_SUCCESS  
 extracting: models/augmented_ner_model/fields/datasetParams/._SUCCESS.crc  


In [None]:
documentAssembler = nlp.DocumentAssembler()\
		.setInputCol("text")\
		.setOutputCol("document")

tokenizer = nlp.Tokenizer()\
		.setInputCols(["document"])\
		.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

ner = nlp.NerDLModel.load("models/augmented_ner_model") \
		.setInputCols(["document", "token", "embeddings"]) \
		.setOutputCol("ner")

ner_pipeline = nlp.Pipeline().setStages([
				documentAssembler,
				tokenizer,
				embeddings,
				ner
    ])

ner_model_2 = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


#### Load saved test configurations, data

In [None]:
h_new = Harness.load("saved_test_configurations",
                       model={"model": ner_model_2,"hub":"johnsnowlabs"}, 
                       task="ner",                                 
                       load_testcases=True)

Generating testcases... (robustness): 100%|██████████| 1/1 [00:30<00:00, 30.30s/it]


####Test New NER Model Robustness.

In [None]:
harness.run().report()

Running test cases...: 100%|██████████| 452/452 [01:00<00:00,  7.49it/s]


Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,lowercase,31,195,86%,60%,True
1,robustness,uppercase,39,187,83%,60%,True


We can see that after augmentation, both tests are performing better than earlier.

# Comparison Table ( Normal V/S Augmented Model )

| Model            | Category   | Test_Type | Fail_Count | Pass_Count | Pass_Rate | Minimum_Pass_Rate | Pass  |
|------------------|------------|-----------|------------|------------|-----------|-------------------|-------|
| Initial_Model    | Robustness | Lowercase |     198    |     28     |    12%    |        60%        | False |
|                  |            |           |            |            |           |                   |       |
| Initial_Model    | Robustness | Uppercase |     83     |    143     |    63%    |        60%        | True  |
|                  |            |           |            |            |           |                   |       |
| Augmented_Model  | Robustness | Lowercase |     31     |    195     |    86%    |        60%        | True  |
|                  |            |           |            |            |           |                   |       |
| Augmented_Model  | Robustness | Uppercase |     39     |    187     |    83%    |        60%        | True  |
