![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlptest/tree/issue-23-robustness-notebook-init/example/Automated_Robustness_Testing_Spark_NLP.ipynb)

# Automated Robustness Testing for NLP Models

# Spark NLP Setup

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==4.2.5

In [None]:
import json
import os

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp.start(params=params)

spark

# Data Preparation: Downloading CoNLL

In [None]:
# Download example train and test CoNLL files from nlptest repo
! wget https://raw.githubusercontent.com/JohnSnowLabs/nlptest/main/example/data/train.conll
! wget https://raw.githubusercontent.com/JohnSnowLabs/nlptest/main/example/data/test.conll

# Robustness

Model robustness can be described as the ability of a model to keep similar levels of accuracy, precision and recall when perturbations are made to the data it is predicting on. In the case of NER, the goal is to understand how documents with typos or fully uppercased sentences affect the model's prediction performance compared to documents similar to those in the original training set.

## Robustness Testing

Testing a NER model's robustness gives us an idea on how our data may need to be modified to make the model more robust.

### Spark NLP Model for Robustness Test

Testing robustness first requires building a pipeline with a NER model that will be tested. This model should ideally not have been trained on the samples in the test set. We are using a pretrained model here for demo purposes, but it is more common to use a locally trained model.

In [None]:
documentAssembler = DocumentAssembler()\
		.setInputCol("text")\
		.setOutputCol("document")

tokenizer = Tokenizer()\
		.setInputCols(["document"])\
		.setOutputCol("token")
	
embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

ner = NerDLModel.pretrained("ner_dl", 'en') \
		.setInputCols(["document", "token", "embeddings"]) \
		.setOutputCol("ner")

ner_pipeline = Pipeline().setStages([
				documentAssembler,
				tokenizer,
				embeddings,
				ner
    ])

ner_model_pipeline = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

### Robustness Test Parameters & Perturbations

This function tests the robustness of a NER model by applying different types of perturbations to a list of sentences taken from a test dataset. Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.
<br/>

Here is a list of the different parameters that can be passed to the `test_robustness` function:

| Parameter  | Description |  |
| - | - | - |
|**spark**      |An active spark session.|
|**pipeline_model** |PipelineModel with document assembler, sentence detector, tokenizer, <br/>word embeddings (if applicable), NER model with _"ner"_ output label name, <br/> and NER converter with _"ner_chunk"_ output label name.|
|**test_file_path**     |Path to test file to test robustness. Can be .txt or .conll file in CoNLL format <br/> or .csv file with just one column (text) with series of test samples.|
|**test**      |List of robustness tests to implement. Possible values described in next <br/> section. Defaults to all tests.|
|**noise_prob**      |Proportion of samples from test data to apply noise to (between 0 and 1).|
|**sample_sentence_count**     |Number of sentence that will be sampled from the test data.|
|**metric_type**      |Set "strict" to calculate metrics in IOB2 format, "flex" to calculate in IO <br/> format. Defaults to 'flex'.|
|**metrics_output_format**   |Set "dictionary" to get a report in dictionary format, "dataframe" to get it in <br/> dataframe format. Defaults to 'dictionary'.|
|**log_path**      |Path to log file, False to avoid saving test results. Defaults to <br/>'./robustness_test_results.json'|
|**starting_context**     |List of words or phrases to add as context perturbations to the beginning <br/> of sentences when running the `add_context` test.|
|**ending_context**     |List of words or phrases to add as context perturbations to the end of <br/> sentences when running the `add_context` test.|

<br/>

Multiple perturbation methods are available to test the model's robustness. These are meant to be passed in a list to the `test` parameter in the `test_robustness` function:

- **`capitalization_upper`**: capitalization of the test set is turned into uppercase

- **`capitalization_lower`**: capitalization of the test set is turned into lowercase

- **`capitalization_title`**: capitalization of the test set is turned into title case

- **`add_punctuation`**: special characters at end of each sentence are replaced by other special characters, if no
special character at the end, one is added

- **`strip_punctuation`**: special characters are removed from the sentences (except if found in numbers, such as '2.5')

- **`introduce_typos`**: typos are introduced in sentences

- **`add_contractions`**: contractions are added where possible (e.g. 'do not' contracted into 'don't')

- **`add_context`**: tokens are added at the beginning and at the end of the sentences

- **`swap_entities`**: named entities replaced with same entity type with same token count from terminology

- **`swap_cohyponyms`**: Named entities replaced with co-hyponym from the WordNet database

- **`american_to_british`**: American English will be changed to British English

- **`british_to_american`**: British English will be changed to American English

### Robustness Testing Module


In [None]:
from nlptest import test_robustness

# Running robustness test on sentences from test set
test_results = test_robustness(spark = spark,
                               pipeline_model = ner_model_pipeline,
                               test_file_path = 'test.conll',
                               test = ['capitalization_upper', 'capitalization_lower', 
                                       'capitalization_title', 'add_punctuation', 'strip_punctuation', 
                                       'introduce_typos', 'add_contractions', 'american_to_british', 
                                       'add_context', 'swap_entities', 'swap_cohyponyms'],
                               noise_prob = 0.5,
                               metric_type = 'flex',
                               metrics_output_format = 'dictionary')

In [None]:
# Dictionary outputs metrics, a comparison dataframe and written details for each test
test_results.keys()

In [None]:
# Metrics contains detailed metrics for each test
test_results['metrics'].keys()

In [None]:
# Select a specific test to view its metrics
test_results['metrics']['modify_capitalization_upper']

In [None]:
# 1-to-1 token comparison of every perturbation applied
test_results['comparison_df']

In [None]:
# Written details for each test (preview)
print(test_results['test_details'][:865])

## Robustness Fixing

Once a NER model's robustness has been tested, we can make an informed decision about how to make it more robust to perturbations.

### Robustness Fixing Parameters

The `augment_robustness` function augments a training set by generating perturbations. The resulting dataset includes both the original samples and the noisy ones. Here is a list of the different parameters that can be passed to the function:

| Parameter  | Description |  |
| - | - | - |
|**spark**      |An active spark session.|
|**conll_path**      |Path to CoNLL file to augment with selected perturbations.|
|**conll_save_path**    |Path to save augmented CoNLL file.|
|**return_spark**      |Return Spark DataFrame instead of CoNLL file.|
|**perturbation_map**    |A dictionary of perturbation names and desired proportions <br/> to apply on all entity classes.|
|**entity_perturbation_map**  |A dictionary of perturbation names and desired perturbation <br/> proportions defined for each entity class.|
|**optimized_inplace**    |Whether you want to apply perturbations inplace or create <br/> duplicate sentences with perturbations applied.|
|**starting_context**   |List of words or phrases to add as context perturbations <br/> to the beginning of sentences when running the <br/> `add_context` perturbation.|
|**ending_context**      |List of words or phrases to add as context perturbations <br/> to the end of sentences when running the <br/> `add_context` perturbation.|
|**print_info**     |Print logs of augmentation process, default is False.|
|**ignore_warnings**     |Ignore warnings from augmentation process, default is False.|
|**regex_pattern**     |Regex pattern to tokenize context and contractions, <br/> defaults to pattern used in regular tokenizer.|
|**random_state**      |Random state to apply perturbations on a consistent sample <br/> of sentences.|

<br/>

Multiple perturbation methods are available to augment the dataset to attempt to increase the robustness of the model trained on it. These are meant to be passed as keys of the dictionary passed to the `entity_perturbation_map` and `entity_perturbation_map` parameters:

- **`capitalization_upper`**: capitalization of the dataset is turned into uppercase

- **`capitalization_lower`**: capitalization of the dataset is turned into lowercase

- **`capitalization_title`**: capitalization of the dataset is turned into title case

- **`add_punctuation`**: special characters at end of each sentence are replaced by other special characters, if no
special character at the end, one is added

- **`strip_punctuation`**: special characters are removed from the sentences (except if found in numbers, such as '2.5')

- **`introduce_typos`**: typos are introduced in sentences

- **`add_contractions`**: contractions are added where possible (e.g. 'do not' contracted into 'don't')

- **`add_context`**: tokens are added at the beginning and at the end of the sentences

- **`swap_entities`**: named entities replaced with same entity type with same token count from terminology

- **`swap_cohyponyms`**: Named entities replaced with co-hyponym from the WordNet database

- **`american_to_british`**: American English will be changed to British English

- **`british_to_american`**: British English will be changed to American English


### Robustness Fixing Module

In [None]:
from nlptest import augment_robustness

# using perturbation_map
perturbation_map = {
   "capitalization_upper": 0.05,
   "capitalization_lower": 0.05,
   'capitalization_title': 0.05, 
   'add_punctuation': 0.05, 
   'strip_punctuation': 0.05,                  
   'introduce_typos': 0.05, 
   'add_contractions': 0.05, 
   'american_to_british': 0.05, 
   'add_context': 0.05, 
   'swap_entities': 0.05, 
   'swap_cohyponyms': 0.05
}

augment_robustness(conll_path = 'train.conll',
                   conll_save_path = 'augmented_train.conll',
                   perturbation_map = perturbation_map,
                   print_info=False,
                   ignore_warnings=True,
                   random_state=42)

In [None]:
# using entity_perturbation_map
entity_perturbation_map = {
   "capitalization_upper": {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01},
   "capitalization_lower": {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01},
   'capitalization_title': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'add_punctuation': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'strip_punctuation': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01},                  
   'introduce_typos': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'add_contractions': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'american_to_british': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'add_context': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'swap_entities': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}, 
   'swap_cohyponyms': {'PER':0.05, 'ORG':0.02, 'LOC':0.06, 'MISC':0.01}
}

augment_robustness(conll_path = 'train.conll',
                   conll_save_path = 'augmented_train.conll',
                   entity_perturbation_map = entity_perturbation_map,
                   print_info=False,
                   ignore_warnings=True,
                   random_state=42)

## Robustness Test and Fix One-Liner

To fully automate robustness testing and fixing, the `nlptest` library proposes a one-liner that allows users to simply pass in the path to a CoNLL file along with the different perturbations they would like to test and fix for. Here is a list of the different parameters that can be passed to the function:

| Parameter  | Description |  |
| - | - | - |
|**spark**      |An active spark session.|
|**pipeline_model** |PipelineModel with document assembler, sentence detector, tokenizer, <br/>word embeddings (if applicable), NER model with _"ner"_ output label name, <br/> and NER converter with _"ner_chunk"_ output label name.|
|**test_file_path**      |Path to test file to test and fix robustness. Can be .txt or .conll file <br/> in CoNLL format.|
|**conll_path_to_augment**      |Path to CoNLL file to augment with selected perturbations.|
|**conll_save_path**    |Path to save augmented CoNLL file.|
|**test**      |List of robustness tests to implement. Possible values described in next <br/> section. Defaults to all tests.|
|**noise_prob**      |Proportion of samples from test data to apply noise to (between 0 and 1).|
|**optimized_inplace**    |Whether you want to apply perturbations inplace or create <br/> duplicate sentences with perturbations applied.|
|**sample_sentence_count**     |Number of sentence that will be sampled from the test data.|
|**metric_type**      |Set "strict" to calculate metrics in IOB2 format, "flex" to calculate in IO <br/> format. Defaults to 'flex'.|
|**metrics_output_format**   |Set "dictionary" to get a report in dictionary format, "dataframe" to get it in <br/> dataframe format. Defaults to 'dictionary'.|
|**log_path**      |Path to log file, False to avoid saving test results. Defaults to <br/>'./robustness_test_results.json'|
|**starting_context**     |List of words or phrases to add as context perturbations to the beginning <br/> of sentences when running the `add_context` test.|
|**ending_context**     |List of words or phrases to add as context perturbations to the end of <br/> sentences when running the `add_context` test.|
|**print_info**     |Print logs of augmentation process, default is False.|
|**ignore_warnings**     |Ignore warnings from augmentation process, default is False.|
|**regex_pattern**     |Regex pattern to tokenize context and contractions, <br/> defaults to pattern used in regular tokenizer.|
|**random_state**      |Random state to apply perturbations on consistent samples <br/> of sentences.|
|**return_spark**      |Return Spark DataFrame instead of CoNLL file.|


<br/>

Multiple perturbation methods are available to test and augment the dataset to attempt to increase the robustness of the model trained on it. These are meant to be passed as keys of the dictionary passed to the `test` parameter:

- **`capitalization_upper`**: capitalization of the dataset is turned into uppercase

- **`capitalization_lower`**: capitalization of the dataset is turned into lowercase

- **`capitalization_title`**: capitalization of the dataset is turned into title case

- **`add_punctuation`**: special characters at end of each sentence are replaced by other special characters, if no
special character at the end, one is added

- **`strip_punctuation`**: special characters are removed from the sentences (except if found in numbers, such as '2.5')

- **`introduce_typos`**: typos are introduced in sentences

- **`add_contractions`**: contractions are added where possible (e.g. 'do not' contracted into 'don't')

- **`add_context`**: tokens are added at the beginning and at the end of the sentences

- **`swap_entities`**: named entities replaced with same entity type with same token count from terminology

- **`swap_cohyponyms`**: Named entities replaced with co-hyponym from the WordNet database

- **`american_to_british`**: American English will be changed to British English

- **`british_to_american`**: British English will be changed to American English

In [None]:
from nlptest import test_and_augment_robustness

# Running robustness test on sentences from test set
test_results = test_and_augment_robustness(spark = spark,
                                           pipeline_model = ner_model,
                                           test_file_path = 'test.conll',
                                           conll_path_to_augment = 'train.conll',
                                           conll_save_path = 'augmented_train.conll',
                                           test = ['capitalization_upper', 'capitalization_lower', 
                                                   'add_punctuation', 'introduce_typos', 'add_contractions', 
                                                   'add_context', 'swap_entities'],
                                           noise_prob = 0.5, 
                                           metric_type = 'flex',
                                           metrics_output_format = 'dictionary')

# Real-World Project Workflows

In this section, we dive into complete workflows for using the model testing module in real-world project settings.

## Robustness

In this example, we will be testing a model's robustness to changes in capitalization - more specifically, we will be applying 2 tests: uppercasing and lowercasing. The real-world project workflow of the model robustness testing and fixing in this case goes as follows:

1. Train NER model on original CoNLL training set

2. Test NER model robustness on CoNLL test set

3. Augment CoNLL training set based on test results 

4. Train new NER model on augmented CoNLL training set

5. Test new NER model robustness on the CoNLL test set from step 2

6. Compare robustness of new NER model against original NER model

#### Step 1: Train NER Model

In [None]:
from sparknlp.training import CoNLL

embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

nerTagger = NerDLApproach()\
    .setInputCols(["document", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(20)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')

training_pipeline = Pipeline(stages=[
          embeddings,
          nerTagger
 ])


conll_data = CoNLL().readDataset(spark, 'train.conll')

ner_model = training_pipeline.fit(conll_data)

ner_model.stages[-1].write().overwrite().save('models/first_NER_20epoch')

In [None]:
!zip -r first_NER_20epoch.zip models/first_NER_20epoch

#### Step 2: Test NER Model Robustness on Capitalization

In [None]:
!unzip first_NER_20epoch.zip

In [None]:
documentAssembler = DocumentAssembler()\
		.setInputCol("text")\
		.setOutputCol("document")

tokenizer = Tokenizer()\
		.setInputCols(["document"])\
		.setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

ner = NerDLModel.load("models/first_NER_20epoch") \
		.setInputCols(["document", "token", "embeddings"]) \
		.setOutputCol("ner")

ner_pipeline = Pipeline().setStages([
				documentAssembler,
				tokenizer,
				embeddings,
				ner
    ])

ner_model = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
from nlptest import test_robustness

# Running robustness test on test set sentences
test_results = test_robustness(spark = spark,
                               pipeline_model = ner_model,
                               conll_test_path = 'test.conll',
                               test = ['capitalization_upper', 'capitalization_lower'],
                               noise_prob = 0.5,
                               metric_type = 'flex',
                               metrics_output_format = 'dataframe')

In [None]:
test_results['metrics']

In [None]:
test_results['metrics'].to_csv('first_ner_robustness_results.csv')

#### Step 3: Augment CoNLL Training Set Based on Robustness Test Results

In [None]:
from nlptest import augment_robustness

# using perturbation_map
perturbation_map = {
   "capitalization_upper": 0.05,
   "capitalization_lower": 0.05,
}

augment_robustness(conll_path = 'train.conll',
                   conll_save_path = 'augmented_train.conll',
                   perturbation_map = perturbation_map,
                   print_info=False,
                   ignore_warnings=True,
                   random_state=42)

#### Step 4: Train New NER Model on Augmented CoNLL

In [None]:
from sparknlp.training import CoNLL

embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

nerTagger = NerDLApproach()\
    .setInputCols(["document", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(20)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')

training_pipeline = Pipeline(stages=[
          embeddings,
          nerTagger
 ])


conll_data = CoNLL().readDataset(spark, 'augmented_conll.conll')

ner_model = training_pipeline.fit(conll_data)

ner_model.stages[-1].write().overwrite().save('models/second_NER_20epoch')

In [None]:
!zip -r second_NER_20epoch.zip models/second_NER_20epoch

#### Step 5: Test New NER Model Robustness on Capitalization

In [None]:
from sparknlp_jsl.nlp_test import test_robustness

documentAssembler = DocumentAssembler()\
		.setInputCol("text")\
		.setOutputCol("document")

tokenizer = Tokenizer()\
		.setInputCols(["document"])\
		.setOutputCol("token")
	
embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
		.setInputCols(["document", 'token']) \
		.setOutputCol("embeddings")

ner = MedicalNerModel.load("models/second_NER_20epoch") \
		.setInputCols(["document", "token", "embeddings"]) \
		.setOutputCol("ner")

ner_pipeline = Pipeline().setStages([
				documentAssembler,
				tokenizer,
				embeddings,
				ner
    ])

ner_model = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
from nlptest import test_robustness

# Running robustness test on test set sentences
new_test_results = test_robustness(spark = spark,
                                   pipeline_model = ner_model,
                                   conll_test_path = 'test.conll',
                                   test = ['capitalization_upper', 'capitalization_lower'],
                                   noise_prob = 0.5,
                                   metric_type = 'flex',
                                   metrics_output_format = 'dataframe')

In [None]:
new_test_results['metrics']

In [None]:
new_test_results['metrics'].to_csv('second_ner_robustness_results.csv')

#### Step 6: Compare Robustness Reports for First and Second NER Models

In [None]:
robustness_comparison = new_test_results['metrics'][['precision', 'recall', 'f1-score']] - test_results['metrics'][['precision', 'recall', 'f1-score']]
robustness_comparison[['entity', 'test']] = test_results['metrics'][['entity', 'test']]

In [None]:
robustness_comparison

This dataframe shows the difference between the model trained on augmented data and the original model. We notice that augmenting our training set by adding uppercase and lowercase sentences increases the NER model's robustness when compared to the original NER model tested on the same test set.