## Robust Testing in NLP Models

This notebook contains an example of a Test to check the robustness of a trained NLP model. In this case, we create a test to check if the model is robust to introducing typos in the text. For that, we use `ClassificationInvarianceTest` from mercury.robust

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## Load Dataset

We will use it with the banking intents dataset, which contains short customer queries classified under 77 possible labels. This dataset was used in [[1]](#[1])

In [2]:
path_dataset = "./data/bankintents/"
train_df = pd.read_csv(path_dataset + "train.csv")
test_df = pd.read_csv(path_dataset + "test.csv")
categories_df = pd.read_json(path_dataset + "categories.json")
train_df = train_df.sample(frac=1, random_state=342)  # shuffle

In [3]:
train_df.head()

Unnamed: 0,text,category,set,category_id,id
9695,how do i set up my apple pay watch to connect ...,apple_pay_or_google_pay,train,74,9695
3369,Why isn't my balance updating after depositing...,balance_not_updated_after_cheque_or_cash_deposit,train,26,3369
377,How do you decide what your exchange rates are?,exchange_rate,train,2,377
4669,I see a direct debit transaction that I didn't...,direct_debit_payment_not_recognised,train,37,4669
7815,How can I create another card linked to this a...,getting_spare_card,train,60,7815


## Train model

We train a classification model to predict the category of a text. We create a basic NLP pipeline with a TfidfVectorizer and a LogisticRegression model


In [4]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer()
clf = LogisticRegression()
pipeline = Pipeline([('tfidf', tfidf), ('classifier', clf)])
pipeline = pipeline.fit(train_df["text"].values, train_df["category_id"].values)

In [5]:
from sklearn.metrics import accuracy_score
accuracy_score(test_df.category_id, pipeline.predict(test_df.text))

0.8772727272727273

## Invariance Test

Now we will create an `InvarianceTest` for the model that we trained. The idea of the `ClassificationInvarianceTest` is that if we apply a label-preserving perturbation to a sample, its prediction shouldn't change. For example, if we change the name of a person in a sentence when using a sentiment analysis model, the sentiment shouldn't change, or if we make a minor typo in a text when using a text classification model, the prediction shouldn't change.

The idea of the Invariance Test comes from the paper [Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf). You can read the paper to get further ideas to create more tests

### Create Dataset with perturbed samples (typos)

First, let's create perturbed samples. We take a sample from our test dataset as original dataset. Then, we apply perturbations to these samples to introduce typos and create a perturbed dataset. We first create the `add_typos` function to create types in a given string. Then, we create a function that takes a dataset and a function, and applies `num_perturbation` to each one of the samples using the function.



In [6]:
sample_test_data = test_df.sample(50)["text"].values.tolist()

In [7]:
def add_typos(string: str, num_typos: int = 1):
    """
    Function that receives a string and returns the same string but with added typos.
    The typos are added by randomly swapping consecutive characters.
    Note that the same string can be returned in case that the string hasn't the enough
    length to introduce typos by swapping characters or if the consecutive characters to
    swap are the same.

    Args:
        string: the string to add typos
        num_typos: the number of typos to add

    Returns:
        (str): The string with the typos added
    """
    if len(string) <= 1:
        return string
    string = list(string)
    indices = np.random.choice(len(string) - 1, num_typos)
    for idx in indices:
        tmp = string[idx]
        string[idx] = string[idx + 1]
        string[idx + 1] = tmp
    return ''.join(string)

def generate_perturbations(data, num_perturbations, generation_fn):
    pertubations_dataset = []
    for i in range(len(data)):
        sample_perturbations = []
        for j in range(num_perturbations):
            sample_perturbations.append(generation_fn(data[i]))
        pertubations_dataset.append(sample_perturbations)
    return pertubations_dataset

typos_test_data = generate_perturbations(
    sample_test_data, 
    num_perturbations=3, 
    generation_fn=add_typos
)

Let's see an example

In [8]:
print(sample_test_data[1])
print(typos_test_data[1][0])

The amount of cash I received was different than what I requested.
The amount of cash I received was different than what I requeste.d


### Create test and run

Now, let's create the `ClassificationInvarianceTest` and run it. We need to pass to the test the trained model object which has a `predict()` method returning the predictions, or alternatively a function as `predict_fn` argument that generates the predictions for our samples. We also pass a `threshold` which indicates how many errors are we willing to tolerate in order to pass the test. If the error rate is higher than the threshold, then the test will fail raising a `FailedTestError`

In [9]:
from mercury.robust.model_tests import ClassificationInvarianceTest

test = ClassificationInvarianceTest(
    original_samples=sample_test_data, 
    perturbed_samples=typos_test_data,
    model=pipeline, 
    threshold=0.05, 
    name="Invariance to typos"
)
test.run()

FailedTestError: Error rate 0.22666666666666666 is higher than threshold 0.05

You might obtain a different number of errors depending on the generated typos, but most likely it will fail if you kept the low threshold. 

By calling the `info()` method you can check the:
- rate_samples_with_errors: the percentage of samples with at least one perturbation with different prediction.
- total_rate_errors: the total percentage of perturbations that have different prediction to its original sample

In [10]:
test.info()

{'rate_samples_with_errors': 0.36, 'total_rate_errors': 0.22666666666666666}

You can also call `get_examples_failed()` to obtain some examples where the prediction changes:

In [11]:
pd.set_option('display.max_colwidth', None)

In [12]:
test.get_examples_failed()

Unnamed: 0,original,perturbed,pred_original,pred_perturbed
0,How do I pay by check?,How do I pay by chekc?,61,74
1,"I am seeing a weird payment showing up that I know I did not make, how can I get it cancelled?","I am seeing a weird payment showing up that Ik now I did not make, how can I get it cancelled?",37,52
2,Who can top up my accounts?,Who can top u pmy accounts?,66,18
3,Can I freeze my card right now?,Can I freeze my crad right now?,9,11
4,I need your help in deleting my account.,I need your hel pin deleting my account.,47,31


In [13]:
test.get_examples_failed(n_samples=3, n_perturbed=2)

Unnamed: 0,original,perturbed,pred_original,pred_perturbed
0,What is the process for activating my card?,What is the process for atcivating my card?,71,0
1,Why can't I get my virtual card to work?,Why can't I get my virtual card to owrk?,63,23
2,Why can't I get my virtual card to work?,Why can't I get m yvirtual card to work?,63,9
3,How do I make my virtual card work?,How do I maek my virtual card work?,63,23
4,How do I make my virtual card work?,How do I make ym virtual card work?,63,23


### Test Suites

You can also group different tests in a `TestSuite` and execute them together. You can see an example of the TestSuite tutorial which you can adapt to an NLP use case


## References
<a id="[1]">[1]</a> 
Efficient Intent Detection with Dual Sentence Encoders. https://arxiv.org/abs/2003.04807.
Iñigo Casanueva and Tadas Temcinas and Daniela Gerz and Matthew Henderson and Ivan Vulic.
Data available at https://github.com/PolyAI-LDN/task-specific-datasets