# End to end behavioral testing

In this notebook we show how one can use `nlptest` to create a set of tests (a `TestPack`) to tests specific behaviors of a model.

We here focus on a span extraction task and more precisely a NER task. We use the english NER model of the [spacy](https://spacy.io/) library.

In [1]:
import torch
import spacy

from typing import List, Union, Tuple
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from nlptest.behavior import SpanClassificationBehavior
from nlptest.types import BehaviorType, TaskType, Span
from nlptest.generator import Generator
from nlptest.testpack import TestPack
from nlptest.performers import Performer

  from .autonotebook import tqdm as notebook_tqdm


First, let us generate samples to test our model on.

We generate two sets of samples that are meant to test different behavioral aspect of the model.

In [2]:
kwargs_generator = Generator()

In [3]:
template = "I am flying to {LOC} with {PERSON}."

locations = ["NYC", "Copenhagen", "Miami"]
people = ["John", "Stefan", "Maria"]

# to get the positions of the keywords set `return_pos=True`
loc_pers_samples, all_positions = kwargs_generator.generate(template, 
                                                            generate_all=True, 
                                                            return_pos=True,
                                                            LOC=locations, 
                                                            PERSON=people)

In [4]:
# creating the set of ground truth labels
labels = []
for positions in all_positions:
    labels.append([Span(start=pos[0], end=pos[1], label=pos[2], text=pos[3]) for pos in positions])

In [5]:
# Creating some minimum functionality samples
mf_samples = [
    "He is working for a company called John.",
    "I love all the apple products."
]
mf_labels = [
    [Span(start=35, end=39, label="PERSON", text="John")],
    [Span(start=15, end=20, label="PRODUCT", text="apple")]
]

In [6]:
# load our model
nlp = spacy.load("en_core_web_sm")

We now define the prediction function that will perform the inference as well as formatting the output.

In [7]:
def predict(samples: Union[List[str], str]) -> List[List[Span]]:
    """"""
    if isinstance(samples, str):
        samples = [samples]
        
    all_spans = []
    for sample in samples:
        doc = nlp(sample)
        all_spans.append([Span(start=ent.start_char, end=ent.end_char, label=ent.label_, text=ent.text) for ent in doc.ents])
    return all_spans

In [8]:
# We define the first behavior to test
invariance_loc_pers = SpanClassificationBehavior(
    capability="Class agnotic",
    name="Loc pers invariance behavior",
    test_type=BehaviorType.invariance,
    samples=loc_pers_samples,
    predict_fn=predict,
    labels=labels,
    description="Checking model invariance on LOC & PERS entities."
)

In [9]:
# We define the second behavior to test
mf_behavior = SpanClassificationBehavior(
    capability="Be good",
    name="Random tests",
    test_type=BehaviorType.minimum_functionality,
    samples=mf_samples,
    predict_fn=predict,
    labels=mf_labels,
    description="Random tests"
)

We can now define our suite of tests and evaluate it.

In [10]:
testpack = TestPack(behaviors=[mf_behavior, invariance_loc_pers], performer=Performer())

In [11]:
testpack.run()

In [12]:
print(testpack.performer.tabulate_result())

-------------------------------------  --------
Total                                  0.181818
Name - Random tests                    0
Name - Loc pers invariance behavior    0.222222
Behavior type - minimum functionality  0
Behavior type - invariance             0.222222
Capability - Be good                   0
Capability - Class agnotic             0.222222
-------------------------------------  --------
