In [1]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV

import numpy as np



For this tutorial, we will assume that our task is sentiment analysis.

In [2]:
editor = Editor()

## Minimum Functionality Test (MFT)

A Minimum Functionality Test is like a unit test in Software Engineering.
If you are testing a certain capability (e.g. 'can the model handle negation?'), an MFT is composed of simple examples that verify a specific behavior.  
Let's create a very simple MFT for negations:

In [3]:
# First, let's find some positive and negative adjectives
thing = ['plot', 'movie', 'show', 'storyline']
', '.join(editor.suggest('This is not {a:mask} {thing}.', thing=thing)[:30])

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


'easy, ordinary, original, good, interesting, action, exciting, enjoyable, independent, innocent, average, entertaining, actual, old, ideal, great, normal, unusual, excellent, adult, introductory, individual, animated, origin, epic, new, amazing, acceptable, alternative, anime'

In [4]:
pos = ['original','interesting','entertaining','lovely','good', 'enjoyable', 'exciting', 'excellent', 'amazing', 'great', 'engaging']
neg = ['bad', 'terrible', 'awful', 'horrible','boring','unoriginal','sleep-inducing']

Now let's create some data with both positive and negative negations, assuming `1` means positive and `0` means negative:

In [5]:
ret = editor.template('This is not {a:pos} {thing}.', pos=pos, thing=thing,labels=0, save=True, nsamples=100)
ret += editor.template('This is not {a:neg} {thing}.', neg=neg, thing=thing,labels=1, save=True, nsamples=100)

We can easily turn this data into an MFT:

In [6]:
test = MFT(ret.data, labels=ret.labels, name='Simple negation',
           capability='Negation', description='Very simple negations.')

### Running tests

Let's use an off-the-shelf sentiment analysis model.

In [7]:
from pattern.en import sentiment

In [8]:
import numpy as np
def predict_proba(inputs):
    p1 = np.array([(sentiment(x)[0] + 1)/2. for x in inputs]).reshape(-1, 1)
    p0 = 1- p1
    return np.hstack((p0, p1))

There are two ways of running tests.  
In the first (and simplest) way, you pass a function as argument to `test.run`, which gets called to make predictions.  
We assume that the function returns a tuple with `(predictions, confidences)`, so we have a wrapper to turn softmax (like our function above) into this:

In [11]:
from checklist.pred_wrapper import PredictorWrapper
wrapped_pp = PredictorWrapper.wrap_softmax(predict_proba)

Once you have this function, running the test is as simple as calling `test.run`.  
You can run the test on a subset of testcases (for speed's sake) by specifying `n` if needed.  
We won't do that here since our test is small)

In [13]:
test.run(wrapped_pp)

Predicting 200 examples


Once you run a test, you can print a summary of the results with `test.summary()`

In [15]:
test.summary()

Test cases:      200
Fails (rate):    121 (60.5%)

Example fails:
0.7 This is not an exciting movie.
----
0.5 This is not a sleep-inducing show.
----
1.0 This is not an excellent show.
----


It seems that this off-the-shelf system has trouble with negation.
Note the failures: examples that should be negative are predicted as positive and vice versa (the number shown is the probability of positive)

If you are using jupyter notebooks, you can use `test.visual_summary()` for a nice visualization version of these results:  
(I'll load a gif so you can see this in preview mode)

In [16]:
# from IPython.display import HTML, Image
# with open('visual_summary.gif','rb') as f:
#     display(Image(data=f.read(), format='png'))
test.visual_summary()

TestSummarizer(stats={'npassed': 79, 'nfailed': 121, 'nfiltered': 0}, summarizer={'name': 'Simple negation', '…

The second way to run a test is from a prediction file.  
First, we export the test into a text file:

In [17]:
test.to_raw_file('/tmp/raw_file.txt')

## Invariance tests

An Invariance test (INV) is when we apply label-preserving perturbations to inputs and expect the model prediction to remain the same.  
Let's start by creating a fictitious dataset to serve as an example, and process it with spacy

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [20]:
dataset = ['This was a very nice movie directed by John Smith.',
           'Mary Keen was brilliant.', 
          'I hated everything about this.',
          'This movie was very bad.',
          'I really liked this movie.',
          'just bad.',
          'amazing.',
          ]
pdataset = list(nlp.pipe(dataset))

Now let's apply a simple perturbation: changing people's names and expecting predictions to remain the same:

In [21]:
t = Perturb.perturb(pdataset, Perturb.change_names)
print('\n'.join(t.data[0][:3]))
print('...')
test = INV(**t)

This was a very nice movie directed by John Smith.
This was a very nice movie directed by Michael James.
This was a very nice movie directed by Christopher Ward.
...


In [22]:
test.run(wrapped_pp)
test.summary()

Predicting 22 examples
Test cases:      2
Fails (rate):    0 (0.0%)


Let's try a different test: adding typos and expecting predictions to remain the same

In [23]:
t = Perturb.perturb(dataset, Perturb.add_typos)
print('\n'.join(t.data[0][:3]))
print('...')
test = INV(**t)

This was a very nice movie directed by John Smith.
This was a very nice movie directed by John Smit.h
...


In [24]:
test.run(wrapped_pp)
test.summary()

Predicting 14 examples
Test cases:      7
Fails (rate):    2 (28.6%)

Example fails:
0.8 amazing.
0.5 amaizng.

----
0.9 Mary Keen was brilliant.
0.5 Mary Keen was brilliatn.

----


In [25]:
import csv
r = csv.DictReader(open('data.csv'))
inputs = []
count=0
# reasons = []
for row in r:
    review = row['review']
    print(review)
    count+=1
    if(count>5):
        break

Cinderella is a beautiful film, with beautiful songs of course. In fact, it's one of the best films of the 1950's.<br /><br />I think all the characters are portrayed amazingly. You can see the cruelness of Cinderella's stepsisters and her stepmother, the sweetness of Cinderella. The mice are funny and sweet too.<br /><br />I think they changed the tale a bit, but I think it's for the best. It's such a nice film, and I don't think anyone could resist it deep down.<br /><br />I give it a 8/10. I don't think it's the best Disney film. But it sure is a true classic.
HLOTS was an outstanding series, its what NYPD Blue will never be, on HLOTS the plots are real, the dialog is real, the Relationships are real. With HLOTS back as a movie, Tying up all the loose ends, it was good to have all the gang back together, even a few that passed away show up (wont say how) The storyline was fast paced, emotional and full of the spirit the series had week in and week out. Homicide , Life on the Streets