In [46]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.test_suite import TestSuite
from checklist.expect import Expect

If you read the [paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf), you know that CheckList is more than this package, it's also a process.  
This tutorial is a short version of that process, but you should really read the paper if you haven't :)

# Task and Model: QQP, BERT

For the purpose of this tutorial, we'll use Quora Question Pair as an example, with [a finetuned BERT model hosted by Textattack](https://huggingface.co/textattack/bert-base-uncased-QQP).
**Please note that this is not the model reported in the paper -- we finetuned that model locally.** 
Here, we instead use a model that is available online (loaded through [Huggingface Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)), so that you can easily follow the tutorial.

Loading the model and spacy:

In [47]:
import sys
import spacy
import numpy as np
processor = spacy.load('en_core_web_sm')

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "textattack/bert-base-uncased-QQP"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# sentiment analysis is a general name in Huggingface to load the pipeline for text classification tasks.
# set device=-1 if you don't have a gpu
pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="pt", device=0)

Device set to use cuda:0


Loading the dataset. First, install the [NLP dataset library](https://huggingface.co/nlp/quicktour.html):

In [48]:
!pip install nlp



In [99]:
# from nlp import load_dataset
# from huggingface_hub import login

# login("")

from datasets import load_dataset

qqp_data = load_dataset('glue', 'qqp', split='validation')
all_questions = set()
q1s = [d["question1"] for d in qqp_data]
q2s = [d["question2"] for d in qqp_data]
labels = np.array([d["label"] for d in qqp_data]).astype(int)

qs = list(zip(q1s, q2s))
qqp_data[0]

{'question1': 'Why are African-Americans so beautiful?',
 'question2': 'Why are hispanics so beautiful?',
 'label': 0,
 'idx': 0}

Preprocess all the questions with spacy. This may take sometime.

In [51]:
from tqdm import tqdm
all_questions.update(set(q1s))
all_questions.update(set(q2s))
print(f"Total count of unique questions: {len(all_questions)}")
processed_qs = list(tqdm(processor.pipe(all_questions, batch_size=64)))

Total count of unique questions: 73324


73324it [00:50, 1451.07it/s]


In [52]:
spacy_map = {q: processed_q for (q, processed_q) in zip(all_questions, processed_qs)}
parsed_qs = [(spacy_map[q[0]], spacy_map[q[1]]) for q in qs]

# Top-Down approach: the CheckList matrix

## Capabilities x Test Types

In tutorial #3, we talked about specific test types.  
In order to guide test ideation, it's useful to think of CheckList as a matrix of Capabilities x Test Types.  
*Capabilities* refers to general-purpose linguistic capabilities, which manifest in one way or another in almost any NLP application.   
We suggest that anyone CheckListing a model go through *at least* the following capabilities, trying to create MFTs, INVs, and DIRs for each if possible.
1. **Vocabulary + POS:** important words or groups of words (by part-of-speech) for the task
2. **Taxonomy**: synonyms, antonyms, word categories, etc
3. **Robustness**: to typos, irrelevant additions, contractions, etc
4. **Named Entity Recognition (NER)**: person names, locations, numbers, etc
5. **Fairness**
6. **Temporal understanding**: understanding order of events and how they impact the task
7. **Negation**
8. **Coreference** 
9. **Semantic Role Labeling (SRL)**: understanding roles such as agent, object, passive/active, etc
10. **Logic**: symmetry, consistency, conjunctions, disjunctions, etc

Notice that we are framing this as very top-down approach: you start with a list of capabilities and try to think of what kinds of tests can be created, based on the three test types. We'll talk about how to incorporate some bottom-up thinking later on.

We won't try to create tests for **all** of these capabilities (but we do have notebooks with tests for all of them in the repo), just one as an example. 
Anyway, let's create a test suite (used to save and aggregate tests):

In [7]:
suite = TestSuite()
editor = Editor()

## Capability: NER

Let's start with the NER capability.  
How do named entities impact duplicate question detection? 


### MFT
It seems that the model should at least be able to distinguish questions about different people as non-duplicates.   
Let's write an MFT where we have two people that have the same last name, but different first names.  
Instead of running the test now, we'll add it to the suite and run all tests later.

In [55]:
t = editor.template((
    'Is {first_name} {last_name} {mask}?',
    'Is {first_name2} {last_name} {mask}?',
    ),
    remove_duplicates=True, 
    nsamples=300)
test = MFT(**t, labels=0, name='same adjectives, different people', capability = 'NER',
          description='Different first name, same adjective and last name')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

('Is Mary Marshall lying?', 'Is Lynn Marshall lying?')
('Is Don Peterson Jewish?', 'Is Carl Peterson Jewish?')


### INV
If you have two questions with the same named entity, changing the entity on both should not change whether the questions are duplicates or not.  
Let's write an INV for this.

Since we are dealing with pairs of questions, we have to write a wrapper to make sure the same name is changed on both:

In [57]:
def change_name_on_both(qs):
    q1, q2 = qs
    c1 = Perturb.change_names(q1, seed=1, meta=True)
    c2 = Perturb.change_names(q2, seed=1, meta=True)
    if not c1 or not c2:
        return
    # separating out examples and meta. Meta has tuples (a, b), where name 'a' was changed to 'b'
    c1, m1 = c1
    c2, m2 = c2
    # Only include examples where the same name was changed on both questions
    return [(q1, q2) for q1, q2, m1, m2 in zip(c1, c2, m1, m2) if m1 == m2][:10]

In [59]:
t = Perturb.perturb(parsed_qs, change_name_on_both, nsamples=200)
test = INV(**t, name='Change same name in both questions', capability='NER',
          description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test, overwrite=True)
print(t.data[0][0])
print(t.data[0][1])
print(t.data[0][2])

('Who do you think will win, Trump or Hillary?', 'Who is going to win, Trump or Hillary?')
('Who do you think will win, Trump or Kayla?', 'Who is going to win, Trump or Kayla?')
('Who do you think will win, Trump or Kimberly?', 'Who is going to win, Trump or Kimberly?')


### DIR
Conversely, if an entity is present on a pair the model predicts as a duplicate and we change it to something else on *only one* of the sentences, the prediction should change to non-duplicate.  
Let's write this as a DIR test:

In [60]:
def change_name_on_one(qs):
    q1, q2 = qs
    c1 = Perturb.change_names(q1, seed=1, meta=True)
    c2 = Perturb.change_names(q2, seed=1, meta=True)
    if not c1 or not c2:
        return
    c1, m1 = c1
    c2, m2 = c2
    ret = []
    ret.extend([(q1_, str(q2)) for q1_, m1_ in zip(c1, m1) if m1_[0] in str(q2)])
    ret.extend([(str(q1), q2_) for q2_, m2_ in zip(c2, m2) if m2_[0] in str(q1)])
    return ret

We'll write an expectation function in two steps.  
First, we want the prediction to be 0.  
Second, we only want to include examples where the original prediction is one. We do this with a slice wrapper:

In [61]:
expect_fn = Expect.eq(0)
expect_fn = Expect.slice_orig(expect_fn, lambda orig, *args: orig == 1)


Let's put it all together into a test:

In [63]:
t = Perturb.perturb(parsed_qs, change_name_on_one, nsamples=200)
name = 'Change name in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change name in one of them and expect new prediction to be non-duplicate'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
suite.add(test, overwrite=True)
print(t.data[0][0])
print(t.data[0][1])
print(t.data[0][2])

('How was the personal relationship between Steve Jobs and Bill Gates?', 'Did Steve Jobs hate Bill Gates?')
('How was the personal relationship between Steve Jobs and Michael Brooks?', 'Did Steve Jobs hate Bill Gates?')
('How was the personal relationship between Steve Jobs and Christopher Cox?', 'Did Steve Jobs hate Bill Gates?')


These examples illustrate how thinking through the matrix can help test ideation. We now turn to a bottom up approach

# Bottom up approach

In this approach, we look at specific examples (from the validation dataset or elsewhere) and try to generalize them into MFTs, INVs or DIRs, placing them into a specific capability.  
Here is an example:

In [64]:
np.random.seed(14)
i = np.random.choice(len(qs))
qs[i]

('Which company should I join as a fresher, TCS or Virtusa?',
 'Is it a good decision to join Tcs as a fresher?')

This is a good example, in which a question asks about a comparison between two options, while the other question asks about a single option.  
While they are not duplicates, it is possible that models would get confused here. I think this test fits into the Vocabulary+POS capability (it's not crucial for us to be completely precise about where a test fits).  
Let's try to create an MFT out of it:

In [65]:
', '.join(editor.suggest('{mask} is a large tech company.')[:40])

'Apple, Google, Facebook, This, Microsoft, Amazon, Uber, It, Intel, Samsung, Netflix, Tesla, Twitter, LinkedIn, Oracle, Snap, Target, Disney, AMD, Bloomberg, Sony, That, Wikipedia, China, Fox, Here, this, FB, YouTube, HP, Reddit, Ford, Harris, Pinterest, MIT, GE, Dialog, Square, CBS, Orange'

In [66]:
companies = ['Apple', 'Google', 'Facebook', 'Microsoft', 'Amazon', 'Uber', 'Intel', 'Samsung', 'Netflix', 'Tesla', 'LinkedIn', 'Oracle', 'Target', 'Snap', 'Disney', 'AMD', 'Sony', 'Reddit', 'Youtube']

In [67]:
', '.join(editor.suggest('Should I join {company} as a {mask}?', company=companies)[:30])

'customer, shareholder, member, developer, volunteer, competitor, contributor, contractor, professional, student, buyer, writer, rookie, manager, consumer, worker, pro, teenager, client, sponsor, subscriber, consultant, CEO, user, Beta, director, beta, researcher, citizen, result'

In [68]:
role = ['developer', 'contributor', 'freshman', 'college grad', 'volunteer', 'writer', 'contractor', 'consultant']

In [69]:
t = editor.template((
       'Which company should I join as a {role}, {company1} or {company2}?',
       'Should I join {company1} as a {role}?',
   ),
    company=companies,
    role=role,
    remove_duplicates=True,
    nsamples=100,
)
print(t.data[0])
print(t.data[1])
print(t.data[2])

('Which company should I join as a contributor, Facebook or Samsung?', 'Should I join Facebook as a contributor?')
('Which company should I join as a volunteer, Sony or LinkedIn?', 'Should I join Sony as a volunteer?')
('Which company should I join as a consultant, Apple or Disney?', 'Should I join Apple as a consultant?')


We've replicated the original example, but we can generalize it a bit to other comparisons:

In [70]:
', '.join([str(x) for x in editor.suggest('Will Google\'s {mask} {mask}?')][:50])

"('dominance', 'continue'), ('strategy', 'work'), ('efforts', 'succeed'), ('success', 'continue'), ('experiment', 'work'), ('dominance', 'last'), ('experiment', 'succeed'), ('dominance', 'end'), ('strategy', 'succeed'), ('efforts', 'work'), ('plan', 'work'), ('woes', 'continue'), ('growth', 'continue'), ('strategy', 'stick'), ('gamble', 'succeed'), ('approach', 'work'), ('dominance', 'endure'), ('failures', 'continue'), ('tactics', 'work'), ('experiments', 'work'), ('decision', 'stick'), ('success', 'last'), ('domination', 'continue'), ('lawsuit', 'succeed'), ('gamble', 'work'), ('push', 'succeed'), ('behavior', 'change'), ('ambitions', 'succeed'), ('dominance', 'fade'), ('dominance', 'persist'), ('strategy', 'change'), ('experiments', 'succeed'), ('dominance', 'return'), ('problems', 'continue'), ('policies', 'change'), ('scandals', 'continue'), ('strategy', 'continue'), ('plan', 'succeed'), ('focus', 'change'), ('dominance', 'survive'), ('takeover', 'succeed'), ('move', 'work'), ('mi

In [73]:
t += editor.template((
       'Which company\'s {fverb[0]} will {fverb[1]} {comp}, {company1} or {company2}?',
       'Will {company1}\'s {fverb[0]} {fverb[1]}?',
   ),
    company=companies,
    comp=['most', 'least', 'sooner', 'later'],
    fverb=[('stock', 'rise'), ('CEO', 'quit'), ('board', 'resign'), ('stock', 'fall'), ('effort', 'succeed'), ('strategy', 'work'), ('plan', 'work'), ('gamble', 'work'), ('focus', 'change'), ('intentions', 'change')],
    nsamples=300,
    remove_duplicates=True,
)
print(t.data[-1])
print(t.data[-2])
print(t.data[-3])

("Which company's board will resign most, Microsoft or Apple?", "Will Microsoft's board resign?")
("Which company's board will resign most, LinkedIn or Youtube?", "Will LinkedIn's board resign?")
("Which company's CEO will quit sooner, Youtube or Amazon?", "Will Youtube's CEO quit?")


In [76]:
test = MFT(**t, labels=0, name='Comparison between two entities is not the same as asking about one', capability = 'Vocabulary',
          description='')
suite.add(test, overwrite=True)


# Running the suite, seeing results

When running the prediction, the Huggingface pipeline returns a dict with predicted label and probability:

In [77]:
example = ('Which company should I join as a freshman, Google or Facebook?', 'Should I join Google as a freshman?')
pipe([example], truncation=True, padding=True)


[{'label': 'LABEL_0', 'score': 0.9998934268951416}]

We write a simple wrapper to make the output compatible with CheckList:

In [82]:
def pred_and_conf(data):
    raw_preds = pipe(data, padding=True, truncation=True)
    preds = np.array([ int(p["label"][-1]) for p in raw_preds])
    pp = np.array([[p["score"], 1-p["score"]] if int(p["label"][-1]) == 0 else [1-p["score"], p["score"]] for p in raw_preds])
    return preds, pp


In [83]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 299 examples
Running Change same name in both questions
Predicting 2065 examples
Running Change name in one of the questions
Predicting 3876 examples
Running Comparison between two entities is not the same as asking about one
Predicting 946 examples
Running How can I become more {synonym}?
Predicting 200 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 600 examples


We can see a (text) summary of the results by calling `suite.summary()`

In [80]:
suite.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      946
Fails (rate):    0 (0.0%)




Taxonomy

How can I become more {synonym}?
Test cases:      200
Fails (rate):    200 (100.0%)

Example fails:
0.0 ('How can I become more angry?', 'How can I become more furious?')
----
0.0 ('How can I become less organized?', 'How can I become less organised?')
----
0.0 ('How can I become less intelligent?', 'How can I become less smart?')
----


How can I become more X = How can I become less antonym(X)
Test cases:      600
Fails (rate):    600 (100.0%)

Example fails:
0.0 ('How can I become less progressive?', 'How can I become more conservative?')
----
0.0 ('How can I become less humble?', 'How can I become more proud?')
----
0.0 ('How can I become less stupid?', 'How can I become more smart?')
----




NER

same adjectives, different people
Test cases:      299
Fails (rate):    0 (0.0%)


Change same name in both questions
Test cases:      200
Fails (r

Or if we're using jupyter, we can use a nifty visualization that has all of the tests we created in a matrix.  
You can navigate the matrix and see results for individual tests (*The screenshot below is based on our locally finetuned model, so the numbers may not match with your results.*).

In [81]:
# from IPython.display import HTML, Image
# with open('visual_table_summary.gif','rb') as f:
#     display(Image(data=f.read(), format='png'))
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

## Bonus: testing Taxonomy

Let's create a few additional tests for the Taxonomy capability

In [84]:
tmp = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.synonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        tmp.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(', '.join([str(tuple(x)) for x in tmp][:50]))

('addicted', 'addict'), ('organised', 'organized', 'direct', 'engineer'), ('capable', 'open', 'able'), ('worried', 'upset'), ('religious', 'spiritual'), ('conservative', 'cautious'), ('effective', 'efficient'), ('bad', 'spoiled', 'sorry', 'risky', 'tough', 'defective'), ('inspired', 'divine'), ('desperate', 'heroic'), ('strict', 'rigid', 'stern'), ('stressed', 'distressed', 'stress'), ('intimidating', 'daunting'), ('ethical', 'honorable'), ('positive', 'confident'), ('educated', 'enlightened'), ('miserable', 'poor', 'suffering', 'low', 'pathetic', 'wretched'), ('evil', 'vicious'), ('alone', 'lonely', 'solitary'), ('nervous', 'anxious'), ('understanding', 'savvy'), ('organized', 'organised', 'direct'), ('hungry', 'thirsty'), ('important', 'authoritative', 'significant'), ('lonely', 'alone', 'solitary'), ('obnoxious', 'objectionable'), ('enlightened', 'educated', 'clear'), ('anxious', 'nervous'), ('independent', 'autonomous'), ('sensitive', 'sensible'), ('authoritarian', 'dictator'), ('f

Out of all of those, let's pick a few:

In [85]:
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

With these, we can create a simple MFT, where we expect the model to recognize these synonyms.  


In [87]:
t = editor.template(
    (
    'How can I become {moreless} {x[0]}?',
    'How can I become {moreless} {x[1]}?',
    ),
    x=synonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=200)
name = 'How can I become more {synonym}?' 
desc = 'different (simple) templates where words are replaced with their synonyms'
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)

Let's do the same with antonyms:

In [88]:
opps = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.antonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        opps.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(','.join([str(tuple(x)) for x in opps]))

('conservative', 'progressive', 'liberal'),('bad', 'good'),('pessimistic', 'optimistic'),('positive', 'negative'),('organic', 'functional'),('impatient', 'patient'),('hungry', 'thirsty'),('active', 'passive'),('optimistic', 'pessimistic'),('invisible', 'visible'),('negative', 'positive'),('cautious', 'brave'),('fat', 'lean', 'thin'),('relevant', 'irrelevant'),('powerless', 'powerful'),('corrupt', 'straight'),('rude', 'civil', 'polite'),('stupid', 'intelligent', 'smart'),('specific', 'general'),('unhappy', 'happy'),('emotional', 'intellectual'),('visible', 'invisible'),('passive', 'active'),('insecure', 'secure'),('progressive', 'conservative'),('courageous', 'fearful'),('individual', 'common'),('dependent', 'independent'),('shy', 'confident'),('uncomfortable', 'comfortable'),('defensive', 'offensive'),('humble', 'proud'),('hopeful', 'hopeless'),('conspicuous', 'invisible'),('irresponsible', 'responsible')


In [89]:
antonyms = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]

In [91]:
t = editor.template([(
    'How can I become more {x[0]}?',
    'How can I become less {x[1]}?',
    ),
    (
    'How can I become less {x[0]}?',
    'How can I become more {x[1]}?',
    )],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=300)
name = 'How can I become more X = How can I become less antonym(X)' 
desc = ''
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)

It would be easy to turn the synonym one into an INV as well (we do this in another notebook), but let's end here after we run the suite again and see new results.

In [92]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 299 examples
Running Change same name in both questions
Predicting 2065 examples
Running Change name in one of the questions
Predicting 3876 examples
Running Comparison between two entities is not the same as asking about one
Predicting 946 examples
Running How can I become more {synonym}?
Predicting 200 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 600 examples


In [93]:
suite.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      946
Fails (rate):    0 (0.0%)




Taxonomy

How can I become more {synonym}?
Test cases:      200
Fails (rate):    200 (100.0%)

Example fails:
0.0 ('How can I become more angry?', 'How can I become more furious?')
----
0.0 ('How can I become less happy?', 'How can I become less joyful?')
----
0.0 ('How can I become more scared?', 'How can I become more frightened?')
----


How can I become more X = How can I become less antonym(X)
Test cases:      600
Fails (rate):    600 (100.0%)

Example fails:
0.0 ('How can I become more progressive?', 'How can I become less conservative?')
----
0.0 ('How can I become less stupid?', 'How can I become more smart?')
----
0.0 ('How can I become less hopeful?', 'How can I become more hopeless?')
----




NER

same adjectives, different people
Test cases:      299
Fails (rate):    0 (0.0%)


Change same name in both questions
Test cases:      200
Fails (rate

In [97]:
# suite.visual_summary_table()
suite.save('./test.pkl')



In [98]:
suite_path = './test.pkl'
test_suite = TestSuite.from_file(suite_path)
test_suite.run(pred_and_conf, overwrite=True)
suite.summary()

Running same adjectives, different people
Predicting 299 examples
Running Change same name in both questions
Predicting 2065 examples
Running Change name in one of the questions
Predicting 3876 examples
Running Comparison between two entities is not the same as asking about one
Predicting 946 examples
Running How can I become more {synonym}?
Predicting 200 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 600 examples
Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      946
Fails (rate):    0 (0.0%)




Taxonomy

How can I become more {synonym}?
Test cases:      200
Fails (rate):    200 (100.0%)

Example fails:
0.0 ('How can I become more vocal?', 'How can I become more outspoken?')
----
0.0 ('How can I become more organized?', 'How can I become more organised?')
----
0.0 ('How can I become less courageous?', 'How can I become less brave?')
----


How can I become more X = How can I become less antonym(X)
