In [1]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.test_suite import TestSuite
from checklist.expect import Expect

If you read the [paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf), you know that CheckList is more than this package, it's also a process.  
This tutorial is a short version of that process, but you should really read the paper if you haven't :)

# Task and Model: QQP, BERT

For the purpose of this tutorial, we'll use Quora Question Pair as an example, with [a finetuned BERT model hosted by Textattack](https://huggingface.co/textattack/bert-base-uncased-QQP).
**Please note that this is not the model reported in the paper -- we finetuned that model locally.** 
Here, we instead use a model that is available online (loaded through [Huggingface Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)), so that you can easily follow the tutorial.

Loading the model and spacy:

In [2]:
import sys
import spacy
import numpy as np
processor = spacy.load('en_core_web_sm')

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
# model_name = "textattack/bert-base-uncased-QQP"
model_name = "textattack/distilbert-base-uncased-QQP"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# sentiment analysis is a general name in Huggingface to load the pipeline for text classification tasks.
# set device=-1 if you don't have a gpu
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", device=0)

Device set to use cuda:0


Loading the dataset. First, install the [NLP dataset library](https://huggingface.co/nlp/quicktour.html):

In [3]:
!pip install nlp



In [4]:
# from nlp import load_dataset
# from huggingface_hub import login

# login("")

from datasets import load_dataset

qqp_data = load_dataset('glue', 'qqp', split='validation')
all_questions = set()
q1s = [d["question1"] for d in qqp_data]
q2s = [d["question2"] for d in qqp_data]
labels = np.array([d["label"] for d in qqp_data]).astype(int)

qs = list(zip(q1s, q2s))
qqp_data[0]

{'question1': 'Why are African-Americans so beautiful?',
 'question2': 'Why are hispanics so beautiful?',
 'label': 0,
 'idx': 0}

Preprocess all the questions with spacy. This may take sometime.

In [5]:
from tqdm import tqdm
all_questions.update(set(q1s))
all_questions.update(set(q2s))
print(f"Total count of unique questions: {len(all_questions)}")
processed_qs = list(tqdm(processor.pipe(all_questions, batch_size=64)))

Total count of unique questions: 73324


73324it [00:50, 1442.08it/s]


In [6]:
spacy_map = {q: processed_q for (q, processed_q) in zip(all_questions, processed_qs)}
parsed_qs = [(spacy_map[q[0]], spacy_map[q[1]]) for q in qs]

# Top-Down approach: the CheckList matrix

## Capabilities x Test Types

In tutorial #3, we talked about specific test types.  
In order to guide test ideation, it's useful to think of CheckList as a matrix of Capabilities x Test Types.  
*Capabilities* refers to general-purpose linguistic capabilities, which manifest in one way or another in almost any NLP application.   
We suggest that anyone CheckListing a model go through *at least* the following capabilities, trying to create MFTs, INVs, and DIRs for each if possible.
1. **Vocabulary + POS:** important words or groups of words (by part-of-speech) for the task
2. **Taxonomy**: synonyms, antonyms, word categories, etc
3. **Robustness**: to typos, irrelevant additions, contractions, etc
4. **Named Entity Recognition (NER)**: person names, locations, numbers, etc
5. **Fairness**
6. **Temporal understanding**: understanding order of events and how they impact the task
7. **Negation**
8. **Coreference** 
9. **Semantic Role Labeling (SRL)**: understanding roles such as agent, object, passive/active, etc
10. **Logic**: symmetry, consistency, conjunctions, disjunctions, etc

Notice that we are framing this as very top-down approach: you start with a list of capabilities and try to think of what kinds of tests can be created, based on the three test types. We'll talk about how to incorporate some bottom-up thinking later on.

We won't try to create tests for **all** of these capabilities (but we do have notebooks with tests for all of them in the repo), just one as an example. 
Anyway, let's create a test suite (used to save and aggregate tests):

In [7]:
suite_train = TestSuite()
suite_test = TestSuite()
editor = Editor()

## Capability: NER

Let's start with the NER capability.  
How do named entities impact duplicate question detection? 


### MFT
It seems that the model should at least be able to distinguish questions about different people as non-duplicates.   
Let's write an MFT where we have two people that have the same last name, but different first names.  
Instead of running the test now, we'll add it to the suite and run all tests later.

In [8]:
t_train = editor.template((
    'Hi {first_name} {last_name} {mask}!',
    'Hi {first_name2} {last_name} {mask}！',
    ),
    remove_duplicates=True, 
    nsamples=3000)

t_test = editor.template((
    'Is {first_name} {last_name} {mask}?',
    'Is {first_name2} {last_name} {mask}?',
    ),
    remove_duplicates=True, 
    nsamples=300)

train = MFT(**t_train, labels=0, name='same adjectives, different people', capability = 'NER',
          description='Different first name, same adjective and last name')
suite_train.add(train, overwrite=True)

test = MFT(**t_test, labels=0, name='same adjectives, different people', capability = 'NER',
          description='Different first name, same adjective and last name')
suite_test.add(test, overwrite=True)

print(t_train.data[0])
print(t_train.data[1])

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


('Hi Alison Wright :)!', 'Hi Ian Wright :)！')
("Hi Ashley Gordon '!", "Hi Jessica Gordon '！")


### INV
If you have two questions with the same named entity, changing the entity on both should not change whether the questions are duplicates or not.  
Let's write an INV for this.

Since we are dealing with pairs of questions, we have to write a wrapper to make sure the same name is changed on both:

In [9]:
def change_name_on_both(qs):
    q1, q2 = qs
    c1 = Perturb.change_names(q1, seed=1, meta=True)
    c2 = Perturb.change_names(q2, seed=1, meta=True)
    if not c1 or not c2:
        return
    # separating out examples and meta. Meta has tuples (a, b), where name 'a' was changed to 'b'
    c1, m1 = c1
    c2, m2 = c2
    # Only include examples where the same name was changed on both questions
    return [(q1, q2) for q1, q2, m1, m2 in zip(c1, c2, m1, m2) if m1 == m2][:10]

In [10]:
t_train = Perturb.perturb(parsed_qs, change_name_on_both, nsamples=2000)
train = INV(**t_train, name='Change same name in both questions', capability='NER',
          description='')

t_test = Perturb.perturb(parsed_qs, change_name_on_both, nsamples=200)
test = INV(**t_test, name='Change same name in both questions', capability='NER',
          description='')

# test.run(new_pp)
# test.summary(3)
suite_train.add(train, overwrite=True)
suite_test.add(test, overwrite=True)

print(t_train.data[0][0])
print(t_train.data[0][1])
print(t_train.data[0][2])

("Why isn't Hillary Clinton in jail?", 'Could Hillary Clinton actually go to jail?')
("Why isn't Jennifer Brooks in jail?", 'Could Jennifer Brooks actually go to jail?')
("Why isn't Jessica Cox in jail?", 'Could Jessica Cox actually go to jail?')


### DIR
Conversely, if an entity is present on a pair the model predicts as a duplicate and we change it to something else on *only one* of the sentences, the prediction should change to non-duplicate.  
Let's write this as a DIR test:

In [11]:
def change_name_on_one(qs):
    q1, q2 = qs
    c1 = Perturb.change_names(q1, seed=1, meta=True)
    c2 = Perturb.change_names(q2, seed=1, meta=True)
    if not c1 or not c2:
        return
    c1, m1 = c1
    c2, m2 = c2
    ret = []
    ret.extend([(q1_, str(q2)) for q1_, m1_ in zip(c1, m1) if m1_[0] in str(q2)])
    ret.extend([(str(q1), q2_) for q2_, m2_ in zip(c2, m2) if m2_[0] in str(q1)])
    return ret

We'll write an expectation function in two steps.  
First, we want the prediction to be 0.  
Second, we only want to include examples where the original prediction is one. We do this with a slice wrapper:

In [12]:
# expect_fn = Expect.eq(0)
# expect_fn = Expect.slice_orig(expect_fn, lambda orig, *args: orig == 1)
expect_fn = Expect.monotonic(label=1, increasing=False, tolerance=0.1)


Let's put it all together into a test:

In [13]:
t_train = Perturb.perturb(parsed_qs, change_name_on_one, nsamples=2000)
t_test = Perturb.perturb(parsed_qs, change_name_on_one, nsamples=200)

name = 'Change name in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change name in one of them and expect new prediction to be non-duplicate'
train = DIR(**t_train, expect=expect_fn, name=name, description=desc, capability='NER')
test = DIR(**t_test, expect=expect_fn, name=name, description=desc, capability='NER')

suite_train.add(train, overwrite=True)
suite_test.add(test, overwrite=True)
print(t_train.data[0][0])
print(t_train.data[0][1])
print(t_train.data[0][2])

('Who do you think will win, Trump or Hillary?', 'Who is going to win, Trump or Hillary?')
('Who do you think will win, Trump or Kayla?', 'Who is going to win, Trump or Hillary?')
('Who do you think will win, Trump or Kimberly?', 'Who is going to win, Trump or Hillary?')


These examples illustrate how thinking through the matrix can help test ideation. We now turn to a bottom up approach

# Bottom up approach

In this approach, we look at specific examples (from the validation dataset or elsewhere) and try to generalize them into MFTs, INVs or DIRs, placing them into a specific capability.  
Here is an example:

In [14]:
np.random.seed(14)
i = np.random.choice(len(qs))
qs[i]

('Which company should I join as a fresher, TCS or Virtusa?',
 'Is it a good decision to join Tcs as a fresher?')

This is a good example, in which a question asks about a comparison between two options, while the other question asks about a single option.  
While they are not duplicates, it is possible that models would get confused here. I think this test fits into the Vocabulary+POS capability (it's not crucial for us to be completely precise about where a test fits).  
Let's try to create an MFT out of it:

In [15]:
', '.join(editor.suggest('{mask} is a large tech company.')[:50])

'Apple, Google, Facebook, This, Microsoft, Amazon, Uber, It, Intel, Samsung, Netflix, Tesla, Twitter, LinkedIn, Oracle, Snap, Target, Disney, AMD, Bloomberg, Sony, That, Wikipedia, China, Fox, Here, this, FB, YouTube, HP, Reddit, Ford, Harris, Pinterest, MIT, GE, Dialog, Square, CBS, Orange, Sky, NVIDIA, Nintendo, Bloom, GM, NASA, He, LG, Tech, Buffer'

In [16]:
companies_string = 'Apple, Google, Facebook, This, Microsoft, Amazon, Uber, It, Intel, Samsung, Netflix, Tesla, Twitter, LinkedIn, Oracle, Snap, Target, Disney, AMD, Bloomberg, Sony, That, Wikipedia, China, Fox, Here, this, FB, YouTube, HP, Reddit, Ford, Harris, Pinterest, MIT, GE, Dialog, Square, CBS, Orange, Sky, NVIDIA, Nintendo, Bloom, GM, NASA, He, LG, Tech, Buffer'

companies = [company.strip() for company in companies_string.split(',')]


In [17]:
', '.join(editor.suggest('Should I join {company} as a {mask}?', company=companies)[:30])

'developer, member, customer, shareholder, volunteer, contributor, writer, student, competitor, professional, manager, rookie, contractor, consultant, subscriber, reader, pro, client, follower, buyer, fan, user, trader, journalist, millennial, beta, director, citizen, worker, result'

In [18]:
role_string = 'developer, member, customer, shareholder, volunteer, contributor, writer, student, competitor, professional, manager, rookie, contractor, consultant, subscriber, reader, pro, client, follower, buyer, fan, user, trader, journalist, millennial, beta, director, citizen, worker, result'

role = [role.strip() for role in role_string.split(',')]

In [19]:
t_train = editor.template((
       'Which company should {male} join as a {role}, {company1} or {company2}?',
       'Should {male} join {company1} as a {role}?',
   ),
    company=companies,
    role=role,
    remove_duplicates=True,
    nsamples=1000,
)

t_test = editor.template((
       'Which company should {female} join as a {role}, {company1} or {company2}?',
       'Should {female} join {company1} as a {role}?',
   ),
    company=companies,
    role=role,
    remove_duplicates=True,
    nsamples=100,
)
print(t_train.data[0])
print(t_train.data[1])
print(t_train.data[2])

('Which company should Billy join as a member, Sky or LinkedIn?', 'Should Billy join Sky as a member?')
('Which company should Philip join as a contractor, Amazon or Bloom?', 'Should Philip join Amazon as a contractor?')
('Which company should Jim join as a student, Tech or That?', 'Should Jim join Tech as a student?')


We've replicated the original example, but we can generalize it a bit to other comparisons:

In [20]:
', '.join([str(x) for x in editor.suggest('Will Google\'s {mask} {mask}?')][:50])

"('dominance', 'continue'), ('strategy', 'work'), ('efforts', 'succeed'), ('success', 'continue'), ('experiment', 'work'), ('dominance', 'last'), ('experiment', 'succeed'), ('dominance', 'end'), ('strategy', 'succeed'), ('efforts', 'work'), ('plan', 'work'), ('woes', 'continue'), ('growth', 'continue'), ('strategy', 'stick'), ('gamble', 'succeed'), ('approach', 'work'), ('dominance', 'endure'), ('failures', 'continue'), ('tactics', 'work'), ('experiments', 'work'), ('decision', 'stick'), ('success', 'last'), ('domination', 'continue'), ('lawsuit', 'succeed'), ('gamble', 'work'), ('push', 'succeed'), ('behavior', 'change'), ('ambitions', 'succeed'), ('dominance', 'fade'), ('dominance', 'persist'), ('strategy', 'change'), ('experiments', 'succeed'), ('dominance', 'return'), ('problems', 'continue'), ('policies', 'change'), ('scandals', 'continue'), ('strategy', 'continue'), ('plan', 'succeed'), ('focus', 'change'), ('dominance', 'survive'), ('takeover', 'succeed'), ('move', 'work'), ('mi

In [21]:
t_train += editor.template((
       'Which company\'s {fverb[0]} will {fverb[1]} {comp}, {company1} or {company2}?',
       'Will {company1}\'s {fverb[0]} {fverb[1]}?',
   ),
    company=companies,
    comp=['most', 'least', 'sooner', 'later'],
    fverb=[('stock', 'rise'), ('CEO', 'quit'), ('board', 'resign'), ('stock', 'fall'), ('effort', 'succeed'), ('strategy', 'work'), ('plan', 'work'), ('gamble', 'work'), ('focus', 'change'), ('intentions', 'change')],
    nsamples=3000,
    remove_duplicates=True,
)
t_test += editor.template((
       'Which company\'s {fverb[0]} will {fverb[1]} {comp}, {company1} or {company2}?',
       'Will {company1}\'s {fverb[0]} {fverb[1]}?',
   ),
    company=companies,
    comp=['most', 'least', 'sooner', 'later'],
    fverb=[('stock', 'rise'), ('CEO', 'quit'), ('board', 'resign'), ('stock', 'fall'), ('effort', 'succeed'), ('strategy', 'work'), ('plan', 'work'), ('gamble', 'work'), ('focus', 'change'), ('intentions', 'change')],
    nsamples=300,
    remove_duplicates=True,
)
print(t_train.data[-1])
print(t_train.data[-2])
print(t_train.data[-3])

("Which company's effort will succeed least, Oracle or Apple?", "Will Oracle's effort succeed?")
("Which company's board will resign sooner, Fox or China?", "Will Fox's board resign?")
("Which company's board will resign sooner, Orange or Reddit?", "Will Orange's board resign?")


In [22]:

train = MFT(**t_train, labels=0, name='Comparison between two entities is not the same as asking about one', capability = 'Vocabulary',
          description='')
test = MFT(**t_test, labels=0, name='Comparison between two entities is not the same as asking about one', capability = 'Vocabulary',
          description='')

suite_train.add(train, overwrite=True)
suite_test.add(test, overwrite=True)


# Running the suite_t, seeing results

When running the prediction, the Huggingface pipeline returns a dict with predicted label and probability:

In [23]:
example = ('Which company should I join as a freshman, Google or Facebook?', 'Should I join Google as a freshman?')
pipe([example], truncation=True, padding=True)


[{'label': 'LABEL_0', 'score': 0.6755980253219604}]

We write a simple wrapper to make the output compatible with CheckList:

In [24]:
def pred_and_conf(data):
    raw_preds = pipe(data, padding=True, truncation=True)
    preds = np.array([ int(p["label"][-1]) for p in raw_preds])
    pp = np.array([[p["score"], 1-p["score"]] if int(p["label"][-1]) == 0 else [1-p["score"], p["score"]] for p in raw_preds])
    return preds, pp

pred_and_conf([example])

(array([0]), array([[0.67559803, 0.32440197]]))

In [25]:
suite_train.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 2982 examples
Running Change same name in both questions
Predicting 5865 examples
Running Change name in one of the questions
Predicting 13430 examples
Running Comparison between two entities is not the same as asking about one
Predicting 3935 examples


In [26]:
suite_test.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 297 examples
Running Change same name in both questions
Predicting 2047 examples
Running Change name in one of the questions
Predicting 3864 examples
Running Comparison between two entities is not the same as asking about one
Predicting 392 examples


We can see a (text) summary of the results by calling `suite_t.summary()`

In [27]:
suite_train.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      3935
Fails (rate):    740 (18.8%)

Example fails:
0.6 ("Which company's stock will rise most, LinkedIn or Pinterest?", "Will LinkedIn's stock rise?")
----
0.5 ("Which company's stock will fall most, CBS or Buffer?", "Will CBS's stock fall?")
----
0.5 ("Which company's board will resign sooner, Amazon or Harris?", "Will Amazon's board resign?")
----




NER

same adjectives, different people
Test cases:      2982
Fails (rate):    139 (4.7%)

Example fails:
0.5 ('Hi Joan Crawford Twitter!', 'Hi Kenneth Crawford Twitter！')
----
0.6 ('Hi David Jordan Team!', 'Hi Anna Jordan Team！')
----
0.5 ('Hi Ray Kennedy team!', 'Hi Emily Kennedy team！')
----


Change same name in both questions
Test cases:      566
Fails (rate):    144 (25.4%)

Example fails:
0.6 ('Who would make a better president, Michelle Obama or Hillary Clinton?', 'Who would make a better president: Michelle Obama or Hillary Clinton?')

In [28]:
suite_test.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      392
Fails (rate):    74 (18.9%)

Example fails:
0.5 ("Which company's board will resign later, Apple or Fox?", "Will Apple's board resign?")
----
0.5 ("Which company's stock will rise least, Orange or HP?", "Will Orange's stock rise?")
----
0.6 ("Which company's board will resign sooner, MIT or GM?", "Will MIT's board resign?")
----




NER

same adjectives, different people
Test cases:      297
Fails (rate):    60 (20.2%)

Example fails:
0.5 ('Is Ashley Murray dangerous?', 'Is Matt Murray dangerous?')
----
0.6 ('Is Donald Adams coming?', 'Is Hugh Adams coming?')
----
0.6 ('Is Benjamin Murray evil?', 'Is Nancy Murray evil?')
----


Change same name in both questions
Test cases:      200
Fails (rate):    50 (25.0%)

Example fails:
0.5 ('How is Donald Trump winning?', 'Why Donald Trump is winning the Republican nomination now?')
0.4 ('How is Daniel Myers winning?', 'Why Daniel Myers is winnin

Or if we're using jupyter, we can use a nifty visualization that has all of the tests we created in a matrix.  
You can navigate the matrix and see results for individual tests (*The screenshot below is based on our locally finetuned model, so the numbers may not match with your results.*).

In [29]:
# from IPython.display import HTML, Image
# with open('visual_table_summary.gif','rb') as f:
#     display(Image(data=f.read(), format='png'))
suite_train.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

In [30]:
suite_test.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

## Bonus: testing Taxonomy

Let's create a few additional tests for the Taxonomy capability

In [31]:
tmp = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.synonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        tmp.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(', '.join([str(tuple(x)) for x in tmp][:50]))

('anxious', 'nervous'), ('depressed', 'blue'), ('desperate', 'heroic'), ('religious', 'spiritual'), ('kind', 'tolerant'), ('thoughtful', 'attentive'), ('evil', 'vicious'), ('shy', 'timid', 'unsure'), ('effective', 'efficient'), ('intelligent', 'healthy', 'thinking', 'sound'), ('stressed', 'stress', 'distressed'), ('courageous', 'brave'), ('passive', 'inactive', 'peaceful'), ('enlightened', 'educated', 'clear'), ('disconnected', 'confused', 'fragmented'), ('positive', 'confident'), ('mean', 'hateful', 'average'), ('emotional', 'excited'), ('vain', 'futile'), ('authoritarian', 'dictator'), ('rigid', 'strict', 'stiff', 'fixed'), ('insecure', 'unsafe'), ('militant', 'competitive', 'activist'), ('consistent', 'uniform', 'coherent', 'logical'), ('frustrated', 'queer'), ('knowledgeable', 'learned', 'intimate', 'knowing'), ('charitable', 'benevolent', 'sympathetic'), ('rude', 'crude', 'primitive'), ('lonely', 'alone', 'solitary'), ('hungry', 'thirsty'), ('confident', 'positive'), ('cautious', 

Out of all of those, let's pick a few:

In [32]:
synonyms = [ ('vain', 'futile'), ('committed', 'attached'), ('effective', 'efficient'), ('scared', 'frightened'), ('capable', 'open', 'able'), ('strict', 'rigid', 'stern'), ('spiritual', 'religious'), ('progressive', 'liberal', 'imperfect'), ('hateful', 'mean'), ('passive', 'inactive', 'peaceful'), ('miserable', 'poor', 'pathetic', 'suffering', 'wretched', 'low'), ('alone', 'lonely', 'solitary'), ('suspicious', 'wary', 'suspect'), ('open', 'capable', 'clear', 'receptive', 'candid'), ('individual', 'single', 'private', 'person', 'someone'), ('grateful', 'thankful'), ('alienated', 'alien', 'estranged'), ('lonely', 'alone', 'solitary'), ('knowledgeable', 'learned', 'intimate', 'knowing'), ('radical', 'revolutionary'), ('hopeful', 'promising'), ('religious', 'spiritual'), ('emotional', 'excited'), ('specific', 'particular'), ('corrupt', 'corrupted'), ('so', 'then'), ('positive', 'confident'), ('honest', 'reliable', 'good', 'fair', 'true', 'honorable'), ('unhappy', 'distressed'), ('anxious', 'nervous'), ('dependent', 'qualified'), ('inspired', 'divine'), ('charitable', 'benevolent', 'sympathetic'), ('sensitive', 'sensible'), ('humble', 'modest'), ('organized', 'organised', 'direct'), ('consistent', 'uniform', 'coherent', 'logical'), ('intimidating', 'daunting'), ('independent', 'autonomous'), ('innovative', 'modern', 'advanced'), ('desperate', 'heroic'), ('educated', 'enlightened'), ('efficient', 'effective'), ('insecure', 'unsafe'), ('kind', 'tolerant'), ('aware', 'mindful'), ('mindful', 'aware'), ('fat', 'productive', 'rich', 'fatty'), ('authentic', 'reliable'), ('ambitious', 'challenging')]

With these, we can create a simple MFT, where we expect the model to recognize these synonyms.  


In [33]:
t_train = editor.template(
    (
    'How can {male} become {moreless} {x[0]}?',
    'How can {male} become {moreless} {x[1]}?',
    ),
    x=synonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=2000)

t_test = editor.template(
    (
    'Why is {female} {moreless} {x[0]}?',
    'Why is {female} {moreless} {x[1]}?',
    ),
    x=synonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=200)

name_train = 'How can {male} become more {synonym}?' 
name_test = 'Why is {female} more {synonym}?' 
desc = 'different (simple) templates where words are replaced with their synonyms'

train = MFT(**t_train, labels=1, name=name_train, capability = 'Taxonomy',
          description=desc)
test = MFT(**t_test, labels=1, name=name_test, capability = 'Taxonomy',
          description=desc)

suite_train.add(train, overwrite=True)
suite_test.add(test, overwrite=True)

Let's do the same with antonyms:

In [34]:
opps = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.antonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        opps.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(','.join([str(tuple(x)) for x in opps]))

('irresponsible', 'responsible'),('pessimistic', 'optimistic'),('shy', 'confident'),('courageous', 'fearful'),('passive', 'active'),('positive', 'negative'),('powerless', 'powerful'),('invisible', 'visible'),('emotional', 'intellectual'),('insecure', 'secure'),('rude', 'polite', 'civil'),('hungry', 'thirsty'),('cautious', 'brave'),('organic', 'functional'),('stupid', 'smart', 'intelligent'),('dependent', 'independent'),('negative', 'positive'),('defensive', 'offensive'),('fat', 'lean', 'thin'),('conspicuous', 'invisible'),('active', 'passive'),('conservative', 'progressive', 'liberal'),('relevant', 'irrelevant'),('impatient', 'patient'),('uncomfortable', 'comfortable'),('visible', 'invisible'),('progressive', 'conservative'),('bad', 'good'),('hopeful', 'hopeless'),('humble', 'proud'),('corrupt', 'straight'),('optimistic', 'pessimistic'),('specific', 'general'),('unhappy', 'happy'),('individual', 'common')


In [35]:
antonyms = [('conservative', 'liberal', 'progressive'),('emotional', 'intellectual'),('individual', 'common'),('pessimistic', 'optimistic'),('conspicuous', 'invisible'),('unhappy', 'happy'),('relevant', 'irrelevant'),('defensive', 'offensive'),('dependent', 'independent'),('insecure', 'secure'),('specific', 'general'),('stupid', 'intelligent', 'smart'),('progressive', 'conservative'),('shy', 'confident'),('negative', 'positive'),('corrupt', 'straight'),('fat', 'lean', 'thin'),('impatient', 'patient'),('hungry', 'thirsty'),('uncomfortable', 'comfortable'),('active', 'passive'),('passive', 'active'),('rude', 'civil', 'polite'),('irresponsible', 'responsible'),('invisible', 'visible'),('visible', 'invisible'),('positive', 'negative'),('optimistic', 'pessimistic'),('courageous', 'fearful'),('powerless', 'powerful'),('organic', 'functional'),('cautious', 'brave'),('bad', 'good'),('humble', 'proud'),('hopeful', 'hopeless')]

In [36]:
t_train = editor.template([
    (
    'How can {male} become more {x[0]}?',
    'How can {male} become less {x[1]}?',
    ),
    (
    'How can {male} become less {x[0]}?',
    'How can {male} become more {x[1]}?',
    )],
    unroll=True,
    x=antonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=2000)

t_test = editor.template([
    (
    'Why is {female} more {x[0]}?',
    'Why is {female} less {x[1]}?',
    ),
    (
    'Why is {female} less {x[0]}?',
    'Why is {female} more {x[1]}?',
    )],
    x=antonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=200)

name_train = 'How can I become more X = How can I become less antonym(X)' 
name_test = 'Why is {female} more X = Why is {female} less antonym(X)' 
desc = ''

train = MFT(**t_train, labels=1, name=name_train, capability = 'Taxonomy',
          description=desc)
test = MFT(**t_test, labels=1, name=name_test, capability = 'Taxonomy',
          description=desc)

suite_train.add(train, overwrite=True)
suite_test.add(test, overwrite=True)

It would be easy to turn the synonym one into an INV as well (we do this in another notebook), but let's end here after we run the suite again and see new results.

In [37]:
suite_train.run(pred_and_conf, overwrite=True)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Running same adjectives, different people
Predicting 2982 examples
Running Change same name in both questions
Predicting 5865 examples
Running Change name in one of the questions
Predicting 13430 examples
Running Comparison between two entities is not the same as asking about one
Predicting 3935 examples
Running How can {male} become more {synonym}?
Predicting 2000 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 4000 examples


In [38]:
suite_test.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 297 examples
Running Change same name in both questions
Predicting 2047 examples
Running Change name in one of the questions
Predicting 3864 examples
Running Comparison between two entities is not the same as asking about one
Predicting 392 examples
Running Why is {female} more {synonym}?
Predicting 200 examples
Running Why is {female} more X = Why is {female} less antonym(X)
Predicting 400 examples


In [39]:
suite_train.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      3935
Fails (rate):    740 (18.8%)

Example fails:
0.6 ("Which company's stock will rise later, Target or Bloom?", "Will Target's stock rise?")
----
0.7 ("Which company's stock will rise most, Apple or Harris?", "Will Apple's stock rise?")
----
0.5 ("Which company's stock will fall least, Fox or Samsung?", "Will Fox's stock fall?")
----




Taxonomy

How can {male} become more {synonym}?
Test cases:      2000
Fails (rate):    1952 (97.6%)

Example fails:
0.4 ('How can Louis become more efficient?', 'How can Louis become more effective?')
----
0.4 ('How can Tom become less intimidating?', 'How can Tom become less daunting?')
----
0.3 ('How can Donald become more individual?', 'How can Donald become more single?')
----


How can I become more X = How can I become less antonym(X)
Test cases:      4000
Fails (rate):    3859 (96.5%)

Example fails:
0.3 ('How can Joe become more shy?', 'How can Jo

In [40]:
suite_test.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      392
Fails (rate):    74 (18.9%)

Example fails:
0.6 ("Which company's stock will fall later, Tech or It?", "Will Tech's stock fall?")
----
0.5 ("Which company's stock will fall least, Target or Disney?", "Will Target's stock fall?")
----
0.7 ("Which company's stock will rise most, Ford or Disney?", "Will Ford's stock rise?")
----




Taxonomy

Why is {female} more {synonym}?
Test cases:      200
Fails (rate):    190 (95.0%)

Example fails:
0.3 ('Why is Laura less emotional?', 'Why is Laura less excited?')
----
0.3 ('Why is Kathryn more capable?', 'Why is Kathryn more open?')
----
0.2 ('Why is Michelle more suspicious?', 'Why is Michelle more wary?')
----


Why is {female} more X = Why is {female} less antonym(X)
Test cases:      200
Fails (rate):    192 (96.0%)

Example fails:
0.3 ('Why is Jessica more defensive?', 'Why is Jessica less offensive?')
0.3 ('Why is Jessica less defensive?', 'Wh

In [41]:
suite_train.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

In [42]:
suite_test.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

In [43]:
suite_train.save('./version_0_train.pkl')
suite_test.save('./version_0_test.pkl')

In [44]:
# suite_path = './test.pkl'
# test_suite = TestSuite.from_file(suite_path)
# test_suite.run(pred_and_conf, overwrite=True)
# suite.summary()

In [45]:
failed_samples = []

for test_name, test in suite_train.tests.items():
    result = test.results
    if result:  
        passed = result["passed"]  
        preds = result["preds"]  
        confs = result["confs"]  
        expect_results = result["expect_results"]  
        inputs = test.data  

        for i, is_passed in enumerate(passed):
            if not is_passed and is_passed is not None:
                failed_samples.append({
                    "test_name": test_name,
                    "input": inputs[i],  
                    "pred": preds[i],  
                    "conf": confs[i],  
                    "expected": expect_results[i],  
                })

for sample in failed_samples:
    print(f"Test Name: {sample['test_name']}")
    print(f"Input: {sample['input']}")
    print(f"Prediction: {sample['pred']}, Confidence: {sample['conf']}")
    print(f"Expected: {sample['expected']}")
    print("---")


Test Name: same adjectives, different people
Input: ('Hi Robert Jones …!', 'Hi Hugh Jones …！')
Prediction: 1, Confidence: [0.42456162 0.57543838]
Expected: [-0.57543838]
---
Test Name: same adjectives, different people
Input: ('Hi Don Mitchell Fan!', 'Hi Lucy Mitchell Fan！')
Prediction: 1, Confidence: [0.48946249 0.51053751]
Expected: [-0.51053751]
---
Test Name: same adjectives, different people
Input: ('Hi Wendy Clark Jones!', 'Hi Bill Clark Jones！')
Prediction: 1, Confidence: [0.4870913 0.5129087]
Expected: [-0.5129087]
---
Test Name: same adjectives, different people
Input: ('Hi David Jordan Team!', 'Hi Anna Jordan Team！')
Prediction: 1, Confidence: [0.44920707 0.55079293]
Expected: [-0.55079293]
---
Test Name: same adjectives, different people
Input: ('Hi Fred Hughes Jenkins!', 'Hi Karen Hughes Jenkins！')
Prediction: 1, Confidence: [0.45588475 0.54411525]
Expected: [-0.54411525]
---
Test Name: same adjectives, different people
Input: ('Hi Francis Howard Jones!', 'Hi Donna Howard J

Prediction: 0, Confidence: [0.77404708 0.22595292]
Expected: [-0.77404708]
---
Test Name: How can I become more X = How can I become less antonym(X)
Input: ('How can Alan become more negative?', 'How can Alan become less positive?')
Prediction: 0, Confidence: [0.73151034 0.26848966]
Expected: [-0.73151034]
---
Test Name: How can I become more X = How can I become less antonym(X)
Input: ('How can Alan become less negative?', 'How can Alan become more positive?')
Prediction: 0, Confidence: [0.69394648 0.30605352]
Expected: [-0.69394648]
---
Test Name: How can I become more X = How can I become less antonym(X)
Input: ('How can Richard become more progressive?', 'How can Richard become less conservative?')
Prediction: 0, Confidence: [0.67330426 0.32669574]
Expected: [-0.67330426]
---
Test Name: How can I become more X = How can I become less antonym(X)
Input: ('How can Richard become less progressive?', 'How can Richard become more conservative?')
Prediction: 0, Confidence: [0.64524436 0.3

In [46]:
# store data local

In [47]:
# for MFT
fine_tune_data = []
for sample in failed_samples:
    test_name = sample['test_name']
    label = None

    if test_name in ["How can {male} become more {synonym}", "How can I become more X = How can I become less antonym(X)"]:
        label = 1
    elif test_name in ["Comparison between two entities is not the same as asking about one",
                       "same adjectives, different people"]:
        label = 0

    if label is not None:
        fine_tune_data.append({
            "input": sample['input'],
            "label": label
        })


In [48]:
# store data local

In [53]:
from transformers import (
    pipeline, 
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    Trainer, 
    TrainingArguments, 
    DataCollatorWithPadding
)
from datasets import Dataset

dataset = Dataset.from_list(fine_tune_data)

def tokenize_function(example):
    return tokenizer(example["input"], truncation=True)

encoded_dataset = dataset.map(tokenize_function, batched=True)

split_dataset = encoded_dataset.train_test_split(test_size=0.2)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir="./results",          
    evaluation_strategy="epoch",    
    save_strategy="epoch",          
    learning_rate=5e-5,             
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,   
    num_train_epochs=3,             
    weight_decay=0.01,              
    logging_dir="./logs",           
    logging_steps=10,           
    save_total_limit=1,             
    load_best_model_at_end=True,    
    metric_for_best_model="accuracy"  
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

import torch
torch.cuda.empty_cache()

trainer.train()

model.save_pretrained("./fine_tuned_model_version_0")
tokenizer.save_pretrained("./fine_tuned_model_version_0")


Map:   0%|          | 0/4738 [00:00<?, ? examples/s]

  trainer = Trainer(


OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 11.71 GiB of which 120.62 MiB is free. Process 35636 has 1.95 GiB memory in use. Process 37040 has 764.00 MiB memory in use. Process 42483 has 798.00 MiB memory in use. Process 53320 has 1.38 GiB memory in use. Process 58735 has 2.14 GiB memory in use. Process 60379 has 2.15 GiB memory in use. Including non-PyTorch memory, this process has 1.58 GiB memory in use. Of the allocated memory 1.25 GiB is allocated by PyTorch, and 96.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [51]:
tokenizer_v0 = AutoTokenizer.from_pretrained('./fine_tuned_model_version_0/')
model_v0 = AutoModelForSequenceClassification.from_pretrained('./fine_tuned_model_version_0/')

pipe_v0 = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", device=0)

def pred_and_conf_v0(data):
    raw_preds = pipe_v0(data, padding=True, truncation=True)
    preds = np.array([ int(p["label"][-1]) for p in raw_preds])
    pp = np.array([[p["score"], 1-p["score"]] if int(p["label"][-1]) == 0 else [1-p["score"], p["score"]] for p in raw_preds])
    return preds, pp

pred_and_conf_v0([example])

suite = TestSuite.from_file("./version_0_test.pkl")


Device set to use cuda:0


In [None]:
suite.run(pred_and_conf_v0, overwrite=True)

suite.summary()

In [None]:
suite.visual_summary_table()