# Table of Contents

&emsp;[<a id="#vocab" /> <font color='yellow' size=5> Vocab </font>](#%3Ca-id%3D%22%23vocab%22-/%3E-%3Cfont-color%3D%27yellow%27%3E-Vocab-%3C/font%3E)<br>
&emsp;&emsp;[1. Comparison](#1.-Comparison)<br>
&emsp;&emsp;[2. Intensifier](#2.-Intensifier)<br><br>
&emsp;[<a id="Taxonomy" /> <font color='yellow' size=5> Taxonomy </font>](#%3Ca-id%3D%22Taxonomy%22-/%3E-%3Cfont-color%3D%27yellow%27%3E-Taxonomy-%3C/font%3E)<br>
&emsp;&emsp;[1. Size, Shape, Color, Age, Material](#1.-Size%2C-Shape%2C-Color%2C-Age%2C-Material)<br>
&emsp;&emsp;[2. Professions vs nationalities](#2.-Professions-vs-nationalities)<br>
&emsp;&emsp;[3. Animal vs Vehicles - Example1](#3.-Animal-vs-Vehicles---Example1)<br>
&emsp;&emsp;[4. Animal vs Vehicles - Example2](#4.-Animal-vs-Vehicles---Example2)<br>
&emsp;&emsp;[5. Synonyms](#5.-Synonyms)<br>
&emsp;&emsp;[6. Antonyms](#6.-Antonyms)<br>
&emsp;&emsp;[7. Antonyms Comparison](#7.-Antonyms-Comparison)<br><br>
&emsp;[<a id="Temporal" /> <font color='yellow' size=5> Temporal </font>](#%3Ca-id%3D%22Temporal%22-/%3E-%3Cfont-color%3D%27yellow%27%3E-Temporal-%3C/font%3E)<br>
&emsp;&emsp;[1. Change in profession](#1.-Change-in-profession)<br>
&emsp;&emsp;[2. Understand time difference](#2.-Understand-time-difference)<br><br>
&emsp;[<a id="Negation" /> <font color='yellow' size=5> Negation </font>](#%3Ca-id%3D%22Negation%22-/%3E-%3Cfont-color%3D%27yellow%27%3E-Negation-%3C/font%3E)<br>
&emsp;&emsp;[1. Negation in Context](#1.-Negation-in-Context)<br>
&emsp;&emsp;[3. Negation in Question](#3.-Negation-in-Question)<br><br>
&emsp;[<a id="Coref" /> <font color='yellow' size=5> Coref </font>](#%3Ca-id%3D%22Coref%22-/%3E-%3Cfont-color%3D%27yellow%27%3E-Coref-%3C/font%3E)<br>
&emsp;&emsp;[1. He/she coref](#1.-He/she-coref)<br>
&emsp;&emsp;[2. His/Her coref](#2.-His/Her-coref)<br>
&emsp;&emsp;[3. Former and Latter](#3.-Former-and-Latter)<br>

In [26]:
%load_ext autoreload
%autoreload 2
import checklist.editor
import munch
import itertools
from checklist.test_suite import TestSuite
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.perturb import Perturb
from helper import get_finetuned_electra_predictor, format_squad_with_context,show_example, crossproduct, export_suite_to_jsonl, get_summary,display_and_export_mdtable

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
suite = TestSuite()
editor = checklist.editor.Editor()
model_predictor = get_finetuned_electra_predictor()

## <a id="#vocab" /> <font color='yellow'> Vocab </font>

### 1. Comparison

In [28]:
adj = ['old', 'smart', 'tall', 'young', 'strong', 'short', 'tough', 'cool', 'fast', 'nice', 'small', 'dark', 'wise', 'rich', 'great', 'weak', 'high', 'slow', 'strange', 'clean']
adj = [(x.rstrip('e'), x) for x in adj]


t_comparison = editor.template(
    [(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is less {adj[1]}?'
    ),(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is {adj[0]}er?'
    ),
    ],
    labels = ['{first_name1}','{first_name}'],
    adj=adj,
    remove_duplicates=True,
    nsamples=500,
    save=True
)

In [29]:
show_example(t_comparison, n=1)

  [1] ('Ed is greater than George.', 'Who is less great?') , pred: Ed , label: George
  [2] ('Ed is greater than George.', 'Who is greater?') , pred: Ed , label: Ed
---


In [30]:
name = 'Comparisons'
description = 'A is COMP than B. Who is more / less COMP?'
test = MFT(**t_comparison, name=name, description=description, capability='Vocabulary')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1000 examples
Test cases:      500
Fails (rate):    500 (100.0%)

Example fails:
C: Daniel is cooler than Claire.
Q: Who is less cool?
A: Claire
P: Daniel


----


### 2. Intensifier

In [31]:
state = editor.suggest('John is very {mask} about the project.')[:20]
very = ['very', 'extremely', 'really', 'quite', 'incredibly', 'particularly', 'highly', 'super']
somewhat = ['a little', 'somewhat', 'slightly', 'mildly']

t_intensifier = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {very} {s} about the project. {first_name1} is {s} about the project.',
            '{first_name1} is {s} about the project. {first_name} is {very} {s} about the project.',
            '{first_name} is {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {s} about the project.',
            '{first_name} is {very} {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {very} {s} about the project.',
        ],
        'qas': [
            (
                'Who is most {s} about the project?',
                '{first_name}'
            ), 
            (
                'Who is least {s} about the project?',
                '{first_name1}'
            ), 
            
        ]
        
    },
    s = state,
    very=very,
    somewhat=somewhat,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


In [32]:
show_example(t_intensifier)

  [1] ('Kate is really confident about the project. Virginia is confident about the project.', 'Who is most confident about the project?') , pred: Virginia , label: Kate
  [2] ('Kate is really confident about the project. Virginia is confident about the project.', 'Who is least confident about the project?') , pred: Virginia , label: Virginia
  [3] ('Virginia is confident about the project. Kate is really confident about the project.', 'Who is most confident about the project?') , pred: Kate , label: Kate
  [4] ('Virginia is confident about the project. Kate is really confident about the project.', 'Who is least confident about the project?') , pred: Kate , label: Virginia
  [5] ('Kate is confident about the project. Virginia is slightly confident about the project.', 'Who is most confident about the project?') , pred: Kate , label: Kate
  [6] ('Kate is confident about the project. Virginia is slightly confident about the project.', 'Who is least confident about the project?') , pred: 

In [33]:
name = 'Intensifiers'
desc = '(very, super, extremely) and reducers (somewhat, kinda, etc)?'
test = MFT(**t_intensifier, name=name, description=desc, capability='Vocabulary')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context, n_per_testcase=2)
suite.add(test)

Predicting 5964 examples
Test cases:      497
Fails (rate):    497 (100.0%)

Example fails:
C: Ruth is incredibly pessimistic about the project. Donna is pessimistic about the project.
Q: Who is most pessimistic about the project?
A: Ruth
P: empty

C: Donna is pessimistic about the project. Ruth is incredibly pessimistic about the project.
Q: Who is most pessimistic about the project?
A: Ruth
P: empty


----


## <a id="Taxonomy" /> <font color='yellow'> Taxonomy </font>

### 1. Size, Shape, Color, Age, Material

In [34]:
order = ['size', 'shape', 'age', 'color']
props = []
properties = {
    'color' : ['red', 'blue','yellow', 'green', 'pink', 'white', 'black', 'orange', 'grey', 'purple', 'brown'],
    'size' : ['big', 'small', 'tiny', 'enormous'],
    'age' : ['old', 'new'],
    'shape' : ['round', 'oval', 'square', 'triangular'],
    'material' : ['iron', 'wooden', 'ceramic', 'glass', 'stone']
}
objects = ['box', 'clock', 'table', 'object', 'toy', 'painting', 'sculpture', 'thing', 'figure']

for i in range(len(order)):
    for j in range(i + 1, len(order)):
        p1, p2 = order[i], order[j]
        for v1, v2 in itertools.product(properties[p1], properties[p2]):
            props.append(munch.Munch({
                'p1': p1,
                'p2': p2,
                'v1': v1,
                'v2': v2,
            }))
            
t_properties = crossproduct(editor.template(
    {
        'contexts': [
            'There is {a:p.v1} {p.v2} {obj} in the room.',
            'There is {a:obj} in the room. The {obj} is {p.v1} and {p.v2}.',
        ],
        'qas': [
            (
                'What {p.p1} is the {obj}?',
                '{p.v1}'
            ), 
            (
                'What {p.p2} is the {obj}?',
                '{p.v2}'
            ), 
            
        ]
    },
    obj=objects,
    p=props,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))


In [35]:
show_example(t_properties, n=2)

  [1] ('There is a round black thing in the room.', 'What shape is the thing?') , pred: round black , label: round
  [2] ('There is a round black thing in the room.', 'What color is the thing?') , pred: round black , label: black
  [3] ('There is a thing in the room. The thing is round and black.', 'What shape is the thing?') , pred: round and black , label: round
  [4] ('There is a thing in the room. The thing is round and black.', 'What color is the thing?') , pred: black , label: black
---
  [1] ('There is a tiny black clock in the room.', 'What size is the clock?') , pred: tiny black , label: tiny
  [2] ('There is a tiny black clock in the room.', 'What color is the clock?') , pred: black , label: black
  [3] ('There is a clock in the room. The clock is tiny and black.', 'What size is the clock?') , pred: tiny and black , label: tiny
  [4] ('There is a clock in the room. The clock is tiny and black.', 'What color is the clock?') , pred: black , label: black
---


In [36]:
name = 'Properites'
desc = 'size, shape, age, color'
test = MFT(**t_properties, name=name, description=desc, capability='Taxonomy')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 2000 examples
Test cases:      500
Fails (rate):    500 (100.0%)

Example fails:
C: There is a small oval box in the room.
Q: What size is the box?
A: small
P: oval

C: There is a box in the room. The box is small and oval.
Q: What size is the box?
A: small
P: small and oval


----


### 2. Professions vs nationalities

In [37]:
professions = editor.suggest('{first_name} works as {a:mask}.')[:30]
professions += editor.suggest('{first_name} {last_name} works as {a:mask}.')[:30]
professions = list(set(professions))
if 'translator' in professions:
    professions.remove('translator')


def clean(string):
    return string.lstrip('[a,the,an,in,at] ').rstrip('.')
def expect_squad(x, pred, conf, label=None, meta=None):
    return clean(pred) == clean(label)
expect_squad = Expect.single(expect_squad)


t_profession_nationtionality = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:nat} {prof}.',
            '{first_name} is {a:prof}. {first_name} is {nat}.',
            '{first_name} is {nat}. {first_name} is {a:prof}.',
            '{first_name} is {nat} and {a:prof}.',
            '{first_name} is {a:prof} and {nat}.',
        ],
        'qas': [
            (
                'What is {first_name}\'s job?',
                '{prof}'
            ), 
            (
                'What is {first_name}\'s nationality?',
                '{nat}'
            ), 
            
        ]
        
    },
    nat = editor.lexicons['nationality'][:10],
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    save=True,
    ))

In [38]:
show_example(t_profession_nationtionality, n=1)

  [1] ('Nick is a Russian assistant.', "What is Nick's job?") , pred: Russian assistant , label: assistant
  [2] ('Nick is a Russian assistant.', "What is Nick's nationality?") , pred: Russian , label: Russian
  [3] ('Nick is an assistant. Nick is Russian.', "What is Nick's job?") , pred: assistant , label: assistant
  [4] ('Nick is an assistant. Nick is Russian.', "What is Nick's nationality?") , pred: Russian , label: Russian
  [5] ('Nick is Russian. Nick is an assistant.', "What is Nick's job?") , pred: assistant , label: assistant
  [6] ('Nick is Russian. Nick is an assistant.', "What is Nick's nationality?") , pred: Russian , label: Russian
  [7] ('Nick is Russian and an assistant.', "What is Nick's job?") , pred: assistant , label: assistant
  [8] ('Nick is Russian and an assistant.', "What is Nick's nationality?") , pred: Russian , label: Russian
  [9] ('Nick is an assistant and Russian.', "What is Nick's job?") , pred: assistant and Russian , label: assistant
  [10] ('Nick is a

In [39]:
name = 'Profession vs nationality'
test = MFT(**t_profession_nationtionality, name=name, expect=expect_squad, description='',  capability='Taxonomy')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 5000 examples
Test cases:      500
Fails (rate):    274 (54.8%)

Example fails:
C: Sara is a Chinese investor.
Q: What is Sara's job?
A: investor
P: Chinese investor

C: Sara is an investor and Chinese.
Q: What is Sara's job?
A: investor
P: investor and Chinese


----


### 3. Animal vs Vehicles - Example1

In [40]:
animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t_animal_vehicles = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} has {a:animal} and {a:vehicle}.',
            '{first_name} has {a:vehicle} and {a:animal}.',
        ],
        'qas': [
            (
                'What animal does {first_name} have?',
                '{animal}'
            ), 
            (
                'What vehicle does {first_name} have?',
                '{vehicle}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))

In [41]:
show_example(t_animal_vehicles)

  [1] ('Ray has a lizard and a train.', 'What animal does Ray have?') , pred: lizard , label: lizard
  [2] ('Ray has a lizard and a train.', 'What vehicle does Ray have?') , pred: a lizard and a train , label: train
  [3] ('Ray has a train and a lizard.', 'What animal does Ray have?') , pred: lizard , label: lizard
  [4] ('Ray has a train and a lizard.', 'What vehicle does Ray have?') , pred: a train , label: train
---


In [42]:
name = 'Animal vs Vehicle'
test = MFT(**t_animal_vehicles, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)

Predicting 2000 examples
Test cases:      500
Fails (rate):    493 (98.6%)

Example fails:
C: Philip has an iguana and a bike.
Q: What vehicle does Philip have?
A: bike
P: iguana and a bike

C: Philip has a bike and an iguana.
Q: What animal does Philip have?
A: iguana
P: a bike and an iguana


----


### 4. Animal vs Vehicles - Example2

In [43]:
animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t_animals_vehicles_2 = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} bought {a:animal}. {first_name2} bought {a:vehicle}.',
            '{first_name2} bought {a:vehicle}. {first_name} bought {a:animal}.',
        ],
        'qas': [
            (
                'Who bought an animal?',
                '{first_name}'
            ), 
            (
                'Who bought a vehicle?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))


In [44]:
show_example(t_animals_vehicles_2)

  [1] ('James bought a snake. Pamela bought a bike.', 'Who bought an animal?') , pred: James , label: James
  [2] ('James bought a snake. Pamela bought a bike.', 'Who bought a vehicle?') , pred: Pamela , label: Pamela
  [3] ('Pamela bought a bike. James bought a snake.', 'Who bought an animal?') , pred: James , label: James
  [4] ('Pamela bought a bike. James bought a snake.', 'Who bought a vehicle?') , pred: Pamela , label: Pamela
---


In [45]:

name = 'Animal vs Vehicle v2'
test = MFT(**t_animals_vehicles_2, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)

Predicting 1992 examples
Test cases:      498
Fails (rate):    242 (48.6%)

Example fails:
C: Florence bought a firetruck. Alex bought a bull.
Q: Who bought a vehicle?
A: Florence
P: Alex


----


### 5. Synonyms 

In [46]:
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

t_synonym = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[0]}. {first_name2} is very {s2[0]}.',
            '{first_name2} is very {s2[0]}. {first_name} is very {s1[0]}.',
        ],
        'qas': [
            (
                'Who is {s1[1]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[1]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=250,
    save=True
   ))
t_synonym += crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[1]}. {first_name2} is very {s2[1]}.',
            '{first_name2} is very {s2[1]}. {first_name} is very {s1[1]}.',
        ],
        'qas': [
            (
                'Who is {s1[0]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[0]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=250,
    save=True
    )) 

In [47]:
show_example(t_synonym)

  [1] ('Tony is very courageous. Philip is very scared.', 'Who is brave?') , pred: Tony , label: Tony
  [2] ('Tony is very courageous. Philip is very scared.', 'Who is frightened?') , pred: Philip , label: Philip
  [3] ('Philip is very scared. Tony is very courageous.', 'Who is brave?') , pred: Tony , label: Tony
  [4] ('Philip is very scared. Tony is very courageous.', 'Who is frightened?') , pred: Philip , label: Philip
---


In [48]:
name = 'Synonyms'
test = MFT(**t_synonym, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1796 examples
Test cases:      449
Fails (rate):    28 (6.2%)

Example fails:
C: Anna is very vocal. Frances is very courageous.
Q: Who is outspoken?
A: Anna
P: Frances


----


### 6. Antonyms

In [49]:
comp_pairs = [('better', 'worse'), ('older', 'younger'), ('smarter', 'dumber'), ('taller', 'shorter'), ('bigger', 'smaller'), ('stronger', 'weaker'), ('faster', 'slower'), ('darker', 'lighter'), ('richer', 'poorer'), ('happier', 'sadder'), ('louder', 'quieter'), ('warmer', 'colder')]
comp_pairs = list(set(comp_pairs))

In [50]:
t_antonymns = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {comp[0]} than {first_name1}.',
            '{first_name1} is {comp[1]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is {comp[1]}?',
                '{first_name1}',
            ),
            (
                'Who is {comp[0]}?',
                '{first_name}',
            )
            
        ]
        ,
    },
    comp=comp_pairs,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))

In [51]:
show_example(t_antonymns)

  [1] ('Rose is taller than Amanda.', 'Who is shorter?') , pred: empty , label: Amanda
  [2] ('Rose is taller than Amanda.', 'Who is taller?') , pred: Rose , label: Rose
  [3] ('Amanda is shorter than Rose.', 'Who is shorter?') , pred: Amanda , label: Amanda
  [4] ('Amanda is shorter than Rose.', 'Who is taller?') , pred: Amanda , label: Rose
---


In [52]:
name = 'Antonyms'
test = MFT(**t_antonymns, name=name, description='A is COMP than B. Who is antonym(COMP)? B', capability='Taxonomy')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1992 examples
Test cases:      498
Fails (rate):    498 (100.0%)

Example fails:
C: Pamela is older than Roy.
Q: Who is younger?
A: Roy
P: Pamela

C: Roy is younger than Pamela.
Q: Who is older?
A: Pamela
P: Roy


----


### 7. Antonyms Comparison

In [53]:
antonym_adjs = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]

t_antonymns_compare = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is more {a[0]} than {first_name1}.',
            '{first_name1} is more {a[1]} than {first_name}.',
            '{first_name} is less {a[1]} than {first_name1}.',
            '{first_name1} is less {a[0]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is more {a[0]}?',
                '{first_name}',
            ),
            (
                'Who is less {a[0]}?',
                '{first_name1}',
            ),
            (
                'Who is more {a[1]}?',
                '{first_name1}',
            ),
            (
                'Who is less {a[1]}?',
                '{first_name}',
            ),
        ]
        ,
    },
    a = antonym_adjs,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))

In [54]:
show_example(t_antonymns_compare)

  [1] ('Mary is more insecure than Kate.', 'Who is more insecure?') , pred: Mary , label: Mary
  [2] ('Mary is more insecure than Kate.', 'Who is less insecure?') , pred: Mary , label: Kate
  [3] ('Mary is more insecure than Kate.', 'Who is more secure?') , pred: Mary , label: Kate
  [4] ('Mary is more insecure than Kate.', 'Who is less secure?') , pred: Mary , label: Mary
  [5] ('Kate is more secure than Mary.', 'Who is more insecure?') , pred: Kate , label: Mary
  [6] ('Kate is more secure than Mary.', 'Who is less insecure?') , pred: Kate , label: Kate
  [7] ('Kate is more secure than Mary.', 'Who is more secure?') , pred: Kate , label: Kate
  [8] ('Kate is more secure than Mary.', 'Who is less secure?') , pred: Kate , label: Mary
  [9] ('Mary is less secure than Kate.', 'Who is more insecure?') , pred: Mary , label: Mary
  [10] ('Mary is less secure than Kate.', 'Who is less insecure?') , pred: Mary , label: Kate
  [11] ('Mary is less secure than Kate.', 'Who is more secure?') , pr

In [55]:
description = 'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.'
name= 'Antonyms Comparison'
test = MFT(**t_antonymns_compare, name=name, description=description, capability='Taxonomy')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 7952 examples
Test cases:      497
Fails (rate):    497 (100.0%)

Example fails:
C: Roy is more religious than Diana.
Q: Who is less religious?
A: Diana
P: Roy

C: Roy is more religious than Diana.
Q: Who is more secular?
A: Diana
P: Roy

C: Diana is more secular than Roy.
Q: Who is more religious?
A: Roy
P: Diana


----


## <a id="Temporal" /> <font color='yellow'> Temporal </font>

### 1. Change in profession

In [56]:
t_profession = crossproduct(editor.template(
    {
        'contexts': [
            'Both {first_name} and {first_name2} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
            'Both {first_name2} and {first_name} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
        ],
        'qas': [
            (
                'Who is {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))


In [57]:
show_example(t_profession)

  [1] ('Both Nick and Betty were editors, but there was a change in Nick, who is now an accountant.', 'Who is an accountant?') , pred: Nick , label: Nick
  [2] ('Both Betty and Nick were editors, but there was a change in Nick, who is now an accountant.', 'Who is an accountant?') , pred: Nick , label: Nick
---


In [58]:
name = 'There was a change in profession'
test = MFT(**t_profession, expect=expect_squad, capability='Temporal', name=name, description='' )
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 948 examples
Test cases:      474
Fails (rate):    8 (1.7%)

Example fails:
C: Both Carol and James were analysts, but there was a change in James, who is now an author.
Q: Who is an author?
A: James
P: empty


----


### 2. Understand time difference
e.g before, after

In [59]:
t_time_difference = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} became a {prof} before {first_name2} did.',
            '{first_name2} became a {prof} after {first_name} did.',
        ],
        'qas': [
            (
                'Who became a {prof} first?',
                '{first_name}'
            ), 
            (
                'Who became a {prof} last?',
                '{first_name2}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))

In [60]:
show_example(t_time_difference)

  [1] ('Betty became a producer before Tim did.', 'Who became a producer first?') , pred: Betty , label: Betty
  [2] ('Betty became a producer before Tim did.', 'Who became a producer last?') , pred: Betty , label: Tim
  [3] ('Tim became a producer after Betty did.', 'Who became a producer first?') , pred: Tim , label: Betty
  [4] ('Tim became a producer after Betty did.', 'Who became a producer last?') , pred: Tim , label: Tim
---


In [61]:

description = 'Understanding before / after -> first / last.'
name="Time Difference"
test = MFT(**t_time_difference, expect=expect_squad, capability='Temporal', name=name, description=description )
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1992 examples
Test cases:      498
Fails (rate):    498 (100.0%)

Example fails:
C: Joseph became a assistant before Bruce did.
Q: Who became a assistant last?
A: Bruce
P: Joseph

C: Bruce became a assistant after Joseph did.
Q: Who became a assistant first?
A: Joseph
P: Bruce


----


## <a id="Negation" /> <font color='yellow'> Negation </font>

### 1. Negation in Context

In [62]:
t_context_negation = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is not {a:prof}. {first_name2} is.',
            '{first_name2} is {a:prof}. {first_name} is not.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))

In [63]:
show_example(t_context_negation)

  [1] ('Mark is not a DJ. Ralph is.', 'Who is a DJ?') , pred: empty , label: Ralph
  [2] ('Mark is not a DJ. Ralph is.', 'Who is not a DJ?') , pred: Mark , label: Mark
  [3] ('Ralph is a DJ. Mark is not.', 'Who is a DJ?') , pred: Ralph , label: Ralph
  [4] ('Ralph is a DJ. Mark is not.', 'Who is not a DJ?') , pred: Mark , label: Mark
---


In [64]:

name = 'Negation in context, may or may not be in question'
test = MFT(**t_context_negation, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 2000 examples
Test cases:      500
Fails (rate):    487 (97.4%)

Example fails:
C: Walter is not a waitress. Kathleen is.
Q: Who is a waitress?
A: Kathleen
P: Walter


----


### 3. Negation in Question

In [65]:
t_question_negation = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:prof}. {first_name2} is {a:prof2}.',
            '{first_name2} is {a:prof2}. {first_name} is {a:prof}.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is {a:prof2}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))


In [66]:

show_example(t_question_negation)

  [1] ('Judith is an editor. Ron is an assistant.', 'Who is an editor?') , pred: Judith , label: Judith
  [2] ('Judith is an editor. Ron is an assistant.', 'Who is not an editor?') , pred: Judith , label: Ron
  [3] ('Judith is an editor. Ron is an assistant.', 'Who is an assistant?') , pred: Ron , label: Ron
  [4] ('Judith is an editor. Ron is an assistant.', 'Who is not an assistant?') , pred: Ron , label: Judith
  [5] ('Ron is an assistant. Judith is an editor.', 'Who is an editor?') , pred: Judith , label: Judith
  [6] ('Ron is an assistant. Judith is an editor.', 'Who is not an editor?') , pred: Judith , label: Ron
  [7] ('Ron is an assistant. Judith is an editor.', 'Who is an assistant?') , pred: Ron , label: Ron
  [8] ('Ron is an assistant. Judith is an editor.', 'Who is not an assistant?') , pred: Ron , label: Judith
---


In [67]:
name = 'Negation in question only.'
test = MFT(**t_question_negation, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(model_predictor)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 3888 examples
Test cases:      486
Fails (rate):    486 (100.0%)

Example fails:
C: Greg is a reporter. Ellen is an analyst.
Q: Who is not a reporter?
A: Ellen
P: Greg

C: Greg is a reporter. Ellen is an analyst.
Q: Who is not an analyst?
A: Greg
P: Ellen

C: Ellen is an analyst. Greg is a reporter.
Q: Who is not a reporter?
A: Ellen
P: Greg


----
C: Kathryn is an economist. Adam is an interpreter.
Q: Who is not an economist?
A: Adam
P: Kathryn

C: Kathryn is an economist. Adam is an interpreter.
Q: Who is not an interpreter?
A: Kathryn
P: Adam

C: Adam is an interpreter. Kathryn is an economist.
Q: Who is not an economist?
A: Adam
P: Kathryn


----
C: Robin is an activist. Jonathan is an entrepreneur.
Q: Who is not an activist?
A: Jonathan
P: Robin

C: Robin is an activist. Jonathan is an entrepreneur.
Q: Who is not an entrepreneur?
A: Robin
P: Jonathan

C: Jonathan is an entrepreneur. Robin is an activist.
Q: Who is not an activist?
A: Jonathan
P: Robin


----


## <a id="Coref" /> <font color='yellow'> Coref </font>

### 1. He/she coref

In [68]:
if 'actress' in professions:
    professions.remove('actress')

t_he_she_coref = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{female} and {male} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{male} and {female} are friends. She is {a:prof2}, and he is {a:prof1}.',
            '{female} and {male} are friends. She is {a:prof2}, and he is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{male}'
            ), 
            (
                'Who is {a:prof2}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))


In [69]:
show_example(t_he_she_coref)

  [1] ('Adam and Sophie are friends. He is an agent, and she is an interpreter.', 'Who is an agent?') , pred: Adam and Sophie , label: Adam
  [2] ('Adam and Sophie are friends. He is an agent, and she is an interpreter.', 'Who is an interpreter?') , pred: Adam and Sophie , label: Sophie
  [3] ('Sophie and Adam are friends. He is an agent, and she is an interpreter.', 'Who is an agent?') , pred: Sophie and Adam , label: Adam
  [4] ('Sophie and Adam are friends. He is an agent, and she is an interpreter.', 'Who is an interpreter?') , pred: Sophie and Adam , label: Sophie
  [5] ('Adam and Sophie are friends. She is an interpreter, and he is an agent.', 'Who is an agent?') , pred: Adam and Sophie , label: Adam
  [6] ('Adam and Sophie are friends. She is an interpreter, and he is an agent.', 'Who is an interpreter?') , pred: Adam and Sophie , label: Sophie
  [7] ('Sophie and Adam are friends. She is an interpreter, and he is an agent.', 'Who is an agent?') , pred: Sophie and Adam , label: A

In [70]:
name = 'Basic coref, he / she'
test = MFT(**t_he_she_coref, expect=expect_squad, name=name, description='', capability='Coref')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 3784 examples
Test cases:      473
Fails (rate):    473 (100.0%)

Example fails:
C: Jimmy and Lucy are friends. He is an entrepreneur, and she is a model.
Q: Who is an entrepreneur?
A: Jimmy
P: Jimmy and Lucy

C: Jimmy and Lucy are friends. He is an entrepreneur, and she is a model.
Q: Who is a model?
A: Lucy
P: Jimmy and Lucy

C: Lucy and Jimmy are friends. He is an entrepreneur, and she is a model.
Q: Who is an entrepreneur?
A: Jimmy
P: Lucy and Jimmy


----


### 2. His/Her coref

In [71]:
t_his_her = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. His mom is {a:prof}.',
            '{female} and {male} are friends. His mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{male}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=250,
    ))
t_his_her += crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. Her mom is {a:prof}.',
            '{female} and {male} are friends. Her mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=250,
    ))


In [72]:
show_example(t_his_her)

  [1] ('Ray and Frances are friends. His mom is an intern.', 'Whose mom is an intern?') , pred: Ray and Frances , label: Ray
  [2] ('Frances and Ray are friends. His mom is an intern.', 'Whose mom is an intern?') , pred: Frances and Ray , label: Ray
---


In [73]:
name = 'Basic coref, his / her'
test = MFT(**t_his_her, expect=expect_squad, name=name, description='', capability='Coref')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1000 examples
Test cases:      500
Fails (rate):    500 (100.0%)

Example fails:
C: Bruce and Edith are friends. His mom is an intern.
Q: Whose mom is an intern?
A: Bruce
P: Bruce and Edith

C: Edith and Bruce are friends. His mom is an intern.
Q: Whose mom is an intern?
A: Bruce
P: Edith and Bruce


----


### 3. Former and Latter

In [74]:
t_former_latter = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} and {first_name2} are friends. The former is {a:prof1}.',
            '{first_name2} and {first_name} are friends. The latter is {a:prof1}.',
            '{first_name} and {first_name2} are friends. The former is {a:prof1} and the latter is {a:prof2}.',
            '{first_name2} and {first_name} are friends. The former is {a:prof2} and the latter is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))


In [75]:
show_example(t_former_latter)

  [1] ('Heather and Eric are friends. The former is a nurse.', 'Who is a nurse?') , pred: Heather and Eric , label: Heather
  [2] ('Eric and Heather are friends. The latter is a nurse.', 'Who is a nurse?') , pred: Eric and Heather , label: Heather
  [3] ('Heather and Eric are friends. The former is a nurse and the latter is an artist.', 'Who is a nurse?') , pred: Heather and Eric are friends. The former is a nurse and the latter is an artist , label: Heather
  [4] ('Eric and Heather are friends. The former is an artist and the latter is a nurse.', 'Who is a nurse?') , pred: Eric and Heather , label: Heather
---


In [76]:
name = 'Former / Latter'
test = MFT(**t_former_latter, expect=expect_squad, name=name, description='', capability='Coref')
test.run(model_predictor)
test.summary(n=1, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1920 examples
Test cases:      480
Fails (rate):    480 (100.0%)

Example fails:
C: Henry and Sharon are friends. The former is an editor.
Q: Who is an editor?
A: Henry
P: Henry and Sharon

C: Sharon and Henry are friends. The latter is an editor.
Q: Who is an editor?
A: Henry
P: Sharon and Henry

C: Henry and Sharon are friends. The former is an editor and the latter is an accountant.
Q: Who is an editor?
A: Henry
P: an accountant


----


In [77]:
# suite.summary(n=0)

In [78]:
df_suite=get_summary(suite)

In [79]:
display_and_export_mdtable(df_suite, do_export=False)

| Test Name                                          |   Total Cases |   Example Per Case |   Failures | Failure Rate   |
|----------------------------------------------------|---------------|--------------------|------------|----------------|
| Comparisons                                        |           500 |                  2 |        500 | 100.00%        |
| Intensifiers                                       |           497 |                 12 |        497 | 100.00%        |
| Properites                                         |           500 |                  4 |        500 | 100.00%        |
| Profession vs nationality                          |           500 |                 10 |        274 | 54.80%         |
| Animal vs Vehicle                                  |           500 |                  4 |        493 | 98.60%         |
| Animal vs Vehicle v2                               |           498 |                  4 |        242 | 48.59%         |
| Synonyms                                           |           449 |                  4 |         28 | 6.24%          |
| Antonyms                                           |           498 |                  4 |        498 | 100.00%        |
| Antonyms Comparison                                |           497 |                 16 |        497 | 100.00%        |
| There was a change in profession                   |           474 |                  2 |          8 | 1.69%          |
| Time Difference                                    |           498 |                  4 |        498 | 100.00%        |
| Negation in context, may or may not be in question |           500 |                  4 |        487 | 97.40%         |
| Negation in question only.                         |           486 |                  8 |        486 | 100.00%        |
| Basic coref, he / she                              |           473 |                  8 |        473 | 100.00%        |
| Basic coref, his / her                             |           500 |                  2 |        500 | 100.00%        |
| Former / Latter                                    |           480 |                  4 |        480 | 100.00%        |

In [80]:
export_suite_to_jsonl(suite)

âœ… Saved, 45228 rows to checklist_testsuite.jsonl.
