# CheckList - Tests' execution

## Initial install & imports

`conda install python=3.6`

`!pip install checklist`
`!pip install --upgrade checklist`

`!pip install -U spacy`
`!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz`

`!pip install torch`

`tar xvzf release_data.tar.gz`
`import tarfile`
`tar = tarfile.open('checklist-master/release_data.tar.gz', "r:gz")`
`tar.extractall('checklist-master')`
`tar.close()`

In [1]:
import sys
import checklist
from checklist.test_suite import TestSuite
from checklist.viewer import *
from checklist.viewer.test_summarizer import TestSummarizer

## Uploading the suites and the predictions

Suites

In [2]:
suite_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/sentiment_suite.pkl'
suite_Orig = TestSuite.from_file(suite_path_Orig)

suite_path_AMI = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/1_AMI.pkl'
suite_AMI = TestSuite.from_file(suite_path_AMI)

suite_path_Fairness = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/2_Fairness_HateSpeech.pkl'
suite_Fairness = TestSuite.from_file(suite_path_Fairness)

suite_path_New = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/3_New_Capabilities.pkl'
suite_New = TestSuite.from_file(suite_path_New)

Predictions

In [3]:
#pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/roberta'
#pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/microsoft'
#pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/google'
#pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/bert'
#pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/amazon'
pred_path_Orig = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/output_n500.txt'

pred_path_AMI = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/output_AMI.txt'

pred_path_Fairness = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/output_Fairness.txt'

pred_path_New = '/Users/Marta/opt/anaconda3/lib/python3.6/site-packages/checklist/release_data/sentiment/predictions/output_New_Capabilities.txt'

## Running the tests

### Released sentiment suite

In [4]:
suite_Orig.run_from_file(pred_path_Orig, overwrite=True)
suite_Orig.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'single positive word…

In [5]:
suite_Orig.summary()

Vocabulary

single positive words
Test cases:      34
Fails (rate):    0 (0.0%)


single negative words
Test cases:      35
Fails (rate):    28 (80.0%)

Example fails:
0.0 0.0 1.0 bad
----
0.0 0.0 1.0 frustrating
----
0.0 0.0 1.0 dread
----


single neutral words
Test cases:      13
Fails (rate):    13 (100.0%)

Example fails:
0.0 0.0 1.0 Israeli
----
0.0 0.0 1.0 international
----
0.0 0.0 1.0 commercial
----


Sentiment-laden words in context
Test cases:      8658
Test cases run:  500
Fails (rate):    208 (41.6%)

Example fails:
0.0 0.0 1.0 That is an annoying cabin crew.
----
0.0 0.0 1.0 This was a boring flight.
----
0.0 0.0 1.0 That company was dreadful.
----


neutral words in context
Test cases:      1716
Test cases run:  500
Fails (rate):    500 (100.0%)

Example fails:
0.0 0.0 1.0 It is a commercial food.
----
0.0 0.0 1.0 This is an Indian staff.
----
0.0 0.0 1.0 I find this food.
----


intensifiers
Test cases:      2000
Test cases run:  500
Fails (rate):    23 (4.6%)

Example

In [6]:
for item in suite_Orig.tests:
    print(item)

single positive words
single negative words
single neutral words
Sentiment-laden words in context
neutral words in context
intensifiers
reducers
change neutral words with BERT
add positive phrases
add negative phrases
add random urls and handles
punctuation
typos
2 typos
contractions
change names
change locations
change numbers
used to, but now
"used to" should reduce
protected: race
protected: sexual
protected: religion
protected: nationality
simple negations: negative
simple negations: not negative
simple negations: not neutral is still neutral
simple negations: I thought x was positive, but it was not (should be negative)
simple negations: I thought x was negative, but it was not (should be neutral or positive)
simple negations: but it was not (neutral) should still be neutral
Hard: Negation of positive with neutral stuff in the middle (should be negative)
Hard: Negation of negative with neutral stuff in the middle (should be positive or neutral)
negation of neutral with neutral in 

In [7]:
#suite_Orig.tests['neutral words in context'].get_stats()
stats = {}
for test in suite_Orig.tests:
    #print(suite_Orig.tests[test].get_stats())
    stats[test] = suite_Orig.tests[test].get_stats()

In [8]:
stats

{'single positive words': Munch({'testcases': 34, 'fails': 0, 'fail_rate': 0.0}),
 'single negative words': Munch({'testcases': 35, 'fails': 28, 'fail_rate': 80.0}),
 'single neutral words': Munch({'testcases': 13, 'fails': 13, 'fail_rate': 100.0}),
 'Sentiment-laden words in context': Munch({'testcases': 8658, 'testcases_run': 500, 'fails': 208, 'fail_rate': 41.6}),
 'neutral words in context': Munch({'testcases': 1716, 'testcases_run': 500, 'fails': 500, 'fail_rate': 100.0}),
 'intensifiers': Munch({'testcases': 2000, 'testcases_run': 500, 'fails': 23, 'fail_rate': 4.6}),
 'reducers': Munch({'testcases': 2000, 'testcases_run': 500, 'after_filtering': 6, 'after_filtering_rate': 1.2, 'fails': 4, 'fail_rate': 66.66666666666667}),
 'change neutral words with BERT': Munch({'testcases': 500, 'fails': 12, 'fail_rate': 2.4}),
 'add positive phrases': Munch({'testcases': 500, 'fails': 3, 'fail_rate': 0.6}),
 'add negative phrases': Munch({'testcases': 500, 'fails': 4, 'fail_rate': 0.8}),
 'ad

In [9]:
import csv
with open('/Users/Marta/CheckList - FBK/results_suite_Orig.csv', 'w') as f:
    for key in stats.keys():
        f.write("%s,%s\n"%(key,stats[key]))

### Customized suites (for English) 

#### AMI (Automatic Misogyny Detection)

In [10]:
suite_AMI.run_from_file(pred_path_AMI, overwrite=True)
suite_AMI.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'single positive word…

In [11]:
suite_AMI.summary()

Vocabulary

single positive words
Test cases:      54
Fails (rate):    0 (0.0%)


single negative words
Test cases:      76
Fails (rate):    56 (73.7%)

Example fails:
0.0 0.0 1.0 regretted
----
0.0 0.0 1.0 blamed
----
0.0 0.0 1.0 lame
----


single neutral words
Test cases:      48
Fails (rate):    48 (100.0%)

Example fails:
0.1 0.0 0.9 israeli
----
0.0 0.0 1.0 saw
----
0.0 0.0 1.0 independent
----


sentiment-laden words in context
Test cases:      1200
Test cases run:  500
Fails (rate):    84 (16.8%)

Example fails:
0.0 0.0 1.0 That wench is disappointing.
----
0.2 0.0 0.8 You blame this witch.
----
0.0 0.0 1.0 You dread this butch.
----


neutral words in context
Test cases:      600
Test cases run:  500
Fails (rate):    45 (9.0%)

Example fails:
1.0 0.0 0.0 She is a gay woman.
----
1.0 0.0 0.0 She was a gay young lady.
----
1.0 0.0 0.0 That girlfriend was busy.
----


intensifiers
Test cases:      400
Fails (rate):    0 (0.0%)


reducers
Test cases:      800
Test cases run:  500


In [12]:
for item in suite_AMI.tests:
    print(item)

single positive words
single negative words
single neutral words
sentiment-laden words in context
neutral words in context
intensifiers
reducers
change neutral words with BERT
add positive phrases
add negative phrases
add random urls and handles
punctuation
typos
2 typos
contractions
change names
change locations
change numbers
change with English names
change with german names
change with vietnamese names
change with brazilian names
change profession
used to, but now
"used to" should reduce
"before" should reduce
simple negations: negative
simple negations: not negative
simple negations: not neutral is still neutral
simple negations (negative)
simple negations (neutral or positive)
simple negations (neutral)
hard negations: negative
hard negations: positive or neutral
negation of neutral
my opinion is what matters, not negative
my opinion is what matters, not positive
Q & A: yes, not negative
Q & A: yes, not positive
Q & A: yes (neutral)
Q & A: no, not positive
Q & A: no, not negative

In [13]:
stats = {}
for test in suite_AMI.tests:
    stats[test] = suite_AMI.tests[test].get_stats()

In [14]:
stats

{'single positive words': Munch({'testcases': 54, 'fails': 0, 'fail_rate': 0.0}),
 'single negative words': Munch({'testcases': 76, 'fails': 56, 'fail_rate': 73.6842105263158}),
 'single neutral words': Munch({'testcases': 48, 'fails': 48, 'fail_rate': 100.0}),
 'sentiment-laden words in context': Munch({'testcases': 1200, 'testcases_run': 500, 'fails': 84, 'fail_rate': 16.8}),
 'neutral words in context': Munch({'testcases': 600, 'testcases_run': 500, 'fails': 45, 'fail_rate': 9.0}),
 'intensifiers': Munch({'testcases': 400, 'fails': 0, 'fail_rate': 0.0}),
 'reducers': Munch({'testcases': 800, 'testcases_run': 500, 'after_filtering': 24, 'after_filtering_rate': 4.8, 'fails': 14, 'fail_rate': 58.333333333333336}),
 'change neutral words with BERT': Munch({'testcases': 52, 'fails': 12, 'fail_rate': 23.076923076923077}),
 'add positive phrases': Munch({'testcases': 60, 'fails': 6, 'fail_rate': 10.0}),
 'add negative phrases': Munch({'testcases': 60, 'fails': 7, 'fail_rate': 11.6666666666

In [15]:
import csv
with open('/Users/Marta/CheckList - FBK/results_suite_AMI.csv', 'w') as f:
    for key in stats.keys():
        f.write("%s,%s\n"%(key,stats[key]))

#### Fairness in HateSpeech

In [28]:
suite_Fairness.run_from_file(pred_path_Fairness, overwrite=True)
suite_Fairness.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'M/F failure rates sh…

In [17]:
suite_Fairness.summary()

Fairness

M/F failure rates should be similar for different professions
Test cases:      1000
Test cases run:  500
Fails (rate):    0 (0.0%)


unintended bias towards women
Test cases:      1464
Test cases run:  500
Fails (rate):    174 (34.8%)

Example fails:
0.0 0.0 1.0 girl should be respected
----
0.0 0.0 1.0 girl should be supported
----
0.3 0.0 0.7 nauseating lady
----


protected/sensitive: race
Test cases:      600
Test cases run:  500
Fails (rate):    459 (91.8%)

Example fails:
1.0 0.0 0.0 Emma is black.
0.0 0.0 1.0 Emma is white.
0.0 0.0 1.0 Emma is hispanic.

----
1.0 0.0 0.0 Marie is black.
0.0 0.0 1.0 Marie is hispanic.
0.0 0.0 1.0 Marie is white.

----
1.0 0.0 0.0 Gary is black.
0.0 0.0 1.0 Gary is white.
0.0 0.0 1.0 Gary is hispanic.

----


protected/sensitive: sexual
Test cases:      600
Test cases run:  500
Fails (rate):    500 (100.0%)

Example fails:
0.1 0.0 0.9 Melissa is an asexual.
1.0 0.0 0.0 Melissa is a pansexual.
1.0 0.0 0.0 Melissa is a transsexual.

----
1

In [18]:
for item in suite_Fairness.tests:
    print(item)

M/F failure rates should be similar for different professions
unintended bias towards women
protected/sensitive: race
protected/sensitive: sexual
protected/sensitive: religion
protected/sensitive: nationality
stereotyped female work roles changed with traditional male positions
stereotyped male work roles changed with traditional female positions
gender stereotypes
stereotypes about body image
stereotypes about toxic masculinity
neutral identification statements feminism-related
stereotypes and insults about specific nationality or religion
stereotypes and insults about disability, homeless people, old people
misogynous: examples from AMI_Golbeck
misogynous: examples from AMI_AMI
misogynous: examples from AMI_SBF
misogynous: examples from AMI_HatEval
misogynous: examples from AMI_Waasem
misogynous: examples from AMI_Jigsaw
nationality, religion: examples from Hate_Founta
nationality, religion: examples from Hate_Golbeck
nationality, religion: examples from Hate_SBF
nationality, religio

In [19]:
stats = {}
for test in suite_Fairness.tests:
    stats[test] = suite_Fairness.tests[test].get_stats()

In [20]:
stats

{'M/F failure rates should be similar for different professions': Munch({'testcases': 1000, 'testcases_run': 500, 'fails': 0, 'fail_rate': 0.0}),
 'unintended bias towards women': Munch({'testcases': 1464, 'testcases_run': 500, 'fails': 174, 'fail_rate': 34.8}),
 'protected/sensitive: race': Munch({'testcases': 600, 'testcases_run': 500, 'fails': 459, 'fail_rate': 91.8}),
 'protected/sensitive: sexual': Munch({'testcases': 600, 'testcases_run': 500, 'fails': 500, 'fail_rate': 100.0}),
 'protected/sensitive: religion': Munch({'testcases': 600, 'testcases_run': 500, 'fails': 496, 'fail_rate': 99.2}),
 'protected/sensitive: nationality': Munch({'testcases': 600, 'testcases_run': 500, 'fails': 0, 'fail_rate': 0.0}),
 'stereotyped female work roles changed with traditional male positions': Munch({'testcases': 500, 'fails': 0, 'fail_rate': 0.0}),
 'stereotyped male work roles changed with traditional female positions': Munch({'testcases': 500, 'fails': 0, 'fail_rate': 0.0}),
 'gender stereot

In [21]:
import csv
with open('/Users/Marta/CheckList - FBK/results_suite_Fairness.csv', 'w') as f:
    for key in stats.keys():
        f.write("%s,%s\n"%(key,stats[key]))

#### New Capabilities

In [22]:
suite_New.run_from_file(pred_path_New, overwrite=True)
suite_New.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'she is adj vs she is…

In [23]:
suite_New.summary()

Taxonomy

she is adj vs she is positive synonym
Test cases:      200
Fails (rate):    0 (0.0%)


she is adj vs she is negative synonym
Test cases:      200
Fails (rate):    28 (14.0%)

Example fails:
1.0 0.0 0.0 Dorothy is horrible.
0.0 0.0 1.0 Dorothy is terrifying.

----
1.0 0.0 0.0 Jennifer is ugly.
0.3 0.0 0.7 Jennifer is nasty.

----
1.0 0.0 0.0 Rachel is horrible.
0.0 0.0 1.0 Rachel is terrifying.

----


she is adj vs she is antonym
Test cases:      50
Fails (rate):    25 (50.0%)

Example fails:
0.0 0.0 1.0 Jane is so beautiful.
0.0 0.0 1.0 Jane is very ugly.

----
0.0 0.0 1.0 Kate is so beautiful.
0.0 0.0 1.0 Kate is very ugly.

----
0.0 0.0 1.0 Elizabeth is so beautiful.
0.0 0.0 1.0 Elizabeth is very ugly.

----




Coref

opinions
Test cases:      400
Fails (rate):    187 (46.8%)

Example fails:
0.0 0.0 1.0 Ben and Jill are colleagues: she thinks that he is feisty and I agree
----
0.0 0.0 1.0 Martin and Charlotte are colleagues: the latter thinks that the former is rough and 

In [24]:
for item in suite_New.tests:
    print(item)

she is adj vs she is positive synonym
she is adj vs she is negative synonym
she is adj vs she is antonym
opinions
neutral opinions
role of #sarcastic hashtag and emojis
hate/disgust emojis
happy emojis
emoji intensifiers
emoji reducers
hopeful tweets
change names
change locations
change numbers


In [25]:
stats = {}
for test in suite_New.tests:
    stats[test] = suite_New.tests[test].get_stats()

In [26]:
stats

{'she is adj vs she is positive synonym': Munch({'testcases': 200, 'fails': 0, 'fail_rate': 0.0}),
 'she is adj vs she is negative synonym': Munch({'testcases': 200, 'fails': 28, 'fail_rate': 14.0}),
 'she is adj vs she is antonym': Munch({'testcases': 50, 'fails': 25, 'fail_rate': 50.0}),
 'opinions': Munch({'testcases': 400, 'fails': 187, 'fail_rate': 46.75}),
 'neutral opinions': Munch({'testcases': 50, 'fails': 0, 'fail_rate': 0.0}),
 'role of #sarcastic hashtag and emojis': Munch({'testcases': 500, 'fails': 13, 'fail_rate': 2.6}),
 'hate/disgust emojis': Munch({'testcases': 12, 'fails': 12, 'fail_rate': 100.0}),
 'happy emojis': Munch({'testcases': 17, 'fails': 0, 'fail_rate': 0.0}),
 'emoji intensifiers': Munch({'testcases': 800, 'testcases_run': 500, 'fails': 0, 'fail_rate': 0.0}),
 'emoji reducers': Munch({'testcases': 800, 'testcases_run': 500, 'after_filtering': 26, 'after_filtering_rate': 5.2, 'fails': 14, 'fail_rate': 53.84615384615385}),
 'hopeful tweets': Munch({'testcase

In [27]:
import csv
with open('/Users/Marta/CheckList - FBK/results_suite_New.csv', 'w') as f:
    for key in stats.keys():
        f.write("%s,%s\n"%(key,stats[key]))