## Interannotator Agreement

An appropriate interannotator agreement measure for our choice of annotation is pi. Our annotation unit is the sentence, and we have three annotators per sentence. The annotation options are 'Yes' and 'No', for whether the corrected sentence is a valid correction of the original sentence or not. Therefore, for three different annotators per unit and two annotation options, and assuming that the annotators use a shared probability distribution, pi is the best choice of agreement measure.

Here is how we calculated our interannotator agreement:

In [164]:
import nltk
from nltk.metrics.agreement import AnnotationTask

We create a function that makes triples out of our annotations, since that is the input to nltk's AnnotationTask.

In [170]:
def convert_to_tuples(annotations_list):
    # We need to have (Annotator, Sentence, Yes/No) triples
    triple_list = []
    for (index, (workers_list, label_list)) in enumerate(annotations_list):
        for i in range(len(label_list)):
            triple_list.append((workers_list[i], str(index),label_list[i]))
    
    return triple_list


We read in the data from the file produced by Mechanical Turk. We turn it into a list so that it can be passed into the `create_triples` function, and we print out the list below to examine it.

In [166]:
import pandas as pd

annotations_df = pd.read_csv("annotations.csv")
hit_id_annotation_df = annotations_df.iloc[:,[0,15,29]]
hit_id_annotation_df = hit_id_annotation_df.rename(columns={"Answer.grammar-correction.label": "Label"})
hit_id_annotation_df = hit_id_annotation_df.groupby('HITId').agg(lambda x: x.tolist())
list_df = hit_id_annotation_df.values.tolist()
list_df

[[['A2UNQ8YQL05X7T', 'A13XNZBCRSEGGY', 'A2PQN6EIDHNJWJ'],
  ['Yes', 'No', 'Yes']],
 [['A3L6JLT39UBMCD', 'AJ9O2ZA0E8UDZ', 'A1P8G437YE33GL'], ['Yes', 'Yes', 'No']],
 [['A1BPX78NLNLI9B', 'A10042UW3Q59GF', 'AJ9O2ZA0E8UDZ'], ['No', 'Yes', 'Yes']],
 [['A19Y8QDS9ABVD4', 'A13HAGXAAV0R3W', 'A10042UW3Q59GF'],
  ['Yes', 'No', 'Yes']],
 [['A3EB5G62854MAJ', 'A13XNZBCRSEGGY', 'A1P8G437YE33GL'],
  ['Yes', 'No', 'Yes']],
 [['A3EB5G62854MAJ', 'A1KQ4XZM996XWW', 'AF8OCFNA5KE8A'], ['Yes', 'Yes', 'No']],
 [['AJ9O2ZA0E8UDZ', 'A1EF3SB1XTBETJ', 'A2F0X4LN9N4O4C'],
  ['Yes', 'Yes', 'Yes']],
 [['A2ZLJQWCM8KU36', 'A2NI0G1DKQJ08H', 'AWS8VZX8K72NU'],
  ['Yes', 'Yes', 'Yes']],
 [['A1JKZ6D8L1A8J1', 'AJ9O2ZA0E8UDZ', 'A3FOC1PCYZ0VT1'],
  ['Yes', 'Yes', 'Yes']],
 [['A7O1CG4QYBEIP', 'AKSLU0C30G3JT', 'AD01UNQ1IAKCP'], ['Yes', 'No', 'Yes']],
 [['A10042UW3Q59GF', 'ADP1QUGXHGJRS', 'A20CVITAYVBNBU'], ['Yes', 'No', 'No']],
 [['A3EB5G62854MAJ', 'A13XNZBCRSEGGY', 'A3P57IUDHUKNCE'], ['No', 'No', 'Yes']],
 [['A3PFU4042GIQLE', 'AIR

We can see that in each sublist, we have three worker id's and their three respective annotations. In order to use AnnotationTask's pi agreement measure, we have to make sure that there are only three unique annotators. Since we have more than three annotators, we change the id to 'annotator0' for the first annotator in each sublist, 'annotator1' for the second one, and 'annnotator2' for the third one.

In [194]:
for sublist in list_df:
    if 'AJ9O2ZA0E8UDZ' in sublist[0] or 'A3EB5G62854MAJ' in sublist[0] or 'A20CVITAYVBNBU' in sublist[0]:
        list_df.remove(sublist)
        
triples = convert_to_tuples(list_df)

trips = []
for i, triple in enumerate(triples):
    triple = ("annotator"+str(i % 3), triple[1], triple[2])
    trips.append(triple)

trips

[('annotator0', '0', 'Yes'),
 ('annotator1', '0', 'No'),
 ('annotator2', '0', 'Yes'),
 ('annotator0', '1', 'Yes'),
 ('annotator1', '1', 'No'),
 ('annotator2', '1', 'Yes'),
 ('annotator0', '2', 'Yes'),
 ('annotator1', '2', 'Yes'),
 ('annotator2', '2', 'Yes'),
 ('annotator0', '3', 'Yes'),
 ('annotator1', '3', 'No'),
 ('annotator2', '3', 'Yes'),
 ('annotator0', '4', 'No'),
 ('annotator1', '4', 'No'),
 ('annotator2', '4', 'No'),
 ('annotator0', '5', 'Yes'),
 ('annotator1', '5', 'No'),
 ('annotator2', '5', 'Yes'),
 ('annotator0', '6', 'No'),
 ('annotator1', '6', 'Yes'),
 ('annotator2', '6', 'No'),
 ('annotator0', '7', 'No'),
 ('annotator1', '7', 'Yes'),
 ('annotator2', '7', 'Yes'),
 ('annotator0', '8', 'Yes'),
 ('annotator1', '8', 'Yes'),
 ('annotator2', '8', 'Yes'),
 ('annotator0', '9', 'No'),
 ('annotator1', '9', 'No'),
 ('annotator2', '9', 'Yes'),
 ('annotator0', '10', 'Yes'),
 ('annotator1', '10', 'No'),
 ('annotator2', '10', 'Yes'),
 ('annotator0', '11', 'No'),
 ('annotator1', '11', 'N

Now we can calculate the pi measure:

In [195]:
annotation_task = AnnotationTask(trips)

print(annotation_task.avg_Ao())
print(annotation_task.pi())
print(annotation_task.kappa())
print(annotation_task.alpha())

0.6143497757847534
0.20349792339640052
0.2040545274583225
0.2046885094600831


As we can see, our interannotator agreement measure is really low. This suggests that the annotators disagree more often than they agree. Upon closer examination, this might be caused by the following reasons:

- The sentences are too long and and contain multiple errors. Our annotation task asks the annotators whether **all** of the grammatical errors have been corrected. When there are multiple errors, some annotators might not have caught all of them and gave the correction a 'Yes' when **most** of the errors were corrected. There might have been others who gave the sentence a 'No' unless **all** errors were corrected, as per the instructions. We did not think that the instructions were ambiguous, but because of the nature of the data, this might have been confusing.

- There might be a bad annotator who consistently disagrees with other annotators, causing agreement to go down. 

In the next few cells, I investigate whether those types of bad annotators exist.

### Looking for bad annotators:

When looking through the data, I found that annotator with worker id 'A13XNZBCRSEGGY' disagrees with other annotators often. I decided to count his annotations:

In [174]:
from collections import defaultdict

triples = convert_to_tuples(hit_id_annotation_df.values.tolist())

# Check for bad annotators
workerid_to_no = defaultdict()

# Found a worker that says 'No' often - A13XNZBCRSEGGY
ggy_count = 0
ggy_no_count = 0
ggy_disagrees = 0
for triple in triples:
    if triple[0] == 'A13XNZBCRSEGGY':
        ggy_count += 1
        
print(ggy_count)

64


In [178]:
from collections import defaultdict

triples = convert_to_tuples(hit_id_annotation_df.values.tolist())

# Check for bad annotators
workerid_to_no = defaultdict()

# Found a worker that says 'No' often - A13XNZBCRSEGGY
ggy_count = 0
ggy_yes_count = 0
ggy_disagrees = 0
for triple in triples:
    if triple[0] == 'AJ9O2ZA0E8UDZ':
        ggy_count += 1
        if triple[2] == 'Yes':
            ggy_yes_count += 1
        
print(ggy_count)
print(ggy_yes_count)

210
207


Now, I create a dictionary of annotators that have been the "odd one out", and how often this happened, to see if there are more annotators who consistently disagree with others.

In [193]:
from collections import defaultdict
all_agree = 0
all_agree_annotators = []
one_no = 0
one_no_annotators = defaultdict()
one_yes = 0
one_yes_annotators = defaultdict()
for sublist in list_df:
    if sublist[1] == ['Yes', 'Yes', 'Yes'] or sublist[1] == ['No', 'No', 'No']:
        all_agree += 1
    elif sublist[1] == ['Yes', 'Yes', 'No'] or sublist[1] == ['Yes', 'No', 'Yes'] or sublist[1] == ['No', 'Yes', 'Yes']:
        one_no += 1
        for i in range(len(sublist[1])):
            if sublist[1][i] =='No':
                if (sublist[0][i]) in one_no_annotators:
                    one_no_annotators[(sublist[0][i])] += 1
                else:
                    one_no_annotators[(sublist[0][i])] = 1
    elif sublist[1] == ['No', 'No', 'Yes'] or sublist[1] == ['Yes', 'No', 'No'] or sublist[1] == ['No', 'Yes', 'No']:
        one_yes += 1
        for i in range(len(sublist[1])):
            if sublist[1][i] =='Yes':
                if (sublist[0][i]) in one_yes_annotators:
                    one_yes_annotators[(sublist[0][i])] += 1
                else:
                    one_yes_annotators[(sublist[0][i])] = 1

print("All agree: ", all_agree)
print("Percentage all agree: ", all_agree/len(list_df))
print("One No: ", one_no)
print("Percentage one no: ", one_no/len(list_df))
print("One Yes: ", one_yes)
print("Percentage one yes: ", one_yes/len(list_df))
print('-------------------------')

print("One no annotators: ", len(one_no_annotators))
print("One no annotators: ", one_no_annotators)
print("One yes annotators: ", len(one_yes_annotators))
print("One yes annotators: ", one_yes_annotators)
print(sorted(list(one_yes_annotators.values())))
print(sorted(list(one_no_annotators.values())))

All agree:  115
Percentage all agree:  0.40350877192982454
One No:  88
Percentage one no:  0.3087719298245614
One Yes:  82
Percentage one yes:  0.28771929824561404
-------------------------
One no annotators:  47
One no annotators:  defaultdict(None, {'A13XNZBCRSEGGY': 6, 'A13HAGXAAV0R3W': 1, 'AF8OCFNA5KE8A': 4, 'AKSLU0C30G3JT': 2, 'A2CLHHGG0NU8X1': 1, 'A2ZLJQWCM8KU36': 3, 'A3P7XSX4AH6VRP': 1, 'A3MGXAMLZW9FA0': 1, 'A1P8G437YE33GL': 5, 'A3HKD5JYXYBK5Q': 1, 'A3U120IPBPBC42': 2, 'A3BWC01TKU0IV9': 2, 'AXQQBHFXMWR26': 1, 'A3BVPQFBYGWJWX': 1, 'AQBFAQGUE6RE7': 2, 'A20CVITAYVBNBU': 9, 'A3EB5G62854MAJ': 1, 'A10DID6M4BA0GY': 1, 'A3HNNVSKBUSJQK': 1, 'A19Y8QDS9ABVD4': 1, 'AIRMPT0ELMSVF': 1, 'A1BPX78NLNLI9B': 4, 'AE9M4Z1NCYSDM': 1, 'A16C1MIPDO21HZ': 2, 'A1Z4XLTF0PG9S2': 1, 'AD01UNQ1IAKCP': 4, 'A21J8P4FV3RW3D': 1, 'A17W1W897L6VTD': 1, 'ALCPF5NANBDSZ': 1, 'A1KQ4XZM996XWW': 2, 'A21NJYGU5OYHZS': 1, 'A1GL6P2M5VMVT5': 1, 'A18XBX32LS7OH6': 1, 'A3L6JLT39UBMCD': 1, 'AANOH0710G7RN': 1, 'A3SK7097GI4WLX': 1, '

As we can see, worker with id 'A13XNZBCRSEGGY' disagrees with the other two annotators 16 times. That is 16/64 = 25% of the time. Another worker with id 'A1P8G437YE33GL' disagrees 13 times. That is 13/80 = 16.25% of the time. However, when I looked at the sentences these annotators were saying 'No' too, they annotator seemed to be right. The other two annotators were selecting 'Yes' because **most** of the errors were being corrected, but these annotators said 'No' because not **all** of the errors were corrected. Therefore, the annotators were not causing bad agreement, bad agreement was caused by an ambiguous task due to the nature of the sentences provided.

AJ9O2ZA0E8UDZ - 56 times is the one_yes annotator, removed him and agreement improved to 18% - removed another two and agreement improved to 20%.