In [98]:
import pandas as pd
import numpy as np
import nltk
from nltk.metrics.agreement import AnnotationTask
from collections import defaultdict, Counter

### First, import the .CSV file and keep the columns required, save the dataframe to `filtered_res`

In [60]:
ann_result = pd.read_csv("mturk_results.csv")
filtered_res = ann_result[["Input.ingredients", 
                           "Input.preparation", 
                           "Input.description", 
                           "Input.current_url",
                           "Input.document_url",
                           "WorkerId",
                           "WorkTimeInSeconds",
                           "Answer.category.label"]]
filtered_res.head(5)

Unnamed: 0,Input.ingredients,Input.preparation,Input.description,Input.current_url,Input.document_url,WorkerId,WorkTimeInSeconds,Answer.category.label
0,"3 bananas 1 vanilla pod, or 1 teaspoon vanilla...",Cut the bananas into slices and place into a f...,"Here's what you need: bananas, vanilla pod, fu...",https://tasty.co/recipe/banana-ice-cream-choco...,Banana Ice Cream Chocolate Bites,AJ9O2ZA0E8UDZ,7,Dessert
1,"3 bananas 1 vanilla pod, or 1 teaspoon vanilla...",Cut the bananas into slices and place into a f...,"Here's what you need: bananas, vanilla pod, fu...",https://tasty.co/recipe/banana-ice-cream-choco...,Banana Ice Cream Chocolate Bites,A3FOC1PCYZ0VT1,6,Dessert
2,"3 bananas 1 vanilla pod, or 1 teaspoon vanilla...",Cut the bananas into slices and place into a f...,"Here's what you need: bananas, vanilla pod, fu...",https://tasty.co/recipe/banana-ice-cream-choco...,Banana Ice Cream Chocolate Bites,ATUCZB2W7VVI7,7,Dessert
3,"nonstick cooking spray, for greasing 1 can ref...",Preheat the oven to 425F (220C). Grease a baki...,"Here's what you need: nonstick cooking spray, ...",https://tasty.co/recipe/spicy-chicken-bacon-fl...,Spicy Chicken Bacon Flatbread,ALS4ARR5D4BEC,15,Appetizer or side dish
4,"nonstick cooking spray, for greasing 1 can ref...",Preheat the oven to 425F (220C). Grease a baki...,"Here's what you need: nonstick cooking spray, ...",https://tasty.co/recipe/spicy-chicken-bacon-fl...,Spicy Chicken Bacon Flatbread,A3TKUXUTDX6FBF,53,Main course


### Next, set up a dictionary `annotations` with key as annotator identifier and value as their annotations.

In [27]:
annotations = dict()
for i in range(3):
    annotator = filtered_res[np.mod(np.arange(filtered_res.index.size),3) == i]
    res_annotator = annotator['Answer.category.label'].tolist()
    key = "C"+str(i)
    annotations[key] = res_annotator

In [31]:
titles = filtered_res[np.mod(np.arange(filtered_res.index.size),3) == 0]['Input.document_url'].tolist()

### We can do a simple analysis first to check how well annotators agree on different observations

In [126]:
answer_triples = []
distinct_idx = []
two_agree_idx = []
all_agree_lst = []
all_distinct = 0
two_agree = 0
all_agree = 0
for i in range(800): #total number of annotated examples:
    answers_set = set()
    answers_lst = []
    for key in ['C0','C1','C2']:
        answers_set.add(annotations[key][i])
        answers_lst.append(annotations[key][i])
    if len(answers_set) == 1:
        all_agree += 1
        all_agree_lst.extend(list(answers_set))
    elif len(answers_set) == 2:
        two_agree += 1
        two_agree_idx.append(i)
    elif len(answers_set) == 3:
        all_distinct += 1
        distinct_idx.append(i)
    answer_triples.append(tuple(answers_lst))

In [117]:
agreed_list = []
for i in two_triples:
    count_dict = Counter(list(i))
    item = [key for key in count_dict.keys() if count_dict[key] == 2]
    agreed_list.append(item[0])

In [47]:
print(f"Number of cases where three annotators provide different answers: {all_distinct}")
print(f"Number of cases where two annotators reach agreement: {two_agree}")
print(f"Number of cases where all three annotators reach agreement: {all_agree}")

Number of cases where three annotators provide different answers: 65
Number of cases where two annotators reach agreement: 394
Number of cases where all three annotators reach agreement: 341


#### Quick Summary
From the result above, we can observe that there are only 341 out of 800 cases where all three annotators reach agreements, which is roughly `42.625%`. Therefore, we can expect our scores for the following `AnnotationTask` will probably be around this number as well. There are also `394/800 (49.25%)` cases where two annotators can reach an agreement. This give us a total of `735/800 (91.875%)` cases where we can conclude an answer from the crowd.

### Check the distribution between categories for the cases that we can conclude an answer from the annotators

In [118]:
Counter(agreed_list + all_agree_lst)

Counter({'Appetizer or side dish': 212,
         'Dessert': 214,
         'Main course': 285,
         'Drinks': 15,
         'Other': 9})

The distribution between `Appetizer/Side dish`, `Dessert` and `Main course` are even with the rest two categories being outliers. This distribution is concluded based on the crowds' annotations.

### We can check on the cases where all three annotators show distinct opinions

In [76]:
three_triples = [answer_triples[i] for i in distinct_idx] 
df_three_answers = pd.DataFrame(three_triples, columns=['C0', 'C1', 'C2'])

In [119]:
df_three_receipe = pd.DataFrame()
df_three_receipe['ReceipeName'] = filtered_res[np.mod(np.arange(filtered_res.index.size),3) == 0].iloc[distinct_idx,:]['Input.document_url']
df_three_receipe = df_three_receipe.reset_index()

df_three = pd.concat([df_three_receipe, df_three_answers], axis=1)
df_three.head(10)

Unnamed: 0,index,ReceipeName,C0,C1,C2
0,27,Squid Ink Fettuccine With Black Mussels,Drinks,Appetizer or side dish,Main course
1,48,Jackfruit Tacos,Dessert,Main course,Other
2,84,Smashed Cucumber Salad,Appetizer or side dish,Dessert,Main course
3,96,Cinnamon Bun French Toast,Appetizer or side dish,Dessert,Main course
4,102,Salted Caramel,Other,Dessert,Appetizer or side dish
5,111,Buffalo Chicken Dip,Main course,Appetizer or side dish,Other
6,132,Vegan Sweet Potatoes Au Gratin,Appetizer or side dish,Drinks,Main course
7,138,Sweet Potato Breakfast Bars,Main course,Dessert,Other
8,162,Banana Oat Freezer-Prep Smoothie,Dessert,Drinks,Appetizer or side dish
9,168,Freezer-Prep Breakfast Burritos,Main course,Appetizer or side dish,Other


### Now we calculate the interannotator agreement measure

In [33]:
#Code adapted from Julian's lecture
def convert_to_triples(sentence,annotations_by_annotator):
    triple_list = []
    for annotator, annotations in annotations_by_annotator.items():
        for i in range(len(annotations)):
            triple_list.append((annotator,sentence[i],annotations[i]))
    return triple_list

triples = convert_to_triples(titles, annotations)

In [34]:
annotation_task = AnnotationTask(triples)

### Choose the measure

Since we have many different annotators for each case, we cannot use the measure kappa as it requires a single coder who submit annotations for all cases. Although the distribution is slightly skewed for categories `Drinks` and `Other`, it is pretty even for the rest three categories. We can use either π or Krippendorff’s α. We choose to use π here since we have five categories (`Appetizers/Side dish`, `Main course`, `Dessert`, `Drinks` and `Other`), which doesn't include any scale issue. Therefore Krippendorff’s α is less required and π would be suffice.

In [81]:
annotation_task.pi()

0.4231398387080211

In [39]:
annotation_task.alpha() #caculated for reference

0.42338019710855934

### Summary

Both measures give us similar results, which is as expected as previously mentioned. Krippendorff's alpha compares 'observed' disagreement with the 'expected' disagreement. The observed disagreement is high (57.375%) in our dataset where at least one annotator provide different annotation. The result is around 0.42, below the lowest conceivable limit (0.667) and fall into the range of (0.41, 0.6) as being `moderate`. 

We can probably improve the score by providing more examples under each categories and write more detailed explanations so that the annotators can reference to. We can also check on cases where at least one annotator diagrees, to see if it is always the same annotator whose answer deviates from others' answers. If that is the case, we can reject the answers of that specific annotator, take our money back and resubmit our annotation task to get other people annotate on our tasks. 

For example, Below are 5 cases completed by annotator with ID `A770IL76LZ9T6`

	**Input.document_url**						**Answer.category.label**
	Squid Ink Fettuccine With Black Mussels		Drinks
	Vegan Sweet Potatoes Au Gratin				Drinks
	Fluffy Jiggly Japanese Cheesecake			Main course
	Acai And Blueberry Smoothie					Dessert
	Pancake Breakfast Sandwich					Dessert
    
These five recipes were all annotated wrongly. In this case, we can reject answers from this annotator and let other annotators re-submit annotations.