## Annotation analysis

In this notebook, we will analyze the data we annotated as a class. ML often relies on human-annotated data. It is very, very important to check if humans actually agree on labels _before_ you start designing a model. If people don't agree then the machine has no hope! 

This is a very, very common mistake in applied ML. It is imperative to validate that you even have a well-defined task before you start predicting things. 

In this notebook, we will analyze the extent to which different annotators agreed during annotation on Friday. If the annotators don't agree with enough regularity, it does not make sense to proceed to modeling. You have a BS task!

In [None]:
#### Take aways

1. Do annotators actually agree? 
2. To measure this we compute a pairwise agreement statistic
3. We also need to ask: do they actually agree at rates higher than chance?

In [57]:
from glob import glob
import csv
from collections import defaultdict
import json


def get_results(input_directory = "submissions/", output_file = "class.jsonl"):
    '''
    In this function we will manipulate raw data to make a list of dictionaries.

    In this case, we want a list of dictionaries which each has the following fields:
        - text holds the text that was judged
        - text_id holds and ID for the text
        - label holds the annotator's label
        - annotator holds the annotator's name
    
    '''
    out = []

    for submission in glob(input_directory + "*"):
        category = "TODO" # should be yelp or emotion
        annotator = "TODO"
        with open(submission, "r") as inf:
            fl = csv.reader(inf)
            # your code here

    return out
            
all_results = get_results(input_directory = "submissions/")

#### Some basic questions
- How many annotators are there? 

[your answer here]

- How many total reviews are there? 

[your answer here]

In [58]:
def get_annotators(results):
    '''
    Get a set of all annotators in the dataset
    '''
    annotators = set()
    # your code here
    return annotators

def get_reviews2judgments(results):
    '''
    Return a map from a given text_id to all judgments for that text_id
    '''
    review2judgments = defaultdict(list)

    # your code here

    return review2judgments

#### More data exploration

- Do annotators tend to agree on item \#2? Does that make sense to you?

[your answer here]

In [90]:
def pairwise_agreement(results):
    '''
    Compute the pairwise agreement between raters for the input results
    
    To compute pairwise agreement compare judgements from all pairs of annotators
    Return the fraction of pairs of annotators who agree
    '''
    total = 0
    agrees = 0
    for result in results:
        for other_result in results:
            # count how many times the annotators agree
            # be sure to only count agreement for instances in the same category, and when there are different 
            # annotators. Be sure to skip over cases where only one annotator applied a label
            # also be sure to only count agreements for the same unit of text
    return agrees/total

pairwise_agreement([k for k in all_results if k["category"] == "yelp"])

0.4899691806710473

In [None]:
### Per-category analysis

- Does Yelp or emotion data have higher or lower pairwise agreement? Does that make sense?

[Type your answer here]

### Per-item analysis

- Which review has the highest and lowest pairwise agreement rate? Does this make sense?

[Type your answer here]

In [2]:
from random import random

def annotator1():
    return random() < .5

def annotator2():
    return random() < .5

trials = 10000
agreements = 0
for j in range(trials):
    if annotator1() == annotator2():
        agreements += 1
        
agreements/trials

0.4959

### Random agreement rate

If two reviewers answered randomly (meaning just picked random annotations) how often would they agree just by chance?

[Type your answer here, and explain your reasoning]

### Fleiss Kappa

[Fleiss kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa) measures the exent to which pairs of reviewers agree, as compared to how much they would agree by chance. 

- $\bar{P}_e$ is the rate at which reviewers agree by chance 
- $\bar{P}$ is the pairwise agreement rate across all items the dataset
    - note: the Wikipedia article uses a slightly different definition of $\bar{P}$, because it assumes all reviewers review all items, which is not true in our case


$\kappa = \frac{\bar{P} - \bar{P}_e}{1-\bar{P}_e}$

- What is the highest possible value of Fleiss Kappa? What is the lowest?

[Type your answer here]

- What does the denominator mean? If $\bar{P}_e$ is high, then is the denominator high or low?

- If $\bar{P}$ is high and $\bar{P_e}$ is high, do you think the task is well-defined?

[Type your answer here]

- If $\bar{P}$ is low and $\bar{P_e}$ is high, do you think the task is well-defined

[Type your answer here]

- If $\bar{P}$ is high and $\bar{P_e}$ is low, do you think the task is well-defined?

[Type your answer here]

- What do you think the Fleiss Kappa will be for the Yelp data set? Do you think it will be higher or lower than for the emotions dataset?

In [None]:
# Compute Fleiss Kappa for the dataset

def kappa(Pe, Pbar):
    return 0

