## Annotation Quality
After you have finished assignment 3, you can contact a TA to obtain data from another annotator. The data will also be made available on LearnIt at 13:30, but we **STRONGLY** recommend to not look at the data from the annotator before completing your annotation.

* a) Calculate the accuracy between you and the other annotator, how often did you agree?
* b) Now implement Cohen’s Kappa score, and calculate the Kappa for your annotation sample. In which range
does your Kappa score fall?
* c) Take a closer look at the cases where you disagreed with the other annotator; are these disagreements due
to ambiguity, or are there mistakes in the annotation? Would you classify your agreement in the same category as it falls in the standard kappa interpretation?

In [None]:
# Function to extract the second column (labels) from annotated files
def extract_labels(file_path):
    labels = []
    with open(file_path) as f:
        for line in f.readlines(): 
            if line.strip(): # Skipping empty lines or lines without labels
                columns = line.split("\t") # Split the line and get the second column (label)
                if len(columns) > 1:
                    labels.append(columns[1])  # Second column is the label
    return labels

In [None]:
# Next extract the labels from two files you have after annotation
out1_labels = extract_labels("pos-data/afek.conllu") #instead of afek.conllu put the annotateded files
out2_labels = extract_labels("pos-data/afek.conllu.gold")

In [None]:
# Function to calculate accuracy between two lists of labels
def calculate_accuracy(labels1, labels2):
    total_tokens = len(labels1)
    correct_tokens = sum(1 for label1, label2 in zip(labels1, labels2) if label1 == label2)
    accuracy = correct_tokens / total_tokens
    return accuracy

# Calculate accuracy
accuracy = calculate_accuracy(out1_labels, out2_labels)
print("Accuracy:", accuracy)

### steps:

Count how often both annotators assigned the same label.

Count how often each annotator assigned each label.

Compute the observed agreement and the expected agreement.

### Kappa Range:

κ=1: Perfect agreement.

κ>0.75: Excellent agreement.

0.40<κ≤0.75: Fair to good agreement.

κ≤0.40: Poor agreement.

In [None]:
#4B
from collections import Counter

def cohen_kappa(annotation1, annotation2):
    #how much of each annotation
    count1 = Counter(annotation1)
    count2 = Counter(annotation2)
    
    # Observed agreement (P_o)
    observed_agreement = sum((a == b) for a, b in zip(annotation1, annotation2)) / len(annotation1)
    
    # Expected agreement (P_e)
    total = len(annotation1)
    categories = set(annotation1).union(set(annotation2)) # categories that appear in either of annotations
    expected_agreement = 0
    
    for category in categories:
        p1 = count1.get(category, 0) / total
        p2 = count2.get(category, 0) / total
        expected_agreement += p1 * p2
    
    # Cohen's Kappa calculation
    kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement) #from the formula
    return kappa

# Calculate Cohen's Kappa
kappa = cohen_kappa(out1_labels, out2_labels)
print(f"Cohen's Kappa: {kappa}")

To Interpret Kappa Score:
- < 0.2: Poor agreement
- 0.2 - 0.4: Fair
- 0.4 - 0.6: Moderate
- 0.6 - 0.8: Substantial
- 0.8 - 1.0: Near-perfect

In [None]:
#4C
def find_disagreements(annotation1, annotation2):
    disagreements = []
    for idx, (a, b) in enumerate(zip(annotation1, annotation2)):
        if a != b:
            disagreements.append((idx, a, b))
    return disagreements

disagreements = find_disagreements(out1_labels, out2_labels)

# here I only print the labels that were differing, you can also print out both words and labels 
# by reading in the files again and accounting for both columns 
for idx, annot1_label, annot2_label in disagreements:
    print(f"Index: {idx}, Annotator 1: {annot1_label}, Annotator 2: {annot2_label}")