# Calculate inter-annotator agreement for annotated data

As we have a nominal scale with a multi-label classification problem,
we should calculate Krippendorff's alpha (see [1]) to measure inter-annotator agreement.

- α = 1 indicates perfect reliability.
- α = 0 indicates the complete absence of reliability. Units and the values assigned to them are statistically unrelated.
- α < 0 when disagreements are systematic and exceed what can be expected by chance. [2]

The minimum acceptable α coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data. When the costs of mistaken conclusions are high, the minimum alpha needs to be set high as well. In the absence of knowledge of the risks of drawing false conclusions from unreliable data, social scientists commonly rely on data with reliabilities α ≥ 0.800, consider data with 0.800 > α ≥ 0.667 only to draw tentative conclusions, and discard data whose agreement measures α < 0.667. [3]

[1] K. Krippendorff, “Reliability in Content Analysis: Some Common Misconceptions and Recommendations,” Human Communication Research, vol. 30, no. 3, pp. 411–433, Jul. 2004, doi: 10.1111/j.1468-2958.2004.tb00738.x.  
[2] https://en.wikipedia.org/wiki/Krippendorff%27s_alpha#General_form_of_alpha  
[3] https://en.wikipedia.org/wiki/Krippendorff%27s_alpha#Significance


In [1]:
import pandas as pd

# Attempt to load the data using UTF-16 encoding, which may resolve encoding issues
variables_df = pd.read_csv('../data/variables_res-soft-class_2024-10-31_21-58.csv', delimiter=';', encoding='utf-16')
data_df = pd.read_csv('../data/data_res-soft-class_2024-10-31_21-46.csv', delimiter=';', encoding='utf-16')
values_df = pd.read_csv('../data/values_res-soft-class_2024-10-31_21-58.csv', delimiter=';', encoding='utf-16')

First, build the matrix.

In [2]:
from csv import DictReader

item_col = "R004_01"
complete_col = "FINISHED"
annotator_col = "R003"
cat_cols = [
    "R001x01", "R001x11", "R001x12", "R001x13", 
    "R001x14", "R001x15", "R001x02", "R001x21", 
    "R001x22", "R001x23", "R001x03", "R001x31", 
    "R001x32", "R001x33", "R001x34", "R001x35", 
    "R001x36", "R001x37", "R001x38", "R001x99"
]

data = []

for (idx, row) in data_df.iterrows():
    if row.FINISHED:
        data.append(
            (
                row.R003,
                row.R004_01,
                frozenset([cat for cat in cat_cols if row.loc[cat] == 2])
            )
        )        

data

[(3, '10.21105/joss.07134', frozenset({'R001x01', 'R001x11'})),
 (9, '10.21105/joss.07134', frozenset({'R001x01', 'R001x11'})),
 (9,
  '10.21105/joss.06914',
  frozenset({'R001x01', 'R001x03', 'R001x12', 'R001x33', 'R001x36'})),
 (2, '10.21105/joss.06914', frozenset({'R001x03', 'R001x32'})),
 (2, '10.21105/joss.06825', frozenset({'R001x02'})),
 (1, '10.21105/joss.06825', frozenset({'R001x02'})),
 (1, '10.21105/joss.06642', frozenset({'R001x01', 'R001x12'})),
 (1,
  '10.21105/joss.05833',
  frozenset({'R001x01', 'R001x03', 'R001x12', 'R001x32'})),
 (3,
  '10.21105/joss.06642',
  frozenset({'R001x02', 'R001x03', 'R001x22', 'R001x36'})),
 (1, '10.21105/joss.06932', frozenset({'R001x01', 'R001x11'})),
 (1, '10.21105/joss.07031', frozenset({'R001x01', 'R001x12'})),
 (12, '10.21105/joss.07031', frozenset({'R001x01', 'R001x12'})),
 (12, '10.21105/joss.05833', frozenset({'R001x01', 'R001x12'})),
 (12, '10.21105/joss.06932', frozenset({'R001x03', 'R001x32'})),
 (3, '10.21105/joss.05766', frozen

Now, run two annotation tasks for measuring Krippendorff's alpha based on multi-label set distances, using Jacccard distance [1] and MASI distance [2].

[1] P. Jaccard, Nouvelles recherches sur la distribution florale. Lausanne: Rouge, 1908.  
[2] R. Passonneau, “Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation,” in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, and D. Tapias, Eds., Genoa, Italy: European Language Resources Association (ELRA), May 2006. Accessed: Nov. 06, 2024. [Online]. Available: http://www.lrec-conf.org/proceedings/lrec2006/pdf/636_pdf.pdf


In [3]:
import nltk
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance

distances = {jaccard_distance: "Jaccard distance", masi_distance: "MASI distance"}

jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
    task.load_array(data)
    print(f"Calculated Krippendorff's alpha based on {distances[task.distance]}.")
    print(f"Annotators: {task.C}\nItems: {task.I}\nAnnotations: {task.K}")
    print(f"Krippendorff's alpha: {task.alpha()}")
    print()

Calculated Krippendorff's alpha based on Jaccard distance.
Annotators: {1, 2, 3, 5, 6, 9, 10, 12, 13}
Items: {'10.21105/joss.04591', '10.21105/joss.02817', '10.21105/joss.07031', '10.21105/joss.06914', '10.21105/joss.06617', '10.21105/joss.05375', '10.21105/joss.06940', '10.21105/joss.06067', '10.21105/joss.04677', '10.21105/joss.02343', '10.21105/joss.05098', '10.21105/joss.02825', '10.21105/joss.02334', '10.21105/joss.04278', '10.21105/joss.04354', '10.21105/joss.05496', '10.21105/joss.02017', '10.21105/joss.04183', '10.21105/joss.05201', '10.21105/joss.04958', '10.21105/joss.06932', '10.21105/joss.04360', '10.21105/joss.05402', '10.21105/joss.06574', '10.21105/joss.02011', '10.21105/joss.02805', '10.21105/joss.04953', '10.21105/joss.05202', '10.21105/joss.05100', '10.21105/joss.02646', '10.21105/joss.05619', '10.21105/joss.03032', '10.21105/joss.05453', '10.21105/joss.01981', '10.21105/joss.04099', '10.21105/joss.05573', '10.21105/joss.02369', '10.21105/joss.02653', '10.21105/joss.0

Based on α values of ~0.3, our annotation data should be **discarded** as unreliable.