# Inter-Annotator Agreement (IAA)

Lecture 14 | CMU ANLP Fall 2025 | Instructor: Sean Welleck

This notebook shows a simple example of computing the Cohen's Kappa inter-annotator agreement (IAA) metric

## Background

When multiple annotators label the same data, we need to measure their agreement to assess annotation quality. A common metric is **Cohen's Kappa (κ)**, defined as:
  
  $$\kappa = \frac{P_o - P_e}{1 - P_e}$$
  
  where $P_e$ is the expected agreement by chance, and $P_o$ is the proportion of items where annotators agree.

### Setup and imports

In [None]:
from __future__ import annotations
import numpy as np
import pandas as pd
from collections import Counter

### Building a confusion matrix

The first step in computing agreement is to build a confusion matrix showing how often each label from Rater 1 co-occurs with each label from Rater 2.

In [2]:
def confusion_matrix_from_labels(y1, y2, label_order=None):
    y1 = np.asarray(y1)
    y2 = np.asarray(y2)
    if y1.shape[0] != y2.shape[0]:
        raise ValueError("y1 and y2 must have the same length.")
    if label_order is None:
        labels = sorted(list(set(list(y1) + list(y2))))
    else:
        labels = list(label_order)
    
    idx = {lab: i for i, lab in enumerate(labels)}
    C = np.zeros((len(labels), len(labels)), dtype=int)
    for a, b in zip(y1, y2):
        C[idx[a], idx[b]] += 1
    
    dfC = pd.DataFrame(
        C, 
        index=pd.Index(labels, name="Rater 1"), 
        columns=pd.Index(labels, name="Rater 2")
    )
    return dfC, labels

### Computing Cohen's Kappa

In [4]:
def cohen_kappa_from_confmat(C: np.ndarray | pd.DataFrame):
    if isinstance(C, pd.DataFrame):
        C = C.values
    
    n = C.sum()
    if n == 0:
        raise ValueError("Empty confusion matrix.")
    
    # Observed agreement: diagonal entries
    Po = (np.trace(C)) / n
    
    # Chance agreement: product of marginal proportions
    r_marg = C.sum(axis=1) / n  # rater1 marginal proportions
    c_marg = C.sum(axis=0) / n  # rater2 marginal proportions
    Pe = float(np.dot(r_marg, c_marg))  # sum_k p1_k * p2_k
    
    denom = 1 - Pe
    if denom == 0:
        kappa = np.nan  # undefined when marginals are degenerate
    else:
        kappa = (Po - Pe) / denom
    
    return kappa, Po, Pe

def cohen_kappa(y1, y2, label_order=None, return_conf_mat=False):
    C, labels = confusion_matrix_from_labels(y1, y2, label_order=label_order)
    kappa, Po, Pe = cohen_kappa_from_confmat(C)
    if return_conf_mat:
        return kappa, Po, Pe, C
    return kappa, Po, Pe

### Bootstrap confidence intervals

To quantify uncertainty in our kappa estimate, we can use the bootstrap to compute confidence intervals.

In [5]:
def bootstrap_ci_kappa(y1, y2, B=5000, alpha=0.05, label_order=None, random_state=0):
    rng = np.random.default_rng(random_state)
    y1 = np.asarray(y1)
    y2 = np.asarray(y2)
    n = len(y1)
    if n != len(y2):
        raise ValueError("y1 and y2 must have same length.")
    
    kappas = np.empty(B)
    for b in range(B):
        idx = rng.integers(0, n, size=n)
        kb, _, _ = cohen_kappa(y1[idx], y2[idx], label_order=label_order, return_conf_mat=False)
        kappas[b] = kb
    
    lo = np.quantile(kappas, alpha/2)
    hi = np.quantile(kappas, 1 - alpha/2)
    return float(lo), float(hi)

## Example: Sentiment annotation task

Suppose two annotators labeled 674 tweets as either positive (1) or not-positive (0). Let's compute their agreement.

For demonstration purposes, we'll simulate plausible annotation data with moderate agreement.

In [6]:
# Simulate a plausible 674-item annotation scenario with moderate agreement
n = 674
rng = np.random.default_rng(2025)

# Balanced underlying sentiment for demo
true_label = rng.binomial(1, 0.5, size=n)

# Rater error rates (demo): rater1 10% error, rater2 12% error
# Build errors via a latent factor for mild dependence
latent = rng.normal(size=n)
err1 = (latent + rng.normal(scale=1.0, size=n) > 2.0)  # low error prob
err2 = (latent + rng.normal(scale=1.0, size=n) > 1.8)

rater1 = np.where(err1, 1-true_label, true_label)
rater2 = np.where(err2, 1-true_label, true_label)

print(f"Simulated {n} annotations")
print(f"Rater 1 positive rate: {rater1.mean():.3f}")
print(f"Rater 2 positive rate: {rater2.mean():.3f}")

Simulated 674 annotations
Rater 1 positive rate: 0.497
Rater 2 positive rate: 0.519


### Computing agreement metrics

In [7]:
# Compute kappa and confusion matrix
kappa, Po, Pe, C = cohen_kappa(rater1, rater2, label_order=[0,1], return_conf_mat=True)

# Compute bootstrap CI
ci_lo, ci_hi = bootstrap_ci_kappa(rater1, rater2, B=3000, alpha=0.05, label_order=[0,1], random_state=7)

print("\nConfusion Matrix:")
print(C)
print(f"\nAgreement Metrics:")
print(f"  Observed agreement (P₀): {Po:.4f}")
print(f"  Chance agreement (Pₑ):   {Pe:.4f}")
print(f"  Cohen's kappa (κ):       {kappa:.4f}")
print(f"  95% CI:                  [{ci_lo:.4f}, {ci_hi:.4f}]")


Confusion Matrix:
Rater 2    0    1
Rater 1          
0        293   46
1         31  304

Agreement Metrics:
  Observed agreement (P₀): 0.8858
  Chance agreement (Pₑ):   0.4999
  Cohen's kappa (κ):       0.7716
  95% CI:                  [0.7239, 0.8161]
