# NB11: Annotation & Inter-Rater Reliability

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB11_annotation_irr.ipynb)

**Time:** ~30 min

## Learning Goals

- Understand annotation workflows and why label quality matters
- Measure inter-rater reliability using Cohen's kappa and Gwet's AC1
- Compare human vs LLM labels systematically
- Apply deductive coding as a structured labeling methodology

In [None]:
!pip install openai pandas scikit-learn numpy datasets -q

import pandas as pd
import numpy as np
import json
import time
from openai import OpenAI
from datasets import load_dataset
from sklearn.metrics import cohen_kappa_score, confusion_matrix, accuracy_score, classification_report

## 1. Why Annotation Matters

Every supervised model is only as good as its labels. If your training data has noisy, inconsistent, or biased labels, the model will learn those problems.

**Inter-rater reliability (IRR)** measures how consistently different annotators label the same data. It answers the question: "If two people independently label the same text, how often do they agree?"

Low IRR is a signal that:
- The task definition is ambiguous
- The categories overlap or are poorly defined
- The annotators need better guidelines or training

Stance detection is a **perfect case study** for annotation disagreement. Unlike topic classification (where the topic is usually clear), stance involves interpreting the author's *position* — which can be implicit, sarcastic, or genuinely ambiguous. This makes IRR measurement meaningful rather than a mere exercise.

## 2. Simulating a Labeling Task

In a real annotation project, you would have multiple human annotators labeling the same texts independently. Here, we simulate that setup:

- **Annotator 1 (Human):** The ground-truth labels from the tweet_eval dataset
- **Annotator 2 (LLM):** Labels produced by an LLM classifier

We use the **stance_climate** subset of tweet_eval, where each tweet is labeled as **favor** (supports climate action), **against** (opposes it), or **none** (neutral/unrelated). This setup lets us measure how well the LLM agrees with human labels on a genuinely difficult classification task.

In [None]:
# Load tweet_eval stance dataset (climate change topic)
ds = load_dataset("cardiffnlp/tweet_eval", "stance_climate")

STANCE_LABELS = {0: "none", 1: "against", 2: "favor"}

# Convert test split to DataFrame
test_full = pd.DataFrame(ds["test"])
test_full["label_name"] = test_full["label"].map(STANCE_LABELS)

# Select 50 examples for annotation
annotation_set = test_full.sample(50, random_state=42)[['text', 'label_name']].reset_index(drop=True)
annotation_set.columns = ['text', 'human_label']

print(f"Annotation set: {len(annotation_set)} examples")
print(f"\nStance distribution:")
print(annotation_set['human_label'].value_counts())
print(f"\nExample tweet:")
print(annotation_set.iloc[0]['text'][:200])

In [None]:
GROQ_API_KEY = ""  # @param {type:"string"}
client = OpenAI(api_key=GROQ_API_KEY, base_url="https://api.groq.com/openai/v1")
CATEGORIES = ["none", "against", "favor"]

llm_labels = []
for _, row in annotation_set.iterrows():
    try:
        resp = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[
                {"role": "system", "content": f"Determine the stance of this tweet toward climate change action. Classify into one of: {CATEGORIES}. 'favor' means supporting action on climate change, 'against' means opposing it, 'none' means neutral or unrelated. Return JSON with 'label' field."},
                {"role": "user", "content": row['text'][:500]}
            ],
            response_format={"type": "json_object"},
            temperature=0.0,
            max_tokens=50
        )
        label = json.loads(resp.choices[0].message.content).get('label', 'Unknown')
        llm_labels.append(label)
    except:
        llm_labels.append('Unknown')
    time.sleep(0.1)

annotation_set['llm_label'] = llm_labels
valid = annotation_set[annotation_set.llm_label != 'Unknown']
print(f"Raw agreement: {accuracy_score(valid.human_label, valid.llm_label):.1%}")

## 3. Cohen's Kappa

Raw agreement ("how often do they pick the same label?") is misleading because some agreement happens **by chance**. If you have 2 categories and both annotators guess randomly, you get 50% agreement just by luck.

**Cohen's kappa** corrects for this:

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

where $p_o$ is observed agreement and $p_e$ is expected agreement by chance.

**Interpretation rules of thumb:**

| Kappa | Interpretation |
|-------|---------------|
| < 0.20 | Slight agreement |
| 0.20 -- 0.40 | Fair agreement |
| 0.40 -- 0.60 | Moderate agreement |
| 0.60 -- 0.80 | Substantial agreement |
| > 0.80 | Almost perfect agreement |

In [None]:
from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(valid['human_label'], valid['llm_label'])
print(f"Cohen's Kappa: {kappa:.3f}")

if kappa > 0.8: interpretation = "Almost perfect agreement"
elif kappa > 0.6: interpretation = "Substantial agreement"
elif kappa > 0.4: interpretation = "Moderate agreement"
elif kappa > 0.2: interpretation = "Fair agreement"
else: interpretation = "Slight agreement"
print(f"Interpretation: {interpretation}")

## 4. Gwet's AC1

Cohen's kappa has a well-known problem called the **kappa paradox**: when the category distribution is highly imbalanced (e.g., 90% of items belong to one class), kappa can be misleadingly low even when annotators agree most of the time. This happens because the chance-agreement correction assumes a specific model of random labeling that does not hold well with skewed distributions.

**Gwet's AC1** is an alternative agreement coefficient that is more robust to this issue. It uses a different model for chance agreement that is less sensitive to marginal distributions.

Since AC1 is not in standard libraries, we implement it manually below.

In [None]:
def gwet_ac1(labels1, labels2):
    """Calculate Gwet's AC1 agreement coefficient."""
    n = len(labels1)
    categories = sorted(set(labels1) | set(labels2))
    
    # Observed agreement
    po = sum(a == b for a, b in zip(labels1, labels2)) / n
    
    # Expected agreement under AC1
    marginals = []
    for cat in categories:
        pi_k = (sum(1 for l in labels1 if l == cat) + sum(1 for l in labels2 if l == cat)) / (2 * n)
        marginals.append(pi_k)
    
    pe = sum(pk * (1 - pk) for pk in marginals) / (len(categories) - 1) if len(categories) > 1 else 0
    
    if pe == 1: return 1.0
    return (po - pe) / (1 - pe)

ac1 = gwet_ac1(valid['human_label'].tolist(), valid['llm_label'].tolist())
print(f"Gwet's AC1: {ac1:.3f}")
print(f"Cohen's Kappa: {kappa:.3f}")
print(f"\nAC1 is often higher than Kappa when categories are imbalanced.")

## 5. Disagreement Analysis

Where do the human and LLM annotators disagree? For stance detection, disagreements are particularly informative because they often reveal:

- **Sarcasm and irony** — tweets where the literal text says one thing but means the opposite
- **Implicit stance** — tweets that discuss climate without explicitly stating a position
- **Ambiguous tweets** — genuinely borderline cases where reasonable annotators would disagree

The most concerning pattern is **stance inversion**: cases where the LLM labels "favor" as "against" or vice versa. These are worse than "none" confusions because they reverse the meaning entirely.

In [None]:
disagreements = valid[valid.human_label != valid.llm_label]
print(f"Disagreements: {len(disagreements)}/{len(valid)} ({len(disagreements)/len(valid):.0%})")

# Check for stance inversions
flipped = disagreements[
    ((disagreements.human_label == 'favor') & (disagreements.llm_label == 'against')) |
    ((disagreements.human_label == 'against') & (disagreements.llm_label == 'favor'))
]
print(f"Stance inversions (favor\u2194against): {len(flipped)}/{len(disagreements)} disagreements")

print(f"\nConfusion pairs:")
confusion_pairs = disagreements.groupby(['human_label', 'llm_label']).size().sort_values(ascending=False)
for (h, l), count in confusion_pairs.head(5).items():
    print(f"  Human: {h:<10} -> LLM: {l} ({count}x)")

print(f"\nExample disagreements:")
for i, (_, row) in enumerate(disagreements.head(3).iterrows()):
    print(f"\n  Tweet: {row['text'][:120]}...")
    print(f"  Human: {row['human_label']} | LLM: {row['llm_label']}")

## 6. Deductive Coding Workflow

**Deductive coding** is a structured annotation methodology from qualitative research. Instead of letting categories emerge from the data (inductive coding), you start with a predefined **codebook** and apply it systematically.

For stance detection, a well-defined codebook is critical because the categories are *interpretive*, not *descriptive*. Unlike topic classification (where "this post is about finance" is fairly objective), stance classification requires judging the author's position — which depends on clear definitions of what counts as "favor," "against," and "none."

The workflow:

1. **Define codebook** -- Write clear category definitions with inclusion/exclusion criteria and examples
2. **LLM applies codes** -- Use the codebook as a prompt to classify all texts
3. **Human validates** -- A human reviews a sample (e.g., 20%) of the LLM's labels
4. **Measure agreement** -- Compute IRR between human and LLM labels
5. **Refine codebook** -- If agreement is low, revise definitions to reduce ambiguity, then repeat

This iterative process converges toward clear, reproducible labels.

In [None]:
CODEBOOK = {
    "favor": "Tweet explicitly or implicitly supports action on climate change. Includes: calling for policy action, expressing concern about climate impacts, supporting renewable energy, criticizing climate inaction.",
    "against": "Tweet explicitly or implicitly opposes action on climate change. Includes: climate change denial/skepticism, opposing climate policies, dismissing environmental concerns, criticizing climate activists.",
    "none": "Tweet mentions climate change but does not take a clear stance, OR is unrelated to climate change. Includes: neutral reporting, asking questions, discussing weather without connecting to climate policy.",
}

print("Stance Detection Codebook:")
print("=" * 60)
for code, desc in CODEBOOK.items():
    print(f"\n  {code.upper()}:")
    print(f"  {desc}")

print("\n\nThis codebook would be used to:")
print("1. LLM codes all tweets using these definitions as the system prompt")
print("2. Human reviews a sample (e.g., 20%)")
print("3. Measure agreement")
print("4. If low agreement -> refine definitions -> repeat")

## 7. Note on Argilla

For real annotation projects that go beyond quick experiments, consider using [Argilla](https://docs.argilla.io/). Argilla is an open-source data annotation platform that provides:

- A proper web UI for labeling (much better than spreadsheets)
- Multi-annotator support with built-in IRR computation
- Integration with Hugging Face datasets for easy export
- Support for various task types (classification, NER, ranking, etc.)

You can deploy Argilla for free on [Hugging Face Spaces](https://huggingface.co/spaces) and start annotating in minutes. This is the recommended approach when you need to label more than a few dozen examples or when you have multiple annotators.

## 8. Summary

Key takeaways:

- **Always measure inter-rater reliability** before trusting your labels. Labels without measured reliability are labels of unknown quality.
- **Cohen's kappa** is the standard IRR metric. It corrects for chance agreement and is widely understood.
- **Gwet's AC1** handles imbalanced categories better than kappa. Use it when your class distribution is skewed.
- **Stance detection reveals real annotation challenges.** Unlike simple topic classification, stance is inherently subjective — disagreements between human and LLM annotators are expected and informative.
- **Human + LLM labeling** is a powerful combination: the LLM provides speed and scale, the human provides quality control and validation.
- **Deductive coding** brings structure to qualitative analysis: define your codebook first, apply it systematically, measure agreement, and iterate. For stance detection, clear definitions of "favor," "against," and "none" are essential.