# Computing Krippendorff's Alpha for Inter-Annotator Agreement

This notebook demonstrates how to use the compute_alpha function to evaluate inter-annotator agreement using Krippendorff’s alpha.
Overview

Krippendorff’s alpha is a statistical measure used to assess the reliability of annotations, particularly when multiple annotators assign labels to the same data points. It is widely used in content analysis, NLP tasks, and other domains requiring human annotation.
Function: compute_alpha

This function calculates Krippendorff’s alpha given a dataset of annotations. It supports various data types, including nominal and ordinal annotations.
Parameters:

    df (pd.DataFrame): The dataset containing annotation data.
    column_mapping (Optional[ColumnMapping]): Defines which columns correspond to annotators and annotations.
    annotation_schema (Optional[AnnotationSchema]): Specifies the type of annotations (e.g., nominal, ordinal).
    weight_dict (Optional[Dict[str, float]]): Assigns weights to different annotators if necessary.
    ordinal_scale (Optional[Dict[str, float]]): Defines a scale for ordinal annotations if applicable.

Returns:

A dictionary containing:

    Krippendorff’s alpha score
    Observed and expected disagreement
    Category-wise agreement scores (for nominal and ordinal data types)

In [35]:
from krippendorff_alpha.compute_alpha import compute_alpha

In [36]:
print(help(compute_alpha))

Help on function compute_alpha in module krippendorff_alpha.compute_alpha:

compute_alpha(
    df: pandas.core.frame.DataFrame,
    data_type: str,
    column_mapping: Optional[krippendorff_alpha.schema.ColumnMapping] = None,
    annotation_level: str = <AnnotationLevelEnum.TEXT_LEVEL: 'text_level'>,
    weight_dict: Optional[Dict[str, float]] = None,
    ordinal_scale: Optional[Dict[str, float]] = None
) -> Any
    Computes Krippendorff's alpha for inter-annotator agreement.

    Parameters:
    - df (pd.DataFrame): The dataframe containing annotation data.
    - data_type (str): The type of annotation data (e.g., "nominal" or "ordinal").
    - column_mapping (Optional[ColumnMapping]): Specifies which columns correspond to annotators and text.
      If None, it will be inferred automatically.
    - annotation_level (str, default="text_level"): The level of annotation (e.g., "text_level", "token_level").
    - weight_dict (Optional[Dict[str, float]]): A dictionary specifying weights fo

### Sample Datasets for Analysis

In [37]:
import pandas as pd

# Load the CSV file
csv_path = "../examples/ordinal_orderedCategories_unequalGaps_sample.csv"
df_high_disagreement = pd.read_csv(csv_path)

# Display the first few rows
df_high_disagreement.head()

Unnamed: 0,Text,Annotator 1,Annotator 2,Annotator 3,Annotator 4,Annotator 5
0,This movie is ok.,Neutral,Positive,Neutral,Neutral,Positive
1,Absolutely loved it!,Very Positive,Positive,Very Positive,Very Positive,Positive
2,Horrible experience.,Very Negative,Negative,Very Negative,Negative,Negative
3,"Not bad, could be better.",Neutral,Neutral,Neutral,Neutral,Neutral
4,Amazing storytelling!,Very Positive,Very Positive,Positive,Very Positive,Very Positive


In [38]:
# Compute Krippendorff's Alpha for ordinal data
results_ordinal = compute_alpha(df_high_disagreement , data_type='ordinal')

# Display results
print('An example of high disagreement:')
print("Krippendorff's Alpha Results (Ordinal):")
print(results_ordinal)

An example of high disagreement:
Krippendorff's Alpha Results (Ordinal):
{
    "alpha": 0.513,
    "observed_disagreement": 0.379,
    "expected_disagreement": 0.779,
    "per_category_scores": {
        "Very Negative": {
            "observed_disagreement": 0.731,
            "expected_disagreement": 0.102
        },
        "Negative": {
            "observed_disagreement": 0.366,
            "expected_disagreement": 0.178
        },
        "Neutral": {
            "observed_disagreement": 0.19,
            "expected_disagreement": 0.172
        },
        "Positive": {
            "observed_disagreement": 0.652,
            "expected_disagreement": 0.118
        },
        "Very Positive": {
            "observed_disagreement": 0.259,
            "expected_disagreement": 0.208
        }
    }
}


### Let's try a dataset with high agreemnet!

In [39]:
csv_path = "../examples/ordinal_orderedCategories_highAgreement_sample.csv"
df_high_agreement = pd.read_csv(csv_path)

# Display the first few rows
df_high_agreement

Unnamed: 0,text,annotator1,annotator2,annotator3,annotator4,annotator5
0,I love this product!,Excellent,Excellent,Excellent,Excellent,Excellent
1,This is the worst experience ever.,Poor,Poor,Poor,Poor,Poor
2,"It's okay, not great but not terrible.",Good,Good,Good,Good,Fair
3,I am extremely happy with my purchase.,Excellent,Excellent,Excellent,Excellent,Excellent
4,The service was terrible.,Poor,Poor,Poor,Poor,Poor
5,I'm somewhat satisfied.,Good,Good,Good,Good,Good
6,"Nothing special, just an average experience.",Fair,Fair,Fair,Fair,Fair
7,Absolutely horrible! Never again.,Poor,Poor,Poor,Poor,Poor
8,"It's pretty decent, I might buy it again.",Very Good,Very Good,Very Good,Very Good,Good
9,I hate it so much!,Poor,Poor,Poor,Poor,Poor


In [40]:
# Compute Krippendorff's Alpha for ordinal data
results_ordinal = compute_alpha(df_high_agreement , data_type='ordinal')

# Display results
print('An example of high agreement:')
print("Krippendorff's Alpha Results (Ordinal):")
print(results_ordinal)

An example of high agreement:
Krippendorff's Alpha Results (Ordinal):
{
    "alpha": 0.854,
    "observed_disagreement": 0.109,
    "expected_disagreement": 0.75,
    "per_category_scores": {
        "Poor": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.231
        },
        "Fair": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.097
        },
        "Good": {
            "observed_disagreement": 0.2,
            "expected_disagreement": 0.149
        },
        "Very Good": {
            "observed_disagreement": 0.4,
            "expected_disagreement": 0.083
        },
        "Excellent": {
            "observed_disagreement": 0.133,
            "expected_disagreement": 0.19
        }
    }
}


## Krippendorff's Alpha Formula

$$
\alpha = 1 - \frac{D_o}{D_e}
$$

Where:

- \( D_o \) = Observed disagreement (how much annotators actually disagree)  
- \( D_e \) = Expected disagreement (how much disagreement would be expected by chance)


### Notes:

1. Krippendorff’s Alpha is not just percent agreement but also considers chance agreement.
2. With such a small dataset, a single disagreement has a large effect.
3. For nominal and ordinal distance, any mismatch counts as a full disagreement (1).
4. High **expected disagreement** suggests that, given the distribution of labels, there’s a significant chance that annotators could have disagreed randomly.

### Let's try more datatsets.

In [41]:
tsv_path = "../examples/ratio_numeric_equalGaps_withAbsoluteZero.tsv"
df_ratio = pd.read_csv(tsv_path, sep="\t")

# Display the first few rows
df_ratio.head()

Unnamed: 0,Text,Annotator 1,Annotator 2,Annotator 3,Annotator 4,Annotator 5,Annotator 6,Annotator 7
0,Nice day.,0,0,0,0,0,0,0
1,You idiot!,2,3,2,3,2,3,3
2,What the hell!,1,2,1,2,1,2,2
3,Perfect day.,0,0,0,0,0,0,0
4,"Shut up, you jerk!",4,5,4,5,4,5,5


In [42]:
# Compute Krippendorff's Alpha for ratio data
results_ordinal = compute_alpha(df_high_agreement , data_type='ratio')

# Display results
print("Krippendorff's Alpha Results (Ratio):")
print(results_ordinal)

Krippendorff's Alpha Results (Ratio):
{
    "alpha": 0.984,
    "observed_disagreement": 0.025,
    "expected_disagreement": 1.491
}


### Interpreting the high Alpha:

- \( D_o = 0.025 \) (Observed Disagreement)  
  This means that the actual disagreement among annotators is very low.

- \( D_e = 1.491 \) (Expected Disagreement)  
  This is the disagreement we'd expect by chance, assuming annotators chose their ratings more randomly.

- Krippendorff’s Alpha:  
  \[
  \alpha = 1 - \frac{D_o}{D_e} = 1 - \frac{0.025}{1.491} \approx 0.984
  \]
  Since \( D_o \) is much smaller than \( D_e \), we get a very high Krippendorff’s Alpha.

### Does This Make Sense?

Yes, it makes sense given the dataset:

- Most annotations are either **exactly the same** (e.g., all 0s) or **differ by only 1** (e.g., 4 vs. 5).
- When disagreement exists, it's **small in magnitude**.
- The expected disagreement is **higher** because, in theory, if annotators were picking numbers randomly, we'd expect a wider range of values.


### Let's Try a Nominal Data for Entity Annotation

In [43]:
import json

# Load JSON data from file
with open("../examples/nominal_categorical_noOrder_sample.json", "r") as file:
    json_data = json.load(file)  # Parse JSON

# Flatten the data
rows = []
for entry in json_data:
    for annotation in entry["annotations"]:
        row = {"text": entry["text"], "word": annotation["word"]}
        row.update({key: value for key, value in annotation.items() if key != "word"})
        rows.append(row)

# Create DataFrame
df_nominal = pd.DataFrame(rows)

# Display DataFrame
df_nominal.head(10)

Unnamed: 0,text,word,annotator_1,annotator_2,annotator_3
0,Apple is a tech company.,Apple,B-ORG,B-ORG,B-ORG
1,Apple is a tech company.,is,O,O,O
2,Apple is a tech company.,a,O,O,O
3,Apple is a tech company.,tech,O,O,B-ORG
4,Apple is a tech company.,company,O,O,O
5,New York is known for its landmarks.,New,B-LOC,B-LOC,B-LOC
6,New York is known for its landmarks.,York,I-LOC,I-LOC,I-LOC
7,New York is known for its landmarks.,is,O,O,O
8,New York is known for its landmarks.,known,O,O,O
9,New York is known for its landmarks.,for,O,O,O


In [44]:
# Compute Krippendorff's Alpha for nominal data
results_nominal_word = compute_alpha(df_nominal,
                                     data_type='nominal', 
                                     annotation_level='token_level') # we can specify that we have a word column for analysis

# Display results
print("Krippendorff's Alpha Results (nominal):")
print(results_nominal_word)

Krippendorff's Alpha Results (nominal):
{
    "alpha": 0.888,
    "observed_disagreement": 0.054,
    "expected_disagreement": 0.484,
    "per_category_scores": {
        "B-LOC": {
            "observed_disagreement": 0.182,
            "expected_disagreement": 0.089
        },
        "B-ORG": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.082
        },
        "B-PERSON": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.051
        },
        "I-LOC": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.026
        },
        "I-PERSON": {
            "observed_disagreement": 0.0,
            "expected_disagreement": 0.026
        },
        "O": {
            "observed_disagreement": 0.051,
            "expected_disagreement": 0.209
        }
    }
}


In [48]:
data = []
with open("../examples/interval_numeric_equalGaps_noAbsoluteZero.jsonl", "r") as f:
    for line in f:
        data.append(json.loads(line.strip()))

# Convert to structured format
rows = []
for entry in data:
    row = {"text": entry["text"]}
    for annotation in entry["annotations"]:
        annotator = annotation["annotator"]
        row[annotator] = annotation["label"]
    rows.append(row)

# Convert to DataFrame
df_interval = pd.DataFrame(rows)

# Rename columns for clarity
df_interval.columns = ["text", "Annotator 1", "Annotator 2", "Annotator 3", "Annotator 4"]

# Print DataFrame
df_interval.head(5)

Unnamed: 0,text,Annotator 1,Annotator 2,Annotator 3,Annotator 4
0,The cat sat.,60.0,55.0,58.0,59.0
1,Complex passage,40.0,50.0,42.0,48.0
2,Scientific paper,20.0,22.0,21.0,19.0
3,An interesting article,65.0,70.0,68.0,66.0
4,Difficult to understand.,30.0,35.0,33.0,32.0


In [49]:
# Compute Krippendorff's Alpha for ratio data
results_interval = compute_alpha(df_interval , data_type='Interval')

# Display results
print("Krippendorff's Alpha Results (Interval):")
print(results_interval)

Krippendorff's Alpha Results (Interval):
{
    "alpha": 0.973,
    "observed_disagreement": 5.467,
    "expected_disagreement": 205.22
}
