# Computing Krippendorff's Alpha for Inter-Annotator Agreement

This notebook demonstrates how to use the compute_alpha function to evaluate inter-annotator agreement using Krippendorff’s alpha.
Overview

Krippendorff’s alpha is a statistical measure used to assess the reliability of annotations, particularly when multiple annotators assign labels to the same data points. It is widely used in content analysis, NLP tasks, and other domains requiring human annotation.
Function: compute_alpha

This function calculates Krippendorff’s alpha given a dataset of annotations. It supports various data types, including nominal and ordinal annotations.
Parameters:

    df (pd.DataFrame): The dataset containing annotation data.
    column_mapping (Optional[ColumnMapping]): Defines which columns correspond to annotators and annotations.
    annotation_schema (Optional[AnnotationSchema]): Specifies the type of annotations (e.g., nominal, ordinal).
    weight_dict (Optional[Dict[str, float]]): Assigns weights to different annotators if necessary.
    ordinal_scale (Optional[Dict[str, float]]): Defines a scale for ordinal annotations if applicable.

Returns:

A dictionary containing:

    Krippendorff’s alpha score
    Observed and expected disagreement
    Category-wise agreement scores (for nominal and ordinal data types)

In [2]:
from krippendorff_alpha.compute_alpha import compute_alpha

In [3]:
print(help(compute_alpha))

Help on function compute_alpha in module krippendorff_alpha.compute_alpha:

compute_alpha(
    df: pandas.core.frame.DataFrame,
    data_type: str,
    column_mapping: krippendorff_alpha.schema.ColumnMapping | dict[str, typing.Any] | None = None,
    annotation_level: str = <AnnotationLevelEnum.TEXT_LEVEL: 'text_level'>,
    weight_dict: dict[str, float] | None = None,
    ordinal_scale: list[int | float | str] | None = None,
    config_path: str | pathlib._local.Path | None = None
) -> dict[str, typing.Any]
    Computes Krippendorff's alpha for inter-annotator agreement.

    Parameters:
    - df (pd.DataFrame): The dataframe containing annotation data.
    - data_type (str): The type of annotation data (e.g., "nominal" or "ordinal").
    - column_mapping (Optional[ColumnMapping]): Specifies which columns correspond to annotators and text.
      If None, it will be inferred automatically.
    - annotation_level (str, default="text_level"): The level of annotation (e.g., "text_level", 

### Sample Datasets for Analysis

In [4]:
import pandas as pd

tsv_path = "../datasets/ordinal_orderedCategories_unequalGaps_sample.tsv"
df_high_disagreement = pd.read_csv(tsv_path, sep="\t")

df_high_disagreement.head()

Unnamed: 0,text,annotator1,annotator2,annotator3,annotator4,annotator5,annotator6,annotator7,annotator8,annotator9
0,This movie is ok.,Neutral,Positive,Neutral,Neutral,Positive,Neutral,Neutral,Positive,Neutral
1,Absolutely loved it!,Very Positive,Positive,Very Positive,Very Positive,Positive,Very Positive,Very Positive,Positive,Very Positive
2,Horrible experience.,Very Negative,Negative,Very Negative,Negative,Negative,Very Negative,Very Negative,Negative,Very Negative
3,"Not bad, could be better.",Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral
4,Amazing storytelling!,Very Positive,Very Positive,Positive,Very Positive,Very Positive,Very Positive,Very Positive,Very Positive,Positive


In [7]:
# Compute Krippendorff's Alpha for ordinal data
results_ordinal = compute_alpha(df_high_disagreement , data_type='ordinal')

# Display results
print('An example of high disagreement:')
print("Krippendorff's Alpha Results (Ordinal):")
print(results_ordinal)

An example of high disagreement:
Krippendorff's Alpha Results (Ordinal):
{'alpha': 0.79, 'observed_disagreement': 0.164, 'expected_disagreement': 0.78, 'per_category_scores': {'Very Negative': {'observed_disagreement': 0.029, 'expected_disagreement': 0.129}, 'Negative': {'observed_disagreement': 0.027, 'expected_disagreement': 0.159}, 'Neutral': {'observed_disagreement': 0.01, 'expected_disagreement': 0.173}, 'Positive': {'observed_disagreement': 0.04, 'expected_disagreement': 0.108}, 'Very Positive': {'observed_disagreement': 0.011, 'expected_disagreement': 0.212}}}


### Let's try a dataset with high agreemnet!

In [8]:
tsv_path = "../datasets/ordinal_orderedCategories_highAgreement_sample.tsv"
df_high_agreement = pd.read_csv(tsv_path, sep="\t")

df_high_agreement

Unnamed: 0,text,annotator1,annotator2,annotator3,annotator4,annotator5,annotator6,annotator7,annotator8,annotator9
0,I love this product!,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent
1,This is the worst experience ever.,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor
2,"It's okay, not great but not terrible.",Good,Good,Good,Good,Fair,Good,Good,Fair,Good
3,I am extremely happy with my purchase.,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent,Excellent
4,The service was terrible.,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor
5,I'm somewhat satisfied.,Good,Good,Good,Good,Good,Good,Good,Good,Good
6,"Nothing special, just an average experience.",Fair,Fair,Fair,Fair,Fair,Fair,Fair,Fair,Fair
7,Absolutely horrible! Never again.,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor
8,"It's pretty decent, I might buy it again.",Very Good,Very Good,Very Good,Very Good,Good,Very Good,Very Good,Good,Very Good
9,I hate it so much!,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor,Poor


In [9]:
# Compute Krippendorff's Alpha for ordinal data
results_ordinal = compute_alpha(df_high_agreement , data_type='ordinal')

# Display results
print('An example of high agreement:')
print("Krippendorff's Alpha Results (Ordinal):")
print(results_ordinal)

An example of high agreement:
Krippendorff's Alpha Results (Ordinal):
{'alpha': 0.939, 'observed_disagreement': 0.045, 'expected_disagreement': 0.747, 'per_category_scores': {'Poor': {'observed_disagreement': 0.0, 'expected_disagreement': 0.231}, 'Fair': {'observed_disagreement': 0.01, 'expected_disagreement': 0.099}, 'Good': {'observed_disagreement': 0.012, 'expected_disagreement': 0.149}, 'Very Good': {'observed_disagreement': 0.021, 'expected_disagreement': 0.074}, 'Excellent': {'observed_disagreement': 0.002, 'expected_disagreement': 0.194}}}


## Krippendorff's Alpha Formula

$$
\alpha = 1 - \frac{D_o}{D_e}
$$

Where:

- \( D_o \) = Observed disagreement (how much annotators actually disagree)  
- \( D_e \) = Expected disagreement (how much disagreement would be expected by chance)


### Notes:

1. Krippendorff’s Alpha is not just percent agreement but also considers chance agreement.
2. With such a small dataset, a single disagreement has a large effect.
3. For nominal and ordinal distance, any mismatch counts as a full disagreement (1).
4. High **expected disagreement** suggests that, given the distribution of labels, there’s a significant chance that annotators could have disagreed randomly.

### Let's try more datatsets.

In [10]:
tsv_path = "../datasets/ratio_numeric_equalGaps_withAbsoluteZero.tsv"
df_ratio = pd.read_csv(tsv_path, sep="\t")

df_ratio.head()

Unnamed: 0,text,annotator1,annotator2,annotator3,annotator4,annotator5,annotator6,annotator7,annotator8,annotator9,annotator10
0,Nice day.,0,0,0,0,0,0,0,0,0,0
1,You idiot!,2,3,2,3,2,3,3,2,3,2
2,What the hell!,1,2,1,2,1,2,2,1,2,1
3,Perfect day.,0,0,0,0,0,0,0,0,0,0
4,"Shut up, you jerk!",4,5,4,5,4,5,5,4,5,4


In [11]:
results_ratio = compute_alpha(df_ratio, data_type='ratio')

print("Krippendorff's Alpha Results (Ratio):")
print(results_ratio)

Krippendorff's Alpha Results (Ratio):
{'alpha': 0.986, 'observed_disagreement': 0.026, 'expected_disagreement': 1.876}


### Interpreting the high Alpha:

- \( D_o = 0.026 \) (Observed Disagreement)  
  This means that the actual disagreement among annotators is very low.

- \( D_e = 1.876 \) (Expected Disagreement)  
  This is the disagreement we'd expect by chance, assuming annotators chose their ratings more randomly.

- Krippendorff’s Alpha:  
  \[
  \alpha = 1 - \frac{D_o}{D_e} = 1 - \frac{0.026}{1.876} \approx 0.986
  \]
  Since \( D_o \) is much smaller than \( D_e \), we get a very high Krippendorff’s Alpha.

### Does This Make Sense?

Yes, it makes sense given the dataset:

- Most annotations are either **exactly the same** (e.g., all 0s) or **differ by only 1** (e.g., 4 vs. 5).
- When disagreement exists, it's **small in magnitude**.
- The expected disagreement is **higher** because, in theory, if annotators were picking numbers randomly, we'd expect a wider range of values.


### Let's Try a Nominal Data for Entity Annotation

In [13]:
tsv_path = "../datasets/nominal_categorical_noOrder_sample.tsv"
df_nominal = pd.read_csv(tsv_path, sep="\t")

df_nominal.head(10)

Unnamed: 0,text,entity,annotator1,annotator2,annotator3,annotator4,annotator5,annotator6,annotator7
0,Apple is a tech company.,Apple,B-ORG,B-ORG,B-ORG,B-ORG,B-PERSON,B-ORG,B-ORG
1,Apple is a tech company.,is,O,O,O,O,O,O,O
2,Apple is a tech company.,a,O,O,O,O,O,O,O
3,Apple is a tech company.,tech,O,O,B-ORG,O,O,O,O
4,Apple is a tech company.,company,O,O,O,O,O,O,O
5,New York is known for its landmarks.,New,B-LOC,B-LOC,B-LOC,B-LOC,B-LOC,B-LOC,B-LOC
6,New York is known for its landmarks.,York,I-LOC,I-LOC,I-LOC,I-LOC,I-LOC,I-LOC,I-LOC
7,New York is known for its landmarks.,is,O,O,O,O,O,O,O
8,New York is known for its landmarks.,known,O,O,O,O,O,O,O
9,New York is known for its landmarks.,for,O,O,O,O,O,O,O


In [14]:
from krippendorff_alpha.schema import ColumnMapping

column_mapping = ColumnMapping(text_col="text", annotator_cols=["annotator1", "annotator2", "annotator3", "annotator4", "annotator5", "annotator6", "annotator7"])
results_nominal = compute_alpha(df_nominal, data_type='nominal', column_mapping=column_mapping, annotation_level='token_level')

print("Krippendorff's Alpha Results (nominal):")
print(results_nominal)

Krippendorff's Alpha Results (nominal):
{'alpha': 0.956, 'observed_disagreement': 0.022, 'expected_disagreement': 0.495, 'per_category_scores': {'B-LOC': {'observed_disagreement': 0.011, 'expected_disagreement': 0.092}, 'B-ORG': {'observed_disagreement': 0.008, 'expected_disagreement': 0.078}, 'B-PERSON': {'observed_disagreement': 0.006, 'expected_disagreement': 0.057}, 'I-LOC': {'observed_disagreement': 0.0, 'expected_disagreement': 0.028}, 'I-PERSON': {'observed_disagreement': 0.0, 'expected_disagreement': 0.028}, 'O': {'observed_disagreement': 0.002, 'expected_disagreement': 0.212}}}


In [15]:
tsv_path = "../datasets/interval_numeric_equalGaps_noAbsoluteZero.tsv"
df_interval = pd.read_csv(tsv_path, sep="\t")

df_interval.head(5)

Unnamed: 0,text,annotator1,annotator2,annotator3,annotator4,annotator5,annotator6,annotator7,annotator8
0,The cat sat.,60.0,55.0,58.0,59.0,57.0,56.0,60.0,58.0
1,Complex passage,40.0,50.0,42.0,48.0,45.0,43.0,47.0,46.0
2,Scientific paper,20.0,22.0,21.0,19.0,20.0,21.0,22.0,20.0
3,An interesting article,65.0,70.0,68.0,66.0,67.0,69.0,65.0,68.0
4,Difficult to understand.,30.0,35.0,33.0,32.0,34.0,31.0,33.0,32.0


In [16]:
results_interval = compute_alpha(df_interval, data_type='interval')

print("Krippendorff's Alpha Results (Interval):")
print(results_interval)

Krippendorff's Alpha Results (Interval):
{'alpha': 0.992, 'observed_disagreement': 2.814, 'expected_disagreement': 367.705}


### Let's incorporate Annotator weight

The following says that annotator 1's annotation is twice as important as the rest of the annotators. This will affect the alpha score.

In [17]:
results_interval_with_weight = compute_alpha(df_interval, data_type='interval', weight_dict={'annotator1': 2.0})

print("Krippendorff's Alpha Results (Interval):")
print(results_interval_with_weight)

Krippendorff's Alpha Results (Interval):
{'alpha': 0.989, 'observed_disagreement': 3.871, 'expected_disagreement': 367.705}


### Let's incorporate column mapping

In [18]:
from krippendorff_alpha.schema import ColumnMapping

column_mapping = ColumnMapping(text_col='text', annotator_cols=['annotator1', 'annotator2', 'annotator3'])
results_interval_only3_annotators = compute_alpha(df_interval, data_type='interval', column_mapping=column_mapping)

print("Krippendorff's Alpha Results (Interval):")
print(results_interval_only3_annotators)

Krippendorff's Alpha Results (Interval):
{'alpha': 0.985, 'observed_disagreement': 1.6, 'expected_disagreement': 104.953}


### Let's incorporate scale for ordinals

In case the scale is not defined in the config file.

In [19]:
df_my_scales = pd.DataFrame({
    "text": ["Text A", "Text B", "Text C", "Text D", "Text E"],
    "annotator_1": ["weird", "weirder", "weird", "weird", "the weirdest"],
    "annotator_2": ["weird", "weird", "weird", "weirder", "the weirdest"],
    "annotator_3": ["weird", "weirder", "weirder", "weird", "the weirdest"]
})

df_my_scales

Unnamed: 0,text,annotator_1,annotator_2,annotator_3
0,Text A,weird,weird,weird
1,Text B,weirder,weird,weirder
2,Text C,weird,weird,weirder
3,Text D,weird,weirder,weird
4,Text E,the weirdest,the weirdest,the weirdest


In [20]:
results_ordinal_my_scale = compute_alpha(df_my_scales, 
                                         data_type='ordinal',
                                        ordinal_scale= ['weird','weirder','the weirdest'])

# Display results
print("Krippendorff's Alpha Results (Weird Ordinal):")
print(results_ordinal_my_scale)

Krippendorff's Alpha Results (Weird Ordinal):
{'alpha': 0.669, 'observed_disagreement': 0.2, 'expected_disagreement': 0.604, 'per_category_scores': {'the weirdest': {'observed_disagreement': 0.0, 'expected_disagreement': 0.16}, 'weird': {'observed_disagreement': 0.094, 'expected_disagreement': 0.249}, 'weirder': {'observed_disagreement': 0.188, 'expected_disagreement': 0.196}}}
