# Annotation Inter-Rater Reliability

We are creating a span categorization model which means we have two forms of "data" for each annotation.

* Whether there was an annotation at all
* The start and stop of eac annotation

## "Any Annotation at All"

This can be turned into a binary annotation metric and as such we can use more standard IRR metrics.
In this case we will be using [Fleiss Kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa).

## "Same Label on Span"

Extract the annotation label from the span and use Fleiss Kappa on the retrived data.

## "Number of Spans within Example"

Extract the number of spans from each example and use Fleiss Kappa on the retrieved data.

## "Agreement on Span Positioning"

In each case that annotators agreed a span was present within the example, compare the span start and stop.
The most common task in which start and stop points are compared and evaluated against others is in "text segmentation".
As such, we are using a text segmentation evaluation metic called ["boundary agreement"](https://segeval.readthedocs.io/en/latest/api/?highlight=agreement#inter-coder-agreement-coefficients).

## Basic Data Prep

In [1]:
# Import and combine all data
import pandas as pd
import json

ANNOTATOR_TO_FILE_LUT = {
    "angel": "angel-annotated-irr-data.json",
    "sarah": "sarah-annotated-irr-data.json",
    "kelly": "kelly-annotated-irr-data.json",
}
annotation_dfs = []
for annotator, filename in ANNOTATOR_TO_FILE_LUT.items():
    with open(filename) as open_f:
        this_annotator_df = pd.DataFrame(json.load(open_f))
        this_annotator_df["annotator"] = annotator
        annotation_dfs.append(this_annotator_df)

annotations = pd.concat(annotation_dfs, ignore_index=True)
annotations.shape, annotations.columns

((84, 10),
 Index(['text', 'meta', '_input_hash', '_task_hash', 'tokens', '_view_id',
        'answer', '_timestamp', 'spans', 'annotator'],
       dtype='object'))

In [2]:
# Unpack meta
unpackaged_json = pd.json_normalize(annotations["meta"])
annotations["muni"] = unpackaged_json["muni"]
annotations["session_id"] = unpackaged_json["session_id"]

# Drop columns
annotations = annotations[["text", "answer", "spans", "annotator", "muni", "session_id"]]
annotations.head()

Unnamed: 0,text,answer,spans,annotator,muni,session_id
0,"Good morning. As you said, I'm a downtown resi...",ignore,,angel,seattle,6c40d8abf3c9
1,Down morning. I'm chair of tree pack. It's dis...,ignore,,angel,seattle,6c40d8abf3c9
2,"So Doug and Andrew if you are out there, call ...",accept,"[{'start': 107, 'end': 270, 'token_start': 25,...",angel,seattle,6c40d8abf3c9
3,"Thank you. Hi, I just want to bring attention ...",accept,"[{'start': 174, 'end': 207, 'token_start': 38,...",angel,seattle,c6bbc7ceec24
4,"Yes, I'm here. I'm unmuted, it appears. Yes. O...",ignore,,angel,seattle,c6bbc7ceec24


## Fleiss Kappa for "Any Annotation at All"

In [3]:
from statsmodels.stats.inter_rater import aggregate_raters, fleiss_kappa

def _interpreted_score(v: float) -> str:
        if v < 0:
            return "No agreement"
        if v < 0.2:
            return "Poor agreement"
        if v >= 0.2 and v < 0.4:
            return "Fair agreement"
        if v >= 0.4 and v < 0.6:
            return "Moderate agreement"
        if v >= 0.6 and v < 0.8:
            return "Substantial agreement"
        return "Almost perfect agreement"

# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["answer"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

(0.8250000000000001, 'Almost perfect agreement')

## Data Prep for Span Evaluation

Note: we are constructing segmentation strings from the span start and stop using the [NLTK segmentation string standard](https://segeval.readthedocs.io/en/latest/api/?highlight=agreement#segeval.convert_nltk_to_masses).

In [4]:
import segeval

# For each row in the dataset, unpack the span content to the larger dataframe
def unpack_span_details(row):
    if isinstance(row["spans"], list):
        row["n_spans"] = len(row["spans"])
        row["span_label"] = ",".join([span["label"] for span in row["spans"]])
        start_and_stop_indices = [
            *[span["start"] for span in row["spans"]],
            *[span["end"] for span in row["spans"]],
        ]
        row["span_segmentation"] = "".join(["1" if i in start_and_stop_indices else "0" for i in range(len(row["text"]))])
    else:
        row["n_spans"] = 0
        row["span_label"] = "None"
        row["span_segmentation"] = "".join(["0" for i in range(len(row["text"]))])
    
    # Convert to segeval masses
    row["span_segmentation"] = segeval.convert_nltk_to_masses(row["span_segmentation"])
    
    return row

annotations = annotations.apply(unpack_span_details, axis=1)
annotations.head()

Unnamed: 0,text,answer,spans,annotator,muni,session_id,n_spans,span_label,span_segmentation
0,"Good morning. As you said, I'm a downtown resi...",ignore,,angel,seattle,6c40d8abf3c9,0,,"(314,)"
1,Down morning. I'm chair of tree pack. It's dis...,ignore,,angel,seattle,6c40d8abf3c9,0,,"(423,)"
2,"So Doug and Andrew if you are out there, call ...",accept,"[{'start': 107, 'end': 270, 'token_start': 25,...",angel,seattle,6c40d8abf3c9,1,PERSON-AFFLIATED-WITH-ORG,"(108, 163)"
3,"Thank you. Hi, I just want to bring attention ...",accept,"[{'start': 174, 'end': 207, 'token_start': 38,...",angel,seattle,c6bbc7ceec24,1,PERSON-AFFLIATED-WITH-ORG,"(175, 33, 400)"
4,"Yes, I'm here. I'm unmuted, it appears. Yes. O...",ignore,,angel,seattle,c6bbc7ceec24,0,,"(253,)"


## Fleiss Kappa for "Same Label on Span"

In [5]:
# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["span_label"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

(0.6386233269598469, 'Substantial agreement')

## Fleiss Kappa for "Number of Spans within Example"

In [6]:
# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["n_spans"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

(0.6472705458908214, 'Substantial agreement')

## Boundary Similarity for "Agreement on Span Positioning"

In [7]:
# Convert the dataset to the format needed by segeval
# https://github.com/cfournie/segmentation.evaluation/blob/master/segeval/agreement/__init__.py#L116
# items_masses = {
#     'item1' : {
#         'coder1' : [5],
#         'coder2' : [2,3],
#         'coder2' : [1,1,3]
#     },
#     'item2' : {
#         'coder1' : [8],
#         'coder2' : [4,4],
#         'coder2' : [2,2,4]
#     }
# }
item_masses = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    for _, row in this_annotator_data.iterrows():
        item_index = f"{row.muni}-{row.session_id}"
        if item_index not in item_masses:
            item_masses[item_index] = {}
        
        item_masses[item_index][annotator] = row["span_segmentation"]

# Compute boundary similarity
segeval.actual_agreement_linear(item_masses)

Decimal('0.7183098591549295774647887324')