# Annotation Inter-Rater Reliability

We are creating a span categorization model which means we have two forms of "data" for each annotation.

* Whether there was an annotation at all
* The start and stop of eac annotation

## "Any Annotation at All"

This can be turned into a binary annotation metric and as such we can use more standard IRR metrics.
In this case we will be using [Fleiss Kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa).

## "Same Label on Span"

Extract the annotation label from the span and use Fleiss Kappa on the retrived data.

## "Number of Spans within Example"

Extract the number of spans from each example and use Fleiss Kappa on the retrieved data.

## "Agreement on Span Positioning"

In each case that annotators agreed a span was present within the example, compare the span start and stop.
The most common task in which start and stop points are compared and evaluated against others is in "text segmentation".
As such, we are using a text segmentation evaluation metic called ["boundary agreement"](https://segeval.readthedocs.io/en/latest/api/?highlight=agreement#inter-coder-agreement-coefficients).

## Basic Data Prep

In [1]:
# Import and combine all data
import pandas as pd

ANNOTATOR_TO_FILE_LUT = {
    "angel": "annotations-angel-round-3.jsonl",
    "sarah": "annotations-sarah-round-3.jsonl",
    "kelly": "annotations-kelly-round-3.jsonl",
    "leo": "annotations-leo-round-3.jsonl",
}
annotation_dfs = []
for annotator, filename in ANNOTATOR_TO_FILE_LUT.items():
    with open(filename) as open_f:
        this_annotator_df = pd.read_json(open_f, lines=True).sort_values(by=["text"])
        this_annotator_df["annotator"] = annotator
        annotation_dfs.append(this_annotator_df)

annotations = pd.concat(annotation_dfs, ignore_index=True)
annotations.shape, annotations.columns

((112, 10),
 Index(['text', 'meta', '_input_hash', '_task_hash', 'tokens', '_view_id',
        'answer', '_timestamp', 'spans', 'annotator'],
       dtype='object'))

In [2]:
# Unpack meta
unpackaged_json = pd.json_normalize(annotations["meta"])
annotations["muni"] = unpackaged_json["muni"]
annotations["session_id"] = unpackaged_json["session_id"]

# Drop columns
annotations = annotations[["text", "answer", "spans", "annotator", "muni", "session_id"]]
annotations.head()

Unnamed: 0,text,answer,spans,annotator,muni,session_id
0,Down morning. I'm chair of tree pack. It's dis...,ignore,,angel,seattle,6c40d8abf3c9
1,"Efz Ning, members of the Oakland City Council,...",accept,"[{'start': 58, 'end': 132, 'token_start': 13, ...",angel,oakland,c658b9361f53
2,"Good afternoon, Council. Thank you for the cha...",accept,"[{'start': 71, 'end': 157, 'token_start': 16, ...",angel,seattle,c6bbc7ceec24
3,"Good afternoon, Greg McConnell, I'm here on be...",accept,"[{'start': 16, 'end': 72, 'token_start': 3, 't...",angel,oakland,d83519594701
4,"Good afternoon, members of committee, my name ...",accept,"[{'start': 49, 'end': 99, 'token_start': 10, '...",angel,oakland,b7cdf3723be0


In [3]:
from itertools import combinations

def print_annotator_diffs(data, label_col, annotator_col="annotator", link_col="text"):
    # Get just annotation series
    annotations: Dict[str, pd.Series] = {}
    link_series: Optional[pd.Series] = None
    
    # Iter annotators
    for annotator_label in data[annotator_col].unique():
        # Just their subset
        annotator_subset = data.loc[
            data[annotator_col] == annotator_label
        ].reset_index(drop=True)
        
        # Get their labels
        annotations[annotator_label] = annotator_subset[label_col]
        if link_series is None:
            link_series = annotator_subset[link_col]

    # Each annotator column values as columns
    annotations_df = pd.DataFrame(
        {
            link_col: link_series,
            **annotations,
        },
    )
    
    # Get all annotator pairs
    annotator_pairs = combinations(data[annotator_col].unique(), 2)
    # Construct pairwise diffs
    diffs = []
    for anno_one, anno_two in annotator_pairs:
        diffs.append(annotations_df.loc[annotations_df[anno_one] != annotations_df[anno_two]])
    diff_df = pd.concat(diffs)
    diff_df = diff_df.drop_duplicates(subset=[link_col]).reset_index(drop=True)
    print(f"Differing labels for '{label_col}'")
    print(diff_df)
    print()
    print("=" * 80)

## Fleiss Kappa for "Any Annotation at All"

In [4]:
from statsmodels.stats.inter_rater import aggregate_raters, fleiss_kappa

def _interpreted_score(v: float) -> str:
        if v < 0:
            return "No agreement"
        if v < 0.2:
            return "Poor agreement"
        if v >= 0.2 and v < 0.4:
            return "Fair agreement"
        if v >= 0.4 and v < 0.6:
            return "Moderate agreement"
        if v >= 0.6 and v < 0.8:
            return "Substantial agreement"
        return "Almost perfect agreement"

# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["answer"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
print_annotator_diffs(annotations, "answer")
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

Differing labels for 'answer'
                                                text   angel   sarah   kelly  \
0  Good morning, Pete her. Good morning. I'm in d...  ignore  accept  ignore   
1  So Doug and Andrew if you are out there, call ...  accept  ignore  accept   

      leo  
0  ignore  
1  accept  



(0.9181286549707602, 'Almost perfect agreement')

## Data Prep for Span Evaluation

Note: we are constructing segmentation strings from the span start and stop using the [NLTK segmentation string standard](https://segeval.readthedocs.io/en/latest/api/?highlight=agreement#segeval.convert_nltk_to_masses).

In [5]:
import segeval

# For each row in the dataset, unpack the span content to the larger dataframe
def unpack_span_details(row):
    if isinstance(row["spans"], list):
        row["n_spans"] = len(row["spans"])
        row["span_label"] = ",".join([span["label"] for span in row["spans"]])
        start_and_stop_indices = [
            *[span["start"] for span in row["spans"]],
            *[span["end"] for span in row["spans"]],
        ]
        row["span_segmentation"] = "".join(["1" if i in start_and_stop_indices else "0" for i in range(len(row["text"]))])
    else:
        row["n_spans"] = 0
        row["span_label"] = "None"
        row["span_segmentation"] = "".join(["0" for i in range(len(row["text"]))])
        
    if len(row["span_label"]) <= 3:
        row["n_spans"] = 0
        row["span_label"] = "None"
        row["span_segmentation"] = "".join(["0" for i in range(len(row["text"]))])
    
    # Convert to segeval masses
    row["span_segmentation"] = segeval.convert_nltk_to_masses(row["span_segmentation"])
    
    return row

annotations = annotations.apply(unpack_span_details, axis=1)
annotations.head()

Unnamed: 0,text,answer,spans,annotator,muni,session_id,n_spans,span_label,span_segmentation
0,Down morning. I'm chair of tree pack. It's dis...,ignore,,angel,seattle,6c40d8abf3c9,0,,"(423,)"
1,"Efz Ning, members of the Oakland City Council,...",accept,"[{'start': 58, 'end': 132, 'token_start': 13, ...",angel,oakland,c658b9361f53,1,PERSON-AFFLIATED-WITH-ORG,"(59, 74, 732)"
2,"Good afternoon, Council. Thank you for the cha...",accept,"[{'start': 71, 'end': 157, 'token_start': 16, ...",angel,seattle,c6bbc7ceec24,1,PERSON-AFFLIATED-WITH-ORG,"(72, 86, 246)"
3,"Good afternoon, Greg McConnell, I'm here on be...",accept,"[{'start': 16, 'end': 72, 'token_start': 3, 't...",angel,oakland,d83519594701,1,PERSON,"(17, 56, 1389)"
4,"Good afternoon, members of committee, my name ...",accept,"[{'start': 49, 'end': 99, 'token_start': 10, '...",angel,oakland,b7cdf3723be0,1,PERSON-AFFLIATED-WITH-ORG,"(50, 50, 1471)"


## Fleiss Kappa for "Same Label on Span"

In [6]:
# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["span_label"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
print_annotator_diffs(annotations, "span_label")
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

Differing labels for 'span_label'
                                                text  \
0  Good morning, Pete her. Good morning. I'm in d...   
1  Good morning. I am Madison, resident of distri...   
2  So Doug and Andrew if you are out there, call ...   
3  Good afternoon, Greg McConnell, I'm here on be...   

                       angel   sarah                      kelly  \
0                       None  PERSON                       None   
1  PERSON-AFFLIATED-WITH-ORG  PERSON  PERSON-AFFLIATED-WITH-ORG   
2                     PERSON    None                     PERSON   
3                     PERSON  PERSON                     PERSON   

                         leo  
0                       None  
1  PERSON-AFFLIATED-WITH-ORG  
2                     PERSON  
3  PERSON-AFFLIATED-WITH-ORG  



(0.8914728682170543, 'Almost perfect agreement')

## Fleiss Kappa for "Number of Spans within Example"

In [7]:
# Create new dataframe where rows are the answers and columns are each annotator
annotator_answers = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    annotator_answers[annotator] = this_annotator_data["n_spans"].reset_index(drop=True)

annotator_answers = pd.DataFrame(annotator_answers)

# Aggregate annotator answers
agg_raters, _ = aggregate_raters(annotator_answers)

# Compute statistical fleiss kappa
any_annotation_at_all_score = fleiss_kappa(agg_raters)

# Interpret and print
print_annotator_diffs(annotations, "n_spans")
any_annotation_at_all_score, _interpreted_score(any_annotation_at_all_score)

Differing labels for 'n_spans'
                                                text  angel  sarah  kelly  leo
0  Good morning, Pete her. Good morning. I'm in d...      0      1      0    0
1  So Doug and Andrew if you are out there, call ...      1      0      1    1



(0.9360730593607308, 'Almost perfect agreement')

## Boundary Similarity for "Agreement on Span Positioning"

In [8]:
# Convert the dataset to the format needed by segeval
# https://github.com/cfournie/segmentation.evaluation/blob/master/segeval/agreement/__init__.py#L116
# items_masses = {
#     'item1' : {
#         'coder1' : [5],
#         'coder2' : [2,3],
#         'coder2' : [1,1,3]
#     },
#     'item2' : {
#         'coder1' : [8],
#         'coder2' : [4,4],
#         'coder2' : [2,2,4]
#     }
# }
item_masses = {}
for annotator in annotations.annotator.unique():
    this_annotator_data = annotations[annotations.annotator == annotator]
    for _, row in this_annotator_data.iterrows():
        item_index = f"{row.muni}-{row.session_id}"
        if item_index not in item_masses:
            item_masses[item_index] = {}
        
        item_masses[item_index][annotator] = row["span_segmentation"]

# Compute boundary similarity
segeval.actual_agreement_linear(item_masses)

ValueError: max() arg is an empty sequence