# Inter-Annotator Agreement Calculation

This notebook contains the code required to calculate the IAA scores as shown in the paper, evaluated on a small subset of the original dataset, with agreement calculated on the dates, phrases, classes and date-event relations.

## Index

1. [Date Agreement](#date_agreement)
2. [Event Class and Phrase Agreement](#phrase_agreement)
3. [Relation Agreement](#relation_agreement)


<a id="date_agreement"><h2>Agreement on dates</h2></a>

The first part of the inter-annotator agreement measurements is the calculation of Cohen's kappa for the dates, to see the agreement over dates extracted from sentences.

In [77]:
import pandas as pd
from nltk.tokenize import word_tokenize
from torchmetrics.text.rouge import ROUGEScore

In [97]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

def align_annotation_files(filename_1: str, filename_2: str) -> [pd.DataFrame, pd.DataFrame]:
    """
    Function that loads the annotation files, and matches them up with the documents 
    that were annotated by both annotators.
    
    returns: (df1, df2) containing two two matched dataframes for both annotators.
    """
    annot_df_1 = pd.read_csv(filename_1)
    annot_df_2 = pd.read_csv(filename_2)
    
    sort_cols = ['doc_id', 'start'] if 'start' in annot_df_1.columns else ['doc_id']
    
    annot_df_1 = annot_df_1.sort_values(sort_cols)
    annot_df_2 = annot_df_2.sort_values(sort_cols)
    
    matched_documents = set(annot_df_1['doc_id']) & set(annot_df_2['doc_id'])

    annot_df_1 = annot_df_1[annot_df_1['doc_id'].isin(matched_documents)]
    annot_df_2 = annot_df_2[annot_df_2['doc_id'].isin(matched_documents)]

    return annot_df_1, annot_df_2

In [98]:
# Load the annotation files for both annotators
dates_df_1, dates_df_2 = align_annotation_files('annotator1/entities.csv', 'annotator2/entities.csv')

In [99]:
# Selected the dates from the sentences, which are contained in the 'text' column
aligned_dates_annot_1 = dates_df_1['text']
aligned_dates_annot_2 = dates_df_2['text']

In [100]:
cohens_kappa_dates = cohen_kappa_score(aligned_dates_annot_1, aligned_dates_annot_2)

In [101]:
print("The cohens Kappa for the %d aligned dates is %.2f" % (aligned_dates_annot_1.shape[0], choens_kappa_dates))

The cohens Kappa for the 334 aligned dates is 1.00


Unsurprisingly, the Cohens kappa for the extraction of the date is perfect. This is due to the fact that these dates are extracted form the text directly, and therefore can be easily identified and extracted.

<a id="phrase_agreement"><h2>Agreement on Events Classes and Phrases</h2></a>

After calculating the agreement over the dates, the agreement over de events and their classes is calculated. For the event phrases, we do this both using exact matches, as well as calculating the rouge-L score to allow for some small annotation differences.

In [102]:
event_df_1, event_df_2 = align_annotation_files('annotator1/date_event_combinations.csv', 'annotator2/date_event_combinations.csv')

# Event class Kappa score
print("The cohen's kappa for the %d events is %.2f" % (event_df_1.shape[0], cohen_kappa_score(event_df_1['label'], event_df_2['label'])))
annotator_1_event_phrases = event_df_1['event'].str.strip().str.lower()
annotator_2_event_phrases = event_df_2['event'].str.strip().str.lower()

print("The cohen's kappa for the %d event phrases is %.2f" % (event_df_1.shape[0], cohen_kappa_score(annotator_1_event_phrases, annotator_2_event_phrases)))
def calculate_rouge_overlap(ser1: pd.Series, ser2: pd.Series)-> float:
    rouge = ROUGEScore(use_stemmer=True, tokenizer=word_tokenize)
    rouge_scores = pd.Series([rouge(phrase_1, phrase_2)['rougeL_fmeasure'].item() for phrase_1, phrase_2 in zip(ser1, ser2)])
    return rouge_scores.mean()

print("The average Rouge-L overlap between event phrases is %.2f" % calculate_rouge_overlap(annotator_1_event_phrases, annotator_2_event_phrases))

The cohen's kappa for the 26 events is 0.91
The cohen's kappa for the 26 event phrases is 0.68
The average Rouge-L overlap between event phrases is 0.86


<a id="relation_agreement"><h2>Agreement on Relations</h2></a>

Apart from the agreement on the dates and the event classes, there is also the agreement on the relations, i.e. whether an event description consists of one or two parts.

In [153]:
relation_df_1, relation_df_2 = align_annotation_files('annotator1/relations.csv', 'annotator2/relations.csv')

def get_relation_ids(relations: list)-> list:
    ids = [eval(event_list) for event_list in relations]
    return ids

annot_1_event_ids = get_relation_ids(event_df_1['event_ids'].values)
annot_2_event_ids = get_relation_ids(event_df_2['event_ids'].values)

def get_relations(event_ids, df):
    relations = []
    for i in event_ids:
        relation = ''
        for id in i:
            type = df.loc[df['to_id'] == id]['type'].values[0]
            relation += type
        relations.append(relation)
    return relations    

relation_types_1 = get_relations(annot_1_event_ids, relation_df_1)
relation_types_2 = get_relations(annot_2_event_ids, relation_df_2)

In [155]:
print("Cohen's kappa for agreeing on labeling relations correctly:", round(sklearn.metrics.cohen_kappa_score(relation_types_1, relation_types_2), 2))

Cohen's kappa for agreeing on labeling relations correctly: 0.62
