# Calculating Inter-Annotator Agreement

This notebook is focused on the calculation of inter-annotator (a.k.a. inter-rater, inter-assessor) agreements, the degree of agreement between multiple human assessors.

### Why annotators disagree

The following research paper 
https://www.nature.com/articles/s41746-023-00773-3 (Published: 21 February 2023) takes a look at the effect of annotation discrepency in the clinical domain. 

#### Reasons for Disagreement

The paper makes the following observations:

> Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, **the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data**.

>There are four main sources of annotation inconsistencies: (a) Insufficient information to perform reliable labelling (e.g., poor quality data or unclear guidelines); (b) Insufficient domain expertise; (c) Human error (i.e., slips & noise); (d) Subjectivity in the labelling task (i.e., judgment & bias). 

We can extend this in addition with the **definition of the class schema** as a source for disagreement. If the defined classes do not match the reality of the dataset then no level of supplied information, expertise, annotation quality or objectivity will solve this. There is also the possibility that the data forms the cause of annotation problems (think about data errors, data quality issues, lack of meaningful features)

We can summarize the reasons for disagreement between annotators as:

- **Insufficient Information**: Challenges in reliable labeling due to poor quality data or unclear guidelines.
- **Lack of Domain Expertise**: Inadequate expert knowledge leading to inconsistent annotations.
- **Human Error**: Mistakes such as slips and noise affecting the accuracy of annotations.
- **Subjectivity in Labelling**: Judgment and bias influencing the annotation process.
- **Class Schema Definition**: Discrepancies when defined classes do not align with the dataset's reality.
- **Data-Related Issues**: Problems arising from data errors, quality issues, or lack of meaningful features.




## 1. Provisioning the Data

### 1.1 Load the Data

We will use the annotated Swiss SMS set as basis for calculating the agreements.
The SMS texts were annotated by students with the following classes:
* Content_Type (what kind of message was sent)
    * Appointment [APP]
    * News [NEWS]
    * NC [No Content]
* Age (if the author of the text message was rather young or old)
    * young [JUNG]
    * old [ALT]

In order to do some calculations on top of these assessments we will have to
* Load the CSV (created via an export from Google Sheets) file into a dataframe
* Replace the String labels (the annotations) with numeric values (this is called `encoding`)

In [67]:
# Loading the CSV as a dataframe

import pandas as pd

df_age_annotations = pd.read_csv('../../../data/swiss_txt_age_2019.csv', header=None)

### 1.2 Exercise: Explore the Data

Lets get some orientation in the dataframe.
Use the commands

* shape
* head()
* tail()

in order to see what we have loaded.

In [8]:
pd.set_option("display.max_colwidth", 200)
# Explore the dataframe with shape, head(), tail() commands
df_age_annotations.tail()

Unnamed: 0,0,1,2,3,4,5,6
15,"ja isch fuchbar .. obwohl bide export hani au nei gstume . abr has langsam igseh , dass e demokratie eifach d regierig vode dumme isch .. immer schön lerne und schaffe ,",ALT,ALT,ALT,ALT,ALT,ALT
16,"Na ja , joga ish nedso miis ding . finds chli störend wenn d leiterin immer seit mantarashaktahari und jetzt gömmer liecht is komotoshiri",ALT,ALT,JUNG,ALT,ALT,ALT
17,"Cool , freu mi ! Gaht dir so zwüschet halbi und 7 ni ? lg und bis morn !",ALT,ALT,ALT,ALT,ALT,ALT
18,"mitem taekwondo organisieret ! isch jetz chli chorzfrischtig , abr wör mi freue wenns klappt .. Wie gsehts öbrigens us miteme neue datum för eis go ziehe ? glg ond guet nacht :*",ALT,ALT,ALT,ALT,ALT,ALT
19,Ou dbnke das mi nomal dra erinneret hesch . Jetz hanis doch glatt scho vergäße gha . =D,JUNG,JUNG,JUNG,ALT,ALT,ALT


## 2. Calculating the Agreement for Age

In order to calculate the agreement for the class `age` we have to complete the following steps:

1. Create a df with the age rows 
2. Apply the analysis of unique values
3. Calculate how often the annotators agreed (unique values = 1)

### 2.1 Check the Dataframe with Age Annotations

In [3]:

df_age_annotations.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,"Ey schnabbel , wie gsehsch es mit de mittwuch-...",ALT,JUNG,JUNG,JUNG,ALT,JUNG
1,fangt dn de concours a ? chani jetz midm ber...,ALT,JUNG,JUNG,JUNG,JUNG,JUNG
2,Hoi gian-andrea . wie gahts ? Has dir in brie...,ALT,ALT,ALT,ALT,ALT,ALT
3,Ab wenn isch pilates ? Abem 11 i odr 12 i ?,ALT,ALT,ALT,ALT,ALT,ALT
4,einersiits iöh andrsiits sone mißge ! Hmm ka ...,JUNG,JUNG,ALT,JUNG,JUNG,JUNG


In [68]:
#Select only the columns with the labels

df_age_labels_only = df_age_annotations.loc[:,[1,2,3,4,5,6]]
df_age_labels_only


Unnamed: 0,1,2,3,4,5,6
0,ALT,JUNG,JUNG,JUNG,ALT,JUNG
1,ALT,JUNG,JUNG,JUNG,JUNG,JUNG
2,ALT,ALT,ALT,ALT,ALT,ALT
3,ALT,ALT,ALT,ALT,ALT,ALT
4,JUNG,JUNG,ALT,JUNG,JUNG,JUNG
5,ALT,ALT,ALT,ALT,ALT,ALT
6,ALT,ALT,ALT,ALT,ALT,ALT
7,ALT,ALT,ALT,ALT,ALT,ALT
8,JUNG,JUNG,JUNG,JUNG,JUNG,JUNG
9,ALT,ALT,ALT,ALT,ALT,ALT


#### Pandas Dataframe Selection with loc()

In the previous cell we have used the `loc()` method in order to select specific rows and columns from the dataframe. 

The method expects us to define rows and columns and returns the selection as a `view` on the original dataframe (as opposed to creating a copy of the original dataframe). 

In the above example `df_age_annotations.loc[:,[1,2,3,4,5,6]].head(5)`:
* `[:` defines to use all rows of the `df_age_annotations` dataframe. 
    * `[5:]` would define to use rows from index `5` until the end of the dataframe
    * `[:5]` this would define to use the rows `0, 1, 2, 3, 4, 5` of the dataframe
* `[1,2,3,4,5,6]]` this selects the columns with the label `1`, `2`, `3`, `4`, `5`, `6`. `loc()` always interprets values as labels (as opposed to interpreting the `2` as the third column from left or s.th like this

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html for the official documentation. 

### 2.2 Analysing the Unique Values

One way to look for disagreement in our annotations is to look at the number of unique values in a row.

`60 	HDMFG , pfus guet und danke vil vil mal H... 	JUNG 	JUNG 	JUNG 	JUNG 	JUNG 	JUNG`

If all annotators agree, as is the case in row `60` the number of unique values in columns 1-6 is `1` ("JUNG").
If the number of unique values is `> 1`, then this indicates disagreement between the annotators.

We will use this approach for our first calculation of the inter-annotator agreement. 


In [39]:
# nunique() (number of unique) allows us to identify the number of unique values per row.

# this import is just for displaying the output as a HTML table
from IPython.core.display import HTML

# we make our selection on the dataframe and use nunique() with the argument 1 (unique values per row).
# calling nunique(0) or just nunique() instead would give us the unique values per column.
unique_values = df_age_labels_only.nunique(1)


# the below line just formats the output of unique_values series to html, and in addition puts out
# the corresponding rows of the dataframe
HTML(unique_values.to_frame().to_html() + df_age_labels_only.to_html())

Unnamed: 0,0
0,2
1,2
2,1
3,1
4,2
5,1
6,1
7,1
8,1
9,1

Unnamed: 0,1,2,3,4,5,6
0,ALT,JUNG,JUNG,JUNG,ALT,JUNG
1,ALT,JUNG,JUNG,JUNG,JUNG,JUNG
2,ALT,ALT,ALT,ALT,ALT,ALT
3,ALT,ALT,ALT,ALT,ALT,ALT
4,JUNG,JUNG,ALT,JUNG,JUNG,JUNG
5,ALT,ALT,ALT,ALT,ALT,ALT
6,ALT,ALT,ALT,ALT,ALT,ALT
7,ALT,ALT,ALT,ALT,ALT,ALT
8,JUNG,JUNG,JUNG,JUNG,JUNG,JUNG
9,ALT,ALT,ALT,ALT,ALT,ALT


### 2.3 Calculate the Agreement

Now that we have a way to identify all rows where the annotators agreed we can easily calculate the annotator agreement with:

$
 \frac{\text{Number of Samples with Agreement}}{\text{Number of Samples}} 
$

This then translates to the cell below, where:

* `df_age_annotations.loc[:,[1,2,3,4,5,6]].nunique(1) == 1).sum()` is the number of samples where annotators agreed.
* `len(df_age_annotations)` gives us the number of samples

Based on this calculation we see that the agreement is `0.6`. In 60% of the rows (samples) all 6 assessors agreed completely in their judgement of `JUNG/ALT`.


In [34]:
simple_annotator_agreement_age = (df_age_labels_only.nunique(1) == 1).sum()/len(df_age_labels_only)
print(simple_annotator_agreement_age)

0.6

#### Exercise: Calculate the Agreement for content_type

Calculate the agreement for the assessments of class `content_type` analog to what we have done above for `age` based on the file:
`../../../data/swiss_txt_content_type_2019.csv` .


In [63]:
df_content_type_annotations = pd.read_csv('../../../data/swiss_txt_content_type_2019.csv', header=None, sep=";")
unique_values = df_content_type_annotations.loc[:,[1,2,3,4,5,6]].nunique(1)
(unique_values == 1).sum()/len(df_age_annotations)

0.46

## 3. Discussion: Simple Annotator Agreement

The tendency in the two calculated agreements fits well with the initial intuition most people had when performning the annotation. 

Most people found it easier to annotate for age than for the content_type.
This can be the result of several factors:

* Lack of a good definition for the meaning of the classes in `content_type`
* Mismatch between the classes in `content_type` and the "reality" reflected by the SMS
* simple_annotator_agreement_age < simple_annotator_agreement_content_type is of course also a reflection of the first being a binary classification and the second a multinomial classification scheme with 3 classes

The important question to ask with regard to the calculated agreement is how to react to these observations.

Generally it can be said that any annotator agreement below 0.5 should make us consider the annotation set up.

1. Clear Instructions for Annotators?
2. Annotator Fit for Task?
     * Do they have the required knowledge?
     * Do they have the required motivation?
3. Defined Classes Make Sense?





## 4. Calculating the Agreement Based on Statistical Measures

### 4.1 Encoding Categorical Data


When we are working with annotated data, we often encounter categorical data.

#### Categorical Data

Categorical data as shown in the graphic above, refers to data that represents categories in the widest sense.
Im supervised ML we will often encounter categorical data such as:

* relevance: "relevant/ not relevant"
* class: "spam/not spam"
* contract type: "lease/mortgage/..."

or as in our case with the SMS:

* content_type: "NC/APP/NEWS"
* age: JUNG/ALT

When data is annotated this is often done with the categorical labels.
If we want to make computations based on these labels then it is often advantageous to map those categories to numeric values (e.g. removes necessity for String handling, smaller size of integer type). This process is called encoding. 

#### Encoding

Encoding is a straightforward process. 
If we have labels in the form of Strings that we want to map to Ints, we simply create a mapping between the labels and integer numbers.

Sklearn supports this with the `LabelEncoder` as shown in the cell below.



In [92]:
# There are some handy tools for transforming data in the preprocessing
# package of sci-kit learn.

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df_age_numeric = df_age_labels_only.apply(encoder.fit_transform)

In [70]:
df_age_numeric

Unnamed: 0,1,2,3,4,5,6
0,0,1,1,1,0,1
1,0,1,1,1,1,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,1,1,0,1,1,1
5,0,0,0,0,0,0
6,0,0,0,0,0,0
7,0,0,0,0,0,0
8,1,1,1,1,1,1
9,0,0,0,0,0,0


In [71]:
# we can use the encoder to map back to the original values

encoder.inverse_transform([0, 1])

array(['ALT', 'JUNG'], dtype=object)

### 4.2 Statistical Tools to Calculate Inter-Annotator Agreement

Statistical tooling is often applied to calculate the level of agreement between annotators.
The reason to bring in statistical tooling is the following:

"We would like to consider how likely it is, that n people agree on their assessments."

In a nutshell these tools take the distribution of the labels and the number of assessors into account when calculating the agreement.
This can be useful in cases where the distribution of labels is extremely skewed towards some labels (e.g. 2 out of 10 labels that make up 90% of the annotations).

Therein lies the advantage of these tools. Their disadvantage is that it is harder to interprete the results. 
In our simple calculation it is completely clear how to interprete the result.

### 4.3 Calculating Agreement

#### Agreement Between Two Annotators - Cohens Kappa

Cohen's Kappa is probably the most commonly used approach to calculate the agreement between two annotators (exactly two, and not more).

In [90]:
### Cohens Kappa

from sklearn.metrics import cohen_kappa_score
cohens_sklearn = cohen_kappa_score(df_age_numeric[[1]], df_age_numeric[[5]])
print("Cohens kappa from sklearn: {:.2f}".format(cohens_sklearn))

# Calculate Fleiss' Kappa
# kappa_value = fleiss_kappa(df_age_numeric)

# print("Fleiss' Kappa:", kappa_value)


Cohens kappa from sklearn: 0.47


### Interpretation of Cohen's Kappa

The following table provides the basis for interpreting Cohen's Kappa

| Range           | Interpretation       |
|-----------------|-----------------------|
| ≤ 0            | No Agreement          |
| 0.01–0.20       | None to Slight        |
| 0.21–0.40       | Fair                  |
| 0.41–0.60       | Moderate              |
| 0.61–0.80       | Substantial           |
| 0.81–1.00       | Almost Perfect        |



### Exercise: 
Calculate Cohen's Kappa with different pairings of the six annnotators. Compare your results with the underlying ratings and test if the Kappa scores match your intuition. 


### Agreement Between N Annotators - Fleiss Kappa

Fleiss Kappa is a generalisation of Cohen's Kappa that allows us to calculate the agreement between a fixed number of `N` annotators.

https://cran.r-project.org/web/packages/irr/irr.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/


In [96]:
# You can install pyirr with pip. It is the port of a R library for calculating inter-annotator agreement. 
import pyirr as pyirr


pyirr.kappam_fleiss(df_age_numeric, detail = True)


            Fleiss` Kappa for m Raters            
Subjects = 20
  Raters = 6
   Kappa = 0.603

       z = 10.447
 p-value = 0.000

   Kappa       z  p.value
0  0.603  10.447      0.0
1  0.603  10.447      0.0

#### Interpretation of Fleiss Kappa

The following table shows the interpretation of Fleiss Kappa.
As with the interpretation of Cohen's Kappa the "correct" interpretation of the scores depends on the context of your ML projects.
You might be more tolerant for a level of disagreement if your annotators are labelling comments on the Web in the categories (OFFENSIVE/NON-OFFENSIVE), compared to the labelling of cancer scans.

| Value of Kappa | Level of Agreement   | % of Data that are Reliable |
|-----------------|-----------------------|-----------------------------|
| 0–0.20          | None                  | 0–4%                        |
| 0.21–0.39       | Minimal               | 4–15%                       |
| 0.40–0.59       | Weak                  | 15–35%                      |
| 0.60–0.79       | Moderate              | 35–63%                      |
| 0.80–0.90       | Strong                | 64–81%                      |
| Above 0.90      | Almost Perfect        | 82–100%                     |


<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/#:~:text=Cohen%20suggested%20the%20Kappa%20result,1.00%20as%20almost%20perfect%20agreement"> Source</a> 


## `Exercise 1: Calculate Fleiss and Cohens Kappa for content_type`

Calculate Fleiss and Cohens Kappa for the content_type by using the methods as shown in the cells above. 

In [107]:
df_content_type_numeric = df_content_type_annotations.loc[:,[1,2,3,4,5,6]].apply(encoder.fit_transform)
cohens_sklearn = cohen_kappa_score(df_content_type_numeric[[2]], df_content_type_numeric[[3]])
print("Cohens kappa from sklearn: {:.2f}".format(cohens_sklearn))
pyirr.kappam_fleiss(df_content_type_numeric, detail=True)


Cohens kappa from sklearn: 0.55


            Fleiss` Kappa for m Raters            
Subjects = 50
  Raters = 6
   Kappa = 0.615

       z = 23.681
 p-value = 0.000

   Kappa       z  p.value
0  0.759  20.779      0.0
1  0.573  15.694      0.0
2  0.524  14.351      0.0