Welcome to this hands-on session's official Google Colab!



---



In this environment, you get text blocks (like this one) and editable code blocks (like the one that follows this one).

In some blocks, there will be an ellipsis somewhere - ...

This marks places where you are expected to fill in your own code. You can
expect that the required code fits neatly into a single line of code, but you can of course use as many lines as you feel comfortable.

There is no ellipsis in the following code block, so you can just go ahead and run it -- it loads:

1) the *os*, *pandas* packages;

2) defines the *expected agreement* function, which we use later;

3) optionally, may load the *metrics* library from the *scikit-learn* package.

Click the Play button in the top left of the code block!


In [None]:
import pandas as pd
import os

import sklearn.metrics as metrics

print("Packages loaded and expected agreement defined!")

Packages loaded and expected agreement defined!


The following block loads the annotation everyone has worked on today into memory (after I've anonymized it on my machine, which, if you're reading this, is probably the case -- if you're unsure, ask!).

Each original X-Ray report and its translation has a unique id (something like "0P10090238471000"), which here replaces the text itself. If we become interested in a particular report, I can easily find its corresponding text, in both German and English, and show it on the projector.

The cell contains error handling logic which tries to load the data into memory from GitHub. If it fails, it prints out an error message.

In [None]:
try:
  url = "https://raw.githubusercontent.com/iml-r/heidelberg-nmt/refs/heads/main/anonymized_reports.tsv"
  data = pd.read_csv(url, sep="\t")

  print("Data succesfully loaded!")
  print(data.columns)
  print(data)

except:
  print("Whoops! Something went wrong. Let me know and I'll fix it!")

Data succesfully loaded!
Index(['report_id', 'annotator_English', 'annotator_German',
       'diagnosis_English', 'diagnosis_German'],
      dtype='object')
           report_id  annotator_English  annotator_German  \
0   0P10087274932000                 18                15   
1   0P10087451436000                 10                13   
2   0P10087451447000                 18                15   
3   0P10087490536000                 10                13   
4   0P10087510626000                 18                15   
5   0P10087619283000                  5                13   
6   0P10087641883000                  5                13   
7   0P10087656469000                  8                17   
8   0P10087663944000                  5                13   
9   0P10087673761000                  5                13   
10  0P10087676176000                 10                13   
11  0P10087747788000                 18                15   
12  0P10088264123000                 10           

Extract the "diagnosis_English" column and the corresponding German column. A sanity check will be performed to ascertain that these are of the same length.

In [None]:
english_labels = data.diagnosis_English
german_labels = data.diagnosis_German

assert len(english_labels) == len(german_labels), "Lengths don't match!"
print("Success! (probably)")

Success! (probably)


Now we can calculate observed agreement. As we've discussed during the intro prezi, this is actually not sufficient -- because randomly annotating apples at oranges at 50% distribution is gonna give you 50% agreement.

Expected agreement is simply the ratio between all items in x,y that are equal and the length of x and y (which has to be the same -- the first line of the function definition already checks for that). In our case, $x$ is the German labels and $y$ is the English labels (or vice versa; neither Cohen's kappa or observed agreement cares about ordering).

In addition, this code block defines expected agreement. This is a bit too difficult for the time we have, but you can at least check the code.

In [None]:
def observed_agreement(x,y):
  p_o = sum([a == b for a,b in zip(x,y)])/len(x)

  return p_o

#Do not edit this
def expected_agreement(rater1, rater2):
    rater1 = list(rater1)
    rater2 = list(rater2)

    n = len(rater1)
    assert n == len(rater2)

    categories = set(rater1) | set(rater2)

    p_rater1 = {c: rater1.count(c) / n for c in categories}
    p_rater2 = {c: rater2.count(c) / n for c in categories}

    p_e = sum(p_rater1[c] * p_rater2[c] for c in categories)
    return p_e

Cohen's kappa solves the issue of random agreement.

It expects two sequences of labels, $x$ and $y$, of equal length.
We already know that $x$ and $y$ are of the same length, since the previous block checks for it.

Formally, the definition is

$$
\kappa = \frac{p_o - p_e}{1 - p_e},
$$

where $p_o$ is the observed agreement (the ratio of [items in $x$ equalling items in $y$ on the same position] and [total number of items]) and $p_e$ is the expected agreement (use the *expected_agreement()* function from the first block to calculate them and *observed_agreement()*).

If you find yourself struggling, either ask me or just uncomment the commented
line, which uses a pre-coded version from scikit-learn (by the way,
this would be the preferred approach over re-inventing the wheel in actual practice,
but this is a coding exercise and learning opportunity!)
Also, you can uncomment the line with the expected agreement to see the difference between it and Cohen's kappa.

In [None]:
def cohen_kappa_score(x, y) -> float:
  assert len(x) == len(y)

  p_o = observed_agreement(x,y)
  p_e = expected_agreement(x,y)

  kappa = (p_o - p_e) / (1 - p_e)

  return kappa

result = cohen_kappa_score(english_labels, german_labels)
#result = metrics.cohen_kappa_score(english_labels, german_labels)
#result = expected_agreement(english_labels, german_labels)

print("Result: ", round(result, 2))

Result:  1.0


Finally, our task is to find the disagreed-upon reports so that we can compare them and discuss what went wrong and more importantly why.

Write code that prints the id of each report where the labels by the two annotators don't match (I will find the reports corresponding to the ids on my machine -- again, data confidentiality)

In [None]:
for _, row in data.iterrows():
  english_label = row.diagnosis_English
  german_label = row.diagnosis_German

  ...


