## 3. Active Learning Exercises: Calculating Cohen's Kappa in Python Solutions

In our analysis we calculated inter-annotator agreement using Python's Cohen's kappa function from the `Scikit` library. However, this statistic can of course also be calculated manually. In this excerise we will teach you how to compute Cohen's Kappa manually by going through the calculations step-by-step.

### 3.1. Understanding the Confusion Matrix

Cohen's Kappa is based on a confusion matrix, a table that reports the number of true positives, false negatives, false positives, and true negatives.

Let's assume we have two annotators, Annotators A and B, who classified profanity in 100 lines of lyrics into two categories: "Positive" and "Negative":

| A |  B | Count |
|---------|---------|-------|
| Positive| Positive| 50    |
| Positive| Negative| 10    |
| Negative| Positive| 5     |
| Negative| Negative| 35    |

In this table:
- The first row indicates that both annotators classified 50 profanity occurences as "Positive".
- The second row indicates that Annotator A classified 10 profanity occurences as "Positive" while Annotator B classified them as "Negative".
- The third row indicates that Annotator A classified 5 profanity occurences as "Negative" while Annotator B classified them as "Positive".
- The fourth row indicates that both Annotators classified 35 profanity occurences as "Negative".

Your first exercise is to create a confusion matrix based on the example data above using the `NumPy` package, which is a fundamental package for numerical computing in Python. It provides support for arrays and mathematical functions.

*(Hint: you'll need the `array` function, see documentation [here](https://numpy.org/doc/2.1/reference/generated/numpy.array.html))*

In [144]:
# Import NumPy:
import numpy as np

# Create the confusion matrix:
confusion_matrix = np.array([[50, 10],
                              [5, 35]])
# Print the matrix:
print(confusion_matrix)

[[50 10]
 [ 5 35]]


### 3.2. Calculate Observed Agreement (Po)

Next, we will need to calculate the observed agreement (Po), the proportion of times the annotators agree on their classifications. The cases where the two annotators agree are visible in the *diagonal* of the confusion matrix (top left to bottom right). 

The formula for observed agreement is:

$P_o = \frac{( a  +  d )}{N}$ 

Where:
- a = number of agreements in the positive category (top-left cell in the confusion matrix)
- d = number of agreements in the negative category (bottom-right cell in the confusion matrix)
- N = total number of observations (sum of all cells in the confusion matrix)

Your task is to calculate Po using the confusion matrix and formula. Index the confusion matrix to extract the necessary values.

In [147]:
# Use indexing to extract the values ('a' and 'd') needed for the calculation of Po:
a = confusion_matrix[0, 0]  # Positive-Positive
d = confusion_matrix[1, 1]  # Negative-Negative

# Calculate the total number of observations (N) by summing all the elements in the confusion matrix:
N = np.sum(confusion_matrix)

# Calculate Po using the formula:
P_o = (a + d) / N

# Print the result:
print(f'Observed Agreement =', P_o)

Observed Agreement = 0.85


### 3.3. Calculate Expected Agreement (Pe)

We will now do the same for the expected agreement (Pe), the agreement between two annotators that would be expected by chance. 

The formula for expected agreement is:

$P_e = \left(\frac{(a+b)(a+c)}{N^2}\right) + \left(\frac{(c+d)(b+d)}{N^2}\right)$ 

Where:
- b = number of positive ratings by Annotator A and negative ratings by Annotator B (top-right cell in the confusion matrix)
- c = number of negative ratings by Annotator A and positive ratings by Annotator B (bottom-left cell in the confusion matrix)

*(Note that you already created variables for values `a`, `d` and `N` in the previous exercise)*

In [150]:
# Use indexing to extract the values ('b' and 'c') needed for the calculation of Pe:
b = confusion_matrix[0, 1]  # Positive-Negative
c = confusion_matrix[1, 0]  # Negative-Positive

# Calculate Pe using the formula:
P_e = ((a + b) * (a + c) + (c + d) * (b + d)) / (N ** 2)

# Print the result:
print(f'Expected Agreement =', P_e)

Expected Agreement = 0.51


### 3.4. Calculate Cohen's Kappa ($\kappa$)

Now we have the values needed to calculate Cohen's Kappa ($\kappa$). 

Calculate $\kappa$ with its formula:

$ \kappa = \frac{(P_o - P_e)} {(1 - P_e)}$ 

In [156]:
# Calculate k using the formula:
k = (P_o - P_e) / (1 - P_e)

# Print the result:
print(f"Cohen's Kappa =", k)

Cohen's Kappa = 0.6938775510204082


### 3.5. Interpreting Cohen's Kappa

After calculating Cohen's Kappa ($\kappa$), you will have a numerical value that falls between -1 and 1, that indicates the level of agreement between the two annotators. But, we would of course like to know what this numerical value actually *means*.

Consulting [this](https://datatab.net/tutorial/cohens-kappa) documentation, write a few sentences below interpreting the $\kappa$ value you just computed:

    (Assuming you calculated Cohen's Kappa and obtained a value of 0.69)

    The Cohen's Kappa of 0.69: falls within Landis & Koch's (1977) range of 0.61 - 0.80, indicating substantial agreement between the two annotators. This suggests that the annotators are generally consistent in their classifications, but there may still be some discrepancies that could be addressed.

### 3.6. Reflection

We now know that Cohen's Kappa is a statistical measure used to assess inter-annotator reliability in classification tasks. While high agreement isn't absolutely critical for all applications, such as with our data example of analysing profanity sentiment in music lyrics, there are many fields where consistent classification is crucial and it can never hurt to include the measure in your research. 

Can you identify any fields/domains where achieving a high Cohen's Kappa is particularly important?

    Consistent classification is important in any field that requires subjective judgement. Some fields where it is particularly important are medical diagnosis and psychological assesments. Inconsistent diagnoses could lead to misdiagnosis, delayed treatment, or missed critical conditions, potentially putting patients' lives at risk. 