<center>    
    <h1 id='spacy-notebook-10' style='color:#7159c1; font-size:350%'>Evaluations</h1>
    <i style='font-size:125%'>Exploring Evaluations Strategies and Kappa's Score</i>
</center>

> **Topics**

```
- 🪙 Gold Standard
- 🎯 Accuracy
- 🥅 Cohen's Kappa Score
- 🥅 Fleiss's Kappa Score
- 🥅 Weighted Cohen's Kappa Score
```

<h1 id='0-gold-standard' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🪙 | Gold Standard</h1>

Suppose that we created a Language Model (LM) that predicts the sentiment related to anime comments as `positive`, `neutral` and `negative`.

In order to evaluate how good the predictions were, we must compare them to a dataset containing the correct sentiments related to each comment. This very dataset is called `Gold Standard`, so:

```txt
Validation or Evaluation Dataset == Gold Standard
```

For now, let's suppose our `Gold Standard` and model's predictions are the following:

In [1]:
gold_standard = [
    'positive', 'neutral', 'negative', 'positive', 'neutral'
    , 'positive', 'negative', 'neutral', 'positive', 'negative'
]

model_predictions = [
    'positive', 'neutral', 'negative', 'neutral', 'neutral'
    , 'positive', 'positive', 'neutral', 'positive', 'neutral'
]

<h1 id='1-accuracy' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🎯 | Accuracy</h1>

Taking the Gold Standard and model's predictions into consideration, the first thing we could think is to calculate the `Accuracy` between them, that is, how many predictions matched the correct sentiment related to the anime comments. This metric is usually known as `Accuracy` or `Agreement Degree`.

In [2]:
# 7 out of 10 comments have been correctly classified!!
agreement_degree = 7 / 10

<h1 id='2-cohens-kappa-score' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🥅 | Cohen's Kappa Score</h1>

Even though Accuracy (Agreement Degree) considers how many predictions were done correctly by the model; relying only on it can lead us to terrible conclusions, since it totally ignores the probability of the model got correct predictions `by chance`, like tossing a die to make a prediction as 'positive', 'neutral' or 'negative' and, consequently, biasing the Agreement percentage.

In order to avoid it, we can use `Cohen's Kappa` score, that can be used when:

- we are evaluating `two raters`, e.g. Gold Standard and model's predictions;

- the categores (prediction results) are `nominal`, that is, they are not ordinal, so when `there isn't hierarchy between them`.

It's equation is given by:

```python
cohens_kappa = (observed_probability - expected_probability) / (1 - expected_probability)
```

$$
\text{Cohen's Kappa} = \frac{(\text{observed probability} - \text{expected probability})}{(1 - \text{expected probability})}
$$

where:

- **observed probability** - `probability of the observed/obtained predictions`;

-  **expected probability** - `probability of all predictions were made by chance`.

---

The `Observed Probability (Po)` is literally the `Accuracy (Agreement Degree)`, whereas the `Expected Probability (Pe)` consists of the probability of the Gold Standard and the model's predictions be each one of the available categories\sentiments, so:

```python
observed_probability = correct_predictions / total_predictions
expected_probability = sum(prod(frequency_category / total_predictions))
```

$$
\text{Observed Probability} = \frac{\text{correctPredictions}}{\text{totalPredictions}}
$$

$$
\text{Expected Probability} = \sum_{i}{(\prod_{j}({\frac{\text{frequencyCategory}_{i,j}}{\text{totalPredictions}})})}
$$

In [3]:
# Observed Probability
observed_probability = 7 / 10
print(f'- Observed Probability (Po): {observed_probability}')

- Observed Probability (Po): 0.7


In [4]:
# Expected Probability
#
#  gold_standard_sentiment_probability * model_prediction_sentiment_probability
#
positive_probability = (4 / 10) * (4 / 10)
neutral_probability = (3 / 10) * (5 / 10)
negative_probability = (3 / 10) * (1 / 10)

expected_probability = positive_probability + neutral_probability + negative_probability
print(f'- Expected Probability (Pe): {expected_probability}')

- Expected Probability (Pe): 0.3400000000000001


With both Observed and Expected Probabilities, we can finally calculate `Cohen's Kappa` score. To do it, we just need to subtract the randomness probability from the observed one and then divide the result by the complement of the randomness probability in order to scale the score in a range from -1 to +1.

```python
cohens_kappa = (observed_probability - expected_probability) / (1 - expected_probability)
```

$$
\text{Cohen's Kappa} = \frac{(\text{observed probability} - \text{expected probability})}{(1 - \text{expected probability})}
$$

In [5]:
# Cohen's Kappa Score
cohens_kappa_score = (observed_probability - expected_probability) / (1 - expected_probability)
print(f'- Cohen\'s Kappa Score: {cohens_kappa_score}')

- Cohen's Kappa Score: 0.5454545454545453


There's also a table proposed by Landis and Koch to interpret Cohen's Kappa score. Since the divisions were created `arbitrary`, we must take it with a pinch of salt!!

<table>
    <tr>
        <th>Score</th>
        <th>Strength of Agreement</th>
    </tr>
    <tr>
        <td>< 0.00</td>
        <td>Poor</td>
    </tr>
    <tr>
        <td>0.00 to 0.20</td>
        <td>Slight</td>
    </tr>
    <tr>
        <td>0.21 to 0.40</td>
        <td>Fair</td>
    </tr>
    <tr>
        <td>0.41 to 0.60</td>
        <td>Moderate</td>
    </tr>
    <tr>
        <td>0.61 to 0.80</td>
        <td>Substantial</td>
    </tr>
    <tr>
        <td>0.81 to 1.00</td>
        <td>Perfect</td>
    </tr>
</table>

Taking our example into consideration, since Cohen's Kappa score is approximately 0.54, we can tell that the model's predictions were `Moderate` reliable to the Gold Standard!!

---

It's obvious that in a real world project, we won't be calculating Cohen's Kappa score by hand when we can get advantage of a great Python's Package to do the job for us.

We only calculated by hand in order to get the glimpse of how the algorithm works.

Now let's do it, but using `sklearn`!!

In [6]:
# Calculating Cohen's Kappa Score with SKLearn
#
#  OBS.: the result can differ a few decimals due to roundings
#
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score_sklearn = cohen_kappa_score(gold_standard, model_predictions)
print(f'- Cohen\'s Kappa Score from SKLearn: {cohen_kappa_score_sklearn}')

- Cohen's Kappa Score from SKLearn: 0.5454545454545454


<h1 id='2-fleiss-kappa-score' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🥅 | Fleiss's Kappa Score</h1>

Cohen's Kappa score is a good metric when we are evaluating only two raters, e.g. Gold Standard and a model's prediction, but when it comes to three or more raters, we should do some adjustments in order to get accurate results. So, when we are evaluating `more than two raters`, we must stick to `Fleiss's Kappa` score, so:

- we are evaluating `three or more raters`, e.g. Gold Standard, model 1's predictions and model 2's predictions;

- the categories are `nominal`, that is, they are not ordinal and `there isn't hierarchy between them`.

For the calculations, the major equation is the same:

```python
fleiss_kappa = (observed_probability - expected_probability) / (1 - expected_probability)
```

$$
\text{Fleiss's Kappa} = \frac{(\text{observed probability} - \text{expected probability})}{(1 - \text{expected probability})}
$$

The unique differences are how the `Observed Probability (Po)` and the `Expected Probability (Pe)` are calculated!!

So, let's consider the same Gold Standard amd model's predictions from the previous example, but adding a new list of predictions.

In [7]:
gold_standard = [
  'positive', 'neutral', 'negative', 'positive', 'neutral'
  , 'positive', 'negative', 'neutral', 'positive', 'negative'
]

model_1_predictions = [
  'positive', 'neutral', 'negative', 'neutral', 'neutral'
  , 'positive', 'positive', 'neutral', 'positive', 'neutral'
]

model_2_predictions = [
  'neutral', 'neutral', 'negative', 'positive', 'positive'
  , 'positive', 'negative', 'neutral', 'neutral', 'negative'
]

---

Let's calculate the `Expected Probability (Pe)` first, so we first need to create a table where the index is each comment, the columns are each rater and therows are the assigned sentiment by the rater.

In [8]:
# Expected Probability
import pandas as pd

evaluation_table = pd.DataFrame(columns=['gold_standard', 'model_1_predictions', 'model_2_predictions'])
evaluation_table['gold_standard'] = gold_standard
evaluation_table['model_1_predictions'] = model_1_predictions
evaluation_table['model_2_predictions'] = model_2_predictions
evaluation_table

Unnamed: 0,gold_standard,model_1_predictions,model_2_predictions
0,positive,positive,neutral
1,neutral,neutral,neutral
2,negative,negative,negative
3,positive,neutral,positive
4,neutral,neutral,positive
5,positive,positive,positive
6,negative,positive,negative
7,neutral,neutral,neutral
8,positive,positive,neutral
9,negative,neutral,negative


After this, we create a second table where the index is each comment, the columns are each possible sentiment and the rows are the frequencies of the assigned sentiments to the comment.

Then, we calculate the probability of each possible category.

In [9]:
# Expected Probability
frequency_table = evaluation_table.apply(
    pd.Series.value_counts, axis=1
)               \
  .fillna(0)    \
  .astype(int)

frequency_table

Unnamed: 0,negative,neutral,positive
0,0,1,2
1,0,3,0
2,3,0,0
3,0,1,2
4,0,2,1
5,0,0,3
6,2,0,1
7,0,3,0
8,0,1,2
9,2,1,0


In [10]:
total_sum = frequency_table.sum().sum()
positive_probability = frequency_table['positive'].sum() / total_sum
neutral_probability = frequency_table['neutral'].sum() / total_sum
negative_probability = frequency_table['negative'].sum() / total_sum

Finally, we can get the `Expected Probability (Pe)` by getting the ratio of each sentiment and the sum their squares:

```python
predicted_probability = sum(expected_ratio**2)
```

$$
\text{Predicted Probability} = \sum_{i}{(\text{expectedRatio}_{i}^2)}
$$

In [11]:
predicted_probability = (positive_probability**2) + (neutral_probability**2) + (negative_probability**2)
print(f'- Expected Probability: {expected_probability}')

- Expected Probability: 0.3400000000000001


---

Now, to calculate the `Observed Probability (Po)`, we must resolve the following equation:

```python
observed_probability = (1 / (N * n * (n - 1))) * (sum(sum(nij**2))) - (N - n)
```

$$
\text{Observed Probability} = (\frac{1}{N \cdot n \cdot (n - 1)}) \cdot (\sum_{i}^{N}{\sum_{j}^{k}{n_{ij}^2}}) - (N \cdot n)
$$

where:

- **N** - `number of items`;

- **n** - `number of raters`;

- **i** - `row index`;

- **j** - `column index`;

- **k** - `category`.

Don't worry about the equation being big, let's just split it up into three small pieces and then solve it:

$$
\text{Piece 1} = (\frac{1}{N \cdot n \cdot (n - 1)})
$$

$$
\text{Piece 2} = \sum_{i}^{n}{\sum_{j}^{k}{n_{ij}^2}}
$$

$$
\text{Piece 3} = (N \cdot n)
$$

In [12]:
# Observed Probability
number_of_items = frequency_table.shape[0]
number_of_raters = frequency_table.shape[1]

piece_1 = 1 / (number_of_items * number_of_raters * (number_of_raters - 1))
piece_2 = frequency_table.apply(lambda frequency: sum(frequency**2)).sum()
piece_3 = number_of_items * number_of_raters

observed_probability = piece_1 * (piece_2 - piece_3)
print(f'- Observed Probability: {observed_probability}')

- Observed Probability: 0.6


---

Now, to calculat `Fleiss's Kappa` score, we use the same equation from Cohen's Kappa score:

```python
fleiss_kappa = (observed_probability - expected_probability) / (1 - expected_probabilty)
```

$$
\text{Fleiss's Kappa} = \frac{\text{observedProbability} - \text{expectedProbability}}{1 - \text{expectedProbability}}
$$

So, realize that the equation is the same, being the way to calculate both observed and expected probabilities the unique differences!!

In [13]:
# Fleiss's Kappa Score
fleiss_kappa_score = (observed_probability - expected_probability) / (1 - expected_probability)
print(f'- Fleiss\'s Kappa Score: {fleiss_kappa_score}')

- Fleiss's Kappa Score: 0.3939393939393938


Taking our example into consideration, since Fleiss score is approximately 0.39, we can tell that the model's predictions were `Fair` reliable to the Gold Standard!!

Now let's do the easy way to calculate the score using sklearn!!

In [14]:
# Calculating Fleiss's Kappa Score with SKLearn
#
#  OBS.: the result can differ a few decimals due to roundings
#
from statsmodels.stats.inter_rater import fleiss_kappa
fleiss_kappa_score = fleiss_kappa(frequency_table, method='fleiss')
print(f'- Fleiss\'s Kappa Score: {fleiss_kappa_score}')

- Fleiss's Kappa Score: 0.3856655290102387


<h1 id='4-weighted-cohens-kappa-score' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🥅 | Weighted Cohen's Kappa Score</h1>

Sometimes, some categories can be more important than other ones and we must assign weights to them in order to express their importance degree. In this scenario, we must apply `Weighted Cohen's Kappa` score, specially when:

- we are evaluating `two raters`, e.g. Gold Standard and model's predictions;

- the categories are `ordinal`, that is, they are not nominal and `there's hierarchy between them`.

For instance, taking our animes comment sentiments analysis example into consideration, we can say that the 'positive' sentiment is the most important one, 'neutral' the second, and 'negative the last one, then:

$$
\text{positive} >> \text{neutral} >> \text{negative}
$$

And, to calculate the `Weighted Cohen's Kappa` score, we simply calculate the complement of the sum of the observed weights multiplied by the observed frequencies or probabilities, divided by the sum of the expected weights multiplied by the expected frequencies or probabilities:

```python
weighted_cohens_kappa = 1 - (sum(observed_weights * observed_probability)) / (sum(expected_weights * expected_probability))
```

$$
\text{Weighted Cohen's Kappa} = 1 - \frac{\sum_{i}({\text{observedWeight}_{i} \cdot \text{observedProbability}_{i}})}{\sum_{j}({\text{expectedWeights}_{j} \cdot \text{expectedProbability}_{j}})}
$$

In [37]:
gold_standard = [
	'positive', 'neutral', 'negative', 'positive', 'neutral'
	, 'positive', 'negative', 'neutral', 'positive', 'negative'
]

model_predictions = [
	'positive', 'neutral', 'negative', 'neutral', 'neutral'
	, 'positive', 'positive', 'neutral', 'positive', 'neutral'
]

---

First off, we have to calculate a `Matrix Confusion` between the Gold Standard and the model's predictions in order to get the `Observed Probability (Po)`.

In [38]:
# Observed Probability
from sklearn.metrics import confusion_matrix

observed_frequency = confusion_matrix(gold_standard, model_predictions)
total_frequency = observed_frequency.sum()
observed_probability = observed_frequency / total_frequency
observed_probability

array([[0.1, 0.1, 0.1],
       [0. , 0.3, 0. ],
       [0. , 0.1, 0.3]])

---

For the `Expected Probability (Pe)`, we calculate the sum of each column and row and multiply the sum for each respective column and row index.

In [39]:
# Expected Probability
import numpy as np

rows_sum = observed_probability.sum(axis=1)
columns_sum = observed_probability.sum(axis=0)

expected_probability = np.outer(rows_sum, columns_sum)
expected_frequency = expected_probability * total_frequency
expected_probability

array([[0.03, 0.15, 0.12],
       [0.03, 0.15, 0.12],
       [0.04, 0.2 , 0.16]])

---

Now we can satar calculating the `weights`. There are two types of weights, the `Linear` and the `Quadratic`, whereas the first one apply the `same penalty` to all categories, the second one apply `small penalties` to the first ones and `big penalties` to the last ones. Then:

- **Linear Weight** - `same weight gap between all categories, that is, the weight difference between the categories is the same`;

- **Quadratic Weight** - `different weight gap between the categories, that is, the weight difference between the categories is different and increases accordingly the importance degree decreases`.

Starting off `Linear Weight`, it's equation is given by the absolute value of the subtraction of the row and column indexes divided by the total number of categories minus 1:

```python
linear_weight = abs(row_index - column_index) / (total_categories - 1)
```

$$
\text{Linear Weight} = \frac{ || \text{rowIndex} - \text{columnIndex} || }{\text{totalCategories} - 1}
$$

Whereas `Quadratic Weight` is given by the square of the sum of row and column indexes divided by the sqyare of the total number of categories minus 1:

```python
quadratic_weight = ((row_index - column_index)**2) / ((total_categories - 1)**2)
```

$$
\text{Quadratic Weight} = \frac{(\text{rowIndex} - {columnIndex})^2}{(\text{totalCategories} - 1)^2}
$$

Let's calculate both weights!!

In [40]:
# Linear Weight and Quadratic Weight
rows_index = np.array([0, 1, 2])
columns_index = np.array([0, 1, 2])
normalizer_factor = rows_index.shape[0] - 1

linear_weight = np.abs(rows_index[:, None] - columns_index[None, :]) / normalizer_factor
quadratic_weight = ((rows_index[:, None] - columns_index[None, :])**2) / normalizer_factor**2

print(f'- Linear Weight: {linear_weight}')
print('---')
print(f'- Quadratic Weight: {quadratic_weight}')

- Linear Weight: [[0.  0.5 1. ]
 [0.5 0.  0.5]
 [1.  0.5 0. ]]
---
- Quadratic Weight: [[0.   0.25 1.  ]
 [0.25 0.   0.25]
 [1.   0.25 0.  ]]


---

Then we can finally calculate the `Weighted Cohen's Kappa` score using the following equation:

```python
weighted_cohens_kappa = 1 - (sum(observed_weights * observed_probability)) / (sum(expected_weights * expected_probability))
```

$$
\text{Weighted Cohen's Kappa} = 1 - \frac{\sum_{i}({\text{observedWeight}_{i} \cdot \text{observedProbability}_{i}})}{\sum_{j}({\text{expectedWeights}_{j} \cdot \text{expectedProbability}_{j}})}
$$

In [41]:
# Linear Weight
linear_observed_frequency = (linear_weight * observed_frequency).sum()
linear_expected_frequency = (linear_weight * expected_frequency).sum()
linear_weighted_cohens_kappa = 1 - (linear_observed_frequency / linear_expected_frequency)
print(f'- Linear Weighted Cohen\'s Kappa: {linear_weighted_cohens_kappa}')

- Linear Weighted Cohen's Kappa: 0.5121951219512195


In [42]:
# Quadratic Weight
quadratic_observed_frequency = (quadratic_weight * observed_frequency).sum()
quadratic_expected_frequency = (quadratic_weight * expected_frequency).sum()
quadratic_weighted_cohens_kappa = 1 - (quadratic_observed_frequency / quadratic_expected_frequency)
print(f'- Quadratic Weighted Cohen\'s Kappa: {quadratic_weighted_cohens_kappa}')

- Quadratic Weighted Cohen's Kappa: 0.47368421052631593


Taking our example into consideration, both Linear and Quadratic Weighted Cohen's Kappa score were `Moderate` reliable to the Gold Standard

Now let's do the easy way to calculate the score using sklearn!!

In [43]:
# Calculating Weighted Cohen's Kappa Score with SKLean
#
#  OBS.: the result can differ a few decimals due to roundings
#
from sklearn.metrics import cohen_kappa_score

linear_weighted_cohens_kappa = cohen_kappa_score(
    gold_standard
    , model_predictions
    , weights='linear'
)

quadratic_weighted_cohens_kappa = cohen_kappa_score(
    gold_standard
    , model_predictions
    , weights='quadratic'
)

print(f'- Linear Weighted Cohen\'s Kappa: {linear_weighted_cohens_kappa}')
print(f'- Quadratic Weighted Cohen\'s Kappa: {quadratic_weighted_cohens_kappa}')

- Linear Weighted Cohen's Kappa: 0.5121951219512195
- Quadratic Weighted Cohen's Kappa: 0.4736842105263158


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).