In [1]:
from datasets import load_dataset
import pandas as pd
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from seqeval.metrics import recall_score
from seqeval.metrics import precision_score
from seqeval.scheme import IOB2

In [2]:
greek_legal_ner = load_dataset("joelito/lextreme", "greek_legal_ner")
lener_br = load_dataset("joelito/lextreme","lener_br")

Reusing dataset lextreme (C:\Users\vmatoshi\.cache\huggingface\datasets\joelito___lextreme\greek_legal_ner\1.0.0\14697fe76aebeca2a6a3c25166bd6eb61f769e4ad7a19bd520634691a4f70bc5)


  0%|          | 0/3 [00:00<?, ?it/s]

Reusing dataset lextreme (C:\Users\vmatoshi\.cache\huggingface\datasets\joelito___lextreme\lener_br\1.0.0\14697fe76aebeca2a6a3c25166bd6eb61f769e4ad7a19bd520634691a4f70bc5)


  0%|          | 0/3 [00:00<?, ?it/s]

# NER evaluation

## Problem statement

- There seems to exist a bunch of ways of how to evaluate the results of NER
    - 
    
- There are two major problems:
    - 1. We need to decide which method we are going to use. 
    - 2. In many papers people don't really state how they calculated the scores. In those cases, we need to read their code and figure out how they did it. In other cases, they explained it very well, as in this paper:
        - Angelidis, I., Chalkidis, I., & Koubarakis, M. (2018). Named Entity Recognition, Linking and Generation for Greek Legislation. JURIX.
          - *\"MUC guidelines consider partial token overlaps between the gold annotations and the predicted entities(sequences of consecutive tokens that belong in the same class), given the correct (gold) class.\"*
        

### 1. How we did it

#### 1.1. Tools

- We used the library seqeval: https://github.com/chakki-works/seqeval
- The documentation states that seqeval supports the following schemes:
    IOB1
    IOB2
    IOE1
    IOE2
    IOBES(only in strict mode)
    BILOU(only in strict mode)
- So, usually, when you annotate NER tags, use specify when a NER tag starts and ends. This is usally done be some sort of prefix, as I- = inside or O- = outside.


#### 1.2. Problem

- The LEXTREME dataset contains a lot of NER datasets
- But those are not presented using the same scheme


- All NER datasets that we created follow a very simple scheme, that is not supported by seqeval, which means that the label column only gives the ID of the label without specifying whether it is the beginning or the end of a tag
- So we have these label classes:

"label_classes": [
        "FACILITY",
        "GPE",
        "LEG-REFS",
        "LOCATION-NAT",
        "LOCATION-UNK",
        "ORG",
        "PERSON",
        "PUBLIC-DOCS",
        "O",
    ]
       - NOTE that, if we use this kind of annotation, we must be very sure that the two instances of the same named entity do not occur adjacently 

In [3]:
pd.DataFrame(greek_legal_ner['train'])

Unnamed: 0,input,label,language
0,"[ΝΟΜΟΣ, ΥΠ', ΑΡΙΘΜ, ., 4156]","[8, 8, 8, 8, 8]",el
1,"[Μνημόνιο, Κατανόησης, στον, τομέα, της, προστ...","[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...",el
2,"[Ο, ΠΡΟΕΔΡΟΣ, ΤΗΣ, ΕΛΛΗΝΙΚΗΣ, ΔΗΜΟΚΡΑΤΙΑΣ]","[8, 8, 8, 1, 1]",el
3,"[Εκδίδομε, τον, ακόλουθο, νόμο, που, ψήφισε, η...","[8, 8, 8, 8, 8, 8, 8, 5, 8]",el
4,"[Άρθρο, 1]","[8, 8]",el
...,...,...,...
17694,"[Την, έγκριση, της, πρότασης, της, Κοινότητας,...","[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...",el
17695,"[5.000, ., Κατά, της, παρούσης, επιτρέπεται, η...","[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...",el
17696,"[Η, απόφαση, αυτή, να, δημοσιευθεί, στην, Εφημ...","[8, 8, 8, 8, 8, 8, 8, 8, 8, 8]",el
17697,"[Παλλήνη, ,, 1, Δεκεμβρίου, 2004, Ο, Νομάρχης,...","[8, 8, 8, 8, 8, 8, 8, 6]",el


- On the other side, we have lener_br that uses a scheme supported by seqeval
- So, we have these label classes:

"label_classes": [
        "O",
        "B-ORGANIZACAO",
        "I-ORGANIZACAO",
        "B-PESSOA",
        "I-PESSOA",
        "B-TEMPO",
        "I-TEMPO",
        "B-LOCAL",
        "I-LOCAL",
        "B-LEGISLACAO",
        "I-LEGISLACAO",
        "B-JURISPRUDENCIA",
        "I-JURISPRUDENCIA",
    ]

In [4]:
pd.DataFrame(lener_br['train'])

Unnamed: 0,input,label,language
0,"[EMENTA, :, APELAÇÃO, CÍVEL, -, AÇÃO, DE, INDE...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",pt
1,"[-, O, art, ., 178, ,, II, ,, do, CPC, prescre...","[0, 0, 9, 10, 10, 10, 10, 10, 10, 10, 0, 0, 0,...",pt
2,"[-, Tratando-se, de, ação, indenizatória, ajui...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",pt
3,"[-, Tendo, o, vício, sido, arguido, pelo, Parq...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",pt
4,"[-, Preliminar, acolhida, para, reconhecer, a,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",pt
...,...,...,...
7823,"[Verbas, Rescisórias, .]","[0, 0, 0]",pt
7824,"[Controversas, '', ,, por, ofensa, ao, artigo,...","[0, 0, 0, 0, 0, 0, 9, 10, 10, 10, 0, 0, 0, 0, ...",pt
7825,"[Brasília, ,, 04, de, novembro, de, 2015, .]","[7, 0, 5, 6, 6, 6, 6, 0]",pt
7826,"[Firmado, por, assinatura, digital, (, MP, 2.2...","[0, 0, 0, 0, 0, 9, 10, 0, 3, 4, 4, 4, 0, 0, 0,...",pt


- It makes a difference concerning the question what is perceived as correct and not correct

- **Let's take, for example, the scheme used in greek_legal_ner**
- We add the prefix I-, because seqeval truncates the tags 

- Below, everything is perfectly predicted

In [5]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))

Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0


- Next, instead of PER, O is predicted
- As we can see, accuracy is measured on token level
- All the other scores are calculated on NER level
- There are two instaced of PER; since the model missed one single token for the second instance of PER, the second instance of PER is not included in the cases of correctly classified NERs at all. This is a big difference in comparison to the scheme used in Angelidis, I., Chalkidis, I., & Koubarakis, M. (2018). Named Entity Recognition, Linking and Generation for Greek Legislation. JURIX.
  - In other word: while the scores for MISC remain perfect, the score for f1, precision and recall is 0.50, because half of the instances of PER where not detected (even though only one token was not detected correctly) 

In [6]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.9
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



- It does not matter, if another token of the same NE instance is wrong, except for the accuracy score

In [7]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','O','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))




Accuracy:  0.8
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



- Exception: if the entire NE is not detected at all (i.e. everything is O), it does not count, which is why recall stays the same, but precision increases again yield a better F1 score 
  - **this is weird**

In [8]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','O','O','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.7
Precision:  1.0
Recall:  0.6666666666666666
F1:  0.8
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       1.00      0.50      0.67         2

   micro avg       1.00      0.67      0.80         3
   macro avg       1.00      0.75      0.83         3
weighted avg       1.00      0.67      0.78         3



- In the following the results, if we have false positives

In [9]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','O','O']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','I-PER','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.9
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



In [10]:
y_true = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','O','O']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','I-PER','I-PER']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.8
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



- **Let's take, for example, the scheme used in lener_br**

In [11]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))

Accuracy:  0.9
Precision:  1.0
Recall:  1.0
F1:  1.0


In [12]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.8
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



In [13]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','O','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.7
Precision:  0.6666666666666666
Recall:  0.6666666666666666
F1:  0.6666666666666666
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.50      0.50      0.50         2

   micro avg       0.67      0.67      0.67         3
   macro avg       0.75      0.75      0.75         3
weighted avg       0.67      0.67      0.67         3



- Again, if the entire NE is not detected at all (i.e. everything is O), it does not count, which is why recall stays the same, but precision increases again yield a better F1 score 
  - **this is weird**

In [14]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O','B-PER','I-PER','I-PER']]
y_pred = [['B-PER', 'I-PER', 'O', 'I-MISC', 'I-MISC', 'I-MISC', 'O','O','O','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.6
Precision:  1.0
Recall:  0.6666666666666666
F1:  0.8
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       1.00      0.50      0.67         2

   micro avg       1.00      0.67      0.80         3
   macro avg       1.00      0.75      0.83         3
weighted avg       1.00      0.67      0.78         3



- In the following the results, if we have false positives
     - The results are slightly different

In [15]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','O','O']]
y_pred = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','B-PER','O']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.9
Precision:  0.75
Recall:  1.0
F1:  0.8571428571428571
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.67      1.00      0.80         2

   micro avg       0.75      1.00      0.86         3
   macro avg       0.83      1.00      0.90         3
weighted avg       0.78      1.00      0.87         3



In [16]:
y_true = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','O','O']]
y_pred = [['B-PER', 'I-PER', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'B-PER','I-PER','B-PER','I-PER']]
print('Accuracy: ',accuracy_score(y_true, y_pred))
print('Precision: ',precision_score(y_true, y_pred))
print('Recall: ',recall_score(y_true, y_pred))
print('F1: ',f1_score(y_true, y_pred))
print('------------------------')
print(classification_report(y_true, y_pred))

Accuracy:  0.8
Precision:  0.75
Recall:  1.0
F1:  0.8571428571428571
------------------------
              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       0.67      1.00      0.80         2

   micro avg       0.75      1.00      0.86         3
   macro avg       0.83      1.00      0.90         3
weighted avg       0.78      1.00      0.87         3



## Possible solutions

- I think, we can stick to seqeval, but we must be able to understands it internal functioning
- We should definitly use the same scheme for all our NER datasets!
- It might be necessary to provide four different metrics, as it was done in this paper:
    - <div class="csl-entry">Segura-Bedmar, I., Mart\’\inez, P., &#38; Herrero-Zazo, M. (2013). SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). <i>Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)</i>, 341–350. https://aclanthology.org/S13-2056</div>
    
     - *Our evaluation metricsshould score if a system is able to identify the ex-act span of an entity (regardless of the type) and ifit is able to assign the correct entity type (regardlessof the boundaries). Thus, our evaluation script willoutput four sets of scores according to*:
      - 1. Strict evaluation (exact-boundary and typematching).
      - 2. Exact boundary matching (regardless to thetype).
      - 3. Partial boundary matching (regardless to thetype).
      - 4. Type matching (some overlap between thetagged entity and the gold entitity is required).
      

- Maybe one of these libraries might help us:
 - https://github.com/MantisAI/nervaluate
 - If we do an evluation on token level, we might also just use scikit learn by treating each token as a sample
     - https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2