## Mise en place d'une évaluation objective
Très important : besoin d'avoir une référence validée, aussi appelée "vérité terrain" (*"ground truth"*), "données cibles" (*"targets"*), *"gold standard"*…

Cette référence contient, pour une échantillon représentatif de données d'entrée de notre système, les données idéales que notre système devrait produire en sortie.
Dans le doute, il est important de bien coller à la définition d'une tâche de traitement de données "classique", c'est à dire à un triplet (type et format des données d'entrées, type et format des données de sortie, méthode d'évaluation de la conformité entre données prédite et données attendues) communément utilisé par les équipes expérimentées sur ce sujet.

TODO introduire notions de precision/recall/fscore (métriques de détection / retrieval)

In [None]:
# On charge le dataset dans un format facile
import json
def load_dataset(path_to_json: str) -> dict[str, tuple[str, list[tuple[int, int, str]]]]:
    with open(path_to_json, encoding="utf8") as in_file:
        return json.load(in_file)

all_data = load_dataset("../dataset/French_ELTEC_NER_Open_Dataset.json")
print(f"Loaded text and target entities for {len(all_data)} samples.")

Loaded text and target entities for 100 samples.


In [None]:
from spacy.scorer import Scorer
from spacy.training.example import Example

def evaluate(ner_model, dataset_dict, debug=False):
    """FIXME DOC"""
    examples = []
    for doc_id, (text, target_entities) in dataset_dict.items():
        pred_doc = ner_model(text)
        if debug:
            print("Pred.:", [(ent.text, ent.label_) for ent in pred_doc.ents], " ↔ Targ.:", [(text[e[0]:e[1]], e[2]) for e in target_entities])
        try:
            example = Example.from_dict(pred_doc, {"entities": target_entities})
            examples.append(example)
        except ValueError as e:
            err_msg = f"Error parsing document '{doc_id}': "
            err_msg += getattr(e, "msg", str(e))
            print(err_msg)
            raise ValueError(err_msg)
    
    scorer = Scorer()
    scores = scorer.score_spans(examples, "ents")
    # print(scores["ents_f"])
    return scores

In [None]:
# Load a NER model
ner_model = spacy.load('fr_core_news_sm')

We should deactivate the useless parts of the pipeline here, to accelerate the evaluation.

In [None]:
ner_model.pipe_names

['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
ner_model.select_pipes(enable="ner")
ner_model.pipe_names

['ner']

In [None]:
%%time
# evaluate using custom function, maybe useless because of the Language.evaluate() method! <https://spacy.io/api/language#evaluate>
results = evaluate(ner_model, all_data, debug=False)
results

Où..." with entities "[[51, 58, 'PER'], [106, 113, 'PER'], [369, 374, 'L...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
Dans la grande salle des fêtes de 1' « ..." with entities "[[1003, 1014, 'PER'], [1246, 1252, 'PER'], [1254, ...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.



CHAPITRE PREMIER
PREMIERS SIGNES
Je suis toute..." with entities "[[122, 140, 'PER'], [160, 168, 'PER'], [998, 1006,...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.


CPU times: user 7.77 s, sys: 213 ms, total: 7.98 s
Wall time: 7.98 s


{'ents_p': 0.4317656129529684,
 'ents_r': 0.6339622641509434,
 'ents_f': 0.5136829231004433,
 'ents_per_type': {'MISC': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'PER': {'p': 0.6165458937198067,
   'r': 0.5984759671746777,
   'f': 0.6073765615704938},
  'LOC': {'p': 0.4205488194001276,
   'r': 0.698093220338983,
   'f': 0.5248904818797292},
  'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}


Try the evaluation using the [`Language.evaluate()`](https://spacy.io/api/language#evaluate) method.

In [None]:
%%time
examples = []
for doc_id, (text, target_entities) in all_data.items():
    base_doc = ner_model.make_doc(text)  # We create simpler examples here but will the evaluate function recompute them?
    try:
        example = Example.from_dict(base_doc, {"entities": target_entities})
        examples.append(example)
    except ValueError as e:
        err_msg = f"Error parsing document '{doc_id}': "
        err_msg += getattr(e, "msg", str(e))
        print(err_msg)
        raise ValueError(err_msg)
print(f"Created {len(examples)} examples.")

Où..." with entities "[[51, 58, 'PER'], [106, 113, 'PER'], [369, 374, 'L...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
Dans la grande salle des fêtes de 1' « ..." with entities "[[1003, 1014, 'PER'], [1246, 1252, 'PER'], [1254, ...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.



CHAPITRE PREMIER
PREMIERS SIGNES
Je suis toute..." with entities "[[122, 140, 'PER'], [160, 168, 'PER'], [998, 1006,...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.


Created 100 examples.
CPU times: user 3.13 s, sys: 2.09 ms, total: 3.13 s
Wall time: 3.14 s


In [None]:
%%time
scores = ner_model.evaluate(examples)
scores

CPU times: user 7.78 s, sys: 8.48 s, total: 16.3 s
Wall time: 16.4 s


{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'ents_p': 0.43243243243243246,
 'ents_r': 0.6339622641509434,
 'ents_f': 0.5141545524100996,
 'ents_per_type': {'LOC': {'p': 0.4216250799744082,
   'r': 0.698093220338983,
   'f': 0.5257279617072198},
  'MISC': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'PER': {'p': 0.6172914147521161,
   'r': 0.5984759671746777,
   'f': 0.6077380952380953},
  'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}},
 'speed': 9838.436375865644}

On obtient les mêmes valeurs, mais plus lentement ; probablement car on fait une évaluation plus large avec l'évaluation de la tokenization et de la vitesse en plus.