# Pipeline d'entraînement des modèles NER avec spaCy (Google colab)

Auteur : Lucas Terriel

Date : 27/10/2021

Exécuter ce notebook dans une environnement GPU `Exécution > Modifier le type d'éxecution > accélérateur matériel : GPU`


### 1. Commencer par monter votre espace *drive* / dossier pour placer vos données (persistence)

L'organisation dans le *drive* ou le dossier peut suivre la structure suivante : 

```
drive/MyDrive
  |
  |- spaCy/
  |   |- data/ (contient les exemples ici train.spacy et le test.spacy et eval.spacy (opt.))
  |   |- le fichier base_config_nom.cfg selon l'architecture modèle utilisée 
  |   |- output_models/ : un dossier d'acceuil pour les modèles
  |           |- version_projet
  |           |       |- evaluations/ (metrics/ et visualisations/) 
  |           |- model1, model2 ...
  |   |- raw_texts_tests/ : un ou deux fichiers texte brutes pour tester le modèle "à la volée"
  |   |- vectors/ : fr_wiki fasttext embeddings (optionnel)
  |
```

In [None]:
#from google.colab import drive
#drive.mount("/drive")

Mounted at /drive


### 2. Installer les dépendances Python

In [None]:
# Use this if use GPU :
#!pip install -U pip setuptools wheel
#!pip install -U spacy[cuda111,transformers]
#!pip install torch
#!python -m spacy download fr_dep_news_trf
#!python -m spacy download fr_core_news_lg

# Use this if use CPU :
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download fr_core_news_lg

Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 12.9 MB/s 
Collecting setuptools
  Downloading setuptools-58.3.0-py3-none-any.whl (946 kB)
[K     |████████████████████████████████| 946 kB 53.4 MB/s 
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed pip-21.3.1 setuptools-58.3.0


Collecting spacy[cuda111,transformers]
  Downloading spacy-3.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
     |████████████████████████████████| 5.9 MB 13.3 MB/s            
Collecting thinc<8.1.0,>=8.0.9
  Downloading thinc-8.0.11-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (627 kB)
     |████████████████████████████████| 627 kB 40.6 MB/s            
[?25hCollecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.4 MB/s             
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 26.6 MB/s    

### 3. Vérifier Spacy en version 3.x et l'utilisation du GPU

In [None]:
#import spacy 
#import torch
#torch.cuda.empty_cache()

#print('GPU State : ', spacy.prefer_gpu())
#print('GPU required : ', spacy.require_gpu(0))
#print('Cuda available : ', torch.cuda.is_available())
print('SPACY VERSION : ', spacy.__version__)

GPU State :  True
GPU required :  True
Cuda available :  True
SPACY VERSION :  3.1.3


### 4. Renseigner les paramètres d'accès à vos données

In [None]:
# a modifier selon où se trouve les données

CONFIG_BASE="/drive/MyDrive/spaCy/base_config_cnn.cfg"
CONFIG_BASE_FILLED="/drive/MyDrive/spaCy/config_spacy_cnn.cfg"

OUTPUT_MODEL="/drive/MyDrive/spaCy/output_models/v1_n4a/n4a_v1_lg_cnn_fr"

### ➹ 5a. créer les vecteurs de mots (*embeddings*) [Optionnel si déjà réalisé]

In [None]:
# !python -m spacy init vectors fr /drive/MyDrive/spaCy/vectors/wiki.fr.vec.txt.zip /drive/MyDrive/spaCy/vectors/vec_wikifr --name with_frwiki_vec --verbose

[38;5;4mℹ Creating blank nlp object for language 'fr'[0m
[2021-10-26 09:16:08,274] [INFO] Reading vectors from /drive/MyDrive/spaCy/vectors/wiki.fr.vec.txt.zip
tcmalloc: large alloc 1382940672 bytes == 0x55e90da40000 @  0x7f3e972e1001 0x7f3e94ddf54f 0x7f3e94e2fb58 0x7f3e94e33b17 0x7f3e94ed2203 0x55e90386a544 0x55e90386a240 0x55e9038de627 0x55e90386bafa 0x55e9038d9915 0x55e9038d8ced 0x55e90386bbda 0x55e9038da737 0x55e9038d89ee 0x55e9037aae2b 0x55e9038dafe4 0x55e9038d8ced 0x55e9037aae2b 0x55e9038dafe4 0x55e9038d89ee 0x55e90386c48c 0x55e90386c698 0x55e9038dafe4 0x55e90386bafa 0x55e9038d9c0d 0x55e9038d8ced 0x55e90386bbda 0x55e9038d9c0d 0x55e9038d8ced 0x55e90386bbda 0x55e9038d9c0d
1152449it [02:27, 7802.89it/s]
[2021-10-26 09:18:37,336] [INFO] Loaded vectors from /drive/MyDrive/spaCy/vectors/wiki.fr.vec.txt.zip
[38;5;2m✔ Successfully converted 1152449 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' s

### ⚙️ 5b. Préparer le fichier de configuration [Optionnel si déjà réalisé]

Dans le fichier de config renseigner les variables `train` et `dev` (`vectors.path` optionnel) pour indiquer le chemin vers vos données d'entraînement puis lacer la cellule ci-dessous.

In [None]:
# if use transformer with embeddings
#import os
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
!python -m spacy init fill-config $CONFIG_BASE $CONFIG_BASE_FILLED

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/drive/MyDrive/spaCy/config_spacy_cnn.cfg
You can now add your data and train your pipeline:
python -m spacy train config_spacy_cnn.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### 6. ✅ Contrôle des données d'entraînement

In [None]:
!python -m spacy debug data $CONFIG_BASE_FILLED

[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: fr
Training pipeline: tok2vec, ner
1205 training docs
266 evaluation docs
[38;5;3m⚠ 51 training examples also in evaluation data[0m
[38;5;3m⚠ Low number of examples to train a new pipeline (1205)[0m
[1m
[38;5;4mℹ 14633 total word(s) in the data (2840 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;4mℹ 5 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;3m⚠ 115 entity span(s) with punctuation[0m
[38;5;3m⚠ Low number of examples for label 'EVENT' (11)[0m
[2K[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
Entity spans consisting of or starting/ending with punctuation can not be
trained with a noise level > 0.
[1m
[38;5;2m✔ 4 checks passed[0m


### 7. 	🚀 🤖 Lancement de l'entraînement  

paramètre `-g 0` (si GPU)

In [None]:
!python -m spacy train $CONFIG_BASE_FILLED -g -1 --output $OUTPUT_MODEL

[38;5;4mℹ Saving to output directory:
/drive/MyDrive/spaCy/output_models/v1_n4a/n4a_v1_lg_cnn_fr[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-10-26 11:27:24,247] [INFO] Set up nlp object from config
[2021-10-26 11:27:24,259] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-10-26 11:27:24,264] [INFO] Created vocabulary
[2021-10-26 11:27:24,265] [INFO] Finished initializing nlp object
[2021-10-26 11:27:26,067] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     51.68    0.00    0.00    0.00    0.00
  1     200        632.22   4175.26   58.65   65.84   52.88    0.59
  2     400        411.02   2049.09   67.83   71.39   64.60    0.68
  5     600        474.85   1195.64   73.38   74.21   72.57    0.73
  7   

### 8.📊 Évaluation du modèle 

In [None]:
# a modifier selon où se trouve les données

model_eval="/drive/MyDrive/spaCy/output_models/v1_n4a/n4a_v1_lg_cnn_fr/model-best"
eval_file="/drive/MyDrive/spaCy/data/large/eval.spacy"
output_metrics="/drive/MyDrive/spaCy/output_models/v1_n4a/evaluations/metrics/n4a_v1_lg_cnn_fr" + ".json"
output_visualisations="/drive/MyDrive/spaCy/output_models/v1_n4a/evaluations/visualisations/"

In [None]:
# If GPU : --gpu-id 0
!python -m spacy evaluate $model_eval $eval_file --output $output_metrics --gpu-id -1 --displacy-path $output_visualisations

[38;5;4mℹ Using CPU[0m
[1m

TOK     -    
NER P   54.55
NER R   48.32
NER F   51.25
SPEED   19470

[1m

                   P       R       F
ORGANISATION   30.00   23.08   26.09
PERSON         62.79   64.29   63.53
LOCATION       57.38   53.85   55.56
TITLE          66.67   26.67   38.10
EVENT           0.00    0.00    0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
/drive/MyDrive/spaCy/output_models/v1_n4a/evaluations/visualisations
[38;5;2m✔ Saved results to
/drive/MyDrive/spaCy/output_models/v1_n4a/evaluations/metrics/n4a_v1_lg_cnn_fr.json[0m


### 9. 🧪 Tester le modèle NER "à la volée" / visualisation

In [None]:
from spacy import displacy

# a modifier pour le texte d'évaluation
test_text = "/drive/MyDrive/spaCy/raw_texts/text_test_2.txt"

text = open(test_text, mode="r", encoding="utf-8").read()
sample = []
for line in text.splitlines():
   if line != "":
      sample.append(line)

nlp = spacy.load(model_eval)

doc = nlp(text)

opt_render = {
    "ents":["ORGANISATION", "LOCATION", "TITLE", "PERSON", "EVENT"],
    "colors":{
        "LOCATION":"#e74c3c",
        "ORGANISATION":"#9b59b6",
        "PERSON":"#45b39d",
        "TITLE":"#85c1e9",
        "EVENT":"#4d8fbb"
    }
}

displacy.render(doc, style="ent", jupyter=True, options=opt_render)