**NER Notebook**

Created by: Tan Poh Keam, Republic Polytechnic

Acknowledgement: The notebook is inspired from https://youtu.be/TKoPva69_6E


This notebook demonstrates:

1. How we can prepare data for NER training tasks.
The input data is labelling on based Spacy 2 format, then then convert into binary Docbin.

2. How to use the CLI commands to train a custom NER model

3. How to reload the improved NER model to predict against an unseen data

We further assumes that the Configuration file has been prepared. 

The tested environment is based on Pyton 3.8 and Spacy 3

In [1]:
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
from spacy import displacy

The next function will essentially takes in annotated data and convert to the DocBin object.

In [2]:
### Do NOT CHANGE THIS BLOCK OF CODES.

nlp = spacy.blank("en") # load a new spacy model

def  create_spacy3_training_data(TRAIN_DATA):
    db = DocBin() # create a DocBin object
    for text, annot in tqdm(TRAIN_DATA): # data in previous format
       doc = nlp.make_doc(text) # create doc object from text
       ents = []
       for start, end, label in annot["entities"]: # add character indexes
           span = doc.char_span(start, end, label=label, alignment_mode="contract")
           if span is None:
               print("Skipping entity")
           else:
              ents.append(span)
       doc.ents = ents # label the text with the ents
       db.add(doc)
    return (db)



**Step 1A: Prepare the Annotated Training Data in Spacy 2 format**

You will first need to prepare the training and validation data based on 80/20 split.
In practice, you will need hundreds of examples statements per entity label, and will be loaded in via a JSON file instead.


In [3]:
TRAIN_DATA = \
[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
 ('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
 ('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
 ('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
 ('does the US still train pilots to dog fight?',{'entities': [(0, 0, 'aircraft')]}),
 ('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
 ('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]}),
 ('would it be possible to steal a F15', {'entities': [(32, 35, 'aircraft')]}),
 ('who manufactures the F16', {'entities': [(21, 24, 'aircraft')]}),
 ('how many countries have bought the F35',{'entities': [(35, 38, 'aircraft')]}),
 ('is the F35 a waste of money', {'entities': [(7, 10, 'aircraft')]})]

In [4]:
VAL_DATA = \
[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
 ('did you see the F16 landing?', {'entities': [(16, 19, 'aircraft')]}),
 ('how many missiles can a F35 carry', {'entities': [(24, 27, 'aircraft')]}),
 ('is the F15 outdated', {'entities': [(7, 10, 'aircraft')]}),
 ('does the US still train pilots to dog fight?',{'entities': [(0, 0, 'aircraft')]}),
 ('how long does it take to train a F16 pilot',{'entities': [(33, 36, 'aircraft')]}),
 ('how much does a F35 cost', {'entities': [(16, 19, 'aircraft')]}),
 ('would it be possible to steal a F15', {'entities': [(32, 35, 'aircraft')]}),
 ('who manufactures the F16', {'entities': [(21, 24, 'aircraft')]}),
 ('how many countries have bought the F35',{'entities': [(35, 38, 'aircraft')]}),
 ('is the F35 a waste of money', {'entities': [(7, 10, 'aircraft')]})]

**Step 1B:  Convert to Docbin Format that Spacy 3 requires.**
The training data will need to be converted to a binary format that is saved to a path in the directory.
You can modify the next two variables if needed.
Do NOT modify the codes, except for the location of the output files.

In [5]:
# modify if needed
path_to_train_data = '../assets/train.spacy'
path_to_test_data = '../assets/dev.spacy'


In [7]:
# performs the conversion from Spacy 2 to Spacy 3 format
db_train = create_spacy3_training_data(TRAIN_DATA)
db_train.to_disk(path_to_train_data) # save the docbin object

100%|██████████| 11/11 [00:00<00:00, 1789.80it/s]

Skipping entity





In [8]:
db_train = create_spacy3_training_data(VAL_DATA)
db_train.to_disk(path_to_test_data) # save the docbin object

100%|██████████| 11/11 [00:00<00:00, 1899.83it/s]

Skipping entity





In [9]:
# Check that the files are saved
!ls -l ../assets

total 16
-rw-r--r--  1 tanpohkeam  staff  1173 Jun 30 22:07 dev.spacy
-rw-r--r--  1 tanpohkeam  staff  1173 Jun 30 22:07 train.spacy


In [10]:
# check for errors in the doc bin files


**Step 2: Use the  CLI to perform the NER Training**

Start the starting using CLI spacy train.

In [11]:
## This is a sample from website. Do not run this.
#!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev dev.spacy


In [13]:
path_to_model = '../output'

!python -m spacy train ../configs/config.cfg --output $path_to_model --paths.train $path_to_train_data --paths.dev $path_to_test_data



[38;5;4mℹ Using CPU[0m
[1m
[2021-06-30 22:08:32,787] [INFO] Set up nlp object from config
[2021-06-30 22:08:32,791] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-06-30 22:08:32,792] [INFO] Created vocabulary
[2021-06-30 22:08:32,793] [INFO] Finished initializing nlp object
[2021-06-30 22:08:32,971] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     41.50    0.00    0.00    0.00    0.00
[38;5;2m✔ Saved pipeline to output directory[0m
../output/model-last


The ./output folder contains two customer NER model which you can use to create new pipelines

In [14]:
!ls -l $path_to_model
!pwd

total 0
drwxr-xr-x  8 tanpohkeam  staff  256 Jun 30 22:08 [34mmodel-best[m[m
drwxr-xr-x  8 tanpohkeam  staff  256 Jun 30 22:08 [34mmodel-last[m[m
/Users/tanpohkeam/Workspace/Spacy/ner-course/scripts


**Step 3: Use the new model found  inference**
    

In [16]:
# load thebest model from training
best_nlp = spacy.load("../output/model-best")

In [24]:
doc = nlp('s the F16 a waste of money')
displacy.render(doc, style="ent")

In [18]:
displacy.render(doc, style="ent")



In [21]:
nlp = spacy.blank("en")
doc = nlp('Who is the manfacturer of the F16')
displacy.render(doc, style="ent")
