### Importing Required Libraries

In [45]:
import spacy
from spacy.tokens import DocBin #for the file format in spacy
from tqdm import tqdm #presenting progress with bar

nlp = spacy.load("en_core_web_sm") # load a new spacy model
db = DocBin() # create a DocBin object

In [38]:
import json
f = open('./annotations.json')
TRAIN_DATA = json.load(f)

### Preprocessing the json File to Spacy Format

This also get rid of null labeled sentences

In [39]:
for text, annot in tqdm(TRAIN_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 324/324 [00:00<00:00, 4317.94it/s]

Skipping entity





### Initializing the Config File and Training on the Food Dataset


In [40]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency


✘ The provided output file already exists. To force overwriting the config file,
set the --force or -F flag.



In [41]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

ℹ Saving to output directory: .
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
[1m
✔ Initialized pipeline
[1m
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     45.90    0.00    0.00    0.00    0.00
  2     200         72.01   2063.54   82.51   83.72   81.34    0.83
  5     400        300.88    657.42   94.62   94.72   94.52    0.95
 10     600        172.77    423.30   95.98   97.49   94.52    0.96
 15     800        582.48    456.99   95.82   96.31   95.33    0.96
 21    1000        220.48    368.10   96.32   97.11   95.54    0.96
 28    1200        146.16    345.65   96.85   96.95   96.75    0.97
 38    1400        186.02    350.07   97.05   97.35   96.75    0.97
 49    1600        178.54    448.07   97.04   97.74   96.35    0.97
 64    1800         93.30    445.73   96.93   97.93   95.94    0.97

[2022-06-20 23:09:14,508] [INFO] Set up nlp object from config
[2022-06-20 23:09:14,515] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-20 23:09:14,517] [INFO] Created vocabulary
[2022-06-20 23:09:14,517] [INFO] Finished initializing nlp object
[2022-06-20 23:09:14,718] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


### Adding the Existing Model with the New NER Model

Till now we used our dataset to train on only food annotated dataset. Which lead to catastrophic forgetting leading to the model only recognizing foods. To add the existing labels to this model, we need to add the base model to the pipeline.

In [46]:
nlp_ner = spacy.load("./model-best")
nlp_ner.replace_listeners("tok2vec", "ner", ["model.tok2vec"])

In [47]:
nlp.vocab.strings.add("FOOD") #adds the new label to the base 
nlp.add_pipe('ner', source = nlp_ner, name = 'food_ner', before = 'ner')
nlp.to_disk("ConcatedModelNER")

In [48]:

ConcatedNlp = spacy.load("./ConcatedModelNER")
doc = ConcatedNlp('''We take tea in the morning. Tea contains less caffine than coffee. Fresh tea is a delicious treat to start the day with. Dhaka University is a bad institution. Dollars are flying in the sky. Before Dhaka University was established, near its grounds were the former buildings of Dhaka College affiliated to the University of Calcutta. In 1873 the college was relocated to Bahadur Shah Park. Later it shifted to Curzon Hall, which would become the first institute of the university''')
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter