Naturally after reading a particular text, Humans can recognize some common entities like person name , date and so on. But to do the same with the aid of computers, we have to help the computer learn and do the task for us. To do so, we can avail services of `Natural Language Processing (NLP)` and `Machine Learning (ML)`. The role of NLP is to make possible for the computer to read text, communicate with humans , understand their sentiments and interpret it by knowing the patterns and rules of languages. And the role of ML is to help machines learn and improve in time.

Like how we define a heartbeat as a two-part pumping action, we define the working of NER as a two-step process

1. Identify the named entity 
2. Categorize the named entity.

### Importing necessary libraries

In [28]:
import random
from pathlib import Path
import spacy
from tqdm import tqdm
from spacy.training.example import Example
import pickle

### Training Data

First we need to create entity categories such as `Degree`, `School name`, `Location`, `Percentage` & `Date` and feed the NER model with relevant training data.

Spacy library accepts the training data in the form of tuples containing text data and a dictionary. The dictionary should contain the start and end indices of the named entity in the text and category of the named entity.

In [31]:
TRAIN_DATA = [('Higher School Certificate, Parramatta Marist High School, Westmead (1998)',
            {'entities':[(0,25,'degree'),(27,56,'school_name'),(58,66,'location'),(68,72,'date')]}),
            
            ('Bachelor of Business, University of Western Sydney (2005) ',
            {'entities':[(0,20,'degree'),(22,43,'school_name'),(44,50,'location'),(52,56,'date')]}),
            
            ('2007–2010 BCA (Bachelor of Computer Application) from Khalsa college for women, Amritsar (Affiliated to Guru Nanak Dev University (G.N.D.U) India ',
            {'entities':[(0,9,'date'),(12,50,'degree'),(54,78,'school_name'),(80,88,'location')]}),
            
            ('2010–2013 MCA (Masters in Computer Applications) from Amritsar College of Engineering, Amritsar (Affiliated to Punjab Technical University (P.T.U) India. ',
            {'entities':[(0,9,'date'),(10,48,'degree'),(54,85,'school_name'),(87,95,'location')]})]

### Creating Blank Model

The very first baby step in building a cutom model is to create a blank ‘en’ model. This blank model is built to carry out NER process.

In [32]:
model = None
output_dir=Path("ner/")
n_iter=100

#load the model

if model is not None:
    nlp = spacy.load(model)  
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank('en')  
    print("Created blank 'en' model")

Created blank 'en' model


### Pipeline Set-up

Next step is to set-up the pipeline with only NER using create_pipe function.

In [33]:
#set up the pipeline

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe('ner', last=True)
else:
    ner = nlp.get_pipe('ner')

### Training the model

Before starting to train the model, we have to add the categories of the named entities (Labels) to the ‘ner’ using `ner.add_label()` method and then we have to disable other pipeline components apart from ‘ner’ since these components should not get affected while training. We train the recognizer by disabling those components using `nlp.disable_pipes()` method.

To train the ‘ner’ model, the model has to be looped over the training data for sufficient number of iterations. For that, we use `n_iter` which is set to 100. Inorder to ensure that the model does not make generalizations based on the order of the examples, we will shuffle the training data randomly before every iteration using `random.shuffle()` function.

We use `tqdm()` function for creating `Progress Meters` or `Progress Bars`. Example class holds the information for one training instance. It stores two objects, one for holding the predictions of the pipeline and other for holding the reference data. `Example.from_dict`(doc,annotations) method is used to construct an Example object from the predicted document (doc) and the reference annotations provided as a dictionary (annotations).The `nlp_update()` function can be used to train the recognizer.

In [34]:
# adding labels to ner
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
example = []
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in tqdm(TRAIN_DATA):
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update(
                [example], 
                drop=0.5,  
                sgd=optimizer,
                losses=losses)
        print(losses)

100%|██████████| 4/4 [00:00<00:00, 23.48it/s]


{'ner': 63.28670035302639}


100%|██████████| 4/4 [00:00<00:00, 28.08it/s]


{'ner': 59.828527707606554}


100%|██████████| 4/4 [00:00<00:00, 28.74it/s]


{'ner': 52.66073753032833}


100%|██████████| 4/4 [00:00<00:00, 28.58it/s]


{'ner': 42.837681759148836}


100%|██████████| 4/4 [00:00<00:00, 27.79it/s]


{'ner': 35.07668270834256}


100%|██████████| 4/4 [00:00<00:00, 28.52it/s]


{'ner': 25.85703120281687}


100%|██████████| 4/4 [00:00<00:00, 27.96it/s]


{'ner': 24.489172616134056}


100%|██████████| 4/4 [00:00<00:00, 27.86it/s]


{'ner': 23.389105063099237}


100%|██████████| 4/4 [00:00<00:00, 29.94it/s]


{'ner': 25.080337604911577}


100%|██████████| 4/4 [00:00<00:00, 28.58it/s]


{'ner': 23.36931895060167}


100%|██████████| 4/4 [00:00<00:00, 28.00it/s]


{'ner': 21.087966493445464}


100%|██████████| 4/4 [00:00<00:00, 27.91it/s]


{'ner': 19.674738802493692}


100%|██████████| 4/4 [00:00<00:00, 25.09it/s]


{'ner': 20.680760941408625}


100%|██████████| 4/4 [00:00<00:00, 28.63it/s]


{'ner': 18.656229756788626}


100%|██████████| 4/4 [00:00<00:00, 28.03it/s]


{'ner': 16.62257553454711}


100%|██████████| 4/4 [00:00<00:00, 25.35it/s]


{'ner': 35.29204231052607}


100%|██████████| 4/4 [00:00<00:00, 30.10it/s]


{'ner': 14.614767864916374}


100%|██████████| 4/4 [00:00<00:00, 26.48it/s]


{'ner': 26.51997701139785}


100%|██████████| 4/4 [00:00<00:00, 30.71it/s]


{'ner': 16.51893761415038}


100%|██████████| 4/4 [00:00<00:00, 29.03it/s]


{'ner': 15.69902141037414}


100%|██████████| 4/4 [00:00<00:00, 28.85it/s]


{'ner': 18.65983376975446}


100%|██████████| 4/4 [00:00<00:00, 29.12it/s]


{'ner': 49.85183066912343}


100%|██████████| 4/4 [00:00<00:00, 32.27it/s]


{'ner': 18.937121233588385}


100%|██████████| 4/4 [00:00<00:00, 30.89it/s]


{'ner': 46.33863110443781}


100%|██████████| 4/4 [00:00<00:00, 20.04it/s]


{'ner': 18.018995588674443}


100%|██████████| 4/4 [00:00<00:00, 26.79it/s]


{'ner': 36.06990569998743}


100%|██████████| 4/4 [00:00<00:00, 30.00it/s]


{'ner': 19.586839929402913}


100%|██████████| 4/4 [00:00<00:00, 30.86it/s]


{'ner': 12.880271908784835}


100%|██████████| 4/4 [00:00<00:00, 26.22it/s]


{'ner': 12.076145073611997}


100%|██████████| 4/4 [00:00<00:00, 30.08it/s]


{'ner': 19.250822609549317}


100%|██████████| 4/4 [00:00<00:00, 30.31it/s]


{'ner': 16.111040293942605}


100%|██████████| 4/4 [00:00<00:00, 29.43it/s]


{'ner': 11.038385024803379}


100%|██████████| 4/4 [00:00<00:00, 29.66it/s]


{'ner': 15.164156575750127}


100%|██████████| 4/4 [00:00<00:00, 30.53it/s]


{'ner': 13.168281527111983}


100%|██████████| 4/4 [00:00<00:00, 31.08it/s]


{'ner': 9.702491269255612}


100%|██████████| 4/4 [00:00<00:00, 28.28it/s]


{'ner': 17.888765706339292}


100%|██████████| 4/4 [00:00<00:00, 30.06it/s]


{'ner': 15.053572241068682}


100%|██████████| 4/4 [00:00<00:00, 27.29it/s]


{'ner': 9.233298178088507}


100%|██████████| 4/4 [00:00<00:00, 27.06it/s]


{'ner': 9.956322497396904}


100%|██████████| 4/4 [00:00<00:00, 27.86it/s]


{'ner': 10.241407833134915}


100%|██████████| 4/4 [00:00<00:00, 27.89it/s]


{'ner': 7.14237353708603}


100%|██████████| 4/4 [00:00<00:00, 29.36it/s]


{'ner': 13.059371422663483}


100%|██████████| 4/4 [00:00<00:00, 29.41it/s]


{'ner': 9.667364788916693}


100%|██████████| 4/4 [00:00<00:00, 28.36it/s]


{'ner': 15.359758317404006}


100%|██████████| 4/4 [00:00<00:00, 28.89it/s]


{'ner': 6.053916388187794}


100%|██████████| 4/4 [00:00<00:00, 27.90it/s]


{'ner': 5.181746762436749}


100%|██████████| 4/4 [00:00<00:00, 29.54it/s]


{'ner': 9.248835432020464}


100%|██████████| 4/4 [00:00<00:00, 30.38it/s]


{'ner': 4.167696363956286}


100%|██████████| 4/4 [00:00<00:00, 26.40it/s]


{'ner': 4.030489562436822}


100%|██████████| 4/4 [00:00<00:00, 27.14it/s]


{'ner': 6.360896438680934}


100%|██████████| 4/4 [00:00<00:00, 28.64it/s]


{'ner': 2.2533893417873414}


100%|██████████| 4/4 [00:00<00:00, 23.49it/s]


{'ner': 5.131315517126445}


100%|██████████| 4/4 [00:00<00:00, 27.20it/s]


{'ner': 5.007592728567531}


100%|██████████| 4/4 [00:00<00:00, 29.47it/s]


{'ner': 4.597456365804497}


100%|██████████| 4/4 [00:00<00:00, 26.06it/s]


{'ner': 6.384097184287327}


100%|██████████| 4/4 [00:00<00:00, 23.16it/s]


{'ner': 0.9735611738664218}


100%|██████████| 4/4 [00:00<00:00, 30.81it/s]


{'ner': 0.024268005734564092}


100%|██████████| 4/4 [00:00<00:00, 29.87it/s]


{'ner': 4.107122311429196}


100%|██████████| 4/4 [00:00<00:00, 30.94it/s]


{'ner': 3.507463208661293}


100%|██████████| 4/4 [00:00<00:00, 30.85it/s]


{'ner': 2.7942293927420288}


100%|██████████| 4/4 [00:00<00:00, 29.33it/s]


{'ner': 9.34556757338122}


100%|██████████| 4/4 [00:00<00:00, 30.07it/s]


{'ner': 5.206885380828712}


100%|██████████| 4/4 [00:00<00:00, 30.87it/s]


{'ner': 0.38998700968347927}


100%|██████████| 4/4 [00:00<00:00, 30.58it/s]


{'ner': 2.067630927812936}


100%|██████████| 4/4 [00:00<00:00, 31.32it/s]


{'ner': 6.893214031897717}


100%|██████████| 4/4 [00:00<00:00, 30.55it/s]


{'ner': 0.7089821529543942}


100%|██████████| 4/4 [00:00<00:00, 31.92it/s]


{'ner': 0.05541989885452597}


100%|██████████| 4/4 [00:00<00:00, 32.25it/s]


{'ner': 1.2298881823740828}


100%|██████████| 4/4 [00:00<00:00, 31.45it/s]


{'ner': 1.4869562832635712}


100%|██████████| 4/4 [00:00<00:00, 29.78it/s]


{'ner': 0.09143992899813119}


100%|██████████| 4/4 [00:00<00:00, 31.15it/s]


{'ner': 0.02192138501753403}


100%|██████████| 4/4 [00:00<00:00, 31.10it/s]


{'ner': 0.1044594398011253}


100%|██████████| 4/4 [00:00<00:00, 32.28it/s]


{'ner': 2.3357016910593145}


100%|██████████| 4/4 [00:00<00:00, 30.42it/s]


{'ner': 0.0017881437031183089}


100%|██████████| 4/4 [00:00<00:00, 29.35it/s]


{'ner': 0.000508952418915251}


100%|██████████| 4/4 [00:00<00:00, 29.73it/s]


{'ner': 1.6694484287062485}


100%|██████████| 4/4 [00:00<00:00, 30.38it/s]


{'ner': 0.01632491087903734}


100%|██████████| 4/4 [00:00<00:00, 26.94it/s]


{'ner': 0.04765912914539165}


100%|██████████| 4/4 [00:00<00:00, 29.71it/s]


{'ner': 0.000790235823360692}


100%|██████████| 4/4 [00:00<00:00, 30.61it/s]


{'ner': 1.5214304185549896}


100%|██████████| 4/4 [00:00<00:00, 30.33it/s]


{'ner': 1.8159996749271543}


100%|██████████| 4/4 [00:00<00:00, 30.96it/s]


{'ner': 0.0206418225432673}


100%|██████████| 4/4 [00:00<00:00, 31.86it/s]


{'ner': 0.0006686616138340251}


100%|██████████| 4/4 [00:00<00:00, 30.31it/s]


{'ner': 6.55409121742194e-05}


100%|██████████| 4/4 [00:00<00:00, 31.64it/s]


{'ner': 0.005846568099905312}


100%|██████████| 4/4 [00:00<00:00, 32.13it/s]


{'ner': 1.4265474870698165}


100%|██████████| 4/4 [00:00<00:00, 31.09it/s]


{'ner': 0.1962799296387041}


100%|██████████| 4/4 [00:00<00:00, 30.35it/s]


{'ner': 0.309115429033462}


100%|██████████| 4/4 [00:00<00:00, 31.41it/s]


{'ner': 0.12397880156623044}


100%|██████████| 4/4 [00:00<00:00, 31.14it/s]


{'ner': 2.558058637326739}


100%|██████████| 4/4 [00:00<00:00, 32.18it/s]


{'ner': 1.987057831875047}


100%|██████████| 4/4 [00:00<00:00, 30.49it/s]


{'ner': 0.0002729388807092677}


100%|██████████| 4/4 [00:00<00:00, 30.18it/s]


{'ner': 0.2852320413307652}


100%|██████████| 4/4 [00:00<00:00, 28.74it/s]


{'ner': 0.09309896245861422}


100%|██████████| 4/4 [00:00<00:00, 30.62it/s]


{'ner': 0.4220678128687357}


100%|██████████| 4/4 [00:00<00:00, 30.11it/s]


{'ner': 0.006406941140032997}


100%|██████████| 4/4 [00:00<00:00, 31.96it/s]


{'ner': 0.0675933179950787}


100%|██████████| 4/4 [00:00<00:00, 30.98it/s]


{'ner': 0.13301923866734275}


100%|██████████| 4/4 [00:00<00:00, 31.24it/s]


{'ner': 3.567108196151816}


100%|██████████| 4/4 [00:00<00:00, 31.97it/s]

{'ner': 0.0007736536440640603}





### Saving the model

Save the model which is stored in the `output_dir` variable and export the model as a pkl file.

In [36]:
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)
pickle.dump(nlp, open( "education nlp.pkl", "wb" ))

Saved model to ner


### Testing the trained model

In [37]:
doc=nlp("2015-2017, BE Chemical Engineering, Coimbatore Institute of Technology , India")
for ent in doc.ents:
    print(ent.label_+ '  ------>   ' + ent.text)

date  ------>   2015
degree  ------>   -2017, BE Chemical Engineering
school_name  ------>   Coimbatore Institute of Technology




Assigning entities manually in case they have not been recognised

    spacky.tokens.Span  --> [::] works normally like list comprehension in python

In [39]:
from spacy.tokens import Span

s1 = Span(doc,0,3,label = 'date')

doc.set_ents([s1],default='unmodified')

Previously `2017` was categorized as `degree` but now it been categorized properly

In [40]:
for ent in doc.ents:
    print(ent.label_+ '  ------>   ' + ent.text)

date  ------>   2015-2017
degree  ------>   , BE Chemical Engineering
school_name  ------>   Coimbatore Institute of Technology
