**Named-entity recognition (NER)** is the process of automatically identifying the entities discussed in a text and classifying them into pre-defined categories such as ‘person’, ‘organization’, ‘location’ and so on. 

A Named Entity Recognizer is a model that can do this recognizing task. It should be able to identify named entities like ‘America’ , ‘Emily’ , ‘London’ ,etc.. and categorize them as PERSON, LOCATION , and so on. It is a very useful tool and helps in Information Retrival.


The **spaCy library** allows you to train NER models by both updating an existing spacy model to suit the specific context of your text documents and also to train a fresh NER model from scratch

In spacy, Named Entity Recognition is implemented by the pipeline component ner. Most of the models have it in their processing pipeline by default.

**NER with spacy (nothing customised yet !!)**

In [None]:
#using spacy
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp=en_core_web_sm.load()

In [None]:
doc=nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')

In [None]:
print(type(doc))
for x in doc.ents:
  print(x) #displaying all entities found in doc

<class 'spacy.tokens.doc.Doc'>
European
Google
$5.1 billion
Wednesday


In [None]:
[(x.text,x.label,x.label_) for x in doc.ents] #displaying text and associated label

#European is NORD (nationalities or religious or political groups), Google is an organization, $5.1 billion is monetary value and Wednesday is a date object. They are all correct.

[('European', 381, 'NORP'),
 ('Google', 383, 'ORG'),
 ('$5.1 billion', 394, 'MONEY'),
 ('Wednesday', 391, 'DATE')]

**visualising results**

In [None]:
# to display results
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True, options={'distance': 90}) #visualiing results

print("\n NORP ", spacy.explain("NORP"))


 NORP  Nationalities or religious or political groups


In [None]:
#name entity can be person, organisation , quantity,date etc
import spacy
nlp=spacy.load('en_core_web_sm')

print("by default pipeline is ",nlp.pipe_names)
#In case your model does not have , you can add it using nlp.add_pipe() method.

doc=nlp("Australia wants Facebook and Google to pay media companies for news")

print("\ntype(doc) ",type(doc))

for ent in doc.ents:
  print("\n  ent :{} ---> ent.label_ : {} end.start_char: {}  end.end_char: {} ".format( ent ,ent.label_ , ent.start_char, ent.end_char))


by default pipeline is  ['tagger', 'parser', 'ner']

type(doc)  <class 'spacy.tokens.doc.Doc'>

  ent :Australia ---> ent.label_ : GPE end.start_char: 0  end.end_char: 9 

  ent :Facebook and Google ---> ent.label_ : ORG end.start_char: 16  end.end_char: 35 


**need of custom NER!**

In [None]:
import spacy
import en_core_web_sm

nlp=en_core_web_sm.load()
doc=nlp('tennis champion Emerson was expected to win Wimbeldon')
print(doc.ents)
for ent in doc.ents:
    print(ent.text, ent.label_)  #wimbeldon is wrongly classified as PERSON , we need custom NER in such cases

(Emerson, Wimbeldon)
Emerson PERSON
Wimbeldon PERSON


In [None]:
#in cases like below, we need a  custom entity id  recognition . This is financial specific document and NER is not done properly
doc=nlp("I donot have money to pay my credit card account")

print("\ntype(doc) ",type(doc))

print(len(doc.ents))

for ent in doc.ents:
  print("\n  ent :{} ---> ent.label_ : {} end.start_char: {}  end.end_char: {} ".format( ent ,ent.label_ , ent.start_char, ent.end_char))


type(doc)  <class 'spacy.tokens.doc.Doc'>
0


**Updating the Named Entity Recognizer**

In [None]:
#To do this, let’s use an existing pre-trained spacy model and update it with newer examples.

#First , let’s load a pre-existing spacy model with an in-built ner component. Then, get the Named Entity Recognizer using get_pipe() method .

import spacy
nlp=spacy.load('en_core_web_sm')
ner=nlp.get_pipe('ner')
print("\n existing ner.labels :",ner.labels)
print("\n len(ner.labels) :",len(ner.labels))


 existing ner.labels : ('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')

 len(ner.labels) : 18


**Format of the training examples**

In [None]:
#To update a pretrained model with new examples, you’ll have to provide many examples to meaningfully improve the system

#spaCy accepts training data as list of tuples.

#Each tuple should contain the text and a dictionary. The dictionary should hold the start and end indices of the named enity in the text, and the category or label of the named entity.

#For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

In [None]:
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
              ]

In [None]:
for _, annotations in TRAIN_DATA:
  print(_ ," and annotation  is : ", annotations)

Walmart is a leading e-commerce company  and annotation  is :  {'entities': [(0, 7, 'ORG')]}
I reached Chennai yesterday.  and annotation  is :  {'entities': [(19, 28, 'GPE')]}
I recently ordered a book from Amazon  and annotation  is :  {'entities': [(24, 32, 'ORG')]}
I was driving a BMW  and annotation  is :  {'entities': [(16, 19, 'PRODUCT')]}
I ordered this from ShopClues  and annotation  is :  {'entities': [(20, 29, 'ORG')]}
Fridge can be ordered in Amazon   and annotation  is :  {'entities': [(0, 6, 'PRODUCT')]}
I bought a new Washer  and annotation  is :  {'entities': [(16, 22, 'PRODUCT')]}
I bought a old table  and annotation  is :  {'entities': [(16, 21, 'PRODUCT')]}
I bought a fancy dress  and annotation  is :  {'entities': [(18, 23, 'PRODUCT')]}
I rented a camera  and annotation  is :  {'entities': [(12, 18, 'PRODUCT')]}
I rented a tent for our trip  and annotation  is :  {'entities': [(12, 16, 'PRODUCT')]}
I rented a screwdriver from our neighbour  and annotation  is :  {'e

In [None]:
"""
The above code clearly shows you the training format. You have to add these labels to the ner using ner.add_label() method of pipeline . Below code demonstrates the same
"""
for _, annotations in TRAIN_DATA:
   for ent in annotations.get("entities"):
      print("adding label to ner : " ,ent[2])
      ner.add_label(ent[2])


      

adding label to ner :  ORG
adding label to ner :  GPE
adding label to ner :  ORG
adding label to ner :  PRODUCT
adding label to ner :  ORG
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  PRODUCT
adding label to ner :  ORG
adding label to ner :  ORG
adding label to ner :  ORG
adding label to ner :  ORG


In [None]:
print("\n existing ner.labels :",ner.labels)
print("\n len(ner.labels) :",len(ner.labels))


 existing ner.labels : ('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')

 len(ner.labels) : 18


In [None]:
"""
Now it’s time to train the NER over these examples. But before you train, remember that apart from ner , the model has other pipeline components. These components should not get affected in training.

So, disable the other pipeline components through nlp.disable_pipes() method.
"""

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes= [ pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions ]


**Training the NER model**

In [None]:
#all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

#Training the NER model

#To train an ner model, the model has to be looped over the example for sufficient number of iterations

#Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() functio

#The training data is usually passed in batches.

#You can call the minibatch() function of spaCy over the training data that will return you data in batches . 
  #The minibatch function takes size parameter to denote the batch size. You can make use of the utility function compounding to generate an infinite series of compounding values.

#compunding() function takes three inputs which are start ( the first integer value) ,stop (the maximum value that can be generated) and finally compound. 
  #This value stored in compund is the compounding factor for the series.If you are not clear, check out this link for understanding.

#For each iteration , the model or ner is updated through the nlp.update() command. Parameters of nlp.update() are :
  #docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations. `  

#At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.  

In [None]:
#making predictions before training

doc=nlp("I was driving a Alto")
print("\n entities ,",[ (ent.text , ent.label_) for ent in doc.ents])
#below results are wrong


 entities , [('Alto', 'LOC')]


**training and analysing results after one iteration**

In [None]:
#training with just one iteration, here details of same

from spacy.util import minibatch, compounding
import random

print("len(TRAIN_DATA)",len(TRAIN_DATA))
with nlp.disable_pipes(*unaffected_pipes):
  for iteration in range(1): # we are having 30 iteration 
      random.shuffle(TRAIN_DATA)# randomly shuffling before every iteration
      losses={}

      batches= minibatch(TRAIN_DATA,size=compounding(4.0, 32.0, 1.001))

      for c,batch in enumerate(batches):
         print("\nbatch no {} length is {}".format(c,len(batch)))
         print("contents of batch {} are:\n {}".format(c,batch))

         text,annotation=zip(*batch)
         nlp.update(text,annotation,drop=0.5,losses=losses) 
         print("\nloss is {}",losses)
         
#Loop over the examples and call nlp.update, which steps through the words of the input.
#At each word, update makes a prediction. 
#It then consults the annotations, to see whether it was right. 
#If it was wrong, it adjusts its weights so that the correct action will score higher next time

#Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.


len(TRAIN_DATA) 19

batch no 0 length is 4
contents of batch 0 are:
 [('I got my clock fixed', {'entities': [(16, 21, 'PRODUCT')]}), ('I rented a camera', {'entities': [(12, 18, 'PRODUCT')]}), ('I bought a new Washer', {'entities': [(16, 22, 'PRODUCT')]}), ('Flipkart is recognized as leader in market', {'entities': [(0, 8, 'ORG')]})]

loss is {} {'ner': 1.4224304639210459}

batch no 1 length is 4
contents of batch 1 are:
 [('I bought a fancy dress', {'entities': [(18, 23, 'PRODUCT')]}), ('I rented a tent for our trip', {'entities': [(12, 16, 'PRODUCT')]}), ('Fridge can be ordered in Amazon ', {'entities': [(0, 6, 'PRODUCT')]}), ('I repaired my computer', {'entities': [(15, 23, 'PRODUCT')]})]

loss is {} {'ner': 6.491001476853853}

batch no 2 length is 4
contents of batch 2 are:
 [('Walmart is a leading e-commerce company', {'entities': [(0, 7, 'ORG')]}), ('I recently ordered from Max', {'entities': [(24, 27, 'ORG')]}), ('I was driving a BMW', {'entities': [(16, 19, 'PRODUCT')]}), ("Fli

**#training with  30 iteration.**

In [None]:
#training with  30 iteration.

from spacy.util import minibatch, compounding
import random

print("len(TRAIN_DATA)",len(TRAIN_DATA))
with nlp.disable_pipes(*unaffected_pipes):
  for iteration in range(30): # we are having 30 iteration 
      random.shuffle(TRAIN_DATA)# randomly shuffling before every iteration
      losses={}

      batches= minibatch(TRAIN_DATA,size=compounding(4.0, 32.0, 1.001))#  # batch up the examples using spaCy's minibatch

      for batch in batches:
         text,annotation=zip(*batch)
         nlp.update(text,annotation,drop=0.5,losses=losses) 
         print("\n iteration {} , loss is {}".format(iteration,losses))


len(TRAIN_DATA) 19

 iteration 0 , loss is {'ner': 3.576853639235196}

 iteration 0 , loss is {'ner': 3.9724964438864845}

 iteration 0 , loss is {'ner': 6.747934089646151}

 iteration 0 , loss is {'ner': 14.488696412433683}

 iteration 0 , loss is {'ner': 18.688489220092606}

 iteration 1 , loss is {'ner': 4.0160229475659435}

 iteration 1 , loss is {'ner': 5.3159096537253845}

 iteration 1 , loss is {'ner': 7.026368742975379}

 iteration 1 , loss is {'ner': 9.305351496607969}

 iteration 1 , loss is {'ner': 9.314088947620647}

 iteration 2 , loss is {'ner': 3.814085906094988}

 iteration 2 , loss is {'ner': 3.8918759282462005}

 iteration 2 , loss is {'ner': 5.111412194477644}

 iteration 2 , loss is {'ner': 8.204993931065708}

 iteration 2 , loss is {'ner': 10.491811769773022}

 iteration 3 , loss is {'ner': 1.2740659473674896}

 iteration 3 , loss is {'ner': 3.2785224414822522}

 iteration 3 , loss is {'ner': 4.759598728817579}

 iteration 3 , loss is {'ner': 10.581819293986698}

 

**Let’s predict on new texts the model has not seen**

In [None]:
#making predictions after training

doc=nlp("I was driving a Alto")
print("\n entities ,",[ (ent.text , ent.label_) for ent in doc.ents])

doc = nlp("Fridge can be ordered in FlipKart" )
print("\n entities ,",[ (ent.text , ent.label_) for ent in doc.ents])


 entities , [('Alto', 'PRODUCT')]

 entities , [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]


In [None]:
#You can observe that even though I didn’t directly train the model to recognize “Alto” as a vehicle name, it has predicted based on the similarity of context.
#This is the awesome part of the NER model.

#The model does not just memorize the training examples. It should learn from them and be able to generalize it to new examples.

#Once you find the performance of the model satisfactory, save the updated model.

#You can save it your desired directory through the to_disk command.