# Induction to NER 

**Named-entity recognition (NER)** is the process of automatically identifying the entities discussed in a text and classifying them into pre-defined categories such as ‘person’, ‘organization’, ‘location’ and so on.

The **spaCy** library allows you to train NER models by both updating an existing spacy model to suit the specific context of your text documents and also to train a fresh NER model from scratch.

**Named Entity Recognition** is implemented by the pipeline component ner. Most of the models have it in their processing pipeline by default.

In [1]:
import numpy as np
import pandas as pd
import spacy
import random
import spacy.cli

Let us look at the number of rows and columns present in the dataset.

In [2]:
txt = """Coronavirus Vaccine Registration, Coronavirus Omicron Variant Cases in India Live: Gujarat’s health department has confirmed one more case of Omicron variant which takes up India’s total tally of Omicron cases to three. According to news agency PTI, the infected man is 72-year-old man and had returned to Jamnagar city of Gujarat from Zimbabwe. The Gujarat health department had on Friday informed that the man’s samples were sent for laboratory testing to identify whether he was infected with the new mutant Omicron or not. A South Africa returnee was also booked for violating home quarantine protocol, Chandigarh Health Secretary Yashpal Garg told news agency ANI on Saturday. The woman in question had returned on December 1. Although her RT-PCR report came out to be negative, she still broke home quarantine on December 2 and checked into a hotel. Keeping in view the possibility of travelers breaking protocols and checking into hotels, the local administration has directed the hotels to ask for 15-day travel history of international travelers.

Two cases of Omicron variant were reported from Karnataka on Thursday after which a wave of panic, fear and concern was felt across states. The Ministry of Health and Family Welfare on Thursday issued a list of frequently asked questions about the Omicron variant. The Centre had earlier also informed that given Omicron’s characteristics, it is likely to spread to more countries, including India, but given the high exposure to delta variant and high-paced vaccination, the severity of the disease is expected to be low.
"""

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp(txt)    
olist = []
for token in doc:
    l = [token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_]
    olist.append(l)
    
odf = pd.DataFrame(olist)
odf.columns= ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"]
odf

Unnamed: 0,Text,StartIndex,Lemma,IsPunctuation,IsSpace,WordShape,PartOfSpeech,POSTag
0,Coronavirus,0,Coronavirus,False,False,Xxxxx,PROPN,NNP
1,Vaccine,12,Vaccine,False,False,Xxxxx,PROPN,NNP
2,Registration,20,Registration,False,False,Xxxxx,PROPN,NNP
3,",",32,",",True,False,",",PUNCT,","
4,Coronavirus,34,Coronavirus,False,False,Xxxxx,PROPN,NNP
...,...,...,...,...,...,...,...,...
279,to,1569,to,False,False,xx,PART,TO
280,be,1572,be,False,False,xx,AUX,VB
281,low,1575,low,False,False,xxx,ADJ,JJ
282,.,1578,.,True,False,.,PUNCT,.


So using "nlp" we got a lot of information. The details are as follows:

* Text - Tokenized word
* StartIndex - Index at which the word starts in the sentence
* Lemma - Lemma of the word (we need not do lemmatization separately)
* IsPunctuation - Whether the given word is a punctuation or not
* IsSpace - Whether the given word is just a white space or not
* WordShape - Gives information about the shape of word (If all letters are in upper case, we will get XXXXX, if all in lower case then xxxxx, if the first letter is upper and others lower then Xxxxx and so on)
* PartOfSpeech - Part of speech of the word
* POSTag - Tag for part of speech of word

**Named Entity Recognition:**

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. 

We also get named entity recognition as part of spacy package. It is inbuilt in the english language model and we can also train our own entities if needed.

In [5]:
doc = nlp(txt)
olist = []
for ent in doc.ents:
    olist.append([ent.text, ent.label_])
    
odf = pd.DataFrame(olist)
odf.columns = ["Text", "EntityType"]
odf

Unnamed: 0,Text,EntityType
0,Coronavirus Vaccine Registration,ORG
1,Variant Cases,PERSON
2,India,GPE
3,Gujarat,GPE
4,one,CARDINAL
5,Omicron,ORG
6,India,GPE
7,Omicron,ORG
8,three,CARDINAL
9,PTI,ORG


The complete list of different entity types can be seen [here](https://spacy.io/usage/linguistic-features#entity-types)

Spacy also includes a [displacy visualizer](displaCy visualizer with Jupyter support) with jupyter notebook support. This can be used to visualize the named entity recognition data.

In [6]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

Wow. This one looks cool. We can also take one more example and visualize the same. 

In [7]:
doc = nlp(txt)
colors = {'GPE': 'lightblue', 'NORP':'lightgreen'}
options = {'ents': ['GPE', 'NORP'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)

**Noun Phrase Chunking:**

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". 

Now let us look at how to do noun phrase chunking using spacy. In addition to noun phrase chunking, spacy also gets us the root of the noun.

In [8]:
doc = nlp(txt)
olist = []
for chunk in doc.noun_chunks:
    olist.append([chunk.text, chunk.label_, chunk.root.text])
odf = pd.DataFrame(olist)
odf.columns = ["NounPhrase", "Label", "RootWord"]
odf

Unnamed: 0,NounPhrase,Label,RootWord
0,Coronavirus Vaccine Registration,NP,Registration
1,Coronavirus Omicron Variant Cases,NP,Cases
2,India,NP,India
3,Gujarat’s health department,NP,department
4,one more case,NP,case
...,...,...,...
62,India,NP,India
63,the high exposure,NP,exposure
64,delta variant and high-paced vaccination,NP,vaccination
65,the severity,NP,severity


# Custom Traning Model

### Traning Data set 

In [9]:
#SPECIFY THE NER TRAINING DATA
TRAIN_DATA = [
        ("I have deposited an amount of $500 using my debit card.",{"entities":[(7,16,"action"),(30,34,"amount")]}),
        ("Send $500 to the merchant with account number 1234567890. ",{"entities":[(0,4,"action"),(5,9,"amount")]}),
        ("Transfer 2000 to my new bank account ending with the number 4567. ",{"entities":[(0,8,"action"),(9,15,"amount")]}),
        ("Please deposit 200 in my account. ",{"entities":[(7,14,"action"),(15,20,"amount")]}),
        ("I would like to withdraw $10000 from my bank account. ",{"entities":[(16,24,"action"),(25,31,"amount")]})]


load blank NER mode and create pipline. We also added new entities "action" and "amount" in our new model

In [10]:
custom_nlp = spacy.blank('en')
ner = custom_nlp.create_pipe("ner")

custom_nlp.add_pipe(ner, last=True)

#ADD THE CUSTOM NAMED ENTITIES HERE
custom_nlp.entity.add_label('action')
custom_nlp.entity.add_label('amount')

# Training

To train an ner model, the model has to be looped over the example for sufficient number of iterations.

Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function . This will ensure the model does not make generalizations based on the order of the examples.

The training data is usually passed in batches. We can call the minibatch() function of spaCy over the training data that will return you data in batches . The minibatch function takes size parameter to denote the batch size.

In each iteration , the model or ner is updated through the nlp.update() command.

Parameters of nlp.update() are :

**docs**: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations.

**golds**: You can pass the annotations we got through zip method here

**drop**: This represents the dropout rate.

**losses**: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [11]:
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = custom_nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        custom_nlp.update([text], [annotations], sgd=optimizer)
#SAVE THE CUSTOM NER MODEL TO
custom_nlp.to_disk("custom_ner_model")

# Inference

From custom model we are extracting entities

In [19]:
#LOAD THE CUSTOM MODEL
nlp2 = spacy.load("custom_ner_model")
doc2 = nlp2("Please deposit $200 in my account.")
for ent in doc2.ents:
    print(ent.label_, ent.text)

action deposit
amount $200
