<p style="text-align:center">
    <a href="https://www.linkedin.com/company/mt-learners/?viewAsMember=true" target="_blank">
    <img src="https://github.com/Mr-MeerMoazzam/Mr-MeerMoazzam/blob/main/Untitled-2.jpg?raw=true" width="200" alt="MT Learners"  />
    </a>
</p>

# Introduction

In this notebook, you will learn all you need to learn about spaCy, NER(Named Entity Recognition), and its implementation with Python. We will talk about the building blocks of spaCy 3 and RULE-BASED SpaCy in detail. After completing this blog, you will have all the theoretical and practical knowledge about spaCy and be able to design your own custom NER in spaCy 3, and you will have an understanding of pipes and pipelines in general, as well as the ones provided by spaCy specifically.

# Outline

Here's a step-by-step outline of the notebook.
<ol>
    <li>Basic Definitions (NLP, SpaCy, NER, etc)</li>    
    <li>Basics of SpaCy </li>
    <li>SpaCy's Linguistic Annotation</li>
    <li>Word Vectors</li>
    <li>Pipelines in SpaCy</li>
    <li>Rule-Based NER</li>
    <li>RegEx Role in SpaCy</li>
    <li>Build Custom NER</li>
    <li>Closing Remarks</li>
</ol>


# Basic Definitions

## What is NLP?


Computers and machines excel at working with tabular data or spreadsheets. But humans don't typically communicate in the form of tables; instead, they use words and sentences. Human speech and writing contain a great deal of unstructured text. Computer interpretation of such is therefore not very obvious. Natural language processing (NLP) is a Subfield of Artificial Intelligence that is the process of enabling computers to comprehend and extract useful information from unstructured text.

## What is spaCy?

SpaCy is a Python package for high-level Natural Language Processing (NLP), which is open-source and free. You should learn more about it if you work with text frequently. What, for example, is it about? What does the context of the phrase mean? What is being done to whom? Are there any brand names or product names mentioned? What texts share a lot of similarities?

To create applications that analyze and understand massive amounts of text, you can employ spaCy, which is created expressly for usage in production environments. It can also be used to prepare the text for deep learning, develop information extraction or natural language understanding systems, etc.

# Named Entity Recognition  (NER)

One of the most important entity detection techniques used in NLP is named entity recognition. Using the approach, it is possible to automatically scan complete texts, identify some key textual items, and classify such things into predetermined groups. To put it simply, it is the process of extracting named entities from the text, such as names of people, places, businesses, etc. Entities are the most important parts of a sentence, which can be noun phrases, verb phrases, or both. It is also referred to as entity identification, entity extraction, or entity chunking.

# Basic Building Blocks of SpaCy

## Install spacy and download small model




In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-11-06 05:40:32.285682: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 21.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Spacy's Container

Containers are spaCy objects that hold a lot of information about a text. Using the spaCy framework, we design several container objects to analyze texts. You should be familiar with the following list of spaCy containers.<br>
**1. Doc**<br>
A collection of token objects make up Doc, storage space for linguistic annotations. Sentencing and named entities can both be accessed with the use of the Doc class. Annotations can also be exported to NumPy arrays and serialized to compressed binary text.<br> 
**2. Token**<br>
A token, as its name suggests, represents a single element, such as a word, a period, a space, or a symbol.<br>
**3. Span**<br>
Span is a slice from the earlier-discussed Doc object.<br>
**4. Lexeme**<br>
You may describe it as a vocabulary word. A Lexeme does not have a string context, in contrast to a word token. It is a word type, hence it lacks a PoS (Part-of-Speech) tag, dependency parse, or lemma.

# SpaCy's Linguistic Annotation

## Importing spaCy and loading data

In [32]:
import spacy
nlp=spacy.load("en_core_web_sm")
#loading data of U.S. for analysis
with open("wiki_us.txt","r") as f:
     text=f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[i] and 326 Indian reservations with limited sovereignty. It is the third-largest country by both land and total area.[c] The United States shares land borders with Canada to the north and with Mexico to the south as well as maritime borders with the Bahamas, Cuba, and Russia, among others.[j] It has a population of over 331 million,[d] and is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City. The United States is a melting pot of cultures and ethnicities, and its population has been profoundly shaped by centuries of immigration. It has a highly diverse climate and geography and is officially recognized 

## Create Doc container

In [33]:
doc = nlp(text)

In [34]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[i] and 326 Indian reservations with limited sovereignty. It is the third-largest country by both land and total area.[c] The United States shares land borders with Canada to the north and with Mexico to the south as well as maritime borders with the Bahamas, Cuba, and Russia, among others.[j] It has a population of over 331 million,[d] and is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City. The United States is a melting pot of cultures and ethnicities, and its population has been profoundly shaped by centuries of immigration. It has a highly diverse climate and geography and is officially recognized 

## Difference Between doc and text object

In [35]:
print(len(doc))
print(len(text))


811
4468


In [36]:
for token in doc[:10]:
  print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [37]:
for token in text.split()[:10]:
  print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


<b>Note:</b>
Through NLP we easily split word to word but using split is not possible in real world scenarios.

## Sentence Boundry Detection


Sentence boundary detection, often known as SBD in NLP, is the process of identifying sentences in a text. Again, using rules might make this look rather simple to accomplish. Although one may use split("."), in English we also use the period to indicate abbreviation. Again, I wonder, "why bother?" You might create rules to check for periods that are not followed by a lowercase word. All phrases may be entirely split using SBD in a matter of seconds using spaCy.

By using the attribute sents, we can use the following syntax to retrieve the sentences in the Doc container:

In [38]:
for sentence in doc.sents:
  print(sentence)
  print('Next Sentence')

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country primarily located in North America.
Next Sentence
It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[i] and 326 Indian reservations with limited sovereignty.
Next Sentence
It is the third-largest country by both land and total area.[c] The United States shares land borders with Canada to the north and with Mexico to the south as well as maritime borders with the Bahamas, Cuba, and Russia, among others.[j]
Next Sentence
It has a population of over 331 million,[d] and is the third most populous country in the world.
Next Sentence
The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Next Sentence
The United States is a melting pot of cultures and ethnicities, and its population has been profoundly shaped by centuries of immigration.
Next 

In [39]:
sent_list=list(doc.sents)

In [40]:
sent_list[0]

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country primarily located in North America.

## Token Attributes


The token object has a wide variety of properties that are essential for doing NLP in space. We’ll be collaborating with some of them, including:<br>

text — Verbatim text content<br>
head — The syntactic parent, or “governor”, of this token. <br>
left_edge — The leftmost token of this token’s syntactic descendants <br>
right_edge — The rightmost token of this token’s syntactic descendants<br>
ent_type_ — Named entity type of the token<br>
iob_ — IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and “” means no entity tag is set<br>
lemma_ — Base form of the token, with no inflectional suffixes<br>
morph — Morphological analysis of the token<br>
pos_ — part-of-speech from the Universal POS tag set<br>
lang_ — Language of the parent document’s vocabulary<br>
dep_ — Syntactic dependency relation of the token<br>

In [41]:
sentence1=list(doc.sents)[0]

In [42]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country primarily located in North America.

In [43]:
token12=sentence1[12]

In [44]:
print("Token Text : {}".format(token12.text))
print("The syntactic parent of this token : {}".format(token12.head))
print("The leftmost token of this token’s syntactic descendants : {}".format(token12.left_edge))
print("The rightmost token of this token’s syntactic descendants : {}".format(token12.right_edge))
print("Token Named entity type : {}".format(token12.ent_type_))
print("Token IOB code : {}".format(token12.ent_iob_))
print("Token Morphological analysis : {}".format(token12.morph))
print("Token Part of Speech : {}".format(token12.pos_))
print("Token Syntactic dependency relation : {}".format(token12.dep_))
print("Token Language : {}".format(token12.lang_))
print("Token Base form of the token : {}".format(token12.lemma_))

Token Text : known
The syntactic parent of this token : States
The leftmost token of this token’s syntactic descendants : commonly
The rightmost token of this token’s syntactic descendants : America
Token Named entity type : 
Token IOB code : O
Token Morphological analysis : Aspect=Perf|Tense=Past|VerbForm=Part
Token Part of Speech : VERB
Token Syntactic dependency relation : acl
Token Language : en
Token Base form of the token : know


## Part of Speech Tagging (POS)

In [14]:
for token in sentence1:
    print (token.text, token.pos_)

The DET
United PROPN
States PROPN
of ADP
America PROPN
( PUNCT
U.S.A. PROPN
or CCONJ
USA PROPN
) PUNCT
, PUNCT
commonly ADV
known VERB
as ADP
the DET
United PROPN
States PROPN
( PUNCT
U.S. PROPN
or CCONJ
US PROPN
) PUNCT
or CCONJ
America PROPN
, PUNCT
is AUX
a DET
transcontinental ADJ
country NOUN
primarily ADV
located VERB
in ADP
North PROPN
America PROPN
. PUNCT


In [15]:
from spacy import displacy
displacy.render(sentence1, style="dep", jupyter = True)

In [16]:
sentence= "I am enjoying weather"
doc1=nlp(sentence)


In [17]:
for token in doc1:
    print (token.text, token.pos_, token.dep_)

I PRON nsubj
am AUX aux
enjoying VERB ROOT
weather NOUN dobj


In [18]:
from spacy import displacy
displacy.render(doc1, style="dep", jupyter = True)

## Named Entity Recognition

In [19]:
for ent in doc.ents:
    print (ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
nine CARDINAL
326 CARDINAL
Indian NORP
third ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
Russia GPE
over 331 CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York City GPE
The United States GPE
centuries DATE
one CARDINAL
17 CARDINAL
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
the Thirteen British Colonies ORG
the East Coast LOC
the American Revolution ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
east LOC
Southern NORP
the Confederate States of America GPE
the United States GPE
the American Civil War ORG
Union ORG
the United States GPE
the Thirteenth Amendment LAW
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the United States GPE
the Soviet Union GPE
two CARD

In [20]:
spacy.displacy.render(doc,style='ent',jupyter=True)

<b>Next, we are going to discuss word embeddings for word vectors that's why we need to download the next largest English model that is en_core_web_sm because the small model doesn't have these saved</b>

In [21]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
[K     |████████████████████████████████| 42.8 MB 4.1 MB/s eta 0:00:012     |█████████████                   | 17.4 MB 24.8 MB/s eta 0:00:02     |████████████████▋               | 22.2 MB 1.9 MB/s eta 0:00:11     |████████████████████████▉       | 33.2 MB 1.9 MB/s eta 0:00:05
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [22]:
nlp = spacy.load("en_core_web_md")
with open ("wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]

# Word Vectors
Words are represented mathematically in multidimensional space by word vectors, also known as word embeddings. In order for a computer system to comprehend a word, it must be given a word vector. Text is difficult for computers to comprehend well. But they are capable of processing numbers swiftly and accurately. This makes word to number conversion crucial.

In the beginning, all the words in a corpus are converted into a single, unique number in order to create word vectors in a pipeline. Then, a dictionary with the following format is used to hold these words: "The," "a," "2," etc. This is referred to as a bag of words. However, using this method of numerical representation only enables a computer to recognise particular phrases by their numerical representation. The ability for a machine to comprehend meaning is not provided by this.

Let's take a scenario:

I love to eat bananas
I like to eat bananas

A numerical array (list) of these sentences would seem as follows:

[3, 5, 6, 8, 9]

[3, 2, 6, 8, 9]

As we can see, both phrases are essentially equivalent to humans. The only difference is how much Tom enjoys eating chocolate. These two statements appear to have very similar numbers, yet it is impossible to determine for sure what they mean semantically. What degree of similarity exists between the numbers 5 and 2? The number 6 can be used to symbolise both "hates" and "likes." Word vectors can be useful in this situation.

By representing them in the higher-dimensional space mentioned above, word vectors transform this one-dimensional collection of words into ones with a multidimensional meaning. This is accomplished using machine learning, which is easily accomplished using Python modules such as Gensim.

# How Do Word Vectors Appear?

There is a predetermined set of dimensions for word vectors. Machine learning has been used to refine these dimensions. Models consider the frequency of words in a corpus as well as the presence of other terms in similar situations. This enables the computer to calculate the numerical similarity of words' syntactical structures. The relationships must then be mathematically represented. This is accomplished using a vector or a matrix of matrices. Models compress a matrix to a float to more succinctly depict these. The number of dimensions corresponds to the number of floats in the matrix.

Let's examine the first word in our sentence. Let's take a closer look at its vector.

In [23]:
print(sentence1[0].vector)

[-7.2681e+00 -8.5717e-01  5.8105e+00  1.9771e+00  8.8147e+00 -5.8579e+00
  3.7143e+00  3.5850e+00  4.7987e+00 -4.4251e+00  1.7461e+00 -3.7296e+00
 -5.1407e+00 -1.0792e+00 -2.5555e+00  3.0755e+00  5.0141e+00  5.8525e+00
  7.3378e+00 -2.7689e+00 -5.1641e+00 -1.9879e+00  2.9782e+00  2.1024e+00
  4.4306e+00  8.4355e-01 -6.8742e+00 -4.2949e+00 -1.7294e-01  3.6074e+00
  8.4379e-01  3.3419e-01 -4.8147e+00  3.5683e-02 -1.3721e+01 -4.6528e+00
 -1.4021e+00  4.8342e-01  1.2549e+00 -4.0644e+00  3.3278e+00 -2.1590e-01
 -5.1786e+00  3.5360e+00 -3.1575e+00 -3.5273e+00 -3.6753e+00  1.5863e+00
 -8.1594e+00 -3.4657e+00  1.5262e+00  4.8135e+00 -3.8428e+00 -3.9082e+00
  6.7549e-01 -3.5787e-01 -1.7806e+00  3.5284e+00 -5.1114e-02 -9.7150e-01
 -9.0553e-01 -1.5570e+00  1.2038e+00  4.7708e+00  9.8561e-01 -2.3186e+00
 -7.4899e+00 -9.5389e+00  8.5572e+00  2.7420e+00 -3.6270e+00  2.7456e+00
 -6.9574e+00 -1.7190e+00 -2.9145e+00  1.1838e+00  3.7864e+00  2.0413e+00
 -3.5808e+00  1.4319e+00  2.0528e-01 -7.0640e-01 -5

We can perform similarity matches rapidly and accurately once a word vector model has been trained. Let's look at a few vectors from our medium-sized model. Let's look for the terms that are most similar to the word friends.

In [24]:
import numpy as np

word = "friends"

most_similar = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=10)
words = [nlp.vocab.strings[w] for w in most_similar[0][0]]
print(words)

['friendzone', 'Boyfriends', 'Vriendt', 'kinships', 'skinship', 'Ungers', 'kamas', 'girlfriend-', 'Kuzuri', 'DeBeers']


# Doc Similarity
This is also possible in spaCy at the document level. The similarity between two documents can be calculated using word vectors. Let's examine the illustration from spaCy's documentation.

In [25]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.6916494076349142


## Word Similarity
In spaCy, we can also calculate similarity between two words.

In [26]:
# Similarity of tokens and spans
doc=nlp("I love to eat bananas and I love coding")
fruit = doc[4:5]
coding = doc[8]
print(fruit, "<->", coding, fruit.similarity(coding))

bananas <-> coding 0.03394070640206337


<br>There is 0.03 similarity score between banana and coding. That actually makes sense.</br>

# Pipelines in spaCy

When you use nlp on a text, spaCy tokenizes it first to create a Doc object. The Doc is subsequently processed in a number of phases; this is also known as the processing pipeline. A tagger, a lemmatizer, a parser, and an entity recognizer are frequently included in the pipeline utilised by trained pipelines. The processed Doc is returned by each pipeline component, which is then passed on to the next component.

The capabilities of a processing pipeline are always determined by the components, their models, and the manner in which they were educated. For instance, a pipeline for named entity recognition must have a trained named entity recognizer component that has access to a statistical model and weights that allow it to forecast entity labels.

# Attribute Rulers
For tokens identified by Matcher patterns, you can set token attributes using the attribute ruler. The attribute ruler is frequently used to manage token attribute exceptions and to map values between attributes, for example, mapping fine-grained POS tags to coarse-grained POS tags.

-> Dependency Parser

-> EntityLinker

-> EntityRecognizer

-> EntityRuler

-> Lemmatizer

-> Morpholog

-> SentenceRecognizer

-> Sentencizer

-> SpanCategorizer

-> Tagger

-> TextCategorizer

-> Tok2Vec

-> Tokenizer

-> TrainablePipe

-> Transformer

# Matchers
The Matcher allows you to look for words and phrases by applying rules that describe their token properties. Token annotations (such as text or part-of-speech tags) and lexical attributes such as Token.is punct can be referred to by rules. You can retrieve the matched tokens in context by applying the matcher on a Doc. <br>
-> DependencyMatcher

-> Matcher

-> PhraseMatcher

# Add Pipes: A Guide
Typically, you'll utilise a pre-made spaCy model. However, in some circumstances, a standard model will not meet your requirements or will perform a given task extremely slowly. Sentence tokenization is an example of this. Think about if you had a document that contained about a million sentences. Your model would take a very long time to parse and distinguish those 1 million sentences, even if you utilised the small English model. Making a blank English model in this case and simply adding the Sentencizer to it is the best course of action.
The reason is that each pipe in a pipeline will be enabled (unless specified), which implies that each pipe from the Dependency Parser to named entity recognition will be applied to your data. This is an extremely inefficient use of computational resources and time. This task could take the small model hours to complete. This time can be cut down to a matter of minutes by just building a blank model and adding a Sentencizer to it.



Let's start by making a blank model.

In [None]:
nlp1 = spacy.blank("en")


If you look at this, you'll see that we used spacy.blank() instead of spacy.load(). A language's two letter combination, in this case en for English, is all that is required to generate a blank model. Let's now add a new pipe to it using the add pipe() function. Simply a sentencizer will be added.

In [None]:
nlp1.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7f91b103b0a0>

In [None]:
%%time
doc = nlp1(text)
print (len(list(doc.sents)))

27
CPU times: user 33.4 ms, sys: 1.03 ms, total: 34.4 ms
Wall time: 83.1 ms


In [None]:
%%time
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print (len(list(doc.sents)))

27
CPU times: user 952 ms, sys: 37.3 ms, total: 989 ms
Wall time: 992 ms


Note that there is just 27 sentences in the ww2 _data file and the difference is obvious. When there is a file that contains millions of sentences then the above scenario will be more evident. Often, you need to find sentences rapidly, rather than precisely. It makes sense to be aware of techniques like the one above in such circumstances.

## Pipeline Inspection
A pipeline can be examined in spaCy in a number of different ways. analyze_pipe() function can be used to accomplish this in a script:

In [None]:
nlp1.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

Take note of the dictionary structure. This reveals not only what is inside the pipeline, but also its arrangement. After "summary," every key is a pipe. The value is a dictionary. This dictionary tells us a few distinct things. The word "assigns" is stated in each of these value dictionaries, and it refers to the value that a given pipe assigns to a token and document as they move through the pipeline. In some circumstances, the dictionary will have a key of "scores." This describes how the machine learning model was assessed.

# Rule-Based Spacy

## What is actually SpaCy's EntityRuler?

A variety of techniques are available for implementing rules-based NER in the Python module spaCy. Its EntityRuler is one of these techniques.

An individual can generate a set of patterns with associated labels using the EntityRuler, a spaCy factory. In spaCy, a factory is a collection of classes and operations that are pre-loaded and used to carry out certain tasks. In the case of the EntityRuler, the factory at hand enables the user to create an EntityRuler, give it a set of instructions, and then utilise these instructions to find and label entities.

The EntityRuler can then be added to the spaCy pipeline as a new pipe when the user has created it and given it some instructions.

One of the parts of a pipeline is a pipe. The goal of a pipeline is to accept input data, apply some kind of operations on that data, and then output the results as new data or extracted metadata. A pipe is one of the pipeline's constituent parts. In the instance of spaCy, there are several pipes that fulfil various functions. The tokenizer divides the text into individual tokens, the parser parses the text, and the NER recognises and labels entities. The Doc object is where all of this information is kept.

Pipelines are sequential, therefore it's vital to keep that in mind. This indicates that earlier components in a pipeline influence what later components receive. Sometimes the order is crucial, which means that subsequent pipes depend on previous pipes. Sometimes, though, the order is not crucial; later pipes can operate without the earlier ones. As you construct customised spaCy models, it's crucial to keep this in mind (or any pipeline for that matter).

Now we'll take a closer look at the EntityRuler as a component of a spaCy model's pipeline. Off-the-shelf spaCy models include a NER model but not an EntityRuler.An EntityRuler must first be built as a new pipe, instructed, and then added to the model in order to be included in a spaCy model. After that is finished, the user can save the updated model with the EntityRuler to disc.

**Note:** The detailed documentation of EntityRuler can be found at https://spacy.io/api/entityruler

In [None]:
#Import the requisite library
import spacy

#Load the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = """Sialkot is a city located in Punjab, Pakistan. It is the capital of Sialkot District and the 13th most populous city in Pakistan. """

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Sialkot PERSON
Punjab GPE
Pakistan GPE
Sialkot District FAC
Pakistan GPE


Some results may differ depending on the model version you're using. These are the results with small model of spaCy.<br>
Off-the-shelf models frequently fail in the domains in which we want to use them because they have not been trained on texts particular to those domains. However, we can fix this by either using spaCy's EntityRuler or by developing a new model. As we will see in the next, we can easily achieve both with spaCy's EntityRuler.<br>
**As you can see in the sample text, Sialkot is a geopolitical but labeled as person entity. For now, let’s first remedy the issue by giving the model instructions for correctly identifying Sialkot. For simplicity, we will use spaCy’s GPE label.**


**Note that off-the shelf spaCy model NER labeled the 17 types of entities as follow:** <br>
PERSON:      People, including fictional.<br>
NORP:        Nationalities or religious or political groups.<br>
FAC:         Buildings, airports, highways, bridges, etc.<br>
ORG:         Companies, agencies, institutions, etc.<br>
GPE:         Countries, cities, states.<br>
LOC:         Non-GPE locations, mountain ranges, bodies of water.<br>
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)<br>
EVENT:       Named hurricanes, battles, wars, sports events, etc.<br>
WORK_OF_ART: Titles of books, songs, etc.<br>
LAW:         Named documents made into laws.<br>
LANGUAGE:    Any named language.<br>
DATE:        Absolute or relative dates or periods.<br>
TIME:        Times smaller than a day.<br>
PERCENT:     Percentage, including ”%“.<br>
MONEY:       Monetary values, including unit.<br>
QUANTITY:    Measurements, as of weight or distance.<br>
ORDINAL:     “first”, “second”, etc.<br>
CARDINAL:    Numerals that do not fall under another type<br>

In [None]:
#Import the requisite library
import spacy

#Load the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = """Sialkot is a city located in Punjab, Pakistan. It is the capital of Sialkot District and the 13th most populous city in Pakistan. """

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Sialkot"}
            ]

ruler.add_patterns(patterns)

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Sialkot PERSON
Punjab GPE
Pakistan GPE
Sialkot District FAC
Pakistan GPE


You accomplished everything correctly if you ran the aforementioned code and got the same result. This strategy didn't work. Why? The concept of pipelines is the key to the solution. However, by default, spaCy adds a new pipe at the end of the pipeline even though we constructed and added the EntityRuler to the pipeline of the spaCy model. Let's utilise spaCy's analyze pipes function to view the pipeline ().

In [None]:
nlp.analyze_pipes()


{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

It can be a little challenging to read at first, but what it shows us is the arrangement of our pipes as well as some other important details about each pipe. We can see that "entity ruler" is positioned behind "ner" if we locate it.

As seen in the following example, we must place our EntityRuler behind the "ner" pipe in order for it to have primacy:<br>
ruler = nlp.add_pipe(“entity_ruler”, before=”ner”)

In [None]:
#Import the requisite library
import spacy

#Load the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = """Sialkot is a city located in Punjab, Pakistan. It is the capital of Sialkot District and the 13th most populous city in Pakistan. """

#Create the EntityRuler after ner
ruler = nlp.add_pipe('entity_ruler', before='ner')

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Sialkot"}
            ]

ruler.add_patterns(patterns)

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Sialkot GPE
Punjab GPE
Pakistan GPE
Sialkot GPE
Pakistan GPE


In [None]:
nlp.analyze_pipes()


{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

Now that the "ner" pipe has been activated, you can see that our EntityRuler is doing its job and finding and categorising entities before the NER even sees them. Its metadata has priority over the later "ner" pipe because it is earlier in the pipeline.

## Complex Rules and Variance to the Entity Ruler for matching

By passing the rules to the pattern, the spaCy EntityRuler also enables the user to introduce a variety of complicated rules and variances (using, among other things, RegEx). There are numerous arguments that can be used to the patterns. The whole list can be seen at <a href="https://spacy.io/usage/rule-based-matching">Link</a>.<br> Below is an example of pattern matching using entity ruler.

In [None]:
#Import the required library
import spacy

#Sample text
text = "This is a sample phone number 444 4444 444 and this is also 555 7866 987"

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [ {"label": "PHONE_NUMBER","pattern": [
           {'SHAPE': 'ddd'},{'SHAPE': 'dddd'},{'SHAPE': 'ddd'}]}]

#add patterns to ruler
ruler.add_patterns(patterns)



#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

444 4444 444 PHONE_NUMBER
555 7866 987 PHONE_NUMBER


## spaCy's Matcher

Matcher, a rule-based matching engine, is offered by Spacy. It uses text-extracted tokens to carry out its operations. You may also give a custom callback to the rule matcher to handle matches. The patterns established by the Matcher are used in all matches.

In [None]:
#Basic and Simple Email Matcher 
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("Sample email address: moazzamceo@mtlearners.com")
matches = matcher(doc)
#extract entities
print('Lexeme {} start token {}, end token {}'.format(matches[0][0],matches[0][1],matches[0][2]))
print ('Entity Label : ',nlp.vocab[matches[0][0]].text)
print ('Entity Text : ',doc[matches[0][1]:matches[0][2]])


Lexeme 16571425990740197027 start token 4, end token 5
Entity Label :  EMAIL_ADDRESS
Entity Text :  moazzamceo@mtlearners.com


A lexeme is a fundamental unit of meaning. In dictionaries, lexemes are the headwords. The lexeme "play," for example, can take many different forms, including playing, plays, and played. 

## Attributes Taken by Matcher

<ol>
  <li>ORTH - The exact verbatim of a token (str)</li>
  <li>TEXT - The exact verbatim of a token (str)</li>
  <li>LOWER - The lowercase form of the token text (str)</li>
  <li>LENGTH - The length of the token text (int)</li>
  <li>IS_ALPHA</li>
  <li>IS_ASCII</li>
  <li>IS_DIGIT</li>
  <li>IS_DIGIT</li>
  <li>IS_UPPER</li>
  <li>IS_TITLE</li>
  <li>IS_PUNCT</li>
  <li>POS</li>
  <li>SHAPE</li>
  <li>ENT_TYPE</li>
  <li>LEMMA</li>
  <li>DEP</li>
  <li>TAG</li>
  <li>MORPH</li>
  <li>LIKE_URL</li>
  <li>LIKE_EMAIL</li>

</ol>
Explore Matcher attributes in more detail at  <a href="https://spacy.io/usage/rule-based-matching">Click</a>

## Grabbing all proper nouns

In [None]:
with open ("/content/ww2_data.txt", "r") as f:
    text = f.read()

matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:20]:
    print (doc[match[1]:match[2]])

World
War
II
Second
World
War
WWII
WW2
Allies
Axis
Aircraft
World
War
II
Holocaust
Axis
Germany
Japan
World
War


In [None]:
#Improving it with Multi-Word Proper Nouns
pattern = [{"POS": "PROPN", "OP":'+'}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print (doc[match[1]:match[2]])

World
War
World War
II
World War II
War II
Second
World
Second World
War


In [None]:
#Greedy Keyword Argument
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print (doc[match[1]:match[2]])

World War I. World War II
World War II
Second World War
World War II
World War II
Spanish Civil War
World War II
World War II
Second Italo
Ethiopian War


# RegEx Role in spaCy

## Regular Expressions (RegEx)
Regular Expressions, abbreviated as RegEx, are a method of achieving complex string matching based on simple or complicated patterns. It can be used to discover and retrieve patterns, as well as replace matching patterns in a string with another pattern. RegEx can provide more thorough searching and are completely integrated with the majority of search engines. RegEx is used by almost all data scientists, especially those that work with texts, at some point in their workflow, from data searching to data cleaning to putting machine learning models into practise.

## Beneﬁts of RegEx


<ol>
<li>Because of its complicated syntax, it allows programmers to express robust rules in limited space.</li>
<li>It enables the researcher to discover many sorts of variation in strings.</li>
<li>In comparison to other techniques, it can work incredibly swiftly.</li>



## Drawbacks of RegEx
Despite these benefits, RegEx has a few drawbacks.
<ol>
<li>Its syntax is quite tough for newbies.</li>
<li>To make the system effective, a domain expert must collaborate with the programmer and consider all possible variations of a pattern in texts.</li>
</ol>

## RegEx in Python


Any numerical sequence that consists of one or two digits followed by the month will fit this pattern.

In [None]:
import re

"""(\d){1,2} means that we are looking for ant digit between 0 and 9 that may occur once or twice.
Next, a space indicates the space in the string that we would expect with a date. Next,(January|February|March|
April|May|June|July|August|September|October|November|December) indicates another component
of the pattern (because it is parentheses). The | indicates the same concept as “or” in English, so either January, or February, or March, etc."""


pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"
text = "Today is 5 November. My birthday was 18 September."
matches = re.findall(pattern, text)
print (matches)

[('5 November', '5', 'November'), ('18 September', '8', 'September')]


In [None]:
#Add variations of pattern
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"
text = "Today is 5 November. My birthday was September 18."
matches = re.findall(pattern, text)
print (matches)

[('5 November', '5 November', '5', ' November', 'November', '', '', '', ''), ('September 18', '', '', '', '', 'September 18', 'September ', 'September', '8')]


However, you'll notice that we have a tonne of information that is unnecessary for each match. Each match includes these elements. We can eliminate them in a number of ways. Using finditer instead of findall in RegEx is one method as follows:

In [None]:
iter_matches = re.finditer(pattern, text)
print (iter_matches)

<callable_iterator object at 0x7f91a7871c50>


In [None]:
#This is an iterator object, we can loop over it, however, and get our results
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print ('Date Found: ',text[start:end])

Date Found:  5 November
Date Found:  18 September


## RegEx in spaCy

RegEx works well with data types with consistent or largely consistent structures, such as dates, times, IP addresses, and others. RegEx is easily implemented in three pipes in spaCy: Matcher, PhraseMatcher, and EntityRuler. The fact that the Matcher and PhraseMatcher do not align the matches as doc.ents is one of their main drawbacks. We will concentrate on utilising RegEx with the EntityRuler because this is all about NER and our goal is to store the entities in the doc.ents.

In [None]:
#Import the requisite library
import spacy

#Sample text
text = "This is a sample of Pakistan number 0341-5555555. This is also 0309 8765432"
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PAK PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){4})"}},
                                                          {'ORTH': '-', 'OP': '?'},
                                                         {"TEXT": {"REGEX": "((\d){7})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

0341-5555555 PAK PHONE_NUMBER
0309 8765432 PAK PHONE_NUMBER


# Custom NER Training

In [27]:
from spacy.tokens import DocBin
from tqdm import tqdm
import json
f = open('annotations.json')
TRAIN_DATA = json.load(f)

db = DocBin() # create a DocBin object

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA['annotations']): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|████████████████████████████████████████████| 8/8 [00:00<00:00, 594.20it/s]


In [30]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [31]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-11-06 15:27:48,668] [INFO] Set up nlp object from config
[2022-11-06 15:27:48,693] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-11-06 15:27:48,703] [INFO] Created vocabulary
[2022-11-06 15:27:48,704] [INFO] Finished initializing nlp object
[2022-11-06 15:27:49,127] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     35.67    0.00    0.00    0.00    0.00
 78     200          3.64    573.13  100.00  100.00  100.00    1.00
178     400          0.00      0.00  100.00  100.00  100.00    1.00
278     600          0.00      0.00  100.00  100.00  100.00    1.00
448     800          0.00      0.00  100.00  100.00  100.00    

In [None]:
nlp_ner = spacy.load("model-best")

In [None]:
doc = nlp_ner("Strawberry is a luscious, red fruit grown on plants belonging to the Rose or Rosaceae family.") # input sample text
doc.ents

(Strawberry,)

In [None]:
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

In [None]:
doc1=nlp_ner("Best Quality Mangoes in Pakistan")
doc1.ents

(Mangoes,)

In [None]:
spacy.displacy.render(doc1, style="ent", jupyter=True) # display in Jupyter

### Thank you 
## Author

<a href="https://www.linkedin.com/in/meermoazzam/" target="_blank">Moazzam Ali</a>


<hr>

## <h3 align="center"> © <a href="https://www.linkedin.com/company/mt-learners/" target="_blank">MT Learners</a> 2022. All rights reserved. <h3/>
