<a href="https://colab.research.google.com/github/AnetaKovacheva/NER/blob/main/Named_Entity_Recognition_with_NLTK_and_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition with NLTK and SpaCy

In this Notebook, I explain what Named Entity Recognition is, and how it works. The example is demonstrated with a sentence taken from a public source, and an article from an online media. I use both `NLTK` and `SpaCy` libraries to preprocess data, to explain language elements, and to make the computer to name recognized entities. 

The work is inspired from an [article](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) in Towards Data Science but provides more explanations and technical details. Text and code are organized in several sections.

### Imports

In [1]:
from collections import Counter
from pprint import pprint
import requests
import re

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags

import spacy
from spacy import displacy
import en_core_web_sm

from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Named entity recognition (NER), also referred to as entity chunking, extraction, or identification, is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, a machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.

NER is a form of Natural Language Processing (NLP), a subfield of Artificial Intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.

To demonstrate how NER works, we need a sequence of strings - a sentence or an article (or a larger body of text).

# 1. Analyze a sentence and its elements with `NLTK`

A sentence is taken from a Politico.eu [article](https://www.politico.eu/article/putin-macron-agree-expert-mission-zaporizhzhia-nuclear-plant-ukraine/) and stored in a variable so at to demonstrate how Named Entity Recognition works.

In [2]:
sentence = "During a phone call on Friday, Emmanuel Macron and Vladimir Putin agreed that a team from the International Atomic Energy Agency (IAEA) should be sent to the Zaporizhzhia nuclear power plant in Ukraine, according to the French president's office and the Kremlin."

A function (see below) helps for tokenizing and placing tags on the words. `word_tokenize` takes a string (sentence, longer text) and divides it into  lists of substrings (e.g., words and punctuation marks). `pos_tag` tags words according their part-of-speech role. 

In [3]:
def preprocess(sent):
  """
  Tokenizes a string and tags with part-of-speech recognized tokens.
  Args: string
  Returns: a list of tuples each containing a token and its associated 
        part-of-speech
  """
  sent = nltk.word_tokenize(sent)
  sent = nltk.pos_tag(sent)
  
  return sent

Each word (token) is taged with its associated part-of-speech. For example, *During* (IN) is marked as preposition/subordinating conjunction, *phone* (NN) - singular noun, *Friday* (NNP) - proper noun, singular, *Macron* (NNP) - also proper noun in singular form, *sent* (VBN) - past participle form of a verb, and so on. All part-of-speech abbreviations are listed [here](https://www.guru99.com/pos-tagging-chunking-nltk.html).

In [4]:
sentence_preprocessed = preprocess(sentence)
sentence_preprocessed 

[('During', 'IN'),
 ('a', 'DT'),
 ('phone', 'NN'),
 ('call', 'NN'),
 ('on', 'IN'),
 ('Friday', 'NNP'),
 (',', ','),
 ('Emmanuel', 'NNP'),
 ('Macron', 'NNP'),
 ('and', 'CC'),
 ('Vladimir', 'NNP'),
 ('Putin', 'NNP'),
 ('agreed', 'VBD'),
 ('that', 'IN'),
 ('a', 'DT'),
 ('team', 'NN'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('International', 'NNP'),
 ('Atomic', 'NNP'),
 ('Energy', 'NNP'),
 ('Agency', 'NNP'),
 ('(', '('),
 ('IAEA', 'NNP'),
 (')', ')'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('sent', 'VBN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('Zaporizhzhia', 'NNP'),
 ('nuclear', 'JJ'),
 ('power', 'NN'),
 ('plant', 'NN'),
 ('in', 'IN'),
 ('Ukraine', 'NNP'),
 (',', ','),
 ('according', 'VBG'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('French', 'JJ'),
 ('president', 'NN'),
 ("'s", 'POS'),
 ('office', 'NN'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('Kremlin', 'NNP'),
 ('.', '.')]

## 1.1. Chunk sentence into parts-of-speech and IOBs

A pattern with RegEx is defined to find the elements in the sentence which meet the predefined criteria. The chunk pattern consists of one rule, that a noun phrase (NP) should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

In [5]:
pattern = "NP: {<DT>?<JJ>*<NN>}"

In [6]:
chunk_parser = nltk.RegexpParser(pattern)
chunked_sentence = chunk_parser.parse(sentence_preprocessed)
print(chunked_sentence)

(S
  During/IN
  (NP a/DT phone/NN)
  (NP call/NN)
  on/IN
  Friday/NNP
  ,/,
  Emmanuel/NNP
  Macron/NNP
  and/CC
  Vladimir/NNP
  Putin/NNP
  agreed/VBD
  that/IN
  (NP a/DT team/NN)
  from/IN
  the/DT
  International/NNP
  Atomic/NNP
  Energy/NNP
  Agency/NNP
  (/(
  IAEA/NNP
  )/)
  should/MD
  be/VB
  sent/VBN
  to/TO
  the/DT
  Zaporizhzhia/NNP
  (NP nuclear/JJ power/NN)
  (NP plant/NN)
  in/IN
  Ukraine/NNP
  ,/,
  according/VBG
  to/TO
  (NP the/DT French/JJ president/NN)
  's/POS
  (NP office/NN)
  and/CC
  the/DT
  Kremlin/NNP
  ./.)


`tree2conlltags` is an NLTK function that receives chunked string, and returns a list of 3-tuples containing (word, tag, IOB-tag). In addition to the tags-of-speech, it denotes the inside, outside, and beginning of a chunk. For example, *B-NP* marks the beginning of a noun phrase (e.g., call, office), *I-NP* describes that the word is inside of the current noun phrase (e.g., phone, team), *O* shows end of the sentence, and *B-VP* and *I-VP* - beginning and inside of a verb phrase.

In [7]:
iob_tagged = tree2conlltags(chunked_sentence)
pprint(iob_tagged)

[('During', 'IN', 'O'),
 ('a', 'DT', 'B-NP'),
 ('phone', 'NN', 'I-NP'),
 ('call', 'NN', 'B-NP'),
 ('on', 'IN', 'O'),
 ('Friday', 'NNP', 'O'),
 (',', ',', 'O'),
 ('Emmanuel', 'NNP', 'O'),
 ('Macron', 'NNP', 'O'),
 ('and', 'CC', 'O'),
 ('Vladimir', 'NNP', 'O'),
 ('Putin', 'NNP', 'O'),
 ('agreed', 'VBD', 'O'),
 ('that', 'IN', 'O'),
 ('a', 'DT', 'B-NP'),
 ('team', 'NN', 'I-NP'),
 ('from', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('International', 'NNP', 'O'),
 ('Atomic', 'NNP', 'O'),
 ('Energy', 'NNP', 'O'),
 ('Agency', 'NNP', 'O'),
 ('(', '(', 'O'),
 ('IAEA', 'NNP', 'O'),
 (')', ')', 'O'),
 ('should', 'MD', 'O'),
 ('be', 'VB', 'O'),
 ('sent', 'VBN', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('Zaporizhzhia', 'NNP', 'O'),
 ('nuclear', 'JJ', 'B-NP'),
 ('power', 'NN', 'I-NP'),
 ('plant', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('Ukraine', 'NNP', 'O'),
 (',', ',', 'O'),
 ('according', 'VBG', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'B-NP'),
 ('French', 'JJ', 'I-NP'),
 ('president', 'NN', 'I-NP'),
 

# 2. Analyze a sentence and its elements with `SpaCy`

`en_core_web_sm` is one of the `SpaCy` pretrained language models. It should be downloaded in order to use its functionalities. The same sentence is used to illustrated how this library works.

In [8]:
nlp = en_core_web_sm.load()

In [9]:
doc = nlp("During a phone call on Friday, Emmanuel Macron and Vladimir Putin agreed that a team from the International Atomic Energy Agency (IAEA) should be sent to the Zaporizhzhia nuclear power plant in Ukraine, according to the French president's office and the Kremlin.")

The recognized entities in this sentence are *Friday* (as Date), the French and the russian presidents (as PERSON), the *International Atomic Energy Agency* and its abbreviation, and kremlin (as ORG, organization), *Ukraine* (as GPE, Geo-Political entity), and *French* (as NORP, which stands for Nationalities or religious or political groups). 

In [10]:
pprint([(X.text, X.label_) for X in doc.ents])

[('Friday', 'DATE'),
 ('Emmanuel Macron', 'PERSON'),
 ('Vladimir Putin', 'PERSON'),
 ('the International Atomic Energy Agency', 'ORG'),
 ('IAEA', 'ORG'),
 ('Zaporizhzhia', 'PERSON'),
 ('Ukraine', 'GPE'),
 ('French', 'NORP'),
 ('Kremlin', 'ORG')]


Likewise, `SpaCy` can denote tokens with their associated part-of-speech and IOBs.

In [11]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(During, 'O', ''),
 (a, 'O', ''),
 (phone, 'O', ''),
 (call, 'O', ''),
 (on, 'O', ''),
 (Friday, 'B', 'DATE'),
 (,, 'O', ''),
 (Emmanuel, 'B', 'PERSON'),
 (Macron, 'I', 'PERSON'),
 (and, 'O', ''),
 (Vladimir, 'B', 'PERSON'),
 (Putin, 'I', 'PERSON'),
 (agreed, 'O', ''),
 (that, 'O', ''),
 (a, 'O', ''),
 (team, 'O', ''),
 (from, 'O', ''),
 (the, 'B', 'ORG'),
 (International, 'I', 'ORG'),
 (Atomic, 'I', 'ORG'),
 (Energy, 'I', 'ORG'),
 (Agency, 'I', 'ORG'),
 ((, 'O', ''),
 (IAEA, 'B', 'ORG'),
 (), 'O', ''),
 (should, 'O', ''),
 (be, 'O', ''),
 (sent, 'O', ''),
 (to, 'O', ''),
 (the, 'O', ''),
 (Zaporizhzhia, 'B', 'PERSON'),
 (nuclear, 'O', ''),
 (power, 'O', ''),
 (plant, 'O', ''),
 (in, 'O', ''),
 (Ukraine, 'B', 'GPE'),
 (,, 'O', ''),
 (according, 'O', ''),
 (to, 'O', ''),
 (the, 'O', ''),
 (French, 'B', 'NORP'),
 (president, 'O', ''),
 ('s, 'O', ''),
 (office, 'O', ''),
 (and, 'O', ''),
 (the, 'O', ''),
 (Kremlin, 'B', 'ORG'),
 (., 'O', '')]


## 2.1. Extract Named Entity from an article

Another example for a Named Entity Recognition task is to find entities in a whole article (or a document). To demonstrate how it works, the entire text of an [article](https://www.theguardian.com/world/2022/aug/19/canada-zoo-escaped-wolf-pups) from The Guardian is taken out from the site and stored in a variable. The function below scrapes the text and prepares it for further processing.

In [12]:
def url_to_string(url):
  """
  Extracts text from a web site and prepares it for preprocessing
  Args: a URL address
  
  """
  res = requests.get(url)
  html = res.text
  soup = BeautifulSoup(html, 'html5lib')
  for script in soup(["script", "style", 'aside']):
      script.extract()
  return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [13]:
article = url_to_string("https://www.theguardian.com/world/2022/aug/19/canada-zoo-escaped-wolf-pups")
article

'                      Canada zoo finds escaped wolf pups in moment of joy tinged with tragedy | Canada | The Guardian                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Skip to main contentSkip to navigationAdvertisementUS editionUS editionUK editionAustralian editionInternational editionThe Guardian - Back to homeThe Guardian: news website of the yearSearch jobs Sign inSearchNewsOpinionSportCultureLifestyleShowMoreShow MoreNewsUS newsWorld newsEnvironmentSoccerUS politicsBusinessTechScienceNewslettersFight to voteOpinionThe Guardian viewColumnists

Unfortunately, the `Beautifulsoup` scrapes everything on the site. Thus, not only the meaningful text is taken out of this article but links to other articles and pages, too. To simplify cleaning, I extract only the relevant sentences by hand. 

In [14]:
article = "Canada zoo finds escaped wolf pups in moment of joy tinged with tragedy. Canada zoo finds escaped wolf pups in moment of joy tinged with tragedyFour days after a ‘suspicious’ break-in, one pup is found safe and another appears to have been hit by a car Wolf pups in Minnesota. Two pups from the Greater Vancouver Zoo were found after four days on the loose. Emotions are bittersweet at a Canadian zoo after a runaway wolf pup was safely located after four days on the loose, but another was found dead along a road. Conservation officers and zoo staff in Canada have spent the last four days searching for a runaway wolf after mysterious break-in freed a pack of the predators from the popular zoo. In a statement on Friday morning, officials at the Greater Vancouver Zoo said Tempest, a one-year-old grey wolf, had been found and was “back with her family”. She had been located near the zoo, the statement said. But the news came a day after the zoo announced that another escaped wolf, Chia, had died, probably after being hit by a car. “We are so grateful for this positive outcome for Tempest but are still processing the loss of Chia,” the zoo’s statement said, adding it hoped to finally reopen on Saturday. The zoo announced on Tuesday morning it would not open to crowds that day. It later acknowledged that a pack of grey wolves had escaped after “suspicious” damage to the fence of their enclosure. The zoo said the incident was probably the result of “malicious intent”.Canada conservation officers seek runaway wolf days after zoo break-inRead moreOpened in 1970, the tourist attraction has nine adult grey wolves and six pups. Staff did not confirm how many had initially escaped after the fence was broken, nor did it say how many remained unaccounted for. The Royal Canadian Mounted Police are investigating the break-in, but a lack of surveillance footage has made it difficult to identify any suspects.The Greater Vancouver Zoo is seen in Langley, British Columbia. I can just tell you that there was damage done to the enclosure to allow the wolves to exit. At this point, there’s no surveillance, so we don’t have any information to indicate how they got in or suspect information,” said Cpl Holly Largy of Langley RCMP. Located outside Vancouver, the zoo spans 120 acres in the Fraser Valley and is close to a large forested area that contains a naval radio communications facility. Animal rights activists have in recent years focused on the zoo following two attacks: one on a girl who was bitten by a black bear, and another on a staff member attacked by a jaguar. In 2019, the Vancouver Humane Society released a report criticizing the zoo. In 2020, the owners of the zoo spent millions on a “major overhaul” of the facility. The apparent act of vandalism also comes as the plight of wolves, which once thrived in the region, faces activists’ scrutiny. This year, the province extended its aerial wolf cull for another five years. The controversial program kills as many as 300 wolves a year in an effort to save woodland caribou."

The `SpaCy` language model found 50 entities in it.

In [15]:
article = nlp(article)
len(article.ents)

50

Ten out of all tokens are identified as GPE (Geo-Political Entities, e.g. countries, cities, states); six are PERSONs, 8 entities are categorized as CARDINAL, i.e., numerals that do not fall under another type. Six tokens are ORG (i.e., companies, agencies, institutions, etc.), 15 are DATEs (i.e. absolute date or a period), 1 is NORP (Nationalities or religious or political groups), 2 are recognized as TIME (i.e., time smaller than a day), 1 is a QUANTITY (some kind of measurement, e.g., weight or a distance), and 1 is denoted with FAC, which marks buildings, airports, highways, bridges, etc.

In [16]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'GPE': 10,
         'PERSON': 6,
         'CARDINAL': 8,
         'ORG': 6,
         'DATE': 15,
         'NORP': 1,
         'TIME': 2,
         'QUANTITY': 1,
         'FAC': 1})

The most common tokens in this article are "Canada", "wolf pups", and "one".

In [17]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Canada', 4), ('wolf pups', 2), ('one', 2)]

Exploration goes deeper by selecting two sentences for further analysis.

In [18]:
sentences = [x for x in article.sents]
pprint(sentences[2:4])

[Two pups from the Greater Vancouver Zoo were found after four days on the loose.,
 Emotions are bittersweet at a Canadian zoo after a runaway wolf pup was safely located after four days on the loose, but another was found dead along a road.]


`displaCy` visualizes all strings and marks the recognized entities - both in color and with their annotation.

In [19]:
displacy.render(nlp(str(sentences[2:4])), jupyter = True, style = "ent")

Furthermore, `displaCy` dependency visualizer (dep) shows the part-of-speech tags and syntactic dependencies.

In [20]:
displacy.render(nlp(str(sentences[2])), style= "dep", jupyter = True, options = {"distance": 120})

The linguistic analysis continues with verbatim, extracting part-of-speech and lemmatizing the two sentences.

In [21]:
[(x.orth_, x.pos_, x.lemma_) for x in [y for y in nlp(str(sentences[2:4])) if not y.is_stop and y.pos_ != "PUNCT"]]

[('[', 'X', '['),
 ('pups', 'NOUN', 'pup'),
 ('Greater', 'PROPN', 'Greater'),
 ('Vancouver', 'PROPN', 'Vancouver'),
 ('Zoo', 'PROPN', 'Zoo'),
 ('found', 'VERB', 'find'),
 ('days', 'NOUN', 'day'),
 ('loose', 'NOUN', 'loose'),
 ('Emotions', 'NOUN', 'emotion'),
 ('bittersweet', 'ADJ', 'bittersweet'),
 ('Canadian', 'ADJ', 'canadian'),
 ('zoo', 'NOUN', 'zoo'),
 ('runaway', 'ADJ', 'runaway'),
 ('wolf', 'NOUN', 'wolf'),
 ('pup', 'PROPN', 'pup'),
 ('safely', 'ADV', 'safely'),
 ('located', 'VERB', 'locate'),
 ('days', 'NOUN', 'day'),
 ('loose', 'ADJ', 'loose'),
 ('found', 'VERB', 'find'),
 ('dead', 'ADJ', 'dead'),
 ('road', 'NOUN', 'road')]

The code line below prints the recognised entities and the type of entity they belong to.

In [22]:
dict([(str(x), x.label_) for x in nlp(str(sentences[2:4])).ents])

{'Two': 'CARDINAL',
 'the Greater Vancouver Zoo': 'ORG',
 'four days': 'DATE',
 'Canadian': 'NORP'}

`SpaCy` found that "Greater" is inside an entity, as Vancouver, Zoo, and days (all are correct); many are considered. "Two", "the", and "four" begins an entity, and the remaining tokens are labelled as outside entities.

In [23]:
pprint([(x, x.ent_iob_, x.ent_type_) for x in sentences[2]])

[(Two, 'B', 'CARDINAL'),
 (pups, 'O', ''),
 (from, 'O', ''),
 (the, 'B', 'ORG'),
 (Greater, 'I', 'ORG'),
 (Vancouver, 'I', 'ORG'),
 (Zoo, 'I', 'ORG'),
 (were, 'O', ''),
 (found, 'O', ''),
 (after, 'O', ''),
 (four, 'B', 'DATE'),
 (days, 'I', 'DATE'),
 (on, 'O', ''),
 (the, 'O', ''),
 (loose, 'O', ''),
 (., 'O', '')]


At the end, the entire article is displayed with recognised and marked entities. A brief check shows that there is an error: "intent" is not a GPE but outside an entity part-of-speech. 

In [24]:
displacy.render(nlp(str(article)), jupyter=True, style = "ent")

To conclude, the above sentence and article showed what a Named Entity Recognition is, what it is used for, and how it works.