
#Named Entity Recognition (NER) in NLP

**Named Entity Recognition (NER)** is a crucial task in Natural Language Processing (NLP) where the goal is to identify and classify named entities in text into predefined categories such as the names of **persons, organizations, locations, dates, etc**. NER is useful for many applications, including **information retrieval, question answering, and summarization.**

In this context, NER can be performed using different techniques, including rule-based methods, statistical models, and deep learning approaches. Let's go over these approaches and how they are implemented using popular libraries such as SpaCy and NLTK.

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names



['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [18]:
doc = nlp("Apple CEO Tim Cook will visit London on November 25, 2024, and posted something about the Royal Albert Hall on Twitter")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Apple  |  ORG  |  Companies, agencies, institutions, etc.
Tim Cook  |  PERSON  |  People, including fictional
London  |  GPE  |  Countries, cities, states
November 25, 2024  |  DATE  |  Absolute or relative dates or periods
the Royal Albert Hall  |  ORG  |  Companies, agencies, institutions, etc.


In [19]:
from spacy import displacy

displacy.render(doc, style="ent")

In [20]:
# listing all entities
nlp.pipe_labels['ner']


['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [21]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Apple  |  ORG  |  0 | 5
Tim Cook  |  PERSON  |  10 | 18
London  |  GPE  |  30 | 36
November 25, 2024  |  DATE  |  40 | 57
the Royal Albert Hall  |  ORG  |  86 | 107


In [29]:
s = doc[0:5]
print(s)
print(type(s))

Apple CEO Tim Cook will
<class 'spacy.tokens.span.Span'>


In [42]:
print(len(doc) , doc[22:])


23 Twitter


In [43]:
# custom entities
from spacy.tokens import Span

s1 = Span(doc, 22, len(doc), label="ORG")
doc.set_ents([s1], default="unmodified")

for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Apple  |  ORG
Tim Cook  |  PERSON
London  |  GPE
November 25, 2024  |  DATE
the Royal Albert Hall  |  ORG
on  |  ORG
Twitter  |  ORG


In [53]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [70]:
from nltk import word_tokenize, pos_tag
from nltk.chunk import ne_chunk

# Text input
text = "Apple's CEO 'Tim Cook' will visit London on November 25, 2024, and posted something about the Royal Albert Hall on Twitter"

# Tokenize, POS tag and perform NER
tokens = word_tokenize(text)
tags = pos_tag(tokens)
tree = ne_chunk(tags)

# Extract named entities
for subtree in tree:
    if isinstance(subtree, nltk.Tree):  # Check if the subtree is a named entity
        entity = " ".join([word for word, tag in subtree])
        label = subtree.label()  # Organization, Location, etc.
        print(entity, " | ", label)


Apple  |  GPE
CEO  |  ORGANIZATION
Cook  |  PERSON
London  |  GPE
Royal  |  ORGANIZATION
Albert Hall  |  PERSON


In [62]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag
from nltk.chunk import ne_chunk

text = "Apple CEO Tim Cook will visit London on November 25, 2024, and posted something about the Royal Albert Hall on Twitter"

multi_token_entities = {
    "Royal Albert Hall": "ROYAL_ALBERT_HALL",
    "Twitter": "TWITTER"
}

for entity, placeholder in multi_token_entities.items():
    text = text.replace(entity, placeholder)

# Step 2: Custom tokenizer using a regular expression tokenizer
tokenizer = RegexpTokenizer(r'\s|[.,!?;()]|[A-Za-z0-9_]+')
tokens = tokenizer.tokenize(text)
tokens = [token.replace("ROYAL_ALBERT_HALL", "Royal Albert Hall").replace("TWITTER", "Twitter") for token in tokens]
print(tokens)

tags = pos_tag(tokens)  # POS tagging
tree = ne_chunk(tags)   # Named Entity Recognition (NER)

# Step 6: Extract named entities
for subtree in tree:
    if isinstance(subtree, nltk.Tree):  # Check if the subtree is a named entity
        entity = " ".join([word for word, tag in subtree])
        label = subtree.label()  # Organization, Location, etc.
        print(entity, " | ", label)


['Apple', ' ', 'CEO', ' ', 'Tim', ' ', 'Cook', ' ', 'will', ' ', 'visit', ' ', 'London', ' ', 'on', ' ', 'November', ' ', '25', ',', ' ', '2024', ',', ' ', 'and', ' ', 'posted', ' ', 'something', ' ', 'about', ' ', 'the', ' ', 'Royal Albert Hall', ' ', 'on', ' ', 'Twitter']
Apple  |  PERSON
