# Named Entity Recognition

Named Entity Recognition (NER) is a technique in natural language processing (NLP) that focuses on identifying and classifying entities. The purpose of NER is to automatically extract structured information from unstructured text, enabling machines to understand and categorize entities in a meaningful manner for various applications like text summarization, building knowledge graphs, question answering, and knowledge graph construction. The article explores the fundamentals, methods and implementation of the NER model.

# What is Named Entity Recognition (NER)?
Name-entity recognition (NER) is also referred to as entity identification, entity chunking, and entity extraction. NER is the component of information extraction that aims to identify and categorize named entities within unstructured text. NER involves the identification of key information in the text and classification into a set of predefined categories. An entity is the thing that is consistently talked about or refer to in the text, such as person names, organizations, locations, time expressions, quantities, percentages and more predefined categories.

NER system fin applications across various domains, including question answering, information retrieval and machine translation. NER plays an important role in enhancing the precision of other NLP tasks like part-of-speech tagging and parsing. At its core, NLP is just a two-step process, below are the two steps that are involved:

1.Detecting the entities from the text
2.Classifying them into different categories
# Ambiguity in NER
For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous examples:
England (Organization) won the 2019 world cup vs The 2019 world cup happened in England (Location).
Washington (Location) is the capital of the US vs The first president of the US was Washington (Person).
How Named Entity Recognition (NER) works?
# The working of Named Entity Recognition is discussed below:

    The NER system analyses the entire input text to identify and locate the named entities.

    The system then identifies the sentence boundaries by considering capitalization rules. It recognizes the end of the sentence when a word starts with a capital letter, assuming it could be the beginning of a new sentence. Knowing sentence boundaries aids in contextualizing entities within the text, allowing the model to understand relationships and meanings.

    NER can be trained to classify entire documents into different types, such as invoices, receipts, or passports. Document classification enhances the versatility of NER, allowing it to adapt its entity recognition based on the specific characteristics and context of different document types.

    NER employs machine learning algorithms, including supervised learning, to analyze labeled datasets. These datasets contain examples of annotated entities, guiding the model in recognizing similar entities in new, unseen data.

    Through multiple training iterations, the model refines its understanding of contextual features, syntactic structures, and entity patterns, continuously improving its accuracy over time.

    The model’s ability to adapt to new data allows it to handle variations in language, context, and entity types, making it more robust and effective.

# Named Entity Recognition (NER) Methods
# 1.Lexicon Based Method
The NER uses a dictionary with a list of words or terms. The process involves checking if any of these words are present in a given text. However, this approach isn’t commonly used because it requires constant updating and careful maintenance of the dictionary to stay accurate and effective.

# 2.Rule Based Method
The Rule Based NER method uses a set of predefined rules guides the extraction of information. These rules are based on patterns and context. Pattern-based rules focus on the structure and form of words, looking at their morphological patterns. On the other hand, context-based rules consider the surrounding words or the context in which a word appears within the text document. This combination of pattern-based and context-based rules enhances the precision of information extraction in Named Entity Recognition (NER).

# 3. Machine Learning-Based Method
A. Multi-Class Classification with Machine Learning Algorithms
One way is to train the model for multi-class classification using different machine learning algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a deep understanding of context to deal with the ambiguity of the sentences. This makes it a challenging task for a simple machine learning algorithm.
B. Conditional Random Field (CRF)
Conditional random field is implemented by both NLP Speech Tagger and NLTK.  It is a probabilistic model that can be used to model sequential data such as words 
# 4. Deep Learning Based Method
Deep learning NER system is much more accurate than previous method, as it is capable to assemble words. This is due to the fact that it used a method called word embedding, that is capable of understanding the semantic and syntactic relationship between various words.
It is also able to learn analyzes topic specific as well as high level words automatically.
This makes deep learning NER applicable for performing multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for example can use their time more efficiently.

# Using NLTK

In [3]:
import nltk
nltk.download('maxent_ne_chunker')
  

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.


True

In [6]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [7]:
import nltk

# Define the text to be analyzed
text = "GeeksforGeeks is a recognised platform for online learning in India"

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Apply part-of-speech tagging to the tokens
tagged = nltk.pos_tag(tokens)

# Apply named entity recognition to the tagged words
entities = nltk.chunk.ne_chunk(tagged)

# Print the entities found in the text
for entity in entities:
	if hasattr(entity, 'label') and entity.label() == 'ORGANIZATION':
		print(entity.label(),'-->', ''.join(c[0] for c in entity))
	elif hasattr(entity, 'label') and entity.label() == 'GPE':
		print(entity.label(), '-->',''.join(c[0] for c in entity))


ORGANIZATION --> GeeksforGeeks
GPE --> India


In [8]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import ne_chunk

text = "GeeksforGeeks is a recognised platform for online learning in India"

tokens=word_tokenize(text)
tagged=pos_tag(tokens)

entities=ne_chunk(tagged)

for entity in entities:
	if hasattr(entity, 'label'):
		print(entity.label(),'-->', ''.join(c[0] for c in entity))
	

ORGANIZATION --> GeeksforGeeks
GPE --> India


# Using Spacy

In [9]:
import pandas as pd 
import spacy 
import requests 
from bs4 import BeautifulSoup
nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_rows", 200)


In [10]:
content = "Trinamool Congress leader Mahua Moitra has moved the Supreme Court against her expulsion from the Lok Sabha over the cash-for-query allegations against her. Moitra was ousted from the Parliament last week after the Ethics Committee of the Lok Sabha found her guilty of jeopardising national security by sharing her parliamentary portal's login credentials with businessman Darshan Hiranandani."

doc = nlp(content)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


Congress 10 18 ORG
Mahua Moitra 26 38 PERSON
the Supreme Court 49 66 ORG
the Lok Sabha 94 107 PERSON
Moitra 157 163 ORG
Parliament 184 194 ORG
last week 195 204 DATE
the Ethics Committee 211 231 ORG
Darshan Hiranandani 373 392 PERSON


Visualize
The displacy.render function from spaCy is used to visualize the named entities in a text. It generates a visual representation with colored highlights indicating the recognized entities and their respective categories.

In [11]:
from spacy import displacy
displacy.render(doc, style="ent")


Using the following code, we will create a dataframe from the named entities extracted by spaCy, including the text, type (label), and lemma of each entity.

In [12]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
print(df)


                   text    type                 lemma
0              Congress     ORG              Congress
1          Mahua Moitra  PERSON          Mahua Moitra
2     the Supreme Court     ORG     the Supreme Court
3         the Lok Sabha  PERSON         the Lok Sabha
4                Moitra     ORG                Moitra
5            Parliament     ORG            Parliament
6             last week    DATE             last week
7  the Ethics Committee     ORG  the Ethics Committee
8   Darshan Hiranandani  PERSON   Darshan Hiranandani
