In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Name Entity Recognition (NER) 

**Sources:**
- https://www.analyticsvidhya.com/blog/2021/11/a-beginners-introduction-to-ner-named-entity-recognition/
- https://www.turing.com/kb/a-comprehensive-guide-to-named-entity-recognition

Named Entity Recognition (NER), also known as segmentation, extraction, or identification of objects, is a task within **Natural Language Processing (NLP)** and **Artificial Intelligence (AI)**. It involves the identification and classification of specific information, known as entities, in the text. These objects can be individual words or groups of words consistently representing the same concept. 

The main goal of NER is to analyze unstructured text and identify named entities, and then organize them into predetermined categories, such as Organization, Person, Location, Time, or custom categories tailored to specific use cases, such as Healthcare Terms or Programming Language. This process enhances the usability of data for various purposes, such as data analysis, information retrieval, and knowledge graph construction.
 
## Different NER Systems

There are few different NER systems: rule-based, dictionary-based, machine learning (ML) based, and deep learning approaches. 

### Dictionary-based Systems
This is the simplest NER approach. Here we will be having a dictionary that contains a collection of vocabulary. In this approach, basic string matching algorithms are used to check whether the entity is occurring in the given text to the items in the vocabulary. The method has limitations as it is required to update and maintain the dictionary used for the system.

### Rule-based Systems
Here, the model uses a pre-defined set of rules for information extraction. Mainly two types of rules are used, Pattern-based rules, which depend upon the morphological pattern of the words used, and context-based rules, which depend upon the context of the word used in the given text document.  

- **Pattern-based rule:** We define a pattern-based rule that identifies names by looking for sequences of capitalized words. For example, a pattern-based rule might say that any sequence of two or more consecutive capitalized words (e.g., "John Smith," "Mary Jane Watson") should be classified as a "Person" entity. This rule relies on the morphological pattern of capitalized words.

- **Context-based rule:** We define context-based rules that consider the context in which words appear. For instance, we might create a rule that checks if a capitalized word appears after the salutation "Mr." or "Ms." (e.g., "Mr. Smith," "Ms. Watson"). In such cases, we would classify the capitalized word as a "Person" entity based on the context.

### Machine Learning-based Systems
The ML-based systems use statistical-based models for detecting the entity names. These models try to make a feature-based representation of the observed data. By this approach, a lot of limitations of dictionary and rule-based approaches are solved by recognizing an existing entity name, even with small spelling variations.

There are mainly two phases when we use an ML-based solution for NER. 
- The first phase involves training the ML model on the annotated documents.
- In the next phase, the trained model can be used to annotate the corpus

### Deep Learning Approaches

**Key Components:** 

- **Distributed representations for input**: This step refers to methods used to convert words or characters into fixed-length vectors. These vectors represent the semantic meaning or value of each word or character in the context of natural language processing tasks. Such representations enable models to more efficiently process and understand textual information.

- **Context encoder**: This is a component of the model that takes a sequence of words or characters as input and converts it into a contextual representation, considering the relationships and dependencies between the elements of the sequence. The context encoder helps the model understand the connections between words in a sentence or text, capturing the semantic and grammatical relationships between words.

- **Tag decoder**: This is a component of the model responsible for predicting or decoding labels or tags for corresponding elements in the sequence. In the context of natural language processing tasks such as machine translation or Named Entity Recognition (NER), the tag decoder can be used to predict parts of speech, labels of predicted words, or other semantic labels related to the input sequence.

In [None]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [None]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Load text from a file
file_path = 'ner.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Process the text with spaCy
doc = nlp(text)

# Set to keep track of seen entities
seen_entities = set()

# Extract and print unique named entities
for ent in doc.ents:
    if ent.text not in seen_entities:
        seen_entities.add(ent.text)
        print(f"Entity: {ent.text}, Label: {ent.label_}")