<a href="https://colab.research.google.com/github/SKumarAshutosh/natural-language-processing/blob/master/NLP_NER_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER).

### 1. What is NER?

Named Entity Recognition (NER) is a sub-task of information extraction that classifies named entities in text into predefined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Example:

Input: "Apple is planning to buy U.K. startup for $1 billion."

Output:
- "Apple" → ORGANIZATION
- "U.K." → LOCATION
- "$1 billion" → MONEY

### 2. When is it required?

NER is required in various scenarios including:

- Information retrieval: Enhance search by tagging named entities in documents.
- Content recommendation: Based on entities identified in the content.
- Data analytics: Extract structured information from unstructured data sources.
- Knowledge graph construction: Identifying entities and their relations.
- Automating business processes: e.g., extracting financial information or client names from emails.

### 3. Why is it required?

- **Structure to Unstructured Data:** Most of the world's data is unstructured. NER provides a way to extract structured information.
- **Efficiency:** It's an automated way to process large datasets and extract valuable information.
- **Enhanced Analysis:** By identifying entities, data scientists and analysts can focus on more specific aspects of data analysis.

### 4. How can we do NER?

NER can be achieved through:

- **Rule-based Systems:** Using regular expressions and dictionaries to identify named entities.
- **Statistical Models:** Such as Conditional Random Fields (CRF).
- **Deep Learning:** Using architectures like Recurrent Neural Networks (RNNs) or Transformer-based models like BERT.

### 5. Different types of models available for NER:

1. **Rule-based Models:**
    - Pros: Highly specific, can be very accurate for known patterns.
    - Cons: Not adaptive; need manual effort for rule creation.

2. **Statistical Models:**
   - **Hidden Markov Models (HMM):** Uses states and transitions with probabilities.
   - **Conditional Random Fields (CRF):** Especially popular for sequence labeling tasks.
     - Pros: Take into account the context.
     - Cons: Require feature engineering; might be outperformed by deep models.

3. **Deep Learning Models:**
   - **RNNs (LSTM/GRU):** Effective for sequence labeling; consider sequence context.
   - **Bidirectional LSTMs:** Capture forward and backward context.
   - **Transformer-based Models (like BERT, RoBERTa, etc.):** Pre-trained on vast corpora and fine-tuned for NER.
     - Pros: State-of-the-art performance, captures deep contextual information.
     - Cons: Computationally intensive.

4. **Hybrid Models:**
   - Combine rule-based, statistical, and deep learning methods to benefit from each.

5. **Ensemble Models:**
   - Combine predictions from multiple models to achieve better accuracy.

6. **Pre-trained Language Models for Transfer Learning:**
   - Models like BERT, GPT-2, and others can be fine-tuned on specific NER tasks to leverage the knowledge they've acquired during their extensive pre-training.

### Final Note:

The best model often depends on the specific task, the amount and quality of training data, and computational resources. In many real-world applications, a combination of different approaches is used to achieve robust and accurate NER.



---
# Layman's perspective.

### What is NER?

Named Entity Recognition (NER) is like highlighting the names of important things in a story or article. If you read a news article, NER will help underline names of people, places, companies, dates, and more.

### Why and When is it Used?

Imagine you're quickly skimming through a newspaper and just want to know the main people or places mentioned. NER helps with this. Businesses use it to quickly understand documents, researchers use it to summarize content, and search engines use it to categorize information.

### Is NER a Regression or Classification Problem?

NER is a classification problem. Think of it like sorting candies by their colors. Each word (or candy) is given a label (or color).

### Different Models for NER:

1. **Rule-based Models:** Like using a grammar book. If a rule says a word is a name, it's a name.
   - Pros: Simple and straightforward.
   - Cons: Can't adapt to new patterns easily.

2. **Statistical Models:** Imagine asking many friends who often read news about which words are names of companies. Over time, you'd get a good list.
   - Example: Conditional Random Fields (CRF) is a popular method here.

3. **Deep Learning Models:** It's like a very observant person reading thousands of books and noting down names of people, places, and more. Over time, this person gets really good at spotting names.
   - **RNNs:** Think of someone reading a sentence and remembering the previous words to guess the next word's type.
   - **Bidirectional LSTMs:** The same as above, but they remember both previous and upcoming words.
   - **Transformer-based Models (like BERT):** Think of a genius who's read millions of pages and can instantly tell you what each word in a new sentence likely represents.

4. **Hybrid Models:** Combining rules, statistics, and deep learning. Like having a grammar book, friends' suggestions, and a keen observer together.

5. **Ensemble Models:** Asking multiple people (or models) about names in a sentence and then going with the majority vote.

6. **Pre-trained Models:** Imagine someone who's already an expert in recognizing names in English sentences now fine-tuning their skill to recognize names in science articles. That's how models like BERT are used for NER after being pre-trained on lots of general data.

### Conclusion:

NER helps spot names of important things in text. There are many ways to do it, from simple rules to complex deep learning models. The best method often depends on how much data you have and what you want to achieve.




---

# Different NER models based on ML & DL teqniques.

Named Entity Recognition (NER) is a popular task in Natural Language Processing, and over the years, various machine learning and deep learning models have been proposed and used for it. Here's a list:

### Machine Learning Models:

1. **Rule-based Systems:** These systems use hand-crafted rules (often regular expressions) to identify entities.
  
2. **Decision Trees:** They predict the entity label based on features derived from the input text, though they're not the most popular choice for NER.

3. **Hidden Markov Models (HMMs):** These consider the sequence of words and their associated states to predict entities.

4. **Maximum Entropy Markov Models (MEMMs):** Like HMMs but more flexible, allowing for the inclusion of arbitrary features.

5. **Conditional Random Fields (CRFs):** A popular choice for NER in the pre-deep learning era. CRFs consider the entire sequence of words (and their context) to predict entity labels. They can include diverse features, like the word's position in a sentence, its capitalization pattern, and more.

### Deep Learning Models:

1. **Recurrent Neural Networks (RNNs):** They process sequences word-by-word, maintaining a hidden state from previous words to inform predictions for the current word.

   - **Long Short-Term Memory (LSTM):** A type of RNN that's better at capturing long-range dependencies in the data.
   
   - **Bidirectional LSTMs (BiLSTM):** These process the sequence from both directions (start-to-end and end-to-start), providing a more comprehensive view of the context for each word.

2. **Gated Recurrent Units (GRUs):** A variation of RNNs that's simpler than LSTMs but offers similar performance for many tasks.

3. **Convolutional Neural Networks (CNNs):** While mostly used for image processing, they've also been employed for NER, capturing local patterns within the text.

4. **Transformer-based Models:** These models use self-attention mechanisms to weigh the importance of different words in the sequence relative to a given word.

   - **BERT (Bidirectional Encoder Representations from Transformers):** Pre-trained on vast amounts of text and can be fine-tuned for NER.
   
   - **RoBERTa, DistilBERT, ALBERT:** Variations and optimizations of the original BERT architecture.
   
   - **XLNet:** A generalized autoregressive model that outperformed BERT on several benchmarks.
   
   - **GPT (Generative Pre-trained Transformer):** While it's primarily used for generation tasks, with the right setup, it can be adapted for NER.

5. **CRF layer on top of Deep Learning models:** It's common to combine the strengths of CRFs and deep learning by using, for example, a BiLSTM to extract features from sequences, followed by a CRF layer to make the final predictions, considering the sequence's structure.

### Hybrid Models:

These combine elements from both traditional machine learning and deep learning. For instance, using rule-based entity recognition to guide or correct a deep learning model's predictions.

In practice, while traditional machine learning models like CRFs were once the state of the art for NER, deep learning models, particularly transformer-based architectures like BERT, currently dominate the field in terms of performance.


## How Each of NER model works.

### Machine Learning Models:

1. **Rule-based Systems:**
    - **How they work:** Uses pre-defined rules, typically written as regular expressions, to identify entities in text.
    - **Example:** A rule might state that any sequence of digits with a '-' in the middle is a phone number.

2. **Decision Trees:**
    - **How they work:** These models ask a series of yes/no questions about the data to make decisions. Each question splits the data into subsets until a prediction is made.
    - **Example:** Is the word capitalized? If yes, it might be a proper noun.

3. **Hidden Markov Models (HMMs):**
    - **How they work:** Assumes each word corresponds to a hidden "state" (like a part-of-speech or entity label). Transitions between states and the likelihood of a state emitting a particular word are learned from data.
    - **Example:** If the current word is a verb, what's the likelihood the next word is a noun?

4. **Maximum Entropy Markov Models (MEMMs):**
    - **How they work:** Like HMMs, but more flexible. Instead of just considering the current state's probability, they can consider multiple features and their combinations.
    - **Example:** What's the likelihood of a word being a city name if it's capitalized and follows the word "in"?

5. **Conditional Random Fields (CRFs):**
    - **How they work:** They're sequence models like HMMs and MEMMs but take the entire sequence into account when making a prediction for each word.
    - **Example:** Given the surrounding words and their features, what's the most likely entity label for the current word?

### Deep Learning Models:

1. **Recurrent Neural Networks (RNNs):**
    - **How they work:** Processes text word-by-word, remembering some information from previous words to help in understanding the current word.
    - **Example:** If you've seen "New" and you encounter "York", your context from previous words helps you label "York" as part of a location.

2. **Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs):**
    - **How they work:** They're specialized RNNs with mechanisms (called gates) that allow them to remember or forget information over longer sequences, solving the vanishing gradient problem of basic RNNs.
    - **Example:** Even if many words appear between "New" and "York", they can still recognize and label them correctly.

3. **Convolutional Neural Networks (CNNs):**
    - **How they work:** These apply a series of filters on local parts of the input data. For text, it means looking at small windows of words/phrases and extracting features.
    - **Example:** Capturing local patterns like "San Francisco" as a likely location.

4. **Transformer-based Models (e.g., BERT, RoBERTa):**
    - **How they work:** Uses self-attention mechanisms to weigh the importance of different words when considering a particular word in a sequence.
    - **Example:** If the sentence mentions several companies and then says "its CEO," the model gives more weight to the company names to determine which company "its" refers to.

5. **CRF layer on top of Deep Learning models:**
    - **How they work:** Deep models (like BiLSTMs) extract features, and a CRF layer uses these features to predict the best entity labels for the sequence, taking into account the structure and dependencies in the sequence.
    - **Example:** Even if a deep model thinks "Apple" in "Apple's new product" is a fruit with high probability, the CRF layer can correct it to a company based on context.

### Hybrid Models:

- **How they work:** These combine rule-based or traditional machine learning techniques with deep learning. For instance, initial predictions might be made using deep learning, but then refined using rule-based logic.
- **Example:** A deep learning model might recognize "Street" as a location, but a rule-based system can refine this to say if "Street" follows a number, it's more likely a street address.

Each model's actual operation can get quite technical, involving mathematical equations and algorithms. The above explanations provide a high-level, intuitive view to help understand their basic functioning.

## End-to-end NER model developement.

Creating an end-to-end NER model involves several steps, from data acquisition and preprocessing to model training and evaluation. Here's a general guide, first for a machine learning approach and then for a deep learning one:

## 1. Machine Learning Approach (using CRFs):

### Data Collection and Preprocessing:
1. **Data Collection:** Obtain labeled data where each word/token in your text is annotated with its respective entity type.
2. **Tokenization:** Split your text into words/tokens.
3. **Feature Engineering:** For NER using CRFs, you'll typically extract various features from each word, such as:
   - Word identity
   - Word suffix/prefix
   - Word shape (capitalization, digit, punctuation patterns)
   - Part-of-speech tag
   - Surrounding word information

### Model Training:
1. Use a library like `sklearn-crfsuite` or `CRF++` to train the CRF model on your features and labels.

### Evaluation:
1. Split your data into a training set and a test set.
2. Train your model on the training set and evaluate its performance on the test set using metrics like precision, recall, and F1-score.

## 2. Deep Learning Approach (using BiLSTM-CRF):

### Data Collection and Preprocessing:
1. **Data Collection:** Same as above.
2. **Tokenization:** Split your text into words/tokens.
3. **Word Embeddings:** Convert words into vectors using embeddings like Word2Vec, GloVe, or FastText.

### Model Training:
1. Use a deep learning framework like TensorFlow or PyTorch.
2. Create a neural network architecture:
   - **Embedding Layer:** Convert words into dense vectors.
   - **BiLSTM Layer:** Capture context from both forward and backward directions in the text.
   - **CRF Layer:** Make sequence predictions taking into account the structure of the output labels.
3. Train your model using the training data.

### Evaluation:
1. Split your data as in the ML approach.
2. Evaluate using the same metrics (precision, recall, F1-score).

## Implementation Steps:

1. **Data Loading:**
   ```python
   # Example data format: list of sentences where each sentence is a list of (token, label) tuples.
   data = [[('Apple', 'B-ORG'), ('is', 'O'), ('a', 'O'), ('company', 'O')], ...]
   ```

2. **Data Preprocessing:**
   - For ML: Extract features and prepare them in the format suitable for the CRF library.
   - For DL: Tokenize words and convert them to indices; use embeddings.

3. **Model Definition:**
   - For ML: Define a CRF model and the set of possible labels.
   - For DL: Define the neural network architecture (Embedding -> BiLSTM -> CRF).

4. **Training:**
   - Train the model using the training dataset.
  
5. **Evaluation:**
   - Predict entity labels on the test set.
   - Calculate precision, recall, and F1-score.

6. **Deployment (Optional):**
   - Once satisfied with the model, you can deploy it as a service using tools like Flask for real-time NER predictions.

Remember, the quality of an NER model significantly depends on the quality and quantity of the training data. If possible, gather a diverse and comprehensive labeled dataset or consider using pre-trained models and fine-tuning them on your specific dataset.

## End-to-end NER model development 2nd Part.



---
Creating an end-to-end NER model involves several steps, including data preparation, feature extraction (more so for ML models), model selection, training, evaluation, and deployment. I'll guide you through a general pipeline for both traditional machine learning (ML) and deep learning techniques:

### 1. Data Preparation:

#### a. Data Collection:

- **Pre-annotated data**: Use datasets like CoNLL, ACE, or others specific to your domain.
- **Annotation tools**: If you have raw text, you can use tools like [Doccano](https://doccano.herokuapp.com/) or [BRAT](http://brat.nlplab.org/) to annotate your data.

#### b. Data Split:
Split your data into at least three sets:
- Training set: To train the model.
- Validation set: To tune hyperparameters.
- Test set: To evaluate model performance.

### 2. Feature Extraction (especially crucial for ML models):

#### a. Tokenization:
Convert sentences into tokens (usually words).

#### b. Lexical Features:
- Word embeddings (like Word2Vec or FastText).
- Part-of-Speech tags.
- Word shapes (capitalization, digit patterns, etc.).

#### c. Contextual Features:
Previous and next words or tags, n-grams, etc.

### 3. Model Selection:

#### ML Models:

- **Conditional Random Fields (CRFs)**:
  - Popular for NER in ML.
  - Use tools like `CRFsuite` or `sklearn-crfsuite`.

#### Deep Learning Models:

- **BiLSTM-CRF**:
  - Bi-directional Long Short-Term Memory networks combined with a CRF layer on top.
  - Frameworks: TensorFlow, PyTorch.
  
- **Transformers (like BERT, RoBERTa)**:
  - These models can be fine-tuned on NER tasks.
  - Use libraries like `HuggingFace Transformers`.

### 4. Model Training:

#### ML Models:

- Feed the extracted features and corresponding labels to the model.
- Optimize using algorithms like `L-BFGS` for CRFs.

#### Deep Learning Models:

- Prepare data loaders to feed data in batches.
- Fine-tune on your NER data if using pre-trained transformers.
- Optimize using gradient-based methods like Adam.

### 5. Model Evaluation:

- Common metrics: Precision, Recall, F1-score.
- Use tools/libraries like `seqeval` for sequence labeling evaluation.

### 6. Hyperparameter Tuning:

- For ML models: Regularization parameters, state transition features, etc.
- For deep models: Learning rate, dropout rate, number of layers/neurons, etc.
- Techniques: Grid search, random search, or Bayesian optimization.

### 7. Deployment:

- Convert your model to a format suitable for serving (e.g., ONNX).
- Deploy using tools like Flask (for a web API) or TensorFlow Serving.
- Ensure your deployment solution can handle tokenization and any other preprocessing.

### 8. Post-Deployment:

- Monitor the model for any drifts in performance.
- Periodically retrain if you have accumulating new data.

### Tips:

1. **Transfer Learning**: Especially for deep learning, leverage pre-trained models to benefit from knowledge transfer.
2. **Active Learning**: If annotating data from scratch, start with a small set, train your model, then iteratively annotate more challenging examples.
3. **Domain Adaptation**: If your target domain differs significantly from your source domain (where you got your initial training data), consider domain adaptation techniques.

Remember, building an NER system is iterative. Based on model performance and real-world feedback, you'll likely need to revisit data annotation, feature engineering, and model selection.


# 1. Deployment of Machine Learning based Models:

### 1. Rule Based NER Models:

A rule-based NER model primarily relies on pattern matching to identify named entities. The most common tools used for rule-based NER are regular expressions and dictionaries. Here's a step-by-step guide to develop a rule-based NER model and prepare it for deployment:

### 1. Data Collection & Analysis:

- Gather a representative sample of texts that you will process.
- Manually inspect the data to identify common patterns for entities. For instance, dates might follow patterns like "YYYY-MM-DD" or "Month DD, YYYY".

### 2. Rule Creation:

#### a. Dictionary-Based:

- Compile lists of named entities that you want to recognize. For instance, a list of countries, major cities, product names, etc.
- Store these lists in dictionaries.

#### b. Regular Expressions:

- Write regular expressions to capture common patterns.
- Example patterns:
  - Dates: `\d{4}-\d{2}-\d{2}`
  - Email: `[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}`

### 3. Rule Testing:

- Test your rules on the sample data.
- Refine rules based on false positives and false negatives.

### 4. Software Development:

#### a. Preprocessing:

- Tokenize the input text.
- Convert text to lowercase (unless case is important for your rules).
- Other preprocessing steps like removing stop words or punctuation, if necessary.

#### b. Entity Recognition:

- Apply dictionary lookups: For each token or sequence of tokens, check if it's in your entity dictionaries.
- Apply regular expression matches.

#### c. Post-processing:

- Resolve entity overlaps: If two rules identify overlapping entities, you'll need logic to decide which one to prioritize or how to handle the overlap.
- Entity classification: If not already classified by your rules, classify entities into predefined categories (e.g., PERSON, LOCATION, DATE).




### 4. Evaluation:

#### a. Metrics:

- **Precision**: Proportion of entities identified by the model that are correct. High precision means few false positives.
  
- **Recall**: Proportion of actual entities in the data that the model correctly identified. High recall means few false negatives.

- **F1-Score**: Harmonic mean of precision and recall. It provides a single score that balances the trade-off between precision and recall.

#### b. Compute Metrics:

- Run your NER system on the annotated data.
- Compare the system's output to the manual annotations using the metrics.

#### c. Iterative Refinement:

- Based on the evaluation results, adjust your rules to improve performance.
- Re-evaluate after making changes.


### 5. Deployment Preparation:

#### a. Wrap the Model:

- Create a function or class that takes raw text as input and outputs the identified named entities and their types.

#### b. Create an API:

- Use frameworks like Flask or FastAPI to expose your NER model as a web service.
- Your API should:
  - Receive raw text as input.
  - Return the recognized entities and their types.

#### c. Containerization (optional but recommended):

- Use Docker to containerize your API. This ensures it can run consistently across different environments.
- Create a `Dockerfile` that defines the environment, installs required libraries, and runs your API.

### 6. Deployment:

- Deploy your containerized application on platforms like AWS ECS, Google Cloud Run, or any server where Docker can be run.

### 7. Monitoring & Updates:

- Once deployed, continuously monitor the performance.
- Collect feedback, especially false positives and negatives.
- Periodically update rules to handle new patterns and improve accuracy.

### Tips:

1. **Debugging**: Use tools like [regex101](https://regex101.com/) to test and debug your regular expressions.
2. **Scalability**: If expecting high traffic, consider scaling solutions or caching results for frequent queries.
3. **Iterative Development**: Rule-based models often require iterative development. As new edge cases emerge, you might need to add or adjust rules.

This guide provides a general approach for rule-based NER. Depending on specific needs and data complexities, adjustments might be required.

Creating an end-to-end rule-based NER system as described requires a combination of libraries and some manual effort. However, a complete out-of-the-box solution with an extensive dataset that matches the exact requirements might be hard to come by. Still, I can provide you with a simplified example using Python and the `re` library, along with guidance on using a widely available NER dataset.

### Dataset:

One of the most popular datasets for NER is the CoNLL-2003 dataset. However, this dataset is annotated, and rule-based methods usually start from scratch. You can use a small portion of this dataset to test your rules.

You can find the CoNLL-2003 dataset [here](https://www.clips.uantwerpen.be/conll2003/ner/).



In [1]:
import re

# Sample text and its annotations
text = """Apple Inc. is planning to open a new factory in San Francisco on 2023-10-12.
Contact john.doe@email.com for more information."""

# Assuming these are the correct annotations
gold_standard = {
    'Apple Inc.': 'ORG',
    'San Francisco': 'LOC',
    '2023-10-12': 'DATE',
    'john.doe@email.com': 'EMAIL'
}

# Dictionary of entities
entity_dict = {
    'ORG': ['Apple Inc.'],
    'LOC': ['San Francisco'],
}

# Regular expressions for patterns
date_pattern = r'\d{4}-\d{2}-\d{2}'
email_pattern = r'[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}'

predictions = {}

# Extract entities based on dictionary
for entity_type, values in entity_dict.items():
    for value in values:
        if value in text:
            predictions[value] = entity_type

# Extract entities based on regex patterns
for match in re.finditer(date_pattern, text):
    predictions[match.group()] = 'DATE'

for match in re.finditer(email_pattern, text):
    predictions[match.group()] = 'EMAIL'

# Evaluation metrics
true_positives = sum(1 for entity, label in predictions.items() if entity in gold_standard and gold_standard[entity] == label)
false_positives = len(predictions) - true_positives
false_negatives = len(gold_standard) - true_positives

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1_score:.2f}")


Precision: 1.00
Recall: 1.00
F1-Score: 1.00




This code is a basic example. In real-world scenarios, you'd tokenize the text, handle overlapping entities, and apply many more rules and dictionaries.

### Tips for Improvement:

1. For tokenization and text preprocessing, consider using libraries like `spaCy` or `NLTK`.
2. For rule creation, if you have a labeled dataset, inspect the data to identify common patterns, then create rules accordingly.
3. To handle overlaps, post-process your matches to prioritize longer matches or specific entity types.
4. Extend your dictionaries and patterns based on the data you have or the domain you're focusing on.

Remember, rule-based NER is highly domain-dependent. The quality of your system will largely depend on the comprehensiveness and precision of your rules and dictionaries.

To evaluate your rule-based NER system, you'll need to compare its predictions against a gold standard, i.e., a set of manually annotated data.

Let's add a section in the code to compute the Precision, Recall, and F1-score based on the entities extracted by the rule-based system vs. the actual annotations:


Remember, this example demonstrates evaluation on a single sample text. In a real-world scenario:



1.   You'd be applying the rules on a larger dataset.
2.   The gold_standard would come from actual annotated entities in that dataset.
3.  You'd compute average precision, recall, and F1-score over all the samples in your dataset.