# **Named Entity Recognition (NER) for News Headlines**

## **1. Introduction**
   - **Project Overview:** This project focuses on developing a Named Entity Recognition (NER) system to identify and classify entities such as persons, organizations, and locations within news headlines. NER is a key task in Natural Language Processing (NLP), particularly useful in extracting meaningful information from unstructured text data.
   ---
   - **Objectives:**
     - Implement an NER model using the CoNLL-2003 dataset.
     - Preprocess the data to prepare it for model training.
     - Train and fine-tune a model using spaCy’s small English model.
     - Evaluate the model using standard NLP metrics.
     - Apply the model to new headlines for practical insights.

## **2. Dataset Description**
   - **Dataset:** The CoNLL-2003 dataset is a well-known benchmark dataset in NLP, specifically designed for NER tasks. It includes annotated text data with entities categorized as:
     - `PER` (Person)
     - `ORG` (Organization)
     - `LOC` (Location)
     - `MISC` (Miscellaneous)
---
   - **Data Structure:**
     - The dataset is organized into sentences, where each word is tagged with its corresponding entity type or labeled as `O` if it does not belong to any named entity.
     - The dataset is split into training, validation, and test sets to facilitate model development and evaluation.


## **3. Data Preprocessing and Exploration**
---
   - **Data Loading:** Load the CoNLL-2003 dataset using appropriate libraries, ensuring that it is ready for processing.
   - **Data Cleaning:**
     - Convert the raw data into a structured format suitable for model training.
     - Tokenize sentences and align tokens with their respective entity tags.
     - Handle missing data, punctuation, and any inconsistencies.
   - **Exploratory Data Analysis (EDA):**
     - Analyze the distribution of entity types in the dataset.
     - Visualize the frequency of entities to understand the dataset's characteristics and potential challenges.



###**3.1 Library Installation and Importing**

- This section installs and imports the required libraries. Datasets is used to load the CoNLL-2003 dataset, which contains annotated data for Named Entity Recognition (NER). We also import spacy for NLP tasks, including model training and evaluation.

In [None]:
# Install necessary libraries
!pip install datasets

# Import necessary libraries for the project
from datasets import load_dataset  # For loading the CoNLL-2003 dataset
from collections import Counter  # For counting elements
from spacy.training import Example  # For creating training examples in spaCy
from sklearn.metrics import classification_report  # For evaluating the model's performance
import spacy  # For NLP and NER tasks
from spacy.training import Example  # For creating training examples in spaCy
import random  # For shuffling data during training
import pandas as pd  # For data manipulation and analysis
import matplotlib.pyplot as plt  # For creating visualizations
import seaborn as sns  # For enhancing visualizations
import numpy as np  # For numerical operations


### **3.2 Loading the CoNLL-2003 Dataset**

- Here, the CoNLL-2003 dataset is loaded and split into training and testing sets. This dataset contains labeled data used for training and evaluating the NER model.

In [None]:
# Load the CoNLL-2003 dataset, focusing on English
dataset = load_dataset("conll2003")

# Split the dataset into training and testing sets
train_data = dataset["train"]
test_data = dataset["test"]

# Display the overall structure of the dataset
print(dataset)

### **3.3 Dataset Exploration**
- This section allows you to explore the dataset by checking the features and structure. It helps in understanding the format and the available labels (NER tags).

In [None]:
# Display the structure of the dataset and inspect features
print(dataset["train"].features["ner_tags"])
print(dataset['train'].description)
print(dataset.keys())
print(dataset.shape)

In [None]:
# Visualize the distribution of NER labels
entity_types = [entity for sublist in train_data['ner_tags'] for entity in sublist]
entity_counts = Counter(entity_types)

plt.figure(figsize=(10, 6))
plt.bar(entity_counts.keys(), entity_counts.values())
plt.title('Distribution of Named Entity Types in the Training Set')
plt.xlabel('Entity Type')
plt.ylabel('Frequency')
plt.show()


## **4. Model Development**
---
   - **Model Selection:**
     - SpaCy's small English model is chosen for its balance between speed and accuracy.
     - Discuss why this model is suitable for NER tasks, especially in the context of news headlines.
   - **Training Process:**
     - Fine-tune the pre-trained spaCy model using the CoNLL-2003 dataset.
     - Optimize hyperparameters such as learning rate, batch size, and the number of epochs to improve performance.
     - Monitor training metrics to ensure the model is learning effectively.

###**4.1 NER Tag Conversion**

- The numerical NER tags in the dataset are converted to their corresponding text labels for better readability and understanding

In [None]:
# Convert NER label numeric values to text labels
ner_feature = dataset["train"].features["ner_tags"].feature
label_converter = ner_feature.int2str


###**4.2 Data Transformation for Model Training**

- This function converts the dataset into a format that is compatible with the spaCy training process. It prepares the data by converting each sentence into a list of tuples, where each tuple contains the token, its POS tag, chunk tag, and the corresponding NER label.

In [None]:
# Function to transform the data into a format suitable for training the model
def transform_data_for_training(data):
    transformed_data = []
    for sentence in data:
        sentence_formatted = []
        for token, pos_tag, chunk_tag, ner_tag in zip(sentence["tokens"], sentence["pos_tags"], sentence["chunk_tags"], sentence["ner_tags"]):
            sentence_formatted.append((token, pos_tag, chunk_tag, label_converter(ner_tag)))
        transformed_data.append(sentence_formatted)
    return transformed_data

# Transform the training and test datasets
formatted_train_data = transform_data_for_training(train_data)
formatted_test_data = transform_data_for_training(test_data)

# Print a sample of the formatted data
print(formatted_train_data[0])
print(formatted_test_data)


In [None]:
# Visualizing sentence lengths after transformation
def visualize_sentence_lengths(formatted_data, data_type="Train"):
    sentence_lengths = [len(sentence) for sentence in formatted_data]

    plt.figure(figsize=(10, 6))
    plt.hist(sentence_lengths, bins=30, color='lightcoral', edgecolor='black')
    plt.title(f'Distribution of Sentence Lengths in {data_type} Data After Transformation')
    plt.xlabel('Number of Tokens')
    plt.ylabel('Frequency')
    plt.show()

# Visualize sentence lengths for training data
visualize_sentence_lengths(formatted_train_data, data_type="Train")

# Visualize sentence lengths for test data
visualize_sentence_lengths(formatted_test_data, data_type="Test")

##**4.3 Converting Data to spaCy Format**

- This function further processes the data into a format specifically required by spaCy for training the NER model. Each sentence is converted into a format where entities are marked with their respective labels and positions.

In [None]:
# Convert the formatted data to a format compatible with spaCy for NER training
def convert_to_spacy_format(data):
    spacy_data = []
    for sentence in data:
        words = [token for token, pos_tag, chunk_tag, label in sentence]
        entities = []
        start = 0
        for token, pos_tag, chunk_tag, label in sentence:
            end = start + len(token)
            if label != 'O':  # 'O' indicates non-entity tokens
                entities.append((start, end, label))
            start = end + 1  # Move to the next token
        spacy_data.append((' '.join(words), {"entities": entities}))
    return spacy_data

# Convert the training and test data
train_data_spacy = convert_to_spacy_format(formatted_train_data)
test_data_spacy = convert_to_spacy_format(formatted_test_data)

# Print a sample of the data in spaCy format
print(train_data_spacy[0])

### **4.4 Model Training**

- This section loads spaCy's small English model and trains it on the processed dataset. The model is trained over multiple epochs, with the NER pipeline enabled and other pipelines disabled. The losses are printed at the end of each epoch to monitor the training process

In [None]:
# Load the small English model from spaCy
nlp = spacy.load("en_core_web_sm")

# Disable all pipes except the 'ner' pipe
pipes_to_disable = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*pipes_to_disable)

# Set up the optimizer for training
optimizer = nlp.create_optimizer()

# Training process
for epoch in range(3):  # Train for a specific number of epochs (iterations)
    random.shuffle(train_data_spacy)  # Shuffle training data at the beginning of each epoch
    epoch_losses = {}  # Dictionary to keep track of losses for the current epoch

    # Update the model for each training example
    for sentence, labels in train_data_spacy:
        doc = nlp.make_doc(sentence)  # Create a Doc object from the sentence
        example = Example.from_dict(doc, labels)  # Convert the sentence and labels to an Example
        nlp.update([example], drop=0.5, sgd=optimizer, losses=epoch_losses)  # Update the model with the Example

    # Print losses at the end of each epoch
    print(f"Epoch {epoch + 1} - Losses: {epoch_losses}")

## **5. Model Evaluation**
---
   - **Evaluation Metrics:**
     - **Precision:** The proportion of correctly identified entities out of all entities identified by the model.
     - **Recall:** The proportion of correctly identified entities out of all actual entities in the dataset.
     - **F1-Score:** The harmonic mean of precision and recall, providing a balanced evaluation metric.
   - **Confusion Matrix:**
     - A confusion matrix is used to visualize the model’s performance across different entity types.
     - Highlight any misclassifications and discuss potential reasons.



### **5.1 Model Evaluation**

- This function evaluates the trained NER model by comparing the predicted labels against the true labels in the test dataset. The evaluation metrics (precision, recall, and F1-score) are printed for each entity class, providing insight into the model's performance


In [None]:
def ner_model_evaluation(nlp_model, dataset):
    real_labels = []
    predicted_labels = []

    for sentence in dataset:
        # Split the data and extract real labels
        tokens = [word for word, pos_tag, chunk_tag, ner_label in sentence]
        real_labels.extend([ner_label for word, pos_tag, chunk_tag, ner_label in sentence])

        # Process the text with the model
        processed_doc = nlp_model(' '.join(tokens))

        # Collect predicted labels
        for token in processed_doc:
            if token.ent_iob_ == 'O':
                predicted_labels.append('O')
            else:
                predicted_labels.append(token.ent_type_)

        # Balance length differences
        if len(predicted_labels) < len(real_labels):
            predicted_labels.extend(['O'] * (len(real_labels) - len(predicted_labels)))
        elif len(predicted_labels) > len(real_labels):
            predicted_labels = predicted_labels[:len(real_labels)]

    # Evaluate performance with metrics and print
    report = classification_report(real_labels, predicted_labels)
    print(report)

# Evaluate using test data
ner_model_evaluation(nlp, formatted_test_data)


### **5.2 Final Model Evaluation**

- This function provides a detailed evaluation of the NER model, including precision, recall, and F1-scores for each entity type. It also calculates macro and weighted averages, giving a comprehensive view of the model's performance across all classes


In [None]:
def plot_metrics(report):
    metrics = ['precision', 'recall', 'f1-score']
    labels = [label for label in report.keys() if label not in ['accuracy', 'macro avg', 'weighted avg']] # Exclude the 'accuracy', 'macro avg' and 'weighted avg' keys

    for metric in metrics:
        values = [report[label][metric] if label in report else 0 for label in labels]
        plt.figure(figsize=(12, 6))
        sns.barplot(x=labels, y=values, palette='viridis')
        plt.xticks(rotation=45)
        plt.title(f'{metric.capitalize()} for Each Entity Type')
        plt.xlabel('Entity Type')
        plt.ylabel(metric.capitalize())
        plt.show()

def detailed_ner_model_evaluation(nlp_model, dataset):
    real_labels = []
    predicted_labels = []
    for sentence in dataset:
        tokens = [word for word, pos_tag, chunk_tag, ner_label in sentence]
        real_labels.extend([ner_label for word, pos_tag, chunk_tag, ner_label in sentence])
        processed_doc = nlp_model(' '.join(tokens))
        for token in processed_doc:
            if token.ent_iob_ == 'O':
                predicted_labels.append('O')
            else:
                predicted_labels.append(token.ent_type_)
        if len(predicted_labels) < len(real_labels):
            predicted_labels.extend(['O'] * (len(real_labels) - len(predicted_labels)))
        elif len(predicted_labels) > len(real_labels):
            predicted_labels = predicted_labels[:len(real_labels)]
    report = classification_report(real_labels, predicted_labels, output_dict=True)
    classes = ['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']
    for label in classes:
        if label in report:
            print(f"{label}:\n"
                  f"  Precision: {report[label]['precision']:.2f}\n"
                  f"  Recall: {report[label]['recall']:.2f}\n"
                  f"  F1-Score: {report[label]['f1-score']:.2f}\n"
                  f"  Support: {report[label]['support']}\n")
    print(f"Accuracy: {report['accuracy']:.2f}")
    print(f"Macro Avg - Precision: {report['macro avg']['precision']:.2f}")
    print(f"Macro Avg - Recall: {report['macro avg']['recall']:.2f}")
    print(f"Macro Avg - F1-Score: {report['macro avg']['f1-score']:.2f}")
    print(f"Weighted Avg - Precision: {report['weighted avg']['precision']:.2f}")
    print(f"Weighted Avg - Recall: {report['weighted avg']['recall']:.2f}")
    print(f"Weighted Avg - F1-Score: {report['weighted avg']['f1-score']:.2f}")

    # Call plot_metrics within detailed_ner_model_evaluation to access real_labels and predicted_labels
    plot_metrics(report)

# Assuming nlp and formatted_test_data are defined elsewhere
# Evaluate the model with detailed metrics
detailed_ner_model_evaluation(nlp, formatted_test_data)

## **6. Named Entity Recognition on New Headlines**
---
   - **Implementation:**
     - Develop a function to apply the trained NER model to new, unseen headlines.
     - Ensure the function is user-friendly and can handle various input formats.

   - **Example Predictions:**
     - Demonstrate the model's predictions on a set of new headlines.
     - Analyze the results to showcase the model's ability to generalize to real-world data.





In [None]:
def perform_ner_on_headlines(headlines):
    """
    Perform Named Entity Recognition (NER) on a list of news headlines.

    Parameters:
    ----------
    headlines : list of str
        A list of news headlines to analyze.

    Returns:
    -------
    list of dict
        A list of dictionaries where each dictionary contains:
        - 'headline': The original headline (str).
        - 'entities': A list of tuples with identified entities and their labels (list of tuples).
    """

    # Initialize an empty list to store the results
    results = []

    # Process each headline
    for headline in headlines:
        # Use the spaCy model to process the headline
        doc = nlp(headline)

        # Extract entities and their labels from the processed document
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Create a dictionary for the headline and its identified entities
        result = {
            'headline': headline,
            'entities': entities
        }

        # Append the result to the results list
        results.append(result)

    # Return the list of results
    return results

# Example usage:
headlines = [
    "Apple is looking at buying U.K. startup for $1 billion.",
    "Barack Obama will be visiting Germany next week.",
    "Amazon opens a new office in San Francisco."
]

# Perform NER on the provided headlines
ner_results = perform_ner_on_headlines(headlines)

# Print the results
for result in ner_results:
    print(f"Headline: {result['headline']}")
    print("Entities:")
    for entity in result['entities']:
        print(f"  - {entity[0]}: {entity[1]}")
    print()  # Add a blank line for readability



### **Conclusion**
- This project has successfully demonstrated the development of a cutting-edge NER system with high accuracy and practical applicability. By fine-tuning spaCy’s pre-trained model on the CoNLL-2003 dataset, we have created a powerful tool for named entity recognition. The impressive performance and real-world application confirm the model’s effectiveness and potential for significant contributions to natural language processing. Moving forward, continued refinement and exploration of broader applications will further enhance the model’s capabilities and impact.
