# Project 1: Named Entity Recognition (NER) for News Headlines

### **Objective**: Implement a Named Entity Recognition system to identify and classify named entities in news headlines.

**Tasks:**

Use the CoNLL-2003 dataset (English subset)

Implement data preprocessing and exploration

Train a simple NER model using spaCy's small English model

Evaluate the model's performance using precision, recall, and F1-score

Create a function to perform NER on new headlines

## Dataset Description

 **Dataset:** The CoNLL-2003 dataset is a well-known benchmark dataset in NLP, specifically designed for NER tasks. It includes annotated text data with entities categorized as:
     - `PER` (Person)
     - `ORG` (Organization)
     - `LOC` (Location)
     - `MISC` (Miscellaneous)
   - **Data Structure:**
     - The dataset is organized into sentences, where each word is tagged with its corresponding entity type or labeled as `O` if it does not belong to any named entity.
     - The dataset is split into training, validation, and test sets to facilitate model development and evaluation.

# Use the CoNLL-2003 dataset (English subset)

These libraries and functions will be used together to load a dataset, preprocess text data, train an NLP model (e.g., a named entity recognizer or text classifier), and evaluate its performance.









In [None]:
#installing libraries
!pip install datasets
#importing libraries
from datasets import load_dataset #Function to load datasets from Hugging Face's repository.
import spacy
from spacy.training import Example #Class for creating training examples in Spacy.
import random
from sklearn.metrics import classification_report


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
# Load the CoNLL-2003 dataset
dataset = load_dataset("conll2003")

train_data = dataset["train"]
test_data = dataset["test"]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

# Implement data preprocessing and exploration

Exploring the Dataset

In [None]:
# Display the structure of the dataset
print(dataset)

# Display a sample from the training data
print(train_data[0])
print(train_data[2])

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})
{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
{'id': '2', 'tokens': ['BRUSSELS', '1996-08-22'], 'pos_tags': [22, 11], 'chunk_tags': [11, 12], 'ner_tags': [5, 0]}


### Implement data preprocessing






In [None]:
"""
      The code is designed to map the integer NER tags found in the dataset to their corresponding string labels,
      such as "B-PER" for the beginning of a person entity.

"""
ner_labels = dataset["train"].features["ner_tags"].feature
id2label = ner_labels.int2str


"""
     Convert the data into a format suitable for model training, typically a list of sentences
      where each sentence is a list of tuples containing (word, POS, chunk, NER).

"""
# Preprocess the data
#creating a function to store data in a list
def preprocess_data(data):
    processed_data = []
    #looping through data
    for sentence in data:
        processed_sentence = []
        #looping through procesed_sentence
        for word, postag, chunk, ner in zip(sentence["tokens"], sentence["pos_tags"], sentence["chunk_tags"], sentence["ner_tags"]):
            processed_sentence.append((word, postag, chunk, id2label(ner)))
            #appending sentences to processed_data list
        processed_data.append(processed_sentence)
    return processed_data

train_data = preprocess_data(train_data)
test_data = preprocess_data(test_data)
# printing stored data
print(train_data)
print(test_data)



### Prepare Data for spaCy





In [None]:

"""
    This function prepare training dataset in spaCy's format
    Training dataset has to be stored as a dictionary for further modeling NER.


"""

def convert_to_spacy_format(data):
    spacy_data = []
    for example in data:
        words = [token for token, postag, chunk, label in example]
        entities = []
        start = 0
       #This loop iterates over the words and their corresponding NER labels.
        for word, label in zip(words, [label for token, postag, chunk, label in example]):
            if label != 'O':
                entity = (start, start + len(word), label)
                entities.append(entity)
            start += len(word) + 1
        spacy_data.append((' '.join(words), {"entities": entities}))
    return spacy_data

train_data_spacy = convert_to_spacy_format(train_data)
test_data_spacy = convert_to_spacy_format(test_data)

# Display a sample from the training data
print(train_data_spacy[0])
print(train_data_spacy[3])

('EU rejects German call to boycott British lamb .', {'entities': [(0, 2, 'B-ORG'), (11, 17, 'B-MISC'), (34, 41, 'B-MISC')]})
('The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .', {'entities': [(4, 12, 'B-ORG'), (13, 23, 'I-ORG'), (59, 65, 'B-MISC'), (94, 101, 'B-MISC')]})


# Training a simple NER model using spaCy's small English model

* The code loads a small English model from SpaCy and disables all pipeline components except for NER.
* It creates an optimizer and then trains the NER model for 3 iterations using shuffled training data.
* The training process includes updating the model based on examples and printing out the loss after each iteration to track the model's performance.

In [None]:
# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Disable other pipelines
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)

# Creating an optimizer that will be used to update the model's weights during training
optimizer = nlp.create_optimizer()

# Training the model
for i in range(3):  # Number of iterations (epochs)
    # Shuffle the training data to prevent the model from learning order-specific patterns
    random.shuffle(train_data_spacy)
    losses = {}
    for texts, annotations in train_data_spacy:
        doc = nlp.make_doc(texts)
        example = Example.from_dict(doc, annotations)  # Create an Example object that pairs the Doc with its annotations (e.g., entities)
        nlp.update([example], drop=0.5, sgd=optimizer, losses=losses)
    print(f"Iteration {i} - Losses: {losses}")

Iteration 0 - Losses: {'ner': 18227.157272612334}
Iteration 1 - Losses: {'ner': 12004.707271408148}
Iteration 2 - Losses: {'ner': 10247.424668678403}


### Saving Model

In [None]:
# After training the model
output_dir = "/content/trained_spacy_model"  # Define the path to save the model

# Save the model to the specified directory
nlp.to_disk(output_dir)

# Output the path to the saved model
print(f"Model saved to {output_dir}")



Model saved to /content/trained_spacy_model


# Evaluate the Model


In [None]:


"""
      The function evaluate_model is designed to evaluate a SpaCy NLP model using metrics like Precision, Recall, and F1-score.
      It takes an NLP model (nlp) and a test dataset (data) as inputs.

"""

def evaluate_model(nlp, data):
    true_labels = []
    pred_labels = []

    for example in data:
        words = [token for token, postag, chunk, label in example]
        true_labels.extend([label for token, postag, chunk, label in example])

        # Process the text with the NLP model
        doc = nlp(' '.join(words))

        pred_index = 0

        for token in doc:
            if token.ent_iob_ == 'O':
                pred_labels.append('O')
            else:
                pred_labels.append(token.ent_type_)
            pred_index += 1

       # Check for discrepancies between the length of predicted and true labels
        # If there are fewer predicted labels, append 'O' until the lists are of equal length
        while len(pred_labels) < len(true_labels):
            pred_labels.append('O')

      # If there are more predicted labels, remove the excess labels until the lists are of equal length
        while len(pred_labels) > len(true_labels):
            pred_labels.pop()
    print(classification_report(true_labels, pred_labels))

# Evaluate on the test data
evaluate_model(nlp, test_data)



              precision    recall  f1-score   support

       B-LOC       0.68      0.70      0.69      1668
      B-MISC       0.56      0.70      0.62       702
       B-ORG       0.69      0.72      0.71      1661
       B-PER       0.67      0.55      0.60      1617
       I-LOC       0.75      0.56      0.64       257
      I-MISC       0.60      0.43      0.50       216
       I-ORG       0.59      0.78      0.67       835
       I-PER       0.68      0.60      0.64      1156
           O       0.96      0.96      0.96     38323

    accuracy                           0.91     46435
   macro avg       0.69      0.67      0.67     46435
weighted avg       0.91      0.91      0.91     46435



### *Summary of Evaluation*

The model performs well overall with 91% accuracy.

It performs best in the most general class (O) specific to NER tasks.

Improvements can be made to detect less frequent and more complex objects, such as B-PER and I-MISC.

# Create a function to perform NER on new headlines

In [None]:
# Creating a function to perform NER on new headlines
def perform_ner(text, nlp):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Example usage:
# Load the trained SpaCy model
nlp = spacy.load("/content/trained_spacy_model")  # trained model path

# Test the function
new_headline = "The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep ."
entities = perform_ner(new_headline, nlp)
print(entities)


[('European', 'B-ORG'), ('Commission', 'I-ORG'), ('German', 'B-MISC'), ('British', 'B-MISC')]
