### Name- Aditya Kumar Tiwari
### Roll no.- MSA23023

**NLP Lab 6:** Named Entity Recognition (NER) with Language Models

**Objective:** Implement a NER system using transformers and evaluate its performance
on identifying and classifying named entities in text.

In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import pandas as pd
import numpy as np
from seqeval.metrics import classification_report
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
class NERSystem:
    def __init__(self, model_name="dslim/bert-base-NER"):
        """Initialize the NER system with a pre-trained model."""
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.ner_pipeline = pipeline(
            "ner",
            model=self.model,
            tokenizer=self.tokenizer,
            aggregation_strategy="simple"
        )
        
    def predict(self, text):
        """Perform NER prediction on input text."""
        return self.ner_pipeline(text)
    
    def format_results(self, results):
        """Format NER results for better readability."""
        formatted = []
        for entity in results:
            formatted.append({
                'entity': entity['word'],
                'type': entity['entity_group'],
                'confidence': f"{entity['score']:.4f}",
                'start': entity['start'],
                'end': entity['end']
            })
        return formatted
    
    def _convert_to_bio_tags(self, text, entities):
        """
        Convert model outputs to BIO tagging scheme.
        Improved to handle token alignment properly.
        """
        # Tokenize text
        words = text.split()
        bio_tags = ['O'] * len(words)
        
        # Create character to token map
        char_to_word = {}
        current_pos = 0
        for i, word in enumerate(words):
            word_len = len(word)
            for j in range(current_pos, current_pos + word_len):
                char_to_word[j] = i
            current_pos += word_len + 1  # +1 for space
            
        # Assign BIO tags
        for entity in entities:
            try:
                start_word = char_to_word[entity['start']]
                end_word = char_to_word[min(entity['end'] - 1, len(text) - 1)] + 1
                
                for i in range(start_word, end_word):
                    if i == start_word:
                        bio_tags[i] = f"B-{entity['entity_group']}"
                    else:
                        bio_tags[i] = f"I-{entity['entity_group']}"
            except KeyError:
                continue  # Skip if character position mapping fails
                
        return bio_tags
    
    def evaluate_on_dataset(self, test_data):
        """
        Evaluate NER system on a test dataset.
        Fixed to ensure consistent token counts.
        """
        predictions = []
        true_labels = []
        
        for text, annotations in tqdm(test_data):
            # Get predicted entities
            pred = self.predict(text)
            
            # Convert to BIO tags
            pred_labels = self._convert_to_bio_tags(text, pred)
            
            # Ensure prediction length matches ground truth
            if len(pred_labels) == len(annotations):
                predictions.append(pred_labels)
                true_labels.append(annotations)
            else:
                print(f"Warning: Skipping example due to length mismatch. "
                      f"Text: {text}")
                print(f"Predicted length: {len(pred_labels)}, "
                      f"True length: {len(annotations)}")
                
        # Generate evaluation report
        return classification_report(
            true_labels,
            predictions,
            digits=4,
            zero_division=0
        )

In [6]:
if __name__ == "__main__":
    # Initialize NER system
    ner_system = NERSystem()
    
    # Example texts for testing
    test_texts = [
        "Microsoft CEO Satya Nadella spoke at a conference in New York last week.",
        "The European Union signed a trade deal with Singapore worth $50 billion.",
        "Tesla's Elon Musk announced plans to build a new factory in Berlin, Germany."
    ]
    
    # Process each text and display results
    print("NER Analysis Results:")
    for text in test_texts:
        print(f"\nText: {text}")
        results = ner_system.predict(text)
        formatted_results = ner_system.format_results(results)
        
        print("\nDetected Entities:")
        for entity in formatted_results:
            print(f"- {entity['entity']} ({entity['type']}) "
                  f"[Confidence: {entity['confidence']}]")
    
    # Example test dataset with annotations (simplified)
    test_dataset = [
        (
            "Microsoft CEO Satya Nadella spoke at a conference in New York",
            ['B-ORG', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC']
        ),
        (
            "Tesla announced new factory in Berlin",
            ['B-ORG', 'O', 'O', 'O', 'O', 'B-LOC']
        )
    ]
    
    # Evaluate on test dataset
    print("\nEvaluation Results:")
    evaluation_results = ner_system.evaluate_on_dataset(test_dataset)
    print(evaluation_results)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


NER Analysis Results:

Text: Microsoft CEO Satya Nadella spoke at a conference in New York last week.

Detected Entities:
- Microsoft (ORG) [Confidence: 0.9989]
- Satya Nadella (PER) [Confidence: 0.9859]
- New York (LOC) [Confidence: 0.9993]

Text: The European Union signed a trade deal with Singapore worth $50 billion.

Detected Entities:
- European Union (ORG) [Confidence: 0.9994]
- Singapore (LOC) [Confidence: 0.9998]

Text: Tesla's Elon Musk announced plans to build a new factory in Berlin, Germany.

Detected Entities:
- Tesla (ORG) [Confidence: 0.9787]
- El (ORG) [Confidence: 0.9994]
- ##on Musk (ORG) [Confidence: 0.9930]
- Berlin (LOC) [Confidence: 0.9996]
- Germany (LOC) [Confidence: 0.9997]

Evaluation Results:


100%|██████████| 2/2 [00:00<00:00, 12.50it/s]

Predicted length: 11, True length: 10
              precision    recall  f1-score   support

         LOC     1.0000    1.0000    1.0000         1
         ORG     1.0000    1.0000    1.0000         1

   micro avg     1.0000    1.0000    1.0000         2
   macro avg     1.0000    1.0000    1.0000         2
weighted avg     1.0000    1.0000    1.0000         2




