# Assignment 4: Named Entity Recognition (NER) System

This notebook demonstrates:
- **NER Extraction** - Extract entities from text
- **Evaluation Metrics** - Accuracy, Precision, Recall, F1-Score
- **Token-Level Evaluation** - Detailed performance metrics
- **Entity Type Distribution** - Analysis of entity types

In [4]:
# Import required libraries
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('conll2000')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import conlltags2tree, tree2conlltags
from collections import defaultdict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Pack

## Sample Text Data

In [5]:
# Sample news articles / social media text data
sample_texts = [
    "Apple Inc. is planning to open a new store in New York City next month. Tim Cook announced this at the conference.",
    "President Joe Biden met with Prime Minister Narendra Modi in Washington D.C. to discuss trade relations.",
    "Google and Microsoft are competing for market share in the cloud computing industry.",
    "Elon Musk's Tesla delivered record numbers of electric vehicles in California last quarter.",
    "The United Nations held a summit in Geneva, Switzerland about climate change.",
    "Amazon founder Jeff Bezos announced plans to invest in renewable energy projects in Europe.",
    "The FIFA World Cup will be hosted by Saudi Arabia in 2034.",
    "Scientists at MIT and Stanford University made a breakthrough in artificial intelligence research."
]

# Ground truth entities for evaluation
ground_truth = [
    [("Apple Inc.", "ORGANIZATION"), ("New York City", "GPE"), ("Tim Cook", "PERSON")],
    [("Joe Biden", "PERSON"), ("Narendra Modi", "PERSON"), ("Washington D.C.", "GPE")],
    [("Google", "ORGANIZATION"), ("Microsoft", "ORGANIZATION")],
    [("Elon Musk", "PERSON"), ("Tesla", "ORGANIZATION"), ("California", "GPE")],
    [("United Nations", "ORGANIZATION"), ("Geneva", "GPE"), ("Switzerland", "GPE")],
    [("Amazon", "ORGANIZATION"), ("Jeff Bezos", "PERSON"), ("Europe", "GPE")],
    [("FIFA", "ORGANIZATION"), ("Saudi Arabia", "GPE")],
    [("MIT", "ORGANIZATION"), ("Stanford University", "ORGANIZATION")]
]

print("Sample Texts:")
for i, text in enumerate(sample_texts):
    print(f"  {i+1}. {text}")

Sample Texts:
  1. Apple Inc. is planning to open a new store in New York City next month. Tim Cook announced this at the conference.
  2. President Joe Biden met with Prime Minister Narendra Modi in Washington D.C. to discuss trade relations.
  3. Google and Microsoft are competing for market share in the cloud computing industry.
  4. Elon Musk's Tesla delivered record numbers of electric vehicles in California last quarter.
  5. The United Nations held a summit in Geneva, Switzerland about climate change.
  6. Amazon founder Jeff Bezos announced plans to invest in renewable energy projects in Europe.
  7. The FIFA World Cup will be hosted by Saudi Arabia in 2034.
  8. Scientists at MIT and Stanford University made a breakthrough in artificial intelligence research.


## 1. NER Extraction from Text

In [6]:
def extract_entities_nltk(text):
    """Extract named entities using NLTK's ne_chunk"""
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    tree = ne_chunk(pos_tags)
    
    entities = []
    for subtree in tree:
        if hasattr(subtree, 'label'):
            entity_name = ' '.join([leaf[0] for leaf in subtree.leaves()])
            entity_type = subtree.label()
            entities.append((entity_name, entity_type))
    
    return entities

# Extract entities from all texts
print("Extracted Entities:")
predicted_entities = []
for i, text in enumerate(sample_texts):
    entities = extract_entities_nltk(text)
    predicted_entities.append(entities)
    print(f"\nText {i+1}: {text[:60]}...")
    print(f"  Entities: {entities}")

Extracted Entities:

Text 1: Apple Inc. is planning to open a new store in New York City ...
  Entities: [('Apple', 'PERSON'), ('Inc.', 'ORGANIZATION'), ('New York City', 'GPE'), ('Tim Cook', 'PERSON')]

Text 2: President Joe Biden met with Prime Minister Narendra Modi in...
  Entities: [('Joe Biden', 'PERSON'), ('Narendra Modi', 'PERSON'), ('Washington', 'GPE')]

Text 3: Google and Microsoft are competing for market share in the c...
  Entities: [('Google', 'GPE'), ('Microsoft', 'ORGANIZATION')]

Text 4: Elon Musk's Tesla delivered record numbers of electric vehic...
  Entities: [('Elon', 'PERSON'), ('Musk', 'ORGANIZATION'), ('Tesla', 'PERSON'), ('California', 'GPE')]

Text 5: The United Nations held a summit in Geneva, Switzerland abou...
  Entities: [('United Nations', 'ORGANIZATION'), ('Geneva', 'GPE'), ('Switzerland', 'GPE')]

Text 6: Amazon founder Jeff Bezos announced plans to invest in renew...
  Entities: [('Amazon', 'GPE'), ('Jeff Bezos', 'PERSON'), ('Europe', 'GPE')]

Text 7

## 2. Evaluation Metrics: Accuracy, Precision, Recall, F1-Score

In [7]:
def normalize_entity(entity_name):
    """Normalize entity name for comparison"""
    return entity_name.lower().strip()

# Calculate metrics
true_positives = 0
false_positives = 0
false_negatives = 0

print("Comparison of Ground Truth vs Predicted:")
for i, (gt_entities, pred_entities) in enumerate(zip(ground_truth, predicted_entities)):
    print(f"\nDocument {i+1}:")
    print(f"  Ground Truth: {gt_entities}")
    print(f"  Predicted:    {pred_entities}")
    
    gt_names = set([normalize_entity(e[0]) for e in gt_entities])
    pred_names = set([normalize_entity(e[0]) for e in pred_entities])
    
    matches = gt_names & pred_names
    tp = len(matches)
    fp = len(pred_names - gt_names)
    fn = len(gt_names - pred_names)
    
    true_positives += tp
    false_positives += fp
    false_negatives += fn
    
    print(f"  TP: {tp}, FP: {fp}, FN: {fn}")

# Calculate overall metrics
print("\n" + "-" * 60)
print("OVERALL METRICS (Entity-Level):")
print("-" * 60)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

total_ground_truth = sum(len(gt) for gt in ground_truth)
accuracy = true_positives / total_ground_truth if total_ground_truth > 0 else 0

print(f"\n  True Positives:  {true_positives}")
print(f"  False Positives: {false_positives}")
print(f"  False Negatives: {false_negatives}")
print(f"\n  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"  Precision: {precision:.4f} ({precision*100:.2f}%)")
print(f"  Recall:    {recall:.4f} ({recall*100:.2f}%)")
print(f"  F1-Score:  {f1:.4f} ({f1*100:.2f}%)")

Comparison of Ground Truth vs Predicted:

Document 1:
  Ground Truth: [('Apple Inc.', 'ORGANIZATION'), ('New York City', 'GPE'), ('Tim Cook', 'PERSON')]
  Predicted:    [('Apple', 'PERSON'), ('Inc.', 'ORGANIZATION'), ('New York City', 'GPE'), ('Tim Cook', 'PERSON')]
  TP: 2, FP: 2, FN: 1

Document 2:
  Ground Truth: [('Joe Biden', 'PERSON'), ('Narendra Modi', 'PERSON'), ('Washington D.C.', 'GPE')]
  Predicted:    [('Joe Biden', 'PERSON'), ('Narendra Modi', 'PERSON'), ('Washington', 'GPE')]
  TP: 2, FP: 1, FN: 1

Document 3:
  Ground Truth: [('Google', 'ORGANIZATION'), ('Microsoft', 'ORGANIZATION')]
  Predicted:    [('Google', 'GPE'), ('Microsoft', 'ORGANIZATION')]
  TP: 2, FP: 0, FN: 0

Document 4:
  Ground Truth: [('Elon Musk', 'PERSON'), ('Tesla', 'ORGANIZATION'), ('California', 'GPE')]
  Predicted:    [('Elon', 'PERSON'), ('Musk', 'ORGANIZATION'), ('Tesla', 'PERSON'), ('California', 'GPE')]
  TP: 2, FP: 2, FN: 1

Document 5:
  Ground Truth: [('United Nations', 'ORGANIZATION'), ('Gen