# NLTK Complete Guide - Section 8: Named Entity Recognition (NER)

This notebook covers:
- What is NER?
- NLTK's Named Entity Chunker
- Entity Types
- Extracting Entities
- Practical Applications

In [25]:
import nltk

nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

## 8.1 What is Named Entity Recognition?

**NER** identifies and classifies named entities in text:
- **PERSON**: People's names
- **ORGANIZATION**: Companies, institutions
- **GPE**: Geo-Political Entities (countries, cities)
- **LOCATION**: Mountains, rivers, regions
- **DATE/TIME**: Temporal expressions
- **MONEY**: Monetary values
- **PERCENT**: Percentages

In [26]:
text = "Apple Inc. was founded by Steve Jobs in California."

# NER Pipeline: Tokenize → POS Tag → NE Chunk
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)

print(f"Text: {text}\n")
print("Named Entities:")
print(entities)

Text: Apple Inc. was founded by Steve Jobs in California.

Named Entities:
(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Steve/NNP Jobs/NNP)
  in/IN
  (GPE California/NNP)
  ./.)


In [27]:
# Visualize the tree (if IPython display is available)
# entities.draw()  # Uncomment to see tree visualization

# Print tree structure
entities.pprint()

(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Steve/NNP Jobs/NNP)
  in/IN
  (GPE California/NNP)
  ./.)


## 8.2 Extracting Named Entities

In [28]:
def extract_entities(text):
    """Extract named entities from text"""
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tree = ne_chunk(tagged)
    
    entities = []
    for subtree in tree:
        if isinstance(subtree, Tree):
            entity_type = subtree.label()
            entity_text = ' '.join(word for word, tag in subtree.leaves())
            entities.append((entity_text, entity_type))
    
    return entities

In [29]:
text = """Barack Obama was the 44th President of the United States.
He was born in Hawaii and studied at Harvard University.
Microsoft and Google are major tech companies in America."""

print(f"Text:\n{text}\n")

entities = extract_entities(text)

print("Extracted Entities:")
print("-" * 40)
for entity, entity_type in entities:
    print(f"{entity_type:<15} {entity}")

Text:
Barack Obama was the 44th President of the United States.
He was born in Hawaii and studied at Harvard University.
Microsoft and Google are major tech companies in America.

Extracted Entities:
----------------------------------------
PERSON          Barack
PERSON          Obama
GPE             United States
GPE             Hawaii
ORGANIZATION    Harvard University
PERSON          Microsoft
GPE             Google
GPE             America


## 8.3 Entity Types in NLTK

In [30]:
entity_types = {
    'PERSON': 'People, including fictional characters',
    'ORGANIZATION': 'Companies, agencies, institutions',
    'GPE': 'Countries, cities, states (Geo-Political Entities)',
    'LOCATION': 'Non-GPE locations (mountains, rivers)',
    'FACILITY': 'Buildings, airports, highways',
    'GSP': 'Geo-Socio-Political groups',
}

print("NLTK Named Entity Types")
print("=" * 60)
for entity_type, description in entity_types.items():
    print(f"{entity_type:<15} {description}")

NLTK Named Entity Types
PERSON          People, including fictional characters
ORGANIZATION    Companies, agencies, institutions
GPE             Countries, cities, states (Geo-Political Entities)
LOCATION        Non-GPE locations (mountains, rivers)
FACILITY        Buildings, airports, highways
GSP             Geo-Socio-Political groups


In [31]:
# Examples of each entity type
examples = [
    ("PERSON", "Elon Musk founded SpaceX."),
    ("ORGANIZATION", "NASA launched a new satellite."),
    ("GPE", "Tokyo is the capital of Japan."),
    ("LOCATION", "Mount Everest is the tallest mountain."),
]

print("Entity Type Examples")
print("=" * 60)

for expected_type, text in examples:
    entities = extract_entities(text)
    print(f"\nText: {text}")
    print(f"Expected: {expected_type}")
    print(f"Found: {entities}")

Entity Type Examples

Text: Elon Musk founded SpaceX.
Expected: PERSON
Found: [('Elon', 'PERSON'), ('Musk', 'PERSON'), ('SpaceX', 'ORGANIZATION')]

Text: NASA launched a new satellite.
Expected: ORGANIZATION
Found: [('NASA', 'ORGANIZATION')]

Text: Tokyo is the capital of Japan.
Expected: GPE
Found: [('Tokyo', 'GPE'), ('Japan', 'GPE')]

Text: Mount Everest is the tallest mountain.
Expected: LOCATION
Found: [('Mount', 'PERSON'), ('Everest', 'ORGANIZATION')]


## 8.4 Binary NER (Named Entity or Not)

In [32]:
text = "Steve Jobs founded Apple in California."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)

# With entity types (default)
entities_typed = ne_chunk(tagged, binary=False)
print("With entity types (binary=False):")
entities_typed.pprint()

print("\n" + "=" * 50 + "\n")

# Binary (just NE or not)
entities_binary = ne_chunk(tagged, binary=True)
print("Binary mode (binary=True):")
entities_binary.pprint()

With entity types (binary=False):
(S
  (PERSON Steve/NNP)
  (PERSON Jobs/NNP)
  founded/VBD
  Apple/NNP
  in/IN
  (GPE California/NNP)
  ./.)


Binary mode (binary=True):
(S
  (NE Steve/NNP Jobs/NNP)
  founded/VBD
  Apple/NNP
  in/IN
  (NE California/NNP)
  ./.)


## 8.5 How NER Works Under the Hood

### The NER Pipeline

NLTK's NER is a **multi-stage pipeline**:

```
Raw Text
    ↓ word_tokenize()
Tokens: ["Apple", "Inc.", "was", "founded", "by", "Steve", "Jobs"]
    ↓ pos_tag()
POS Tagged: [("Apple", "NNP"), ("Inc.", "NNP"), ("was", "VBD"), ...]
    ↓ ne_chunk()
Entity Tree: (ORGANIZATION Apple Inc.) (PERSON Steve Jobs)
```

### What Features Does NER Look At?

NLTK's `maxent_ne_chunker` (Maximum Entropy NE Chunker) uses these features:

| Feature | Example | Why It Helps |
|---------|---------|--------------|
| **POS Tag** | `NNP` (proper noun) | Entities are usually proper nouns |
| **Word Shape** | `Xxxxx` (capitalized) | Names start with capitals |
| **Previous POS** | What came before | "President Obama" vs "obama" |
| **Previous Word** | Context words | "Mr.", "Dr.", "Inc." are clues |
| **Suffix/Prefix** | `-tion`, `Un-` | Some endings suggest entity types |
| **Word Itself** | Gazetteer lookup | Known names like "Microsoft" |

### Yes, NER is Context-Dependent!

Unlike simple dictionary lookup, NER considers **surrounding context**:

In [33]:
# Demonstration: Context affects NER decisions
print("=" * 70)
print("CONTEXT MATTERS: Same Word, Different Entity Decisions")
print("=" * 70)

context_examples = [
    # "Apple" as company vs fruit
    ("Apple announced new products today.", "ORGANIZATION expected"),
    ("I ate an apple for breakfast.", "No entity expected"),
    
    # "Washington" as person vs place
    ("George Washington was the first president.", "PERSON expected"),
    ("I visited Washington last summer.", "GPE expected"),
    
    # "Jordan" as person vs country  
    ("Michael Jordan played basketball.", "PERSON expected"),
    ("Jordan is located in the Middle East.", "GPE expected"),
    
    # Capitalization matters
    ("Bill works at Microsoft.", "PERSON expected"),
    ("The bill was passed by Congress.", "No entity expected"),
]

print("\nHow context changes entity recognition:\n")

for text, expected in context_examples:
    entities = extract_entities(text)
    entity_str = entities if entities else "No entities"
    print(f"Text: {text}")
    print(f"  Expected: {expected}")
    print(f"  Found: {entity_str}\n")

CONTEXT MATTERS: Same Word, Different Entity Decisions

How context changes entity recognition:

Text: Apple announced new products today.
  Expected: ORGANIZATION expected
  Found: [('Apple', 'PERSON')]

Text: I ate an apple for breakfast.
  Expected: No entity expected
  Found: No entities

Text: George Washington was the first president.
  Expected: PERSON expected
  Found: [('George', 'PERSON'), ('Washington', 'GPE')]

Text: I visited Washington last summer.
  Expected: GPE expected
  Found: [('Washington', 'GPE')]

Text: Michael Jordan played basketball.
  Expected: PERSON expected
  Found: [('Michael', 'PERSON'), ('Jordan', 'PERSON')]

Text: Jordan is located in the Middle East.
  Expected: GPE expected
  Found: [('Jordan', 'GPE'), ('Middle East', 'GPE')]

Text: Bill works at Microsoft.
  Expected: PERSON expected
  Found: [('Bill', 'PERSON'), ('Microsoft', 'ORGANIZATION')]

Text: The bill was passed by Congress.
  Expected: No entity expected
  Found: [('Congress', 'ORGANIZATION

In [34]:
# What features does the NER model actually use?
# Let's examine the input to ne_chunk

print("=" * 70)
print("INSIDE ne_chunk: What the Model Sees")
print("=" * 70)

test_sentences = [
    "Apple released the iPhone.",
    "Steve Jobs founded Apple.",
    "I live in New York City.",
]

for text in test_sentences:
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    
    print(f"\nText: {text}")
    print("POS Tags the NER model receives:")
    for word, tag in tagged:
        # Show what features are available
        shape = ''.join('X' if c.isupper() else 'x' if c.islower() else c for c in word)
        print(f"  {word:<15} POS={tag:<5} Shape={shape}")
    
    entities = ne_chunk(tagged)
    print(f"Result: {extract_entities(text)}")

INSIDE ne_chunk: What the Model Sees

Text: Apple released the iPhone.
POS Tags the NER model receives:
  Apple           POS=NNP   Shape=Xxxxx
  released        POS=VBD   Shape=xxxxxxxx
  the             POS=DT    Shape=xxx
  iPhone          POS=NN    Shape=xXxxxx
  .               POS=.     Shape=.
Result: [('Apple', 'PERSON'), ('iPhone', 'ORGANIZATION')]

Text: Steve Jobs founded Apple.
POS Tags the NER model receives:
  Steve           POS=NNP   Shape=Xxxxx
  Jobs            POS=NNP   Shape=Xxxx
  founded         POS=VBD   Shape=xxxxxxx
  Apple           POS=NNP   Shape=Xxxxx
  .               POS=.     Shape=.
Result: [('Steve', 'PERSON'), ('Jobs', 'PERSON'), ('Apple', 'PERSON')]

Text: I live in New York City.
POS Tags the NER model receives:
  I               POS=PRP   Shape=X
  live            POS=VBP   Shape=xxxx
  in              POS=IN    Shape=xx
  New             POS=NNP   Shape=Xxx
  York            POS=NNP   Shape=Xxxx
  City            POS=NNP   Shape=Xxxx
  .          

### The Algorithm: IOB Tagging + Maximum Entropy

NER uses **IOB tagging** (Inside-Outside-Beginning):

```
Token       POS    IOB Tag      Meaning
-------     ---    -------      -------
Steve       NNP    B-PERSON     Beginning of PERSON entity
Jobs        NNP    I-PERSON     Inside PERSON entity
founded     VBD    O            Outside (not an entity)
Apple       NNP    B-ORG        Beginning of ORG entity
Inc         NNP    I-ORG        Inside ORG entity
.           .      O            Outside
```

**Maximum Entropy Classifier** decides each token's IOB tag by:
1. Extracting features (POS, word shape, context)
2. Computing probability of each IOB tag
3. Picking the highest probability tag

This is **actual machine learning** (unlike N-gram POS taggers which just memorize)!

In [35]:
# Simulate IOB tagging to understand the concept
print("=" * 70)
print("IOB TAGGING: How NER Labels Each Token")
print("=" * 70)

def show_iob_concept(text):
    """Demonstrate IOB tagging concept"""
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tree = ne_chunk(tagged)
    
    print(f"\nText: {text}")
    print(f"{'Token':<15} {'POS':<6} {'IOB Tag':<12} {'Explanation'}")
    print("-" * 60)
    
    # Walk through tree to show IOB concept
    for item in tree:
        if isinstance(item, Tree):
            # This is an entity
            entity_type = item.label()
            words = item.leaves()
            for i, (word, pos) in enumerate(words):
                if i == 0:
                    iob = f"B-{entity_type}"
                    expl = f"Beginning of {entity_type}"
                else:
                    iob = f"I-{entity_type}"
                    expl = f"Inside {entity_type}"
                print(f"{word:<15} {pos:<6} {iob:<12} {expl}")
        else:
            # Not an entity
            word, pos = item
            print(f"{word:<15} {pos:<6} {'O':<12} Outside (not entity)")

show_iob_concept("Barack Obama visited Microsoft headquarters in Seattle.")
show_iob_concept("Dr. John Smith works at Stanford University.")

IOB TAGGING: How NER Labels Each Token

Text: Barack Obama visited Microsoft headquarters in Seattle.
Token           POS    IOB Tag      Explanation
------------------------------------------------------------
Barack          NNP    B-PERSON     Beginning of PERSON
Obama           NNP    B-PERSON     Beginning of PERSON
visited         VBD    O            Outside (not entity)
Microsoft       NNP    B-ORGANIZATION Beginning of ORGANIZATION
headquarters    NNS    O            Outside (not entity)
in              IN     O            Outside (not entity)
Seattle         NNP    B-GPE        Beginning of GPE
.               .      O            Outside (not entity)

Text: Dr. John Smith works at Stanford University.
Token           POS    IOB Tag      Explanation
------------------------------------------------------------
Dr.             NNP    O            Outside (not entity)
John            NNP    B-PERSON     Beginning of PERSON
Smith           NNP    I-PERSON     Inside PERSON
works   

### Why NER Sometimes Fails

Common failure modes:

| Problem | Example | Why It Fails |
|---------|---------|--------------|
| **Lowercase names** | "obama spoke today" | Relies on capitalization |
| **Unknown entities** | "Zorbflex Inc." | Not in training data |
| **Ambiguous context** | "Apple is great" | Could be company or fruit |
| **Multi-word entities** | "New York Stock Exchange" | May split incorrectly |
| **Domain-specific** | "BRCA1 gene" | Medical/scientific terms |

In [36]:
# Demonstrate NER failures and limitations
print("=" * 70)
print("NER LIMITATIONS: When It Goes Wrong")
print("=" * 70)

failure_cases = [
    # Lowercase problem
    ("obama visited china", "Lowercase names → missed"),
    ("Obama visited China", "Capitalized → found"),
    
    # Unknown entities
    ("Zorbflex Corporation announced earnings", "Unknown company name"),
    ("Microsoft Corporation announced earnings", "Known company name"),
    
    # Ambiguous without context
    ("Apple is delicious", "Ambiguous: fruit or company?"),
    ("Apple announced iPhone", "Clear context: company"),
    
    # Complex multi-word entities
    ("The New York Stock Exchange opened today", "Long entity name"),
    ("United Nations Security Council met", "Another long entity"),
]

print(f"\n{'Text':<50} {'Found':<30} {'Note'}")
print("-" * 100)

for text, note in failure_cases:
    entities = extract_entities(text)
    entity_str = str(entities) if entities else "None"
    print(f"{text:<50} {entity_str:<30} {note}")

NER LIMITATIONS: When It Goes Wrong

Text                                               Found                          Note
----------------------------------------------------------------------------------------------------
obama visited china                                None                           Lowercase names → missed
Obama visited China                                [('Obama', 'PERSON'), ('China', 'GPE')] Capitalized → found
Zorbflex Corporation announced earnings            [('Zorbflex', 'PERSON'), ('Corporation', 'ORGANIZATION')] Unknown company name
Microsoft Corporation announced earnings           [('Microsoft', 'PERSON'), ('Corporation', 'ORGANIZATION')] Known company name
Apple is delicious                                 [('Apple', 'GPE')]             Ambiguous: fruit or company?
Apple announced iPhone                             [('Apple', 'PERSON'), ('iPhone', 'ORGANIZATION')] Clear context: company
The New York Stock Exchange opened today           [('New York'

## 8.6 Custom Training NER

### Can You Train Your Own NER Model?

**Short answer:** Yes, but NLTK's built-in NER trainer is limited.

**Options for custom NER:**

| Approach | Difficulty | Best For |
|----------|------------|----------|
| **Rule-based (gazetteers)** | Easy | Known entity lists |
| **NLTK ClassifierBasedTagger** | Medium | Educational purposes |
| **spaCy custom NER** | Medium | Production use |
| **Transformers (BERT, etc.)** | Hard | State-of-the-art accuracy |

### Option 1: Rule-Based NER with Gazetteers

A **gazetteer** is simply a list of known entities. This is the easiest approach:

In [37]:
# Option 1: Rule-based NER using gazetteers (lists of known entities)

class GazetteerNER:
    """Simple rule-based NER using entity lists"""
    
    def __init__(self):
        # Define gazetteers (entity lists)
        self.gazetteers = {
            'COMPANY': {'Apple', 'Microsoft', 'Google', 'Amazon', 'Meta', 
                       'Tesla', 'SpaceX', 'Netflix', 'Uber', 'Airbnb'},
            'PROGRAMMING_LANG': {'Python', 'Java', 'JavaScript', 'C++', 
                                'Rust', 'Go', 'TypeScript', 'Ruby'},
            'FRAMEWORK': {'Django', 'Flask', 'React', 'Angular', 'Vue',
                         'TensorFlow', 'PyTorch', 'Spring', 'Rails'},
            'DATABASE': {'MySQL', 'PostgreSQL', 'MongoDB', 'Redis', 
                        'Cassandra', 'SQLite', 'Oracle'},
        }
        
        # Also include multi-word entities
        self.multi_word = {
            'COMPANY': {'Goldman Sachs', 'JP Morgan', 'General Motors'},
            'FRAMEWORK': {'Spring Boot', 'Ruby on Rails', 'ASP.NET'},
        }
    
    def extract(self, text):
        """Extract entities using gazetteer lookup"""
        entities = []
        tokens = word_tokenize(text)
        
        # Check multi-word entities first
        for entity_type, entity_set in self.multi_word.items():
            for entity in entity_set:
                if entity.lower() in text.lower():
                    entities.append((entity, entity_type))
        
        # Check single-word entities
        for token in tokens:
            for entity_type, entity_set in self.gazetteers.items():
                if token in entity_set:
                    entities.append((token, entity_type))
        
        return entities

# Demo
gaz_ner = GazetteerNER()

tech_texts = [
    "I'm learning Python and Django for web development.",
    "Microsoft uses TypeScript for VS Code.",
    "Tesla uses PyTorch for their AI systems.",
    "Goldman Sachs runs Java applications on Oracle databases.",
]

print("=" * 70)
print("GAZETTEER-BASED NER: Custom Entity Lists")
print("=" * 70)

for text in tech_texts:
    entities = gaz_ner.extract(text)
    print(f"\nText: {text}")
    print(f"Entities: {entities}")

GAZETTEER-BASED NER: Custom Entity Lists

Text: I'm learning Python and Django for web development.
Entities: [('Python', 'PROGRAMMING_LANG'), ('Django', 'FRAMEWORK')]

Text: Microsoft uses TypeScript for VS Code.
Entities: [('Microsoft', 'COMPANY'), ('TypeScript', 'PROGRAMMING_LANG')]

Text: Tesla uses PyTorch for their AI systems.
Entities: [('Tesla', 'COMPANY'), ('PyTorch', 'FRAMEWORK')]

Text: Goldman Sachs runs Java applications on Oracle databases.
Entities: [('Goldman Sachs', 'COMPANY'), ('Java', 'PROGRAMMING_LANG'), ('Oracle', 'DATABASE')]


### Option 2: Training NLTK's NE Chunker

NLTK allows training a custom NER model using the **ClassifierBasedTagger**. 
The training data format uses IOB tags:

In [38]:
# Option 2: Training format for NLTK NER
# Data format: list of sentences, each word has (token, POS, IOB_tag)

# Example training data in IOB format
training_data_iob = [
    # Sentence 1
    [
        ('Python', 'NNP', 'B-LANG'),      # Beginning of LANG entity
        ('is', 'VBZ', 'O'),                # Outside
        ('developed', 'VBN', 'O'),
        ('by', 'IN', 'O'),
        ('Guido', 'NNP', 'B-PERSON'),     # Beginning of PERSON
        ('van', 'NNP', 'I-PERSON'),       # Inside PERSON
        ('Rossum', 'NNP', 'I-PERSON'),    # Inside PERSON
        ('.', '.', 'O'),
    ],
    # Sentence 2
    [
        ('TensorFlow', 'NNP', 'B-FRAMEWORK'),
        ('was', 'VBD', 'O'),
        ('created', 'VBN', 'O'),
        ('by', 'IN', 'O'),
        ('Google', 'NNP', 'B-ORG'),
        ('.', '.', 'O'),
    ],
    # Sentence 3
    [
        ('Django', 'NNP', 'B-FRAMEWORK'),
        ('and', 'CC', 'O'),
        ('Flask', 'NNP', 'B-FRAMEWORK'),
        ('are', 'VBP', 'O'),
        ('Python', 'NNP', 'B-LANG'),
        ('frameworks', 'NNS', 'O'),
        ('.', '.', 'O'),
    ],
]

print("=" * 70)
print("IOB TRAINING DATA FORMAT")
print("=" * 70)
print("\nHow to annotate data for NER training:\n")

for i, sentence in enumerate(training_data_iob, 1):
    print(f"Sentence {i}:")
    print(f"{'Token':<15} {'POS':<6} {'IOB Tag':<12}")
    print("-" * 35)
    for token, pos, iob in sentence:
        print(f"{token:<15} {pos:<6} {iob:<12}")
    print()

IOB TRAINING DATA FORMAT

How to annotate data for NER training:

Sentence 1:
Token           POS    IOB Tag     
-----------------------------------
Python          NNP    B-LANG      
is              VBZ    O           
developed       VBN    O           
by              IN     O           
Guido           NNP    B-PERSON    
van             NNP    I-PERSON    
Rossum          NNP    I-PERSON    
.               .      O           

Sentence 2:
Token           POS    IOB Tag     
-----------------------------------
TensorFlow      NNP    B-FRAMEWORK 
was             VBD    O           
created         VBN    O           
by              IN     O           
Google          NNP    B-ORG       
.               .      O           

Sentence 3:
Token           POS    IOB Tag     
-----------------------------------
Django          NNP    B-FRAMEWORK 
and             CC     O           
Flask           NNP    B-FRAMEWORK 
are             VBP    O           
Python          NNP    B-LANG   

In [39]:
# Build a simple custom NER using NLTK's chunking framework
from nltk.chunk import ChunkParserI
from nltk import RegexpParser

# Option 2a: Rule-based chunker using regex patterns
# This is simpler than training but can be effective

# Define grammar rules for chunking
grammar = r"""
    TECH_COMPANY: {<NNP>+<(Inc|Corp|LLC|Ltd)\.?>?}  # Proper nouns optionally followed by Inc/Corp
    PERSON: {<NNP><NNP>+}                            # Two or more proper nouns
    LANGUAGE: {<NNP>}                                # Single proper noun (needs gazetteer)
"""

chunk_parser = RegexpParser(grammar)

# Test it
test_text = "Steve Jobs founded Apple Inc in California"
tokens = word_tokenize(test_text)
tagged = pos_tag(tokens)
chunked = chunk_parser.parse(tagged)

print("=" * 70)
print("REGEX-BASED CHUNKER (Simple Custom NER)")
print("=" * 70)
print(f"\nText: {test_text}")
print(f"POS Tags: {tagged}")
print(f"\nChunked result:")
print(chunked)
print("\n⚠️ Note: This is pattern-based, not trained on data.")

REGEX-BASED CHUNKER (Simple Custom NER)

Text: Steve Jobs founded Apple Inc in California
POS Tags: [('Steve', 'NNP'), ('Jobs', 'NNP'), ('founded', 'VBD'), ('Apple', 'NNP'), ('Inc', 'NNP'), ('in', 'IN'), ('California', 'NNP')]

Chunked result:
(S
  (TECH_COMPANY Steve/NNP Jobs/NNP)
  founded/VBD
  (TECH_COMPANY Apple/NNP Inc/NNP)
  in/IN
  (TECH_COMPANY California/NNP))

⚠️ Note: This is pattern-based, not trained on data.


### Option 3: Combining NLTK with Gazetteers

The most practical approach for custom domains: **combine NLTK's NER with your own entity lists**:

In [40]:
# Option 3: Hybrid NER - Combine NLTK's NER with custom gazetteers

class HybridNER:
    """
    Combines NLTK's built-in NER with custom gazetteers.
    Best of both worlds!
    """
    
    def __init__(self):
        # Custom domain-specific gazetteers
        self.custom_entities = {
            # Tech domain
            'PROGRAMMING_LANG': {'Python', 'Java', 'JavaScript', 'C++', 'Rust', 'Go'},
            'FRAMEWORK': {'Django', 'Flask', 'React', 'TensorFlow', 'PyTorch', 'Spring'},
            'DATABASE': {'MySQL', 'PostgreSQL', 'MongoDB', 'Redis', 'SQLite'},
            # Medical domain (example)
            'DRUG': {'Aspirin', 'Ibuprofen', 'Paracetamol', 'Amoxicillin'},
            'DISEASE': {'Diabetes', 'Hypertension', 'Alzheimer', 'Parkinson'},
        }
    
    def extract(self, text):
        """Extract entities using both NLTK and custom gazetteers"""
        entities = []
        
        # 1. Use NLTK's built-in NER for standard entities
        tokens = word_tokenize(text)
        tagged = pos_tag(tokens)
        tree = ne_chunk(tagged)
        
        for subtree in tree:
            if isinstance(subtree, Tree):
                entity_type = subtree.label()
                entity_text = ' '.join(w for w, t in subtree.leaves())
                entities.append((entity_text, entity_type))
        
        # 2. Add custom gazetteer matches
        for token in tokens:
            for entity_type, entity_set in self.custom_entities.items():
                if token in entity_set:
                    # Avoid duplicates
                    if not any(token in e for e, t in entities):
                        entities.append((token, entity_type))
        
        return entities

# Demo
hybrid_ner = HybridNER()

test_texts = [
    "Elon Musk uses Python and TensorFlow at Tesla.",
    "Dr. Smith prescribed Aspirin for Hypertension.",
    "Google developed Go and TensorFlow in California.",
    "The Django framework uses PostgreSQL database.",
]

print("=" * 70)
print("HYBRID NER: NLTK + Custom Gazetteers")
print("=" * 70)

for text in test_texts:
    entities = hybrid_ner.extract(text)
    print(f"\nText: {text}")
    print(f"Entities found:")
    for entity, etype in entities:
        print(f"  • {entity} ({etype})")

HYBRID NER: NLTK + Custom Gazetteers

Text: Elon Musk uses Python and TensorFlow at Tesla.
Entities found:
  • Elon (PERSON)
  • Musk (ORGANIZATION)
  • Python (PERSON)
  • TensorFlow (ORGANIZATION)
  • Tesla (ORGANIZATION)

Text: Dr. Smith prescribed Aspirin for Hypertension.
Entities found:
  • Smith (PERSON)
  • Aspirin (PERSON)
  • Hypertension (DISEASE)

Text: Google developed Go and TensorFlow in California.
Entities found:
  • Google (PERSON)
  • TensorFlow (ORGANIZATION)
  • California (GPE)

Text: The Django framework uses PostgreSQL database.
Entities found:
  • Django (GPE)
  • PostgreSQL (ORGANIZATION)


### Summary: NER Under the Hood

| Aspect | How It Works |
|--------|--------------|
| **Algorithm** | Maximum Entropy classifier with IOB tagging |
| **Context-Dependent?** | ✅ Yes - uses POS tags, word shape, surrounding words |
| **Features Used** | POS tag, word shape, capitalization, previous/next words |
| **Custom Training** | Possible but complex; easier to use gazetteers |

**Best Practices for Custom NER:**
1. **Simple domains**: Use gazetteer-based approach
2. **Combined approach**: NLTK + your entity lists
3. **Production quality**: Consider spaCy or Transformers

**Remember:** NLTK's NER is **actual machine learning** (unlike N-gram POS taggers)!

## 8.7 Extracting Entities by Type

In [41]:
def extract_entities_by_type(text):
    """Extract entities grouped by type"""
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tree = ne_chunk(tagged)
    
    entities_by_type = {}
    
    for subtree in tree:
        if isinstance(subtree, Tree):
            entity_type = subtree.label()
            entity_text = ' '.join(word for word, tag in subtree.leaves())
            
            if entity_type not in entities_by_type:
                entities_by_type[entity_type] = []
            entities_by_type[entity_type].append(entity_text)
    
    return entities_by_type

In [42]:
text = """Bill Gates and Satya Nadella lead Microsoft in Redmond, Washington.
Tim Cook is the CEO of Apple, headquartered in Cupertino, California.
Google was founded by Larry Page and Sergey Brin at Stanford University."""

print(f"Text:\n{text}\n")

entities = extract_entities_by_type(text)

print("Entities by Type:")
print("=" * 50)
for entity_type, entity_list in entities.items():
    print(f"\n{entity_type}:")
    for entity in entity_list:
        print(f"  • {entity}")

Text:
Bill Gates and Satya Nadella lead Microsoft in Redmond, Washington.
Tim Cook is the CEO of Apple, headquartered in Cupertino, California.
Google was founded by Larry Page and Sergey Brin at Stanford University.

Entities by Type:

PERSON:
  • Bill
  • Satya Nadella
  • Microsoft
  • Tim Cook
  • Google
  • Larry Page
  • Sergey Brin

ORGANIZATION:
  • Gates
  • CEO
  • Stanford University

GPE:
  • Redmond
  • Washington
  • Apple
  • Cupertino
  • California


## 8.8 Entity Counter

In [43]:
from collections import Counter

def count_entities(text):
    """Count entity occurrences"""
    entities = extract_entities(text)
    return Counter(entities)

text = """Apple announced new products. Apple's CEO Tim Cook presented.
Microsoft also had announcements. Google and Apple compete in many markets.
Tim Cook mentioned Apple's commitment to privacy."""

print(f"Text:\n{text}\n")

entity_counts = count_entities(text)

print("Entity Counts:")
print("-" * 40)
for (entity, etype), count in entity_counts.most_common():
    print(f"{entity:<20} ({etype:<12}) {count}x")

Text:
Apple announced new products. Apple's CEO Tim Cook presented.
Microsoft also had announcements. Google and Apple compete in many markets.
Tim Cook mentioned Apple's commitment to privacy.

Entity Counts:
----------------------------------------
Apple                (PERSON      ) 3x
CEO Tim Cook         (ORGANIZATION) 1x
Microsoft            (PERSON      ) 1x
Google               (PERSON      ) 1x
Tim Cook             (PERSON      ) 1x


## 8.9 Complete NER Pipeline Class

In [44]:
class NERExtractor:
    """Named Entity Recognition utility class"""
    
    def __init__(self, binary=False):
        self.binary = binary
    
    def process(self, text):
        """Process text and return NE tree"""
        tokens = word_tokenize(text)
        tagged = pos_tag(tokens)
        return ne_chunk(tagged, binary=self.binary)
    
    def extract_all(self, text):
        """Extract all entities as list of tuples"""
        tree = self.process(text)
        entities = []
        for subtree in tree:
            if isinstance(subtree, Tree):
                entity_type = subtree.label()
                entity_text = ' '.join(w for w, t in subtree.leaves())
                entities.append((entity_text, entity_type))
        return entities
    
    def extract_by_type(self, text, target_type):
        """Extract entities of specific type"""
        entities = self.extract_all(text)
        return [e for e, t in entities if t == target_type]
    
    def get_people(self, text):
        """Extract person names"""
        return self.extract_by_type(text, 'PERSON')
    
    def get_organizations(self, text):
        """Extract organization names"""
        return self.extract_by_type(text, 'ORGANIZATION')
    
    def get_locations(self, text):
        """Extract locations (GPE + LOCATION)"""
        gpe = self.extract_by_type(text, 'GPE')
        loc = self.extract_by_type(text, 'LOCATION')
        return gpe + loc
    
    def summary(self, text):
        """Get summary of all entities"""
        return {
            'people': self.get_people(text),
            'organizations': self.get_organizations(text),
            'locations': self.get_locations(text),
        }

In [45]:
# Use the NER class
ner = NERExtractor()

text = """Elon Musk is the CEO of Tesla and SpaceX.
Tesla is headquartered in Austin, Texas.
SpaceX launches rockets from Cape Canaveral, Florida.
Mark Zuckerberg runs Meta in Menlo Park, California."""

print(f"Text:\n{text}\n")

summary = ner.summary(text)

print("Entity Summary:")
print("=" * 50)
for category, entities in summary.items():
    print(f"\n{category.upper()}:")
    for entity in entities:
        print(f"  • {entity}")

Text:
Elon Musk is the CEO of Tesla and SpaceX.
Tesla is headquartered in Austin, Texas.
SpaceX launches rockets from Cape Canaveral, Florida.
Mark Zuckerberg runs Meta in Menlo Park, California.

Entity Summary:

PEOPLE:
  • Elon
  • Tesla
  • Mark Zuckerberg

ORGANIZATIONS:
  • Musk
  • CEO of Tesla
  • SpaceX
  • SpaceX
  • Meta

LOCATIONS:
  • Austin
  • Texas
  • Cape
  • Florida
  • Menlo Park
  • California


## 8.10 Practical Application: News Article Analysis

In [46]:
def analyze_article(text):
    """Analyze a news article for entities"""
    ner = NERExtractor()
    
    # Get all entities
    all_entities = ner.extract_all(text)
    
    # Count by type
    type_counts = Counter(t for e, t in all_entities)
    
    # Most mentioned entities
    entity_counts = Counter(e for e, t in all_entities)
    
    return {
        'total_entities': len(all_entities),
        'unique_entities': len(set(e for e, t in all_entities)),
        'type_distribution': dict(type_counts),
        'top_entities': entity_counts.most_common(5),
        'summary': ner.summary(text)
    }

In [47]:
article = """Technology giants Apple, Google, and Microsoft reported strong quarterly earnings.
Apple CEO Tim Cook announced record iPhone sales in China and Europe.
Google's Sundar Pichai highlighted growth in cloud computing services.
Microsoft's Satya Nadella discussed the company's AI investments.
Wall Street analysts predict continued growth for these Silicon Valley companies.
Apple's headquarters in Cupertino and Google's campus in Mountain View remain innovation hubs."""

print("NEWS ARTICLE ANALYSIS")
print("=" * 60)
print(f"\n{article}\n")
print("=" * 60)

analysis = analyze_article(article)

print(f"\nTotal entities found: {analysis['total_entities']}")
print(f"Unique entities: {analysis['unique_entities']}")

print(f"\nEntity Type Distribution:")
for etype, count in analysis['type_distribution'].items():
    print(f"  {etype}: {count}")

print(f"\nTop Mentioned Entities:")
for entity, count in analysis['top_entities']:
    print(f"  {entity}: {count}x")

NEWS ARTICLE ANALYSIS

Technology giants Apple, Google, and Microsoft reported strong quarterly earnings.
Apple CEO Tim Cook announced record iPhone sales in China and Europe.
Google's Sundar Pichai highlighted growth in cloud computing services.
Microsoft's Satya Nadella discussed the company's AI investments.
Wall Street analysts predict continued growth for these Silicon Valley companies.
Apple's headquarters in Cupertino and Google's campus in Mountain View remain innovation hubs.


Total entities found: 18
Unique entities: 14

Entity Type Distribution:
  GPE: 6
  PERSON: 9
  ORGANIZATION: 3

Top Mentioned Entities:
  Google: 3x
  Apple: 2x
  Microsoft: 2x
  Technology: 1x
  Apple CEO Tim Cook: 1x


## Summary

| Method | Description |
|--------|-------------|
| `ne_chunk(tagged)` | Perform NER on POS-tagged tokens |
| `ne_chunk(tagged, binary=True)` | Binary NER (entity or not) |
| `tree.label()` | Get entity type |
| `tree.leaves()` | Get words in entity |

### NER Pipeline
```python
tokens = word_tokenize(text)     # 1. Tokenize
tagged = pos_tag(tokens)         # 2. POS Tag
entities = ne_chunk(tagged)      # 3. NE Chunk
```

### Entity Types
- **PERSON**: People
- **ORGANIZATION**: Companies, institutions
- **GPE**: Countries, cities, states
- **LOCATION**: Non-GPE locations