# NLTK vs SpaCy: POS Tagging and NER Comparison Tutorial

This notebook provides a hands-on comparison between NLTK and SpaCy for:
- **Part-of-Speech (POS) Tagging**
- **Named Entity Recognition (NER)**

We'll explore different approaches including NLTK's HMM tagger and compare performance, accuracy, and ease of use.

## 1. Installation and Setup

First, let's install the required libraries and download necessary data.

In [None]:
# Installation commands (uncomment if needed)
# !pip install nltk spacy pandas tabulate
# !python -m spacy download en_core_web_sm

In [None]:
import nltk
import spacy
import pandas as pd
from tabulate import tabulate
import time
from collections import Counter

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('treebank')
nltk.download('universal_tagset')

# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

print("✓ Setup complete!")

## 2. Sample Text for Analysis

We'll use a sample text containing various entities and grammatical structures.

In [None]:
sample_text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. 
The company is headquartered in Cupertino, California. In 2024, Apple released the iPhone 15 
which quickly became popular worldwide. Microsoft and Google are major competitors in the 
technology industry. Tim Cook became CEO in 2011 and has led the company to unprecedented growth.
"""

print("Sample Text:")
print(sample_text)

## 3. Part-of-Speech (POS) Tagging

### 3.1 NLTK POS Tagging (Default Tagger)

In [None]:
# Tokenize and tag with NLTK's default tagger
tokens_nltk = nltk.word_tokenize(sample_text)

start_time = time.time()
pos_tags_nltk = nltk.pos_tag(tokens_nltk)
nltk_default_time = time.time() - start_time

print("NLTK POS Tagging (Default Tagger):")
print("="*60)
for i in range(0, min(20, len(pos_tags_nltk))):
    word, tag = pos_tags_nltk[i]
    print(f"{word:15} -> {tag}")
    
print(f"\nProcessing time: {nltk_default_time:.4f} seconds")

### 3.2 NLTK POS Tagging with HMM Tagger

Let's train an HMM (Hidden Markov Model) tagger using the Treebank corpus.

In [None]:
from nltk.tag import hmm
from nltk.corpus import treebank

# Load training data from Treebank corpus
print("Training HMM tagger on Treebank corpus...")
train_data = treebank.tagged_sents()[:3000]

# Train HMM tagger
trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train_supervised(train_data)

print("✓ HMM tagger trained successfully!")

# Tag with HMM tagger
start_time = time.time()
pos_tags_hmm = hmm_tagger.tag(tokens_nltk)
nltk_hmm_time = time.time() - start_time

print("\nNLTK POS Tagging (HMM Tagger):")
print("="*60)
for i in range(0, min(20, len(pos_tags_hmm))):
    word, tag = pos_tags_hmm[i]
    print(f"{word:15} -> {tag}")
    
print(f"\nProcessing time: {nltk_hmm_time:.4f} seconds")

### 3.3 SpaCy POS Tagging

In [None]:
# Process with SpaCy
start_time = time.time()
doc_spacy = nlp(sample_text)
spacy_time = time.time() - start_time

print("SpaCy POS Tagging:")
print("="*60)
count = 0
for token in doc_spacy:
    if not token.is_space and count < 20:
        print(f"{token.text:15} -> {token.pos_:8} (Fine: {token.tag_})")
        count += 1
    if count >= 20:
        break

print(f"\nProcessing time: {spacy_time:.4f} seconds")

### 3.4 POS Tagging Comparison Table

In [None]:
# Create comparison for first 15 tokens
comparison_data = []

for i in range(min(15, len(tokens_nltk))):
    token = tokens_nltk[i]
    
    # Get tags from each method
    nltk_default_tag = pos_tags_nltk[i][1] if i < len(pos_tags_nltk) else "N/A"
    nltk_hmm_tag = pos_tags_hmm[i][1] if i < len(pos_tags_hmm) else "N/A"
    
    # Find corresponding SpaCy token
    spacy_tag = "N/A"
    for spacy_token in doc_spacy:
        if spacy_token.text == token and not spacy_token.is_space:
            spacy_tag = spacy_token.pos_
            break
    
    comparison_data.append({
        'Token': token,
        'NLTK Default': nltk_default_tag,
        'NLTK HMM': nltk_hmm_tag,
        'SpaCy': spacy_tag
    })

df_pos_comparison = pd.DataFrame(comparison_data)
print("\nPOS Tagging Comparison:")
print(tabulate(df_pos_comparison, headers='keys', tablefmt='grid', showindex=False))

## 4. Named Entity Recognition (NER)

### 4.1 NLTK NER

In [None]:
from nltk import ne_chunk

# Perform NER with NLTK
start_time = time.time()
ne_tree = ne_chunk(pos_tags_nltk)
nltk_ner_time = time.time() - start_time

print("NLTK Named Entities:")
print("="*60)

nltk_entities = []
for subtree in ne_tree:
    if hasattr(subtree, 'label'):
        entity_name = ' '.join([word for word, tag in subtree.leaves()])
        entity_type = subtree.label()
        nltk_entities.append((entity_name, entity_type))
        print(f"{entity_name:30} -> {entity_type}")

print(f"\nTotal entities found: {len(nltk_entities)}")
print(f"Processing time: {nltk_ner_time:.4f} seconds")

### 4.2 SpaCy NER

In [None]:
# Extract entities from SpaCy
print("SpaCy Named Entities:")
print("="*60)

spacy_entities = []
for ent in doc_spacy.ents:
    spacy_entities.append((ent.text, ent.label_))
    print(f"{ent.text:30} -> {ent.label_:12} ({spacy.explain(ent.label_)})")

print(f"\nTotal entities found: {len(spacy_entities)}")

### 4.3 Visualize SpaCy Entities

In [None]:
from spacy import displacy

# Visualize entities (works in Jupyter)
print("SpaCy Entity Visualization:")
displacy.render(doc_spacy, style='ent', jupyter=True)

### 4.4 NER Comparison Table

In [None]:
# Create NER comparison
ner_comparison_data = []

# Add NLTK entities
for entity, entity_type in nltk_entities:
    ner_comparison_data.append({
        'Entity': entity,
        'NLTK Type': entity_type,
        'SpaCy Type': 'Not Found'
    })

# Match with SpaCy entities
for entity, entity_type in spacy_entities:
    found = False
    for item in ner_comparison_data:
        if entity.lower() in item['Entity'].lower() or item['Entity'].lower() in entity.lower():
            item['SpaCy Type'] = entity_type
            found = True
            break
    
    if not found:
        ner_comparison_data.append({
            'Entity': entity,
            'NLTK Type': 'Not Found',
            'SpaCy Type': entity_type
        })

df_ner_comparison = pd.DataFrame(ner_comparison_data)
print("\nNER Comparison:")
print(tabulate(df_ner_comparison, headers='keys', tablefmt='grid', showindex=False))

## 5. Performance and Feature Comparison

### 5.1 Processing Speed Comparison

In [None]:
speed_comparison = [
    ['Task', 'NLTK Default', 'NLTK HMM', 'SpaCy'],
    ['POS Tagging', f'{nltk_default_time:.4f}s', f'{nltk_hmm_time:.4f}s', f'{spacy_time:.4f}s'],
    ['NER', f'{nltk_ner_time:.4f}s', 'N/A', 'Included in POS']
]

print("\nProcessing Speed Comparison:")
print(tabulate(speed_comparison, headers='firstrow', tablefmt='grid'))

### 5.2 Comprehensive Feature Comparison

In [None]:
feature_comparison = [
    ['Feature', 'NLTK', 'SpaCy'],
    ['POS Tagging Approach', 'Statistical (Perceptron, HMM)', 'Neural Network (CNN)'],
    ['Tagset', 'Penn Treebank', 'Universal Dependencies'],
    ['NER Approach', 'Rule-based + ML', 'Deep Learning (CNN)'],
    ['Entity Types (Default)', '3 types (PERSON, ORGANIZATION, GPE)', '18+ types'],
    ['Processing Speed', 'Moderate (HMM slower)', 'Fast (batch processing)'],
    ['Setup Complexity', 'Multiple downloads required', 'Single model download'],
    ['Dependency Parsing', 'Limited support', 'Built-in, robust'],
    ['Lemmatization', 'Basic (WordNet)', 'Advanced (context-aware)'],
    ['Pipeline Architecture', 'Manual chaining', 'Integrated pipeline'],
    ['Customization', 'High (train own models)', 'Medium (update existing)'],
    ['Memory Usage', 'Low', 'Higher (neural models)'],
    ['Best For', 'Learning, research, custom models', 'Production, accuracy, speed']
]

print("\nComprehensive Feature Comparison:")
print(tabulate(feature_comparison, headers='firstrow', tablefmt='grid'))

## 6. Key Observations and Recommendations

### NLTK Strengths:
- **Educational**: Excellent for learning NLP concepts
- **Customizable**: Easy to train custom models (like HMM tagger)
- **Lightweight**: Lower memory footprint
- **Flexible**: More control over individual components

### NLTK Weaknesses:
- **Accuracy**: Generally lower accuracy on complex texts
- **Speed**: HMM tagger can be slower
- **Limited NER**: Fewer entity types, less sophisticated
- **Manual Pipeline**: Requires manual setup of processing steps

### SpaCy Strengths:
- **Accuracy**: State-of-the-art neural models
- **Speed**: Highly optimized for production
- **Complete Pipeline**: All-in-one processing
- **Rich Entities**: More entity types and better recognition
- **Visualization**: Built-in visualization tools

### SpaCy Weaknesses:
- **Memory**: Higher memory usage
- **Black Box**: Less transparent model internals
- **Less Flexible**: Harder to modify core algorithms

### Recommendations:
- **Use NLTK when**: Learning NLP, building custom models, working with limited resources, need maximum control
- **Use SpaCy when**: Building production systems, need high accuracy, processing large volumes, want quick setup

## 7. Conclusion

This notebook demonstrated the practical differences between NLTK and SpaCy for POS tagging and NER:

1. **NLTK** offers multiple tagging approaches including trainable HMM models, making it ideal for educational purposes and custom solutions.

2. **SpaCy** provides superior accuracy and speed through modern neural networks, making it better suited for production applications.

3. **HMM Tagger** in NLTK showcases statistical NLP approaches and offers a good balance between complexity and customization.

The choice between NLTK and SpaCy depends on your specific needs:
- **Learning and experimentation**: NLTK
- **Production systems**: SpaCy
- **Custom models**: NLTK
- **Quick deployment**: SpaCy

Both libraries have their place in the NLP toolkit, and understanding both makes you a more versatile NLP practitioner!