# Named Entity Recognition (NER)

This notebook demonstrates Named Entity Recognition using different approaches and tools.

## What you'll learn:
- What is Named Entity Recognition
- Using spaCy for NER
- Custom NER with machine learning
- Evaluating NER models
- Visualizing NER results
- Building a simple NER system

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re

# Try to import spaCy (if available)
try:
    import spacy
    from spacy import displacy
    SPACY_AVAILABLE = True
    print("spaCy imported successfully!")
except ImportError:
    SPACY_AVAILABLE = False
    print("spaCy not available. Install with: pip install spacy")
    print("Then download a model: python -m spacy download en_core_web_sm")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Understanding Named Entity Recognition
NER identifies and classifies named entities in text into predefined categories.

In [None]:
# Sample texts with various named entities
sample_texts = [
    "Apple Inc. is planning to open a new store in New York City next month.",
    "Microsoft Corporation was founded by Bill Gates and Paul Allen in 1975.",
    "The Amazon rainforest covers much of the Amazon basin in South America.",
    "Google's headquarters is located in Mountain View, California.",
    "The iPhone was first released by Apple on June 29, 2007.",
    "Elon Musk is the CEO of Tesla and SpaceX companies."
]

print("Sample texts for NER analysis:")
for i, text in enumerate(sample_texts, 1):
    print(f"{i}. {text}")

print("\nCommon entity types:")
entity_types = {
    'PERSON': 'People, including fictional characters',
    'ORG': 'Organizations, companies, agencies',
    'GPE': 'Geopolitical entities (countries, cities, states)',
    'DATE': 'Dates or periods',
    'MONEY': 'Monetary values',
    'PRODUCT': 'Products, vehicles, foods, etc.',
    'EVENT': 'Named events',
    'LOC': 'Locations, mountain ranges, bodies of water'
}

for entity, description in entity_types.items():
    print(f"{entity:>8}: {description}")

## Using spaCy for NER
spaCy provides pre-trained NER models that work out of the box.

In [None]:
if SPACY_AVAILABLE:
    try:
        # Load English model
        nlp = spacy.load("en_core_web_sm")
        
        def extract_entities_spacy(text):
            """Extract named entities using spaCy"""
            doc = nlp(text)
            entities = []
            for ent in doc.ents:
                entities.append({
                    'text': ent.text,
                    'label': ent.label_,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'description': spacy.explain(ent.label_)
                })
            return entities
        
        # Test on sample texts
        print("spaCy NER Results:")
        print("=" * 50)
        
        for i, text in enumerate(sample_texts[:3], 1):
            print(f"\nText {i}: {text}")
            entities = extract_entities_spacy(text)
            
            if entities:
                print("Entities found:")
                for ent in entities:
                    print(f"  {ent['text']:>15} -> {ent['label']:<8} ({ent['description']})")
            else:
                print("  No entities found.")
            print("-" * 50)
        
        SPACY_MODEL_LOADED = True
        
    except OSError:
        print("spaCy model 'en_core_web_sm' not found.")
        print("Download it with: python -m spacy download en_core_web_sm")
        SPACY_MODEL_LOADED = False
else:
    print("spaCy not available. Proceeding with rule-based approach.")
    SPACY_MODEL_LOADED = False

## Rule-based NER Approach
Creating a simple rule-based system for entity recognition.

In [None]:
class SimpleNER:
    """
    A simple rule-based Named Entity Recognition system.
    """
    
    def __init__(self):
        # Define patterns for different entity types
        self.patterns = {
            'PERSON': [
                r'\b[A-Z][a-z]+ [A-Z][a-z]+\b',  # First Last
                r'\b(?:Mr|Mrs|Ms|Dr|Prof)\. [A-Z][a-z]+\b'  # Title + Name
            ],
            'ORG': [
                r'\b[A-Z][a-z]+ (?:Inc|Corp|Corporation|LLC|Ltd|Company)\b',
                r'\b(?:Apple|Microsoft|Google|Amazon|Facebook|Tesla|SpaceX)\b'
            ],
            'GPE': [
                r'\b(?:New York|Los Angeles|Chicago|Houston|Phoenix|Philadelphia|San Antonio|San Diego|Dallas|San Jose)\b',
                r'\b(?:United States|America|China|Japan|Germany|France|Italy|Spain|Canada|Australia)\b'
            ],
            'DATE': [
                r'\b\d{1,2}/\d{1,2}/\d{4}\b',  # MM/DD/YYYY
                r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}\b',
                r'\b\d{4}\b'  # Year
            ],
            'MONEY': [
                r'\$[\d,]+(?:\.\d{2})?\b',  # $1,000.00
                r'\b\d+(?:,\d{3})* dollars?\b'
            ]
        }
    
    def extract_entities(self, text):
        """Extract entities using pattern matching"""
        entities = []
        
        for entity_type, patterns in self.patterns.items():
            for pattern in patterns:
                matches = re.finditer(pattern, text)
                for match in matches:
                    entities.append({
                        'text': match.group(),
                        'label': entity_type,
                        'start': match.start(),
                        'end': match.end(),
                        'confidence': 0.8  # Rule-based confidence
                    })
        
        # Remove duplicates and sort by position
        entities = self._remove_duplicates(entities)
        entities.sort(key=lambda x: x['start'])
        
        return entities
    
    def _remove_duplicates(self, entities):
        """Remove overlapping entities"""
        unique_entities = []
        for entity in entities:
            # Check if this entity overlaps with any existing entity
            overlaps = False
            for existing in unique_entities:
                if (entity['start'] < existing['end'] and 
                    entity['end'] > existing['start']):
                    overlaps = True
                    break
            
            if not overlaps:
                unique_entities.append(entity)
        
        return unique_entities

# Test the simple NER system
simple_ner = SimpleNER()

print("Simple Rule-based NER Results:")
print("=" * 50)

for i, text in enumerate(sample_texts[:4], 1):
    print(f"\nText {i}: {text}")
    entities = simple_ner.extract_entities(text)
    
    if entities:
        print("Entities found:")
        for ent in entities:
            print(f"  {ent['text']:>20} -> {ent['label']:<8} (confidence: {ent['confidence']})")
    else:
        print("  No entities found.")
    print("-" * 50)

## Entity Analysis and Statistics

In [None]:
# Analyze entities across all texts
all_entities = []
entity_counts = Counter()
entity_types_count = Counter()

for text in sample_texts:
    entities = simple_ner.extract_entities(text)
    all_entities.extend(entities)
    
    for entity in entities:
        entity_counts[entity['text']] += 1
        entity_types_count[entity['label']] += 1

print("Entity Statistics:")
print("=" * 30)
print(f"Total entities found: {len(all_entities)}")
print(f"Unique entities: {len(entity_counts)}")

print("\nEntity types distribution:")
for entity_type, count in entity_types_count.most_common():
    print(f"{entity_type:>8}: {count}")

print("\nMost frequent entities:")
for entity, count in entity_counts.most_common(10):
    if count > 1:
        print(f"{entity:>15}: {count}")

In [None]:
# Visualize entity distribution
if entity_types_count:
    plt.figure(figsize=(12, 6))
    
    # Entity types distribution
    plt.subplot(1, 2, 1)
    entity_types = list(entity_types_count.keys())
    counts = list(entity_types_count.values())
    
    plt.bar(entity_types, counts, color='skyblue', alpha=0.7)
    plt.title('Entity Types Distribution')
    plt.xlabel('Entity Type')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    
    # Add value labels on bars
    for i, count in enumerate(counts):
        plt.text(i, count + 0.1, str(count), ha='center', va='bottom')
    
    # Entity length distribution
    plt.subplot(1, 2, 2)
    entity_lengths = [len(entity['text'].split()) for entity in all_entities]
    plt.hist(entity_lengths, bins=range(1, max(entity_lengths) + 2), 
             color='lightcoral', alpha=0.7, edgecolor='black')
    plt.title('Entity Length Distribution (Words)')
    plt.xlabel('Number of Words')
    plt.ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
else:
    print("No entities found for visualization.")

## Building a Simple ML-based NER
Creating a basic machine learning approach for NER.

In [None]:
# Create training data for a simple binary classifier (is entity or not)
def create_training_data():
    """
    Create simple training data for entity detection.
    """
    # Known entities and their types
    entities = {
        'Apple': 'ORG',
        'Microsoft': 'ORG',
        'Google': 'ORG',
        'Barack Obama': 'PERSON',
        'Bill Gates': 'PERSON',
        'New York': 'GPE',
        'California': 'GPE',
        'United States': 'GPE',
        '2009': 'DATE',
        '1975': 'DATE'
    }
    
    # Non-entities
    non_entities = [
        'the', 'is', 'was', 'planning', 'open', 'store', 'month',
        'President', 'founded', 'headquarters', 'located', 'released'
    ]
    
    # Create features and labels
    features = []
    labels = []
    
    # Add entity examples
    for entity in entities.keys():
        features.append(entity)
        labels.append(1)  # Is entity
    
    # Add non-entity examples
    for word in non_entities:
        features.append(word)
        labels.append(0)  # Not entity
    
    return features, labels

# Create feature extractor
def extract_word_features(word):
    """
    Extract features for a word that might help identify entities.
    """
    features = {
        'word': word.lower(),
        'is_capitalized': word[0].isupper(),
        'is_all_caps': word.isupper(),
        'length': len(word),
        'has_digit': any(char.isdigit() for char in word),
        'starts_with_capital': word[0].isupper() if word else False
    }
    return features

# Get training data
train_words, train_labels = create_training_data()

# Extract features
feature_dicts = [extract_word_features(word) for word in train_words]

# Convert to format suitable for scikit-learn
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(feature_dicts)
y_train = train_labels

# Train a simple classifier
classifier = LogisticRegression(random_state=42)
classifier.fit(X_train, y_train)

print("Simple ML-based Entity Detector trained!")
print(f"Training accuracy: {classifier.score(X_train, y_train):.3f}")
print(f"Number of features: {X_train.shape[1]}")

# Test the classifier
test_words = ['Apple', 'the', 'Obama', 'store', 'Tesla', 'building']

print("\nTesting ML Entity Detector:")
for word in test_words:
    word_features = extract_word_features(word)
    X_test = vectorizer.transform([word_features])
    prediction = classifier.predict(X_test)[0]
    probability = classifier.predict_proba(X_test)[0][1]  # Probability of being entity
    
    result = "ENTITY" if prediction == 1 else "NOT ENTITY"
    print(f"{word:>10}: {result:<12} (confidence: {probability:.3f})")

## Text Visualization with Entities

In [None]:
def visualize_entities_text(text, entities):
    """
    Create a simple text visualization with highlighted entities.
    """
    # Sort entities by start position
    entities = sorted(entities, key=lambda x: x['start'])
    
    # Create highlighted text
    highlighted_text = ""
    last_end = 0
    
    for entity in entities:
        # Add text before entity
        highlighted_text += text[last_end:entity['start']]
        
        # Add highlighted entity
        highlighted_text += f"**{entity['text']}** ({entity['label']})"
        
        last_end = entity['end']
    
    # Add remaining text
    highlighted_text += text[last_end:]
    
    return highlighted_text

# Visualize entities in sample texts
print("Entity Visualization:")
print("=" * 60)

for i, text in enumerate(sample_texts[:3], 1):
    entities = simple_ner.extract_entities(text)
    highlighted = visualize_entities_text(text, entities)
    
    print(f"\nText {i}:")
    print(f"Original: {text}")
    print(f"With entities: {highlighted}")
    print("-" * 60)

## Entity Extraction Pipeline

In [None]:
class NERPipeline:
    """
    Complete NER pipeline combining different approaches.
    """
    
    def __init__(self):
        self.rule_based_ner = SimpleNER()
        self.spacy_available = SPACY_MODEL_LOADED if 'SPACY_MODEL_LOADED' in globals() else False
        
        if self.spacy_available:
            self.nlp = nlp  # Use the loaded spaCy model
    
    def extract_entities(self, text, method='rule_based'):
        """
        Extract entities using specified method.
        
        Args:
            text (str): Input text
            method (str): 'rule_based', 'spacy', or 'combined'
        
        Returns:
            list: List of entity dictionaries
        """
        if method == 'rule_based':
            return self.rule_based_ner.extract_entities(text)
        
        elif method == 'spacy' and self.spacy_available:
            return extract_entities_spacy(text)
        
        elif method == 'combined':
            # Combine rule-based and spaCy results
            entities = self.rule_based_ner.extract_entities(text)
            
            if self.spacy_available:
                spacy_entities = extract_entities_spacy(text)
                # Simple combination - add spaCy entities that don't overlap
                for spacy_ent in spacy_entities:
                    overlaps = False
                    for rule_ent in entities:
                        if (spacy_ent['start'] < rule_ent['end'] and 
                            spacy_ent['end'] > rule_ent['start']):
                            overlaps = True
                            break
                    if not overlaps:
                        entities.append(spacy_ent)
            
            return sorted(entities, key=lambda x: x['start'])
        
        else:
            return self.rule_based_ner.extract_entities(text)
    
    def analyze_text(self, text):
        """
        Comprehensive analysis of text entities.
        """
        results = {}
        
        # Extract using all available methods
        results['rule_based'] = self.extract_entities(text, 'rule_based')
        
        if self.spacy_available:
            results['spacy'] = self.extract_entities(text, 'spacy')
            results['combined'] = self.extract_entities(text, 'combined')
        
        return results

# Test the pipeline
pipeline = NERPipeline()

test_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
results = pipeline.analyze_text(test_text)

print("NER Pipeline Results:")
print("=" * 50)
print(f"Text: {test_text}")
print()

for method, entities in results.items():
    print(f"{method.upper()} Method:")
    if entities:
        for ent in entities:
            print(f"  {ent['text']:>15} -> {ent['label']}")
    else:
        print("  No entities found.")
    print()

## Interactive NER Demo

In [None]:
def interactive_ner_demo():
    """
    Interactive demo for testing NER on custom text.
    """
    print("Interactive NER Demo")
    print("=" * 30)
    print("Enter text to analyze (or 'quit' to stop):")
    
    while True:
        user_text = input("\nEnter text: ")
        
        if user_text.lower() == 'quit':
            print("Thanks for using the NER demo!")
            break
        
        if user_text.strip():
            entities = pipeline.extract_entities(user_text, 'rule_based')
            
            if entities:
                print("\nEntities found:")
                for ent in entities:
                    print(f"  {ent['text']:>20} -> {ent['label']}")
                
                # Show visualization
                highlighted = visualize_entities_text(user_text, entities)
                print(f"\nHighlighted: {highlighted}")
            else:
                print("\nNo entities found.")
        else:
            print("Please enter some text.")

# For demonstration, test with predefined examples
demo_texts = [
    "Elon Musk announced that Tesla will open a new factory in Austin, Texas.",
    "The meeting is scheduled for January 15, 2024 at Microsoft headquarters.",
    "Amazon reported revenue of $469 billion in 2021."
]

print("Demo NER Results:")
print("=" * 40)

for i, text in enumerate(demo_texts, 1):
    print(f"\nExample {i}: {text}")
    entities = pipeline.extract_entities(text, 'rule_based')
    
    if entities:
        print("Entities:")
        for ent in entities:
            print(f"  {ent['text']:>20} -> {ent['label']}")
        
        highlighted = visualize_entities_text(text, entities)
        print(f"Highlighted: {highlighted}")
    else:
        print("No entities found.")
    print("-" * 40)

# Uncomment to run interactive demo
# interactive_ner_demo()

## Performance Evaluation

In [None]:
# Simple evaluation of our NER system
def evaluate_ner_simple():
    """
    Simple evaluation based on known entities in our test sentences.
    """
    # Ground truth entities for first few sample texts
    ground_truth = {
        0: [('Apple Inc.', 'ORG'), ('New York City', 'GPE')],
        1: [('Barack Obama', 'PERSON'), ('United States', 'GPE'), ('2009', 'DATE'), ('2017', 'DATE')],
        2: [('Microsoft Corporation', 'ORG'), ('Bill Gates', 'PERSON'), ('Paul Allen', 'PERSON'), ('1975', 'DATE')]
    }
    
    total_true = 0
    total_predicted = 0
    correct_predictions = 0
    
    for idx, true_entities in ground_truth.items():
        text = sample_texts[idx]
        predicted_entities = pipeline.extract_entities(text, 'rule_based')
        
        # Convert to comparable format
        true_set = set(true_entities)
        pred_set = set([(ent['text'], ent['label']) for ent in predicted_entities])
        
        total_true += len(true_set)
        total_predicted += len(pred_set)
        correct_predictions += len(true_set & pred_set)
        
        print(f"Text {idx + 1}:")
        print(f"  True: {true_set}")
        print(f"  Predicted: {pred_set}")
        print(f"  Correct: {true_set & pred_set}")
        print()
    
    # Calculate metrics
    precision = correct_predictions / total_predicted if total_predicted > 0 else 0
    recall = correct_predictions / total_true if total_true > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    print("Evaluation Results:")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"F1-Score: {f1:.3f}")
    
    return precision, recall, f1

# Run evaluation
precision, recall, f1 = evaluate_ner_simple()

## Key Takeaways

1. **NER identifies and classifies** named entities in text into predefined categories
2. **Rule-based approaches** use patterns and dictionaries, good for specific domains
3. **Machine learning approaches** can generalize better but need training data
4. **Pre-trained models** like spaCy work well out-of-the-box for general domains
5. **Evaluation is important** - use precision, recall, and F1-score
6. **Combining approaches** can improve overall performance

## Next Steps

- Try more sophisticated ML models (CRF, BiLSTM-CRF)
- Use transformer-based models (BERT for NER)
- Create domain-specific NER systems
- Handle nested and overlapping entities
- Build entity linking systems (connecting entities to knowledge bases)