# Exercise 5: Named Entity Recognition

Welcome to Named Entity Recognition! You'll learn how to automatically identify and classify important entities in German text.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **Entity Types**: Identify different types of named entities (PERSON, LOCATION, ORG, MISC)
2. **Multi-Model Comparison**: Compare spaCy, Flair, and transformer-based NER models
3. **German NER Challenges**: Handle German-specific issues (compound words, capitalization)
4. **Custom Entity Training**: Train NER models for domain-specific entities
5. **Entity Linking**: Connect entities to knowledge bases (Wikipedia, Wikidata)
6. **Performance Evaluation**: Assess NER model quality using precision, recall, and F1-score

## What You'll Build
- Multi-model German NER system
- Entity extraction and analysis pipeline
- Custom entity recognizer for specific domains
- Entity relationship visualization
- Knowledge base linking system

## Applications
- **Information Extraction**: Extract structured data from unstructured text
- **Document Analysis**: Analyze legal documents, news articles, research papers
- **Privacy Protection**: Identify and anonymize personal information
- **Knowledge Graphs**: Build connections between entities across documents

**Ready to find the hidden structure in text?** üîçüìÑ

## Exercise 1: Multi-Model German NER Comparison

**Goal**: Compare different NER approaches on German text and analyze their strengths and weaknesses.

**Your Tasks**: 
1. Extract entities using spaCy's German NER model
2. Apply Flair's German NER for comparison
3. Use transformer-based NER models
4. Evaluate and compare model performance

**Hints**:
- spaCy provides fast, rule-based + statistical NER
- Flair offers context-sensitive embeddings for better accuracy
- Transformer models (BERT-based) give state-of-the-art results
- German capitalization rules can help with entity detection

### Setup and Imports

In [None]:
# Simple import - try to load spaCy for German
try:
    import spacy
    nlp = spacy.load("de_core_news_sm")
    print("German NER model loaded! Ready to find entities.")
except:
    print("Please install German spaCy model: python -m spacy download de_core_news_sm")
    nlp = None
warnings.filterwarnings('ignore')

# Try to load German spaCy model with NER
try:
    nlp = spacy.load("de_core_news_sm")
    print("German spaCy model loaded successfully!")
    print(f"Available entity types: {list(nlp.get_pipe('ner').labels)}")
except IOError:
    print("Please install German spaCy model: python -m spacy download de_core_news_sm")
    nlp = None

print("Libraries imported successfully!")

In [None]:
# Essential imports for Named Entity Recognition
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Try to load German spaCy model with NER
try:
    import spacy
    nlp = spacy.load("de_core_news_sm")
    print("‚úÖ German spaCy model loaded successfully!")
    print(f"   Available entity types: {list(nlp.get_pipe('ner').labels)}")
except ImportError:
    print("‚ùå Please install spaCy: pip install spacy")
    nlp = None
except IOError:
    print("‚ùå Please install German spaCy model: python -m spacy download de_core_news_sm")
    nlp = None

# Try to load Flair for German NER (optional, more accurate)
try:
    from flair.data import Sentence
    from flair.models import SequenceTagger
    flair_tagger = SequenceTagger.load('de-ner')
    print("‚úÖ Flair German NER model loaded!")
except ImportError:
    print("‚ö†Ô∏è  Flair not available. Install with: pip install flair")
    flair_tagger = None
except Exception as e:
    print(f"‚ö†Ô∏è  Flair model loading failed: {e}")
    flair_tagger = None

# Import for transformer-based NER (optional)
try:
    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    print("‚úÖ Transformers library available for BERT-based NER!")
except ImportError:
    print("‚ö†Ô∏è  Transformers not available. Install with: pip install transformers")

print("\nüîç NER Toolkit Ready!")
print("Available tools: spaCy NER, Flair NER, Transformer-based NER")

### Step 1: Basic NER with spaCy

In [None]:
def extract_entities_spacy(text, nlp_model):
    """
    Extract named entities using spaCy.
    
    Args:
        text (str): Input text
        nlp_model: Loaded spaCy model
    
    Returns:
        dict: Extracted entities with metadata
    """
    if nlp_model is None:
        print("spaCy model not available")
        return None
    
    # TODO: Implement NER extraction:
    # 1. Process text with spaCy
    # 2. Extract entities with labels and positions
    # 3. Group entities by type
    # 4. Calculate confidence scores if available
    # 5. Handle overlapping entities
    
    doc = nlp_model(text)
    
    entities = []
    entity_types = defaultdict(list)
    
    for ent in doc.ents:
        entity_info = {
            'text': ent.text,
            'label': ent.label_,
            'description': spacy.explain(ent.label_),
            'start': ent.start_char,
            'end': ent.end_char,
            'start_token': ent.start,
            'end_token': ent.end
        }
        
        entities.append(entity_info)
        entity_types[ent.label_].append(ent.text)
    
    return {
        'entities': entities,
        'entity_types': dict(entity_types),
        'doc': doc
    }

def display_entities(ner_result):
    """
    Display extracted entities in organized format.
    
    Args:
        ner_result (dict): Result from extract_entities_spacy
    """
    if ner_result is None:
        return
    
    entities = ner_result['entities']
    entity_types = ner_result['entity_types']
    
    print(f"Found {len(entities)} entities:")
    print("=" * 40)
    
    # Display all entities
    for entity in entities:
        print(f"'{entity['text']}' -> {entity['label']} ({entity['description']})")
        print(f"  Position: {entity['start']}-{entity['end']}")
    
    print("\nEntities by type:")
    print("-" * 30)
    
    for entity_type, entity_list in entity_types.items():
        unique_entities = list(set(entity_list))
        print(f"{entity_type} ({spacy.explain(entity_type)}):")
        print(f"  {', '.join(unique_entities)}")
        print(f"  Count: {len(entity_list)} (unique: {len(unique_entities)})")
        print()

# Sample German texts for NER
sample_texts = [
    """Angela Merkel war von 2005 bis 2021 Bundeskanzlerin von Deutschland. 
    Sie wurde am 17. Juli 1954 in Hamburg geboren und studierte Physik an der 
    Universit√§t Leipzig. Vor ihrer politischen Laufbahn arbeitete sie als 
    Wissenschaftlerin am Zentralinstitut f√ºr Physikalische Chemie in Berlin.""",
    
    """Die BMW AG mit Hauptsitz in M√ºnchen ist ein deutscher Automobilhersteller. 
    Das Unternehmen wurde 1916 gegr√ºndet und besch√§ftigt heute √ºber 120.000 
    Mitarbeiter weltweit. Der Umsatz betrug 2022 etwa 142,6 Milliarden Euro.""",
    
    """Am 15. September 2023 fand in der Allianz Arena in M√ºnchen das Spiel zwischen 
    dem FC Bayern M√ºnchen und Borussia Dortmund statt. Thomas M√ºller erzielte in der 
    78. Minute das entscheidende Tor zum 2:1-Sieg."""
]

# Extract entities from sample texts
print("Named Entity Recognition Results:")
print("=" * 50)

for i, text in enumerate(sample_texts):
    print(f"\nText {i+1}:")
    print(f"'{text[:100]}...'")
    print()
    
    ner_result = extract_entities_spacy(text, nlp)
    display_entities(ner_result)
    
    if i < len(sample_texts) - 1:
        print("\n" + "="*50)

### Step 2: NER Visualization

In [None]:
def visualize_entities(text, nlp_model, style="ent"):
    """
    Visualize named entities in text using spaCy's displacy.
    
    Args:
        text (str): Input text
        nlp_model: spaCy model
        style (str): Visualization style ('ent' or 'dep')
    """
    if nlp_model is None:
        print("spaCy model not available")
        return
    
    # TODO: Create entity visualizations:
    # 1. Use displacy for web-based visualization
    # 2. Create custom matplotlib-based visualization
    # 3. Add color coding for different entity types
    # 4. Include entity statistics
    
    doc = nlp_model(text)
    
    print(f"Text: {text}")
    print("\nEntity Visualization:")
    
    # Use displacy for HTML visualization (Jupyter)
    try:
        from IPython.display import HTML
        html = displacy.render(doc, style=style, jupyter=False)
        # Clean HTML for basic display
        print("HTML visualization generated (use displacy.render for full view)")
        
        # Alternative: simple text-based visualization
        print("\nSimple entity highlighting:")
        print("-" * 40)
        
        highlighted_text = text
        entities_info = []
        
        for ent in doc.ents:
            entities_info.append(f"[{ent.text}:{ent.label_}]")
        
        print(f"Entities found: {', '.join(entities_info)}")
        
    except Exception as e:
        print(f"Visualization error: {e}")
        
        # Fallback to simple highlighting
        print("Simple entity list:")
        for ent in doc.ents:
            print(f"  {ent.text} -> {ent.label_} ({spacy.explain(ent.label_)})")

def create_entity_statistics_plot(texts, nlp_model):
    """
    Create statistical plots for entity analysis.
    
    Args:
        texts (list): List of texts to analyze
        nlp_model: spaCy model
    """
    if nlp_model is None:
        print("spaCy model not available")
        return
    
    # TODO: Create comprehensive entity statistics:
    # 1. Entity type distribution
    # 2. Entity frequency analysis
    # 3. Text length vs entity count
    # 4. Most common entities by type
    
    all_entities = []
    entity_type_counts = Counter()
    entity_text_counts = Counter()
    text_stats = []
    
    for i, text in enumerate(texts):
        doc = nlp_model(text)
        text_entities = []
        
        for ent in doc.ents:
            all_entities.append({
                'text': ent.text,
                'label': ent.label_,
                'text_id': i
            })
            text_entities.append(ent.label_)
            entity_type_counts[ent.label_] += 1
            entity_text_counts[ent.text.lower()] += 1
        
        text_stats.append({
            'text_id': i,
            'text_length': len(text),
            'entity_count': len(text_entities),
            'unique_entity_types': len(set(text_entities))
        })
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Entity type distribution
    if entity_type_counts:
        types, counts = zip(*entity_type_counts.most_common())
        axes[0, 0].bar(types, counts)
        axes[0, 0].set_title('Entity Type Distribution')
        axes[0, 0].set_xlabel('Entity Type')
        axes[0, 0].set_ylabel('Count')
        axes[0, 0].tick_params(axis='x', rotation=45)
    
    # Most common entity texts
    if entity_text_counts:
        top_entities = entity_text_counts.most_common(10)
        entities, counts = zip(*top_entities)
        axes[0, 1].barh(range(len(entities)), counts)
        axes[0, 1].set_yticks(range(len(entities)))
        axes[0, 1].set_yticklabels(entities)
        axes[0, 1].set_title('Most Common Entities')
        axes[0, 1].set_xlabel('Frequency')
    
    # Text length vs entity count
    if text_stats:
        text_lengths = [stat['text_length'] for stat in text_stats]
        entity_counts = [stat['entity_count'] for stat in text_stats]
        axes[1, 0].scatter(text_lengths, entity_counts)
        axes[1, 0].set_title('Text Length vs Entity Count')
        axes[1, 0].set_xlabel('Text Length (characters)')
        axes[1, 0].set_ylabel('Entity Count')
        
        # Add trend line
        if len(text_lengths) > 1:
            z = np.polyfit(text_lengths, entity_counts, 1)
            p = np.poly1d(z)
            axes[1, 0].plot(text_lengths, p(text_lengths), "r--", alpha=0.8)
    
    # Entity diversity by text
    if text_stats:
        text_ids = [stat['text_id'] for stat in text_stats]
        unique_types = [stat['unique_entity_types'] for stat in text_stats]
        axes[1, 1].bar(text_ids, unique_types)
        axes[1, 1].set_title('Entity Type Diversity by Text')
        axes[1, 1].set_xlabel('Text ID')
        axes[1, 1].set_ylabel('Unique Entity Types')
    
    plt.tight_layout()
    plt.show()
    
    return {
        'entity_type_counts': entity_type_counts,
        'entity_text_counts': entity_text_counts,
        'text_stats': text_stats,
        'all_entities': all_entities
    }

# Visualize entities in sample texts
print("Entity Visualization Examples:")
print("=" * 40)

for i, text in enumerate(sample_texts[:2]):  # First two texts
    print(f"\nVisualization {i+1}:")
    visualize_entities(text, nlp)

# Create statistical analysis
print("\n\nEntity Statistical Analysis:")
print("=" * 40)
entity_stats = create_entity_statistics_plot(sample_texts, nlp)

### Step 3: Entity Relationship Analysis

In [None]:
def analyze_entity_cooccurrence(texts, nlp_model, window_size=3):
    """
    Analyze co-occurrence patterns between entities.
    
    Args:
        texts (list): List of texts to analyze
        nlp_model: spaCy model
        window_size (int): Window size for co-occurrence
    
    Returns:
        dict: Co-occurrence analysis results
    """
    if nlp_model is None:
        print("spaCy model not available")
        return None
    
    # TODO: Implement entity co-occurrence analysis:
    # 1. Find entities that appear together in sentences
    # 2. Calculate co-occurrence frequencies
    # 3. Create entity relationship network
    # 4. Identify entity clusters
    # 5. Calculate relationship strength
    
    cooccurrence_matrix = defaultdict(lambda: defaultdict(int))
    entity_sentences = defaultdict(list)
    all_relationships = []
    
    for text_id, text in enumerate(texts):
        doc = nlp_model(text)
        
        # Group entities by sentence
        for sent in doc.sents:
            sent_entities = []
            for ent in sent.ents:
                sent_entities.append(ent.text.lower())
                entity_sentences[ent.text.lower()].append((text_id, sent.text))
            
            # Calculate co-occurrences within sentence
            for i, entity1 in enumerate(sent_entities):
                for j, entity2 in enumerate(sent_entities):
                    if i != j and abs(i - j) <= window_size:
                        cooccurrence_matrix[entity1][entity2] += 1
                        all_relationships.append((entity1, entity2))
    
    return {
        'cooccurrence_matrix': dict(cooccurrence_matrix),
        'entity_sentences': dict(entity_sentences),
        'relationships': all_relationships
    }

def create_entity_network(cooccurrence_data, min_cooccurrence=1):
    """
    Create and visualize entity relationship network.
    
    Args:
        cooccurrence_data (dict): Co-occurrence analysis results
        min_cooccurrence (int): Minimum co-occurrence threshold
    
    Returns:
        networkx.Graph: Entity network graph
    """
    if cooccurrence_data is None:
        return None
    
    # TODO: Create entity relationship network:
    # 1. Build NetworkX graph from co-occurrence data
    # 2. Add node attributes (entity type, frequency)
    # 3. Add edge weights (co-occurrence strength)
    # 4. Apply layout algorithms for visualization
    # 5. Color nodes by entity type
    
    G = nx.Graph()
    cooccurrence_matrix = cooccurrence_data['cooccurrence_matrix']
    
    # Add nodes and edges
    for entity1, connections in cooccurrence_matrix.items():
        if not G.has_node(entity1):
            G.add_node(entity1)
        
        for entity2, weight in connections.items():
            if weight >= min_cooccurrence:
                if not G.has_node(entity2):
                    G.add_node(entity2)
                G.add_edge(entity1, entity2, weight=weight)
    
    print(f"Entity Network Statistics:")
    print(f"  Nodes (entities): {G.number_of_nodes()}")
    print(f"  Edges (relationships): {G.number_of_edges()}")
    
    if G.number_of_nodes() > 0:
        # Calculate network metrics
        if G.number_of_edges() > 0:
            density = nx.density(G)
            print(f"  Network density: {density:.3f}")
            
            # Find most connected entities
            degree_centrality = nx.degree_centrality(G)
            top_entities = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
            print(f"  Most connected entities:")
            for entity, centrality in top_entities:
                print(f"    {entity}: {centrality:.3f}")
        
        # Visualize network
        plt.figure(figsize=(12, 8))
        
        if G.number_of_nodes() <= 20:  # Only visualize if not too crowded
            pos = nx.spring_layout(G, k=1, iterations=50)
            
            # Draw edges
            edge_weights = [G[u][v]['weight'] for u, v in G.edges()]
            nx.draw_networkx_edges(G, pos, width=[w*0.5 for w in edge_weights], alpha=0.6)
            
            # Draw nodes
            node_sizes = [degree_centrality.get(node, 0) * 3000 + 100 for node in G.nodes()]
            nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='lightblue', alpha=0.8)
            
            # Draw labels
            nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold')
            
            plt.title('Entity Relationship Network')
            plt.axis('off')
            plt.tight_layout()
            plt.show()
        else:
            print("  Network too large for visualization (>20 nodes)")
    
    return G

def find_entity_clusters(network_graph):
    """
    Find clusters of related entities.
    
    Args:
        network_graph (networkx.Graph): Entity network
    
    Returns:
        list: List of entity clusters
    """
    if network_graph is None or network_graph.number_of_nodes() == 0:
        return []
    
    # TODO: Implement entity clustering:
    # 1. Use community detection algorithms
    # 2. Find connected components
    # 3. Identify entity groups by type
    # 4. Analyze cluster characteristics
    
    # Find connected components (basic clustering)
    clusters = [list(component) for component in nx.connected_components(network_graph)]
    
    print(f"\nEntity Clusters Found:")
    print("=" * 30)
    
    for i, cluster in enumerate(clusters):
        if len(cluster) > 1:  # Only show clusters with multiple entities
            print(f"Cluster {i+1}: {', '.join(cluster)}")
            print(f"  Size: {len(cluster)} entities")
            
            # Calculate cluster metrics
            if len(cluster) > 2:
                subgraph = network_graph.subgraph(cluster)
                if subgraph.number_of_edges() > 0:
                    density = nx.density(subgraph)
                    print(f"  Internal density: {density:.3f}")
            print()
    
    return clusters

# Analyze entity relationships
print("Entity Relationship Analysis:")
print("=" * 40)

cooccurrence_data = analyze_entity_cooccurrence(sample_texts, nlp)

if cooccurrence_data:
    print("\nCo-occurrence Analysis Results:")
    print(f"Found relationships between {len(cooccurrence_data['cooccurrence_matrix'])} entities")
    
    # Create network visualization
    entity_network = create_entity_network(cooccurrence_data)
    
    # Find clusters
    if entity_network:
        clusters = find_entity_clusters(entity_network)

### Step 4: NER Performance Evaluation

In [None]:
def evaluate_ner_performance(texts, true_entities, nlp_model):
    """
    Evaluate NER model performance against ground truth.
    
    Args:
        texts (list): List of texts
        true_entities (list): List of ground truth entities for each text
        nlp_model: spaCy model
    
    Returns:
        dict: Evaluation metrics
    """
    if nlp_model is None:
        print("spaCy model not available")
        return None
    
    # TODO: Implement NER evaluation:
    # 1. Extract entities from texts using model
    # 2. Compare with ground truth annotations
    # 3. Calculate precision, recall, F1-score
    # 4. Analyze errors by entity type
    # 5. Create confusion matrix for entity types
    
    all_predictions = []
    all_true = []
    entity_type_metrics = defaultdict(lambda: {'tp': 0, 'fp': 0, 'fn': 0})
    
    for text, true_ents in zip(texts, true_entities):
        doc = nlp_model(text)
        predicted_ents = [(ent.text.lower(), ent.label_) for ent in doc.ents]
        
        # Convert ground truth to same format
        true_ents_formatted = [(ent['text'].lower(), ent['label']) for ent in true_ents]
        
        all_predictions.extend(predicted_ents)
        all_true.extend(true_ents_formatted)
        
        # Calculate metrics per entity type
        pred_set = set(predicted_ents)
        true_set = set(true_ents_formatted)
        
        # True positives
        for ent in pred_set.intersection(true_set):
            entity_type_metrics[ent[1]]['tp'] += 1
        
        # False positives
        for ent in pred_set - true_set:
            entity_type_metrics[ent[1]]['fp'] += 1
        
        # False negatives
        for ent in true_set - pred_set:
            entity_type_metrics[ent[1]]['fn'] += 1
    
    # Calculate overall metrics
    total_tp = sum(metrics['tp'] for metrics in entity_type_metrics.values())
    total_fp = sum(metrics['fp'] for metrics in entity_type_metrics.values())
    total_fn = sum(metrics['fn'] for metrics in entity_type_metrics.values())
    
    overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
    overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
    overall_f1 = 2 * (overall_precision * overall_recall) / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0
    
    # Calculate per-type metrics
    type_metrics = {}
    for entity_type, metrics in entity_type_metrics.items():
        precision = metrics['tp'] / (metrics['tp'] + metrics['fp']) if (metrics['tp'] + metrics['fp']) > 0 else 0
        recall = metrics['tp'] / (metrics['tp'] + metrics['fn']) if (metrics['tp'] + metrics['fn']) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        type_metrics[entity_type] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'support': metrics['tp'] + metrics['fn']
        }
    
    return {
        'overall': {
            'precision': overall_precision,
            'recall': overall_recall,
            'f1': overall_f1
        },
        'by_type': type_metrics,
        'confusion_data': entity_type_metrics
    }

def create_mock_ground_truth():
    """
    Create mock ground truth annotations for demonstration.
    In practice, this would come from human annotations.
    
    Returns:
        list: Mock ground truth entities
    """
    # TODO: Create realistic ground truth annotations:
    # 1. Manual annotation of sample texts
    # 2. Include various entity types
    # 3. Handle edge cases and ambiguous entities
    # 4. Ensure consistency in annotation scheme
    
    ground_truth = [
        # Text 1: Angela Merkel biography
        [
            {'text': 'Angela Merkel', 'label': 'PER'},
            {'text': '2005', 'label': 'DATE'},
            {'text': '2021', 'label': 'DATE'},
            {'text': 'Deutschland', 'label': 'LOC'},
            {'text': '17. Juli 1954', 'label': 'DATE'},
            {'text': 'Hamburg', 'label': 'LOC'},
            {'text': 'Universit√§t Leipzig', 'label': 'ORG'},
            {'text': 'Zentralinstitut f√ºr Physikalische Chemie', 'label': 'ORG'},
            {'text': 'Berlin', 'label': 'LOC'}
        ],
        # Text 2: BMW information
        [
            {'text': 'BMW AG', 'label': 'ORG'},
            {'text': 'M√ºnchen', 'label': 'LOC'},
            {'text': '1916', 'label': 'DATE'},
            {'text': '120.000', 'label': 'CARDINAL'},
            {'text': '2022', 'label': 'DATE'},
            {'text': '142,6 Milliarden Euro', 'label': 'MONEY'}
        ],
        # Text 3: Football match
        [
            {'text': '15. September 2023', 'label': 'DATE'},
            {'text': 'Allianz Arena', 'label': 'FAC'},
            {'text': 'M√ºnchen', 'label': 'LOC'},
            {'text': 'FC Bayern M√ºnchen', 'label': 'ORG'},
            {'text': 'Borussia Dortmund', 'label': 'ORG'},
            {'text': 'Thomas M√ºller', 'label': 'PER'},
            {'text': '78. Minute', 'label': 'TIME'}
        ]
    ]
    
    return ground_truth

def display_evaluation_results(evaluation_results):
    """
    Display NER evaluation results in organized format.
    
    Args:
        evaluation_results (dict): Evaluation results
    """
    if evaluation_results is None:
        return
    
    print("NER Evaluation Results:")
    print("=" * 30)
    
    # Overall metrics
    overall = evaluation_results['overall']
    print(f"Overall Performance:")
    print(f"  Precision: {overall['precision']:.3f}")
    print(f"  Recall: {overall['recall']:.3f}")
    print(f"  F1-Score: {overall['f1']:.3f}")
    
    # Per-type metrics
    print(f"\nPer-Type Performance:")
    print("-" * 50)
    print(f"{'Type':<10} {'Precision':<10} {'Recall':<10} {'F1':<10} {'Support':<10}")
    print("-" * 50)
    
    for entity_type, metrics in evaluation_results['by_type'].items():
        print(f"{entity_type:<10} {metrics['precision']:<10.3f} {metrics['recall']:<10.3f} {metrics['f1']:<10.3f} {metrics['support']:<10}")
    
    # Create visualization
    if evaluation_results['by_type']:
        types = list(evaluation_results['by_type'].keys())
        precisions = [evaluation_results['by_type'][t]['precision'] for t in types]
        recalls = [evaluation_results['by_type'][t]['recall'] for t in types]
        f1s = [evaluation_results['by_type'][t]['f1'] for t in types]
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Performance by type
        x = np.arange(len(types))
        width = 0.25
        
        ax1.bar(x - width, precisions, width, label='Precision', alpha=0.8)
        ax1.bar(x, recalls, width, label='Recall', alpha=0.8)
        ax1.bar(x + width, f1s, width, label='F1-Score', alpha=0.8)
        
        ax1.set_xlabel('Entity Types')
        ax1.set_ylabel('Score')
        ax1.set_title('NER Performance by Entity Type')
        ax1.set_xticks(x)
        ax1.set_xticklabels(types, rotation=45)
        ax1.legend()
        ax1.set_ylim(0, 1)
        
        # Overall metrics pie chart
        overall_metrics = [overall['precision'], overall['recall'], overall['f1']]
        metric_names = ['Precision', 'Recall', 'F1-Score']
        
        ax2.pie(overall_metrics, labels=metric_names, autopct='%1.3f', startangle=90)
        ax2.set_title('Overall Performance Metrics')
        
        plt.tight_layout()
        plt.show()

# Create evaluation
print("NER Model Evaluation:")
print("=" * 30)

# Create mock ground truth
ground_truth = create_mock_ground_truth()

# Evaluate model
evaluation_results = evaluate_ner_performance(sample_texts, ground_truth, nlp)

# Display results
display_evaluation_results(evaluation_results)

### Step 5: Domain-Specific NER

In [None]:
def create_domain_specific_ner_rules(nlp_model):
    """
    Add domain-specific NER rules to enhance entity recognition.
    
    Args:
        nlp_model: spaCy model
    
    Returns:
        spaCy model with additional rules
    """
    if nlp_model is None:
        return None
    
    # TODO: Implement domain-specific enhancements:
    # 1. Add custom entity patterns
    # 2. Create rule-based entity recognizers
    # 3. Handle domain-specific terminology
    # 4. Improve entity boundary detection
    # 5. Add post-processing rules
    
    from spacy.matcher import Matcher
    from spacy.tokens import Span
    
    # Create matcher for custom patterns
    matcher = Matcher(nlp_model.vocab)
    
    # Define patterns for German-specific entities
    patterns = {
        "GERMAN_UNIVERSITY": [
            [{"LOWER": "universit√§t"}, {"IS_TITLE": True, "OP": "+"}],
            [{"LOWER": "hochschule"}, {"IS_TITLE": True, "OP": "+"}],
            [{"LOWER": "fachhochschule"}, {"IS_TITLE": True, "OP": "+"}]
        ],
        "GERMAN_COMPANY": [
            [{"IS_TITLE": True, "OP": "+"}, {"LOWER": "ag"}],
            [{"IS_TITLE": True, "OP": "+"}, {"LOWER": "gmbh"}],
            [{"IS_TITLE": True, "OP": "+"}, {"LOWER": "se"}]
        ],
        "GERMAN_CITY": [
            [{"LOWER": {"IN": ["berlin", "m√ºnchen", "hamburg", "k√∂ln", "frankfurt", "stuttgart", "d√ºsseldorf", "dortmund", "essen", "leipzig"]}}]
        ]
    }
    
    # Add patterns to matcher
    for pattern_name, pattern_list in patterns.items():
        matcher.add(pattern_name, pattern_list)
    
    def add_custom_entities(doc):
        """Add custom entities to doc based on matcher results."""
        matches = matcher(doc)
        new_ents = []
        
        for match_id, start, end in matches:
            span = doc[start:end]
            label_name = nlp_model.vocab.strings[match_id]
            
            # Map custom labels to standard ones
            if "UNIVERSITY" in label_name or "COMPANY" in label_name:
                label = "ORG"
            elif "CITY" in label_name:
                label = "LOC"
            else:
                label = "MISC"  # Miscellaneous
            
            new_ents.append(Span(doc, start, end, label=label))
        
        # Merge with existing entities, avoiding overlaps
        existing_ents = list(doc.ents)
        all_ents = existing_ents + new_ents
        
        # Remove overlapping entities (keep longer ones)
        filtered_ents = []
        for ent in sorted(all_ents, key=len, reverse=True):
            if not any(ent.start < existing.end and ent.end > existing.start 
                      for existing in filtered_ents):
                filtered_ents.append(ent)
        
        doc.ents = filtered_ents
        return doc
    
    # Add the custom component to the pipeline
    if "custom_ner" not in nlp_model.pipe_names:
        nlp_model.add_pipe("custom_ner", after="ner", config={"callback": add_custom_entities})
    
    return nlp_model

def compare_ner_models(text, models_dict):
    """
    Compare different NER models on the same text.
    
    Args:
        text (str): Text to analyze
        models_dict (dict): Dictionary of model names and models
    
    Returns:
        dict: Comparison results
    """
    # TODO: Implement NER model comparison:
    # 1. Apply different models to same text
    # 2. Compare entity extraction results
    # 3. Analyze differences in entity recognition
    # 4. Calculate performance metrics
    # 5. Visualize comparison results
    
    results = {}
    
    print(f"Comparing NER Models on Text:")
    print(f"'{text[:100]}...'")
    print("=" * 60)
    
    for model_name, model in models_dict.items():
        if model is None:
            continue
            
        doc = model(text)
        entities = []
        
        for ent in doc.ents:
            entities.append({
                'text': ent.text,
                'label': ent.label_,
                'start': ent.start_char,
                'end': ent.end_char
            })
        
        results[model_name] = entities
        
        print(f"\n{model_name}:")
        print(f"  Found {len(entities)} entities")
        for ent in entities:
            print(f"    {ent['text']} -> {ent['label']} ({spacy.explain(ent['label']) or 'Unknown'})")
    
    # Analyze differences
    if len(results) > 1:
        print(f"\nModel Comparison Analysis:")
        print("-" * 40)
        
        all_entities = set()
        for entities in results.values():
            for ent in entities:
                all_entities.add((ent['text'], ent['label']))
        
        print(f"Total unique entities across all models: {len(all_entities)}")
        
        # Find entities recognized by all models
        common_entities = set([(ent['text'], ent['label']) for ent in list(results.values())[0]])
        for entities in list(results.values())[1:]:
            model_entities = set([(ent['text'], ent['label']) for ent in entities])
            common_entities = common_entities.intersection(model_entities)
        
        print(f"Entities recognized by all models: {len(common_entities)}")
        for ent_text, ent_label in common_entities:
            print(f"  {ent_text} ({ent_label})")
    
    return results

# Test domain-specific NER enhancements
print("Domain-Specific NER Enhancement:")
print("=" * 40)

# Create enhanced model
if nlp is not None:
    enhanced_nlp = create_domain_specific_ner_rules(nlp)
    
    # Test text with domain-specific entities
    test_text = """Die Universit√§t M√ºnchen und die BMW AG arbeiten zusammen an 
    einem Forschungsprojekt. Das Projekt wird von der Siemens AG und der 
    Fachhochschule K√∂ln unterst√ºtzt. Die Ergebnisse werden in Berlin pr√§sentiert."""
    
    # Compare original and enhanced models
    models_to_compare = {
        "Original spaCy": nlp,
        "Enhanced spaCy": enhanced_nlp
    }
    
    comparison_results = compare_ner_models(test_text, models_to_compare)
else:
    print("spaCy model not available for domain-specific enhancement")

## Exercise Tasks

Complete the following tasks to deepen your understanding:

1. **Advanced Entity Analysis**:
   - Implement entity disambiguation (linking entities to knowledge bases)
   - Create entity timeline analysis for temporal entities
   - Build entity importance ranking based on frequency and context

2. **Custom NER Training**:
   - Create training data for domain-specific entities
   - Fine-tune a BERT model for German NER
   - Compare custom model with pre-trained models

3. **Multi-language NER**:
   - Test NER on mixed German-English texts
   - Compare monolingual vs multilingual models
   - Handle code-switching scenarios

4. **NER Applications**:
   - Build an entity-based document search system
   - Create automatic content tagging using entities
   - Implement entity-based text summarization

5. **Error Analysis and Improvement**:
   - Analyze common NER errors and their causes
   - Implement post-processing rules for error correction
   - Create ensemble methods combining multiple NER approaches

## Reflection Questions

1. What are the main challenges in German NER compared to English?
2. How do you handle ambiguous entities (e.g., "Apple" as fruit vs company)?
3. What evaluation metrics are most appropriate for NER tasks?
4. How can domain knowledge improve NER performance?
5. What are the privacy implications of NER in text processing?

## Next Steps

- Explore relation extraction between identified entities
- Study entity linking and knowledge graph construction
- Learn about nested and overlapping entity recognition
- Investigate zero-shot and few-shot NER approaches