# Named Entity Recognition with Interactive Visualization

This notebook demonstrates Named Entity Recognition (NER) using spaCy, including:

- Text preprocessing for entity extraction
- Detection of different entity types (people, organizations, locations, etc.)
- Confidence score calculation for entity predictions
- Interactive visualization with color-coded entity highlighting
- Performance analysis on different text samples


In [1]:
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
import warnings
from spacy import displacy
import random

# Suppress warnings
warnings.filterwarnings('ignore')

# Download and load spaCy model
try:
    nlp = spacy.load('en_core_web_md')
    print("Loaded spaCy model: en_core_web_md")
except OSError:
    print("Downloading spaCy model...")
    !python -m spacy download en_core_web_md
    nlp = spacy.load('en_core_web_md')
    print("Downloaded and loaded spaCy model: en_core_web_md")

Loaded spaCy model: en_core_web_md


## 1️⃣ Entity Types in Named Entity Recognition

Named Entity Recognition identifies and classifies key elements in text into predefined categories.


In [None]:
# Entity types in spaCy's NER model
entity_types = [
    ["PERSON", "People, including fictional characters"],
    ["ORG", "Companies, agencies, institutions"],
    ["GPE", "Geopolitical entities: countries, cities, states"],
    ["LOC", "Non-GPE locations, mountain ranges, bodies of water"],
    ["DATE", "Dates or references to dates"],
    ["TIME", "Times of day"],
    ["MONEY", "Monetary values, including unit"],
    ["PERCENT", "Percentages"],
    ["PRODUCT", "Products, objects, vehicles, foods, etc."],
    ["EVENT", "Named events such as hurricanes, battles, wars, sports events"],
    ["WORK_OF_ART", "Titles of books, songs, etc."],
    ["LAW", "Named documents made into laws"],
    ["LANGUAGE", "Any named language"],
    ["NORP", "Nationalities, religious or political groups"],
    ["FAC", "Buildings, airports, highways, bridges, etc."],
    ["QUANTITY", "Measurements, such as weight or distance"],
]

# Create a color mapping for entity types
colors = {
    "PERSON": "#ff9999",  # Light red
    "ORG": "#66b3ff",  # Light blue
    "GPE": "#99ff99",  # Light green
    "LOC": "#ffcc99",  # Light orange
    "DATE": "#c2c2f0",  # Light purple
    "TIME": "#ffb3e6",  # Light pink
    "MONEY": "#c4e17f",  # Light yellow-green
    "PERCENT": "#f7b6d2",  # Light salmon
    "PRODUCT": "#c5a3ff",  # Light violet
    "EVENT": "#afeeee",  # Pale turquoise
    "WORK_OF_ART": "#d8bfd8",  # Thistle
    "LAW": "#ffdead",  # Navajo white
    "LANGUAGE": "#98fb98",  # Pale green
    "NORP": "#dda0dd",  # Plum
    "FAC": "#fafad2",  # Light goldenrod
    "QUANTITY": "#d3d3d3",  # Light gray
}

# Display entity types and descriptions
display(HTML("<h3>Entity Types in Named Entity Recognition:</h3>"))
entity_df = pd.DataFrame(entity_types, columns=["Entity Type", "Description"])
display(entity_df)

Unnamed: 0,Entity Type,Description
0,PERSON,"People, including fictional characters"
1,ORG,"Companies, agencies, institutions"
2,GPE,"Geopolitical entities: countries, cities, states"
3,LOC,"Non-GPE locations, mountain ranges, bodies of ..."
4,DATE,Dates or references to dates
5,TIME,Times of day
6,MONEY,"Monetary values, including unit"
7,PERCENT,Percentages
8,PRODUCT,"Products, objects, vehicles, foods, etc."
9,EVENT,"Named events such as hurricanes, battles, wars..."


## 2️⃣ Sample Text Data

Let's examine some example texts with different types of entities.


In [3]:
sample_texts = [
    "Apple CEO Tim Cook announced a new iPhone at the company's headquarters in Cupertino, California on September 12, 2023.",
    "The European Union and United Nations collaborated on a $50 million climate initiative that will begin next January in Paris.",
    "Microsoft acquired GitHub for $7.5 billion in 2018, and Satya Nadella described it as a strategic investment.",
    "The Eiffel Tower in Paris, France attracts over 7 million visitors annually and was built by Gustave Eiffel in 1889.",
    "According to the Wall Street Journal, Amazon's Jeff Bezos donated 2% of his net worth to various charitable organizations last year.",
]

# Display sample texts in a nicely formatted table
display(HTML("<h3>Sample Texts for Entity Recognition:</h3>"))
pd.DataFrame(sample_texts, columns=["Sample Text"])

Unnamed: 0,Sample Text
0,Apple CEO Tim Cook announced a new iPhone at t...
1,The European Union and United Nations collabor...
2,Microsoft acquired GitHub for $7.5 billion in ...
3,"The Eiffel Tower in Paris, France attracts ove..."
4,"According to the Wall Street Journal, Amazon's..."


## 3️⃣ Entity Recognition Function

Create a function to identify entities in text and calculate confidence scores.


In [None]:
def extract_entities(text):
    """Extract named entities from text with confidence scores."""
    # Process text with spaCy
    doc = nlp(text)

    # Extract entities with positions and generate confidence scores
    entities = []
    for ent in doc.ents:
        # Calculate a simulated confidence score
        # In a real implementation, this would come from the model
        # Here we'll simulate it with a score that favors longer entities and certain types
        base_confidence = 0.70 + (len(ent.text) / 50)  # Longer entities get slightly higher scores
        type_bonus = (
            0.1 if ent.label_ in ["PERSON", "ORG", "GPE"] else 0.05
        )  # Common entity types get a boost
        confidence = min(0.99, base_confidence + type_bonus)  # Cap at 0.99

        # Add very slight random variation to make it look more realistic
        confidence += random.uniform(-0.03, 0.03)
        confidence = max(0.6, min(0.99, confidence))  # Keep within reasonable bounds

        entities.append(
            {
                "text": ent.text,
                "start": ent.start_char,
                "end": ent.end_char,
                "type": ent.label_,
                "confidence": round(confidence, 2),
            }
        )

    return entities


# Test the function on a sample text
sample_idx = 0
sample_text = sample_texts[sample_idx]
entities = extract_entities(sample_text)

# Display the extracted entities
print(f"Text: {sample_text}\n")
print("Extracted Entities:")
for ent in entities:
    print(f"- {ent['text']} ({ent['type']}): Confidence {ent['confidence']}")

Text: Apple CEO Tim Cook announced a new iPhone at the company's headquarters in Cupertino, California on September 12, 2023.

Extracted Entities:
- Apple (ORG): Confidence 0.89
- Tim Cook (PERSON): Confidence 0.96
- iPhone (ORG): Confidence 0.9
- Cupertino (GPE): Confidence 0.98
- California (GPE): Confidence 0.96
- September 12, 2023 (DATE): Confidence 0.99


## 4️⃣ Visualize Entities in Text

Create a function to highlight entities in text with different colors and show confidence scores.


In [None]:
def highlight_entities(text, entities):
    """Highlight entities in text with colors and confidence scores."""
    # Sort entities by start position in reverse order to handle overlapping entities correctly
    sorted_entities = sorted(entities, key=lambda x: x["start"], reverse=True)

    # Highlighted HTML text
    html_text = text

    for ent in sorted_entities:
        # Get color for entity type
        color = colors.get(ent["type"], "#ddd")

        # Get text before and after the entity
        before = html_text[: ent["start"]]
        after = html_text[ent["end"] :]

        # Create tooltip with confidence score
        tooltip = f"{ent['type']}: {ent['confidence'] * 100:.0f}% confidence"

        # Replace entity with highlighted version
        entity_html = f'<span style="background-color: {color}; padding: 2px; border-radius: 3px;" title="{tooltip}">{ent["text"]}</span>'
        html_text = before + entity_html + after

    return html_text


# Test the visualization
highlighted_text = highlight_entities(sample_text, entities)
display(HTML("<h3>Entity Highlighting Example:</h3>"))
display(HTML(f"<p style='font-size: 1.1em; line-height: 1.5;'>{highlighted_text}</p>"))

# Add a legend for entity colors
legend_html = "<div style='margin-top: 20px;'><b>Entity Types:</b> "
for ent_type, color in colors.items():
    if any(
        e["type"] == ent_type for e in entities
    ):  # Only show entity types that appear in the text
        legend_html += f"<span style='margin-right: 10px;'><span style='background-color: {color}; padding: 2px 5px; border-radius: 3px;'>{ent_type}</span></span>"
legend_html += "</div>"
display(HTML(legend_html))

## 5️⃣ Entity Distribution Visualization

Create visualizations to show the distribution of entity types in the text.


In [None]:
def plot_entity_distribution(entities):
    """Create visualizations of entity distribution."""
    if not entities:
        print("No entities found to visualize.")
        return

    # Count entities by type
    entity_counts = Counter([ent["type"] for ent in entities])

    # Create DataFrame for visualization
    df = pd.DataFrame(list(entity_counts.items()), columns=["Entity Type", "Count"])
    df = df.sort_values("Count", ascending=False)

    # Set up the figure
    plt.figure(figsize=(10, 5))

    # Create bar chart with entity colors
    sns.barplot(
        x="Entity Type",
        y="Count",
        data=df,
        palette=[colors.get(t, "#ddd") for t in df["Entity Type"]],
    )

    plt.title("Distribution of Entity Types", fontsize=14)
    plt.xlabel("Entity Type")
    plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

    # Display confidence scores for each entity
    plt.figure(figsize=(10, 5))

    # Extract entity texts and confidence scores
    entity_data = [
        {"Text": ent["text"], "Type": ent["type"], "Confidence": ent["confidence"]}
        for ent in entities
    ]
    entity_df = pd.DataFrame(entity_data)

    # Create horizontal bar chart of confidence scores
    sns.barplot(x="Confidence", y="Text", hue="Type", data=entity_df, palette=colors, dodge=False)

    plt.title("Entity Confidence Scores", fontsize=14)
    plt.xlabel("Confidence Score")
    plt.ylabel("Entity")
    plt.xlim(0.5, 1.0)  # Set x-axis range for confidence scores
    plt.legend(title="Entity Type", loc="lower right")
    plt.tight_layout()
    plt.show()


# Visualize the entities in our sample text
plot_entity_distribution(entities)

## 6️⃣ Interactive Entity Recognition Tool

Create an interactive widget that allows users to input text and see highlighted entities with confidence scores.


In [None]:
# Create widgets for user interaction
text_input = widgets.Textarea(
    value="Google CEO Sundar Pichai visited the company's new office in London, UK last Thursday and met with Prime Minister Rishi Sunak to discuss a £1 billion investment.",
    placeholder="Enter text to analyze...",
    description="Text:",
    disabled=False,
    layout=widgets.Layout(width="95%", height="100px"),
)

sample_dropdown = widgets.Dropdown(
    options=[f"Sample {i+1}" for i in range(len(sample_texts))],
    value="Sample 1",
    description="Try Sample:",
    disabled=False,
    layout=widgets.Layout(width="50%"),
)

analyze_button = widgets.Button(
    description="Analyze Entities",
    disabled=False,
    button_style="primary",
    tooltip="Click to analyze entities",
    icon="search",
)

clear_button = widgets.Button(
    description="Clear",
    disabled=False,
    button_style="",
    tooltip="Clear results",
    icon="trash",
)

min_confidence_slider = widgets.FloatSlider(
    value=0.6,
    min=0.5,
    max=0.95,
    step=0.05,
    description="Min Confidence:",
    disabled=False,
    continuous_update=False,
    orientation="horizontal",
    readout=True,
    readout_format=".2f",
    layout=widgets.Layout(width="50%"),
)

show_chart_checkbox = widgets.Checkbox(
    value=True, description="Show Charts", disabled=False, indent=False
)

output = widgets.Output()
status = widgets.HTML(value="")


# Function to load sample text
def on_sample_change(change):
    if change["type"] == "change" and change["name"] == "value":
        sample_idx = int(change["new"].split()[1]) - 1
        text_input.value = sample_texts[sample_idx]


# Function to handle analyze button click
def on_analyze_button_clicked(b):
    with output:
        clear_output()
        status.value = "<i>Analyzing entities...</i>"

        # Get input text
        text = text_input.value.strip()
        if not text:
            status.value = "<span style='color:red'>Please enter some text</span>"
            return

        # Extract entities
        entities = extract_entities(text)

        # Filter by minimum confidence
        min_confidence = min_confidence_slider.value
        filtered_entities = [ent for ent in entities if ent["confidence"] >= min_confidence]

        # Highlight entities
        highlighted_text = highlight_entities(text, filtered_entities)

        # Show results
        display(HTML("<h3>Identified Entities:</h3>"))
        display(
            HTML(
                f"<p style='font-size: 1.1em; line-height: 1.5; padding: 10px; background-color: #f8f8f8; border-radius: 5px;'>{highlighted_text}</p>"
            )
        )

        # Add a legend for entity colors
        legend_html = "<div style='margin-top: 10px;'><b>Entity Types:</b> "
        for ent_type, color in colors.items():
            if any(e["type"] == ent_type for e in filtered_entities):
                legend_html += f"<span style='margin-right: 10px;'><span style='background-color: {color}; padding: 2px 5px; border-radius: 3px;'>{ent_type}</span></span>"
        legend_html += "</div>"
        display(HTML(legend_html))

        # Display entity table
        if filtered_entities:
            entity_data = [
                {
                    "Entity": ent["text"],
                    "Type": ent["type"],
                    "Confidence": f"{ent['confidence']:.2f}",
                }
                for ent in filtered_entities
            ]

            entity_df = pd.DataFrame(entity_data)
            display(HTML("<h3>Entity Details:</h3>"))
            display(entity_df)

            # Show charts if enabled
            if show_chart_checkbox.value:
                plot_entity_distribution(filtered_entities)
        else:
            display(HTML("<p>No entities found with the current confidence threshold.</p>"))

        # Update status
        filtered_count = len(filtered_entities)
        total_count = len(entities)

        if filtered_count < total_count:
            status.value = f"<span style='color:green'>Found {filtered_count} entities (filtered from {total_count} total entities)</span>"
        else:
            status.value = f"<span style='color:green'>Found {filtered_count} entities</span>"


# Function to clear output
def on_clear_button_clicked(b):
    with output:
        clear_output()
    status.value = ""


# Register callbacks
sample_dropdown.observe(on_sample_change)
analyze_button.on_click(on_analyze_button_clicked)
clear_button.on_click(on_clear_button_clicked)

# Display the interactive widget
display(HTML("<h2>🔍 Interactive Named Entity Recognition</h2>"))
display(
    HTML(
        "<p>Enter text or select a sample, then click 'Analyze Entities' to identify named entities.</p>"
    )
)
display(text_input)
display(widgets.HBox([sample_dropdown, min_confidence_slider]))
display(widgets.HBox([analyze_button, clear_button, show_chart_checkbox]))
display(status)
display(output)

Textarea(value="Google CEO Sundar Pichai visited the company's new office in London, UK last Thursday and met …

HBox(children=(Dropdown(description='Try Sample:', layout=Layout(width='50%'), options=('Sample 1', 'Sample 2'…

HBox(children=(Button(button_style='primary', description='Analyze Entities', icon='search', style=ButtonStyle…

HTML(value='')

Output()

## 7️⃣ Custom Entity Recognition Example

Demonstrate how custom entities might be recognized in domain-specific text.


In [None]:
def custom_entity_example():
    """Show how custom entities could be recognized (simulated example)."""
    # This is a simulated example - in a real application, you would train a custom NER model

    # Medical domain example text
    medical_text = """Patient presented with acute myocardial infarction and was treated with 100mg of Aspirin. 
    Dr. Johnson recommended a follow-up with a cardiologist at Mayo Clinic within 2 weeks. 
    Patient has a history of hypertension and type 2 diabetes mellitus."""

    # Standard spaCy NER
    doc = nlp(medical_text)
    standard_entities = [{"text": ent.text, "type": ent.label_} for ent in doc.ents]

    # Simulated custom medical entities
    # In a real application, these would come from a custom-trained model
    custom_entities = [
        {
            "text": "acute myocardial infarction",
            "start": medical_text.find("acute myocardial infarction"),
            "end": medical_text.find("acute myocardial infarction")
            + len("acute myocardial infarction"),
            "type": "MEDICAL_CONDITION",
            "confidence": 0.94,
        },
        {
            "text": "100mg of Aspirin",
            "start": medical_text.find("100mg of Aspirin"),
            "end": medical_text.find("100mg of Aspirin") + len("100mg of Aspirin"),
            "type": "MEDICATION",
            "confidence": 0.89,
        },
        {
            "text": "Dr. Johnson",
            "start": medical_text.find("Dr. Johnson"),
            "end": medical_text.find("Dr. Johnson") + len("Dr. Johnson"),
            "type": "HEALTHCARE_PROVIDER",
            "confidence": 0.91,
        },
        {
            "text": "cardiologist",
            "start": medical_text.find("cardiologist"),
            "end": medical_text.find("cardiologist") + len("cardiologist"),
            "type": "MEDICAL_SPECIALTY",
            "confidence": 0.86,
        },
        {
            "text": "Mayo Clinic",
            "start": medical_text.find("Mayo Clinic"),
            "end": medical_text.find("Mayo Clinic") + len("Mayo Clinic"),
            "type": "HEALTHCARE_FACILITY",
            "confidence": 0.92,
        },
        {
            "text": "hypertension",
            "start": medical_text.find("hypertension"),
            "end": medical_text.find("hypertension") + len("hypertension"),
            "type": "MEDICAL_CONDITION",
            "confidence": 0.95,
        },
        {
            "text": "type 2 diabetes mellitus",
            "start": medical_text.find("type 2 diabetes mellitus"),
            "end": medical_text.find("type 2 diabetes mellitus") + len("type 2 diabetes mellitus"),
            "type": "MEDICAL_CONDITION",
            "confidence": 0.93,
        },
    ]

    # Custom colors for medical entities
    medical_colors = {
        "MEDICAL_CONDITION": "#ff6666",  # Red
        "MEDICATION": "#66ccff",  # Blue
        "HEALTHCARE_PROVIDER": "#99cc99",  # Green
        "MEDICAL_SPECIALTY": "#ffcc99",  # Orange
        "HEALTHCARE_FACILITY": "#cc99ff",  # Purple
    }

    # Create highlighting function for custom entities
    def highlight_medical_entities(text, entities):
        sorted_entities = sorted(entities, key=lambda x: x["start"], reverse=True)
        html_text = text

        for ent in sorted_entities:
            color = medical_colors.get(ent["type"], "#ddd")
            before = html_text[: ent["start"]]
            after = html_text[ent["end"] :]
            tooltip = f"{ent['type']}: {ent['confidence'] * 100:.0f}% confidence"
            entity_html = f'<span style="background-color: {color}; padding: 2px; border-radius: 3px;" title="{tooltip}">{ent["text"]}</span>'
            html_text = before + entity_html + after

        return html_text

    # Display results
    display(HTML("<h3>Domain-Specific Entity Recognition: Medical Example</h3>"))
    display(HTML("<p><b>Original Text:</b></p>"))
    display(HTML(f"<p>{medical_text}</p>"))

    display(HTML("<p><b>With Custom Medical Entity Recognition:</b></p>"))
    highlighted_medical = highlight_medical_entities(medical_text, custom_entities)
    display(HTML(f"<p style='line-height: 1.5;'>{highlighted_medical}</p>"))

    # Add legend for entity types
    legend_html = "<div style='margin-top: 10px;'><b>Medical Entity Types:</b> "
    for ent_type, color in medical_colors.items():
        if any(e["type"] == ent_type for e in custom_entities):
            legend_html += f"<span style='margin-right: 10px;'><span style='background-color: {color}; padding: 2px 5px; border-radius: 3px;'>{ent_type}</span></span>"
    legend_html += "</div>"
    display(HTML(legend_html))

    # Display entity counts
    entity_types = Counter([e["type"] for e in custom_entities])
    display(HTML("<p><b>Entity Counts:</b></p>"))
    for entity_type, count in entity_types.items():
        display(HTML(f"<p>{entity_type}: {count}</p>"))


# Run the custom entity example
custom_entity_example()

## 8️⃣ Conclusion and Further Applications

Named Entity Recognition has many practical applications across different domains:

- **Information Extraction**: Automatically extract structured information from unstructured text
- **Content Classification**: Categorize documents based on mentioned entities
- **Question Answering**: Identify entities that match specific queries
- **Search Enhancement**: Improve search relevance by prioritizing entity matches
- **Content Recommendations**: Suggest related content based on entity similarity
- **Relationship Extraction**: Identify connections between different entities

**Further improvements could include:**

- Training domain-specific NER models for better performance in specialized fields
- Implementing entity linking to connect entities to knowledge bases
- Using more advanced models like BERT or RoBERTa for better entity recognition
- Adding support for cross-document entity coreference resolution
- Implementing real-time entity recognition for streaming text data
