# Named Entity Recognition with Haystack Pipelines

## Overview
This notebook demonstrates how to build an end-to-end **Named Entity Recognition (NER)** pipeline using Haystack 2.0. We'll create a system that searches the web for articles, extracts content, and automatically identifies and categorizes named entities (people, organizations, locations, and miscellaneous entities).

## What You'll Learn
By the end of this notebook, you will understand:
1. How to build modular Haystack pipelines with multiple components
2. How to integrate web search capabilities into NLP workflows
3. How to extract and process HTML content from web pages
4. How to use pre-trained NER models with Haystack
5. How to create custom Haystack components for specialized data processing
6. How to structure extracted entities for downstream analysis

## Use Cases
This pipeline pattern is useful for:
- **Media Monitoring**: Track mentions of people, organizations, and places in news articles
- **Research**: Gather and categorize information about specific topics or entities
- **Content Analysis**: Automatically tag and organize web content by entities
- **Knowledge Base Creation**: Extract structured information from unstructured web data

## Pipeline Architecture
Our pipeline follows this flow:
```
Web Search ‚Üí Fetch URLs ‚Üí Convert HTML ‚Üí Clean Text ‚Üí Extract Entities ‚Üí Structure Results
```

Each component performs a specific task, and they're connected to create a complete workflow.

In [None]:
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.websearch import SearchApiWebSearch
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from haystack import Pipeline
from haystack.components.extractors import NamedEntityExtractor
from haystack import component, Document
from typing import Any, Dict, List, Union

from dotenv import load_dotenv
import os

load_dotenv(".env")
open_ai_key = os.getenv("OPENAI_API_KEY")
serper_api_key = os.getenv("SERPERDEV_API_KEY")


## Step 1: Import Required Libraries

We'll import all necessary components from Haystack and supporting libraries:

**Haystack Components:**
- **`Pipeline`**: The core container that orchestrates all components
- **`SearchApiWebSearch`**: Performs web searches using SearchAPI
- **`LinkContentFetcher`**: Downloads content from URLs
- **`HTMLToDocument`**: Converts HTML to Haystack Document objects
- **`DocumentCleaner`**: Preprocesses text by removing unwanted content
- **`NamedEntityExtractor`**: Identifies named entities using ML models
- **`Secret`**: Securely manages API keys from environment variables

**Supporting Libraries:**
- **`dotenv`**: Loads environment variables from .env file
- **`typing`**: Provides type hints for better code documentation

## Step 2: Define a Custom Haystack Component

### What is a Custom Component?
Haystack allows you to create custom components to handle specialized processing tasks. A custom component must:
1. Be decorated with `@component`
2. Define input/output types using `@component.output_types()`
3. Implement a `run()` method that processes data

### Why Create NERPopulator?
The built-in `NamedEntityExtractor` identifies entities, but we need to:
- Filter entities by confidence score (only keep high-quality predictions)
- Remove duplicates within each entity type
- Organize entities into a structured format (LOC, PER, ORG, MISC)
- Preserve source URLs for traceability

### Component Architecture
Our `NERPopulator` component:
- **Input**: List of documents with `named_entities` in metadata
- **Processing**: Filters, deduplicates, and categorizes entities
- **Output**: Structured documents ready for analysis or storage

In [2]:
@component
class NERPopulator():
    """This function extracts named entities from a list of
    documents and returns the result in a structured format.

    Args:
        documents (list): List of Haystack Document objects

    Returns:
        extracted_data (list): A list of dictionaries containing the extracted entities, 
        to make it Haystack-compatible we will return this list as a dictionary with the key 'documents'
    """
    
    @component.output_types(documents=List[Document])
    def run(self, sources: List[Document]) -> None:
        extracted_data = []

        for document in sources:
            content = document.content
            doc_id = document.id
            named_entities = document.meta.get('named_entities', [])
            url = document.meta.get('url', 'N/A')  # Default to 'N/A' if URL is not available

            # Sets to store unique entities by type
            entities_by_type = {
                "LOC": set(),
                "PER": set(),
                "ORG": set(),
                "MISC": set()
            }
            
            # Loop through the entities and filter by score and type
            for entity in named_entities:
                if float(entity.score) < 0.8:
                    continue
                
                word = content[entity.start:entity.end]
                if entity.entity in entities_by_type:
                    entities_by_type[entity.entity].add(word)  # Use set to ensure uniqueness
            
            # Prepare the meta field with comma-separated values
            meta = {
                "LOC": ",".join(entities_by_type["LOC"]),
                "PER": ",".join(entities_by_type["PER"]),
                "ORG": ",".join(entities_by_type["ORG"]),
                "MISC": ",".join(entities_by_type["MISC"]),
                "url": url
            }
            
            # Append the result for this document
            extracted_data.append({
                'document_id': doc_id,
                'content': content,
                'meta': meta
            })
        

        return {"documents": extracted_data}


### Understanding the NERPopulator Implementation

**Key Features:**

1. **Confidence Filtering**: `if float(entity.score) < 0.8: continue`
   - Only entities with >80% confidence are kept
   - Reduces false positives and improves data quality

2. **Entity Categories (Standard NER labels)**:
   - **LOC**: Locations (cities, countries, landmarks)
   - **PER**: Persons (names of people)
   - **ORG**: Organizations (companies, institutions)
   - **MISC**: Miscellaneous entities (events, products, etc.)

3. **Deduplication with Sets**: `entities_by_type[entity.entity].add(word)`
   - Using Python sets automatically removes duplicate entities
   - Important when the same entity appears multiple times in text

4. **Metadata Preservation**:
   - Original URL is maintained for source tracking
   - Entities are stored as comma-separated strings for easy export

**Output Format:**
Each processed document contains:
- `document_id`: Unique identifier
- `content`: Full text content
- `meta`: Dictionary with entity categories and source URL

## Step 3: Build the Complete Haystack Pipeline

### Pipeline Components Explained

In this step, we'll initialize each component and connect them into a working pipeline.

**Component Configuration:**

1. **SearchApiWebSearch**
   - `top_k=5`: Returns the top 5 search results
   - `api_key`: Loaded securely from environment variables
   - `allowed_domains`: Restricts results to Britannica.com for quality content

2. **LinkContentFetcher**
   - `retry_attempts=3`: Retries failed downloads up to 3 times
   - `timeout=10`: Waits maximum 10 seconds per request

3. **HTMLToDocument**
   - Converts raw HTML into structured Haystack Document objects

4. **DocumentCleaner**
   - `remove_empty_lines=True`: Removes blank lines
   - `remove_extra_whitespaces=True`: Normalizes spacing
   - `remove_substrings=['\n-']`: Removes specific unwanted patterns

5. **NamedEntityExtractor**
   - `backend="hugging_face"`: Uses HuggingFace models
   - `model="dslim/bert-base-NER"`: Pre-trained BERT model fine-tuned for NER
   - `warm_up()`: Loads the model into memory for faster inference

6. **NERPopulator**
   - Our custom component for structuring the results

In [14]:

# Initialize pipeline
pipeline = Pipeline()
web_search = SearchApiWebSearch(top_k=5,
                                api_key=Secret.from_env_var("SEARCH_API_KEY"),
                                allowed_domains=["https://www.britannica.com/"])
link_content = LinkContentFetcher(retry_attempts=3,
                                  timeout=10)
html_to_doc = HTMLToDocument()
document_cleaner = DocumentCleaner(
                                remove_empty_lines=True,
                                remove_extra_whitespaces=True,
                                remove_repeated_substrings=False,
                                remove_substrings=['\n-']
                            )
extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
extractor.warm_up()

ner_component = NERPopulator()

# Add components
pipeline.add_component(name='search', instance=web_search)
pipeline.add_component(name ='fetcher' , instance= link_content)
pipeline.add_component(name='htmldocument', instance=html_to_doc)
pipeline.add_component(name='cleaner', instance=document_cleaner)
pipeline.add_component(name='extractor', instance=extractor)
pipeline.add_component(name='ner', instance=ner_component)

# Connect components to one another
pipeline.connect("search.links", "fetcher.urls")
pipeline.connect("fetcher", "htmldocument")
pipeline.connect("htmldocument", "cleaner")
pipeline.connect("cleaner", "extractor")
pipeline.connect("extractor", "ner")


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps


<haystack.core.pipeline.pipeline.Pipeline object at 0x3288769c0>
üöÖ Components
  - search: SearchApiWebSearch
  - fetcher: LinkContentFetcher
  - htmldocument: HTMLToDocument
  - cleaner: DocumentCleaner
  - extractor: NamedEntityExtractor
  - ner: NERPopulator
üõ§Ô∏è Connections
  - search.links -> fetcher.urls (list[str])
  - fetcher.streams -> htmldocument.sources (list[ByteStream])
  - htmldocument.documents -> cleaner.documents (list[Document])
  - cleaner.documents -> extractor.documents (list[Document])
  - extractor.documents -> ner.sources (list[Document])

### Understanding Pipeline Connections

**The Data Flow:**

```
1. search (SearchApiWebSearch)
   ‚Üì outputs: links
2. fetcher (LinkContentFetcher)
   ‚Üì outputs: streams (raw HTML)
3. htmldocument (HTMLToDocument)
   ‚Üì outputs: documents
4. cleaner (DocumentCleaner)
   ‚Üì outputs: documents (cleaned)
5. extractor (NamedEntityExtractor)
   ‚Üì outputs: documents (with named_entities in meta)
6. ner (NERPopulator)
   ‚Üì outputs: documents (structured)
```

**Key Concepts:**

- **Component Naming**: Each component gets a unique name (e.g., 'search', 'fetcher')
- **Output Routing**: We specify which output connects to which input
  - Example: `"search.links"` ‚Üí `"fetcher.urls"`
- **Automatic Type Matching**: Haystack validates that outputs match expected inputs
- **Sequential Processing**: Each component processes data from the previous one

**Why This Order Matters:**
1. Can't extract entities before fetching content
2. Can't clean text before converting HTML
3. Can't structure results before extracting entities

This is a **linear pipeline** - each step depends on the previous one.

## Step 4: Run the Pipeline

### How Pipeline Execution Works

When you call `pipeline.run()`, Haystack:
1. Passes your input to the first component ('search')
2. Each component processes data and passes results to the next
3. Returns the final output from the last component

**Input Format:**
```python
data={"search": {"query": query}}
```
- Dictionary keys match component names
- Nested dictionary contains the component's input parameters
- For 'search', we only need to provide the 'query' parameter

**What Happens During Execution:**
1. SearchAPI finds 5 relevant articles about "Elon Musk"
2. LinkContentFetcher downloads HTML from each URL
3. HTMLToDocument converts HTML to text
4. DocumentCleaner removes noise and formatting
5. NamedEntityExtractor identifies people, places, organizations
6. NERPopulator structures the entities into our custom format

**Expected Processing Time:**
- Web search: ~1-2 seconds
- Content fetching: ~3-10 seconds (depends on website response)
- NER extraction: ~5-10 seconds (model inference)
- Total: ~10-20 seconds for 5 articles

In [15]:
query = "Elon Musk"
output = pipeline.run(data={"search":{"query":query}})

### Understanding the Output Structure

The `output` variable contains results from all pipeline components. Structure:
```python
{
    'search': {...},      # Search results
    'fetcher': {...},     # Downloaded content
    'htmldocument': {...},# Converted documents
    'cleaner': {...},     # Cleaned documents
    'extractor': {...},   # Documents with raw entities
    'ner': {              # Our final structured output
        'documents': [...]
    }
}
```

We're interested in the final output from our custom 'ner' component.

In [16]:
extracted_documents = output['ner']['documents']

## Step 5: Extract and Examine Results

Now we'll extract just the final processed documents from our pipeline output.

**What's in extracted_documents?**
Each document is a dictionary containing:
- `document_id`: Unique identifier
- `content`: Full article text
- `meta`: Dictionary with:
  - `LOC`: Comma-separated locations (e.g., "California,Texas,Mars")
  - `PER`: Comma-separated person names (e.g., "Elon Musk,Jeff Bezos")
  - `ORG`: Comma-separated organizations (e.g., "Tesla,SpaceX,NASA")
  - `MISC`: Miscellaneous entities (e.g., "Cybertruck,Model 3")
  - `url`: Source URL for the article

## Step 6: Convert to DataFrame for Analysis

### Why Use a DataFrame?

Converting our results to a pandas DataFrame provides:
- **Tabular View**: Easy-to-read table format
- **Data Analysis**: Use pandas methods for filtering, grouping, aggregating
- **Export Options**: Save to CSV, Excel, or databases
- **Visualization**: Create charts and graphs

### What You'll See

The DataFrame will have columns:
- `document_id`: Unique ID for each article
- `content`: Full article text (may be long)
- `meta`: Nested dictionary with entity categories and URL

**Tip**: You can expand the meta column to see all extracted entities for each document.

In [17]:
import pandas as pd
df = pd.DataFrame(extracted_documents)

df

Unnamed: 0,document_id,content,meta
0,b673a37853a56857b7ca61fa0597deb527c8b1aff16e1f...,Elon Musk\nWhen was Elon Musk born?\nElon Musk...,"{'LOC': 'Canada,Kingston,Pretoria,South Africa..."
1,0375af1ce02ed33eb7f2d3655a45341afa0a79ceb5108f...,Zip2 Date: 1995 - 1999 Ticker: HPQ Share price...,"{'LOC': '', 'PER': 'Kim,bal Musk,Richard Sorki..."
2,6542168e6211791c375551bfc8a3f069180b96bf692e8c...,Department of Government Efficiency\nWhat is t...,"{'LOC': 'U.S.,.S.', 'PER': 'Donald Trump,Mu,on..."
3,20063c6de70ed2964ab610b9da0132f72fc0da2deaa2db...,"Tesla, Inc. formerly (2003‚Äì17): Tesla Motors D...","{'LOC': 'Austin,Texas', 'PER': 'Nikola Tesla,M..."
4,3808990b1c9929bc0c40cca51a77685593e5354bbd2bd4...,"Tesla under Musk: New models, battery technolo...","{'LOC': 'Buffalo,Nevada,Shanghai,Europe,Berlin..."


### Exploring the DataFrame

**Useful operations you can try:**

```python
# See all columns
df.columns

# View just the metadata
df['meta']

# Access specific entity types for first document
df['meta'][0]['PER']  # People mentioned
df['meta'][0]['ORG']  # Organizations mentioned
df['meta'][0]['LOC']  # Locations mentioned

# Count documents
len(df)

# Save to CSV
df.to_csv('ner_results.csv', index=False)
```

In [18]:
extracted_documents

[{'document_id': 'b673a37853a56857b7ca61fa0597deb527c8b1aff16e1f106ede9013be3fb999',
  'content': 'Elon Musk\nWhen was Elon Musk born?\nElon Musk was born on June 28, 1971, in Pretoria, South Africa.Where did Elon Musk go to school?\nElon Musk went to Queen‚Äôs University in Kingston, Ontario, Canada, and then transferred to the University of Pennsylvania in Philadelphia, where he received bachelor‚Äôs degrees in physics and economics in 1997. He began graduate school in physics at Stanford University but left after several days to focus on being an entrepreneur.What are Elon Musk‚Äôs top accomplishments?\nElon Musk cofounded the electronic payment firm PayPal, and in 2002 he founded SpaceX, a company that makes rockets and spacecraft. He was a major early funder of Tesla, which makes electric cars and batteries, and became its chief executive officer in 2008. He cofounded Neuralink, a neurotechnology company, in 2016. Musk purchased the social media service Twitter in 2022 and renamed

### Raw Output Inspection

This cell displays the raw list of dictionaries before DataFrame conversion. Useful for:
- Debugging data structure issues
- Understanding the exact format returned by our pipeline
- Verifying that entities were extracted correctly

**What to look for:**
- Are all entity categories populated?
- Are there any empty strings or missing data?
- Do the URLs look correct?
- Is the confidence threshold (0.8) filtering appropriately?

## Key Takeaways and Best Practices

### What We've Accomplished

1. **Built a Complete NER Pipeline**: From web search to structured entity extraction
2. **Created a Custom Component**: Extended Haystack with specialized processing logic
3. **Implemented Quality Filtering**: Used confidence scores to ensure accuracy
4. **Structured Unstructured Data**: Converted web articles into analyzable entities
