# Reader Exercise: Building a Multi-Task NLP Agent with Classification and NER

## üéØ Exercise Objective

In this exercise, you will:

1. **Build a Custom NER Component**: Create a component that extracts named entities from web articles and stores them in document metadata
2. **Combine Two Pipelines**: Integrate text classification and named entity recognition into unified workflows
3. **Create SuperComponents**: Wrap pipelines with simplified interfaces
4. **Build an AI Agent**: Create an agent that can handle natural language queries for both classification and entity extraction
5. **Deploy with Hayhooks**: Serialize and deploy your pipelines as REST APIs

## üìö What You'll Learn

- How to create custom Haystack components that combine multiple NLP tasks
- Techniques for enriching document metadata with extracted information
- Best practices for building SuperComponents from complex pipelines
- How to create tools that agents can use to process natural language requests
- Pipeline serialization and deployment strategies with Hayhooks

## üîß Prerequisites

Before starting, ensure you have:
- Completed the NER and text classification tutorial notebooks
- Understanding of Haystack pipeline architecture
- API keys configured in your `.env` file:
  - `OPENAI_API_KEY`
  - `SEARCH_API_KEY` or `SERPERDEV_API_KEY`

## üìã Exercise Structure

This exercise is divided into the following sections:

1. **Setup and Imports**
2. **Part 1: Build Custom NER Component**
3. **Part 2: Create Classification Pipeline**
4. **Part 3: Build Combined NER + Classification Pipeline**
5. **Part 4: Create SuperComponents**
6. **Part 5: Build Component Tools for Agent**
7. **Part 6: Implement the Multi-Task Agent**
8. **Part 7: Serialize Pipelines**
9. **Part 8: Deploy with Hayhooks**
10. **Testing and Validation**

## üí° Tips for Success

- Review the reference notebooks before starting
- Test each component individually before combining them
- Pay attention to input/output types when connecting components
- Use clear descriptions for tools so the agent understands when to use them
- Test your agent with various natural language queries

## Section 1: Setup and Imports

### üìù Task
Import all necessary libraries and load environment variables.

### üí° Hint
You'll need imports from both the NER and classification notebooks, plus additional imports for agents and Hayhooks.

In [None]:
# TODO: Import Haystack core components
from haystack import Pipeline, Document, component
from haystack.utils import Secret

# TODO: Import web search and content fetching components
# Hint: SearchApiWebSearch, LinkContentFetcher, HTMLToDocument

# TODO: Import preprocessing components
# Hint: DocumentCleaner

# TODO: Import NER and classification components
# Hint: NamedEntityExtractor, TransformersZeroShotTextRouter

# TODO: Import agent-related components
# Hint: Agent, OpenAIChatGenerator, ChatMessage

# TODO: Import SuperComponent and ComponentTool
# Hint: SuperComponent, ComponentTool

# TODO: Import supporting libraries
from typing import List, Dict, Any
from dotenv import load_dotenv
import os
import pandas as pd

# Load environment variables
load_dotenv(".env")

# Verify API keys are loaded
# TODO: Add code to verify your API keys are present

## Section 2: Part 1 - Build Custom NER Component

### üìù Task
Create a custom Haystack component called `EntityExtractor` that:
1. Takes a list of documents as input
2. Extracts named entities from each document using the NER model
3. Stores entities in the document metadata organized by type (PER, ORG, LOC, MISC)
4. Filters entities by confidence score (>0.8)
5. Removes duplicate entities within each category
6. Returns the enriched documents

### üí° Hints
- Use the `@component` decorator
- Define output types with `@component.output_types(documents=List[Document])`
- Implement a `run()` method that processes the documents
- Store entities in `document.meta['entities']` as a dictionary with keys: LOC, PER, ORG, MISC
- Use sets to automatically remove duplicates

### üìñ Reference
Review the `NERPopulator` component in the NER tutorial notebook for inspiration.

In [None]:
# TODO: Initialize the NER extractor model
# Hint: Use NamedEntityExtractor with "dslim/bert-base-NER" model

ner_extractor = None  # Replace with actual initialization

@component
class EntityExtractor:
    """
    Custom component that extracts named entities from documents
    and stores them in document metadata.
    
    This component:
    - Processes a list of documents
    - Extracts named entities using a pre-trained NER model
    - Filters entities by confidence score
    - Organizes entities by type (PER, ORG, LOC, MISC)
    - Removes duplicates within each category
    """
    
    def __init__(self, confidence_threshold: float = 0.8):
        """
        Initialize the EntityExtractor.
        
        Args:
            confidence_threshold: Minimum confidence score for entity extraction (default: 0.8)
        """
        # TODO: Store the confidence threshold
        pass
    
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]) -> Dict[str, Any]:
        """
        Extract entities from documents and store in metadata.
        
        Args:
            documents: List of Haystack Document objects
            
        Returns:
            Dictionary with 'documents' key containing enriched documents
        """
        # TODO: Implement the entity extraction logic
        # Steps:
        # 1. Loop through each document
        # 2. Get the document content
        # 3. Run NER extraction on the content
        # 4. Filter entities by confidence threshold
        # 5. Organize entities by type using sets for deduplication
        # 6. Store in document.meta['entities'] as a dictionary
        # 7. Return the enriched documents
        
        enriched_documents = []
        
        for document in documents:
            # TODO: Extract content from document
            content = document.content
            
            # TODO: Run NER extraction
            # Hint: Use the ner_extractor you initialized above
            
            # TODO: Initialize entity storage by type
            entities_by_type = {
                "LOC": set(),
                "PER": set(),
                "ORG": set(),
                "MISC": set()
            }
            
            # TODO: Loop through extracted entities
            # Filter by confidence threshold
            # Add to appropriate category
            
            # TODO: Convert sets to comma-separated strings or lists
            # Store in document.meta['entities']
            
            enriched_documents.append(document)
        
        return {"documents": enriched_documents}

### Test Your EntityExtractor Component

Before integrating into a pipeline, test your component with sample documents.

In [None]:
# TODO: Create test documents
# test_docs = [Document(content="Your test text here")]

# TODO: Initialize your EntityExtractor
# entity_extractor = EntityExtractor()

# TODO: Run the component
# result = entity_extractor.run(documents=test_docs)

# TODO: Print the results to verify it works
# print(result['documents'][0].meta.get('entities'))

## Section 3: Part 2 - Create Classification Pipeline

### üìù Task
Build a pipeline that:
1. Searches the web for articles
2. Fetches and converts HTML content
3. Cleans the text
4. Classifies articles into categories

### üí° Hints
- Reuse the classification components from the tutorial
- Use the `NewsClassifier` component pattern
- Categories: Politics, Sport, Technology, Entertainment, Business

### üìñ Reference
Review the classification tutorial notebook for the complete pipeline structure.

In [None]:
# TODO: Initialize the zero-shot text router for classification
# Hint: Use TransformersZeroShotTextRouter with appropriate labels

# TODO: Create the NewsClassifier component (refer to classification notebook)

# TODO: Initialize the classification pipeline
classification_pipeline = Pipeline()

# TODO: Add components to the pipeline
# - SearchApiWebSearch
# - LinkContentFetcher
# - HTMLToDocument
# - DocumentCleaner
# - NewsClassifier

# TODO: Connect the components
# Follow the pattern: search -> fetcher -> html -> cleaner -> classifier

print("Classification pipeline created successfully!")

### Test the Classification Pipeline

In [None]:
# TODO: Test the classification pipeline with a query
# test_query = "artificial intelligence"
# classification_result = classification_pipeline.run(data={"search": {"query": test_query}})

# TODO: Display the results
# for doc in classification_result['classifier']['documents']:
#     print(f"Classification: {doc.meta.get('labels')}")
#     print(f"URL: {doc.meta.get('url')}")
#     print("---")

## Section 4: Part 3 - Build Combined NER + Classification Pipeline

### üìù Task
Create a unified pipeline that performs BOTH classification and entity extraction:
1. Searches and fetches articles
2. Cleans the content
3. Classifies articles into categories
4. Extracts named entities

### üí° Hints
- You can chain the classifier and entity extractor in sequence
- Both components should add metadata without overwriting each other
- Test that both `labels` and `entities` appear in the final document metadata

### ü§î Design Decision
Should classification happen before or after entity extraction? Consider:
- Do entities affect classification?
- Does classification affect entity extraction?
- Which order makes more logical sense?

In [None]:
# TODO: Create a combined pipeline
combined_pipeline = Pipeline()

# TODO: Initialize all necessary components
# - Web search
# - Link fetcher
# - HTML converter
# - Document cleaner
# - Classification component
# - Entity extraction component

# TODO: Add components to the pipeline

# TODO: Connect components in logical order
# Recommended flow: search -> fetch -> convert -> clean -> classify -> extract entities

print("Combined NER + Classification pipeline created successfully!")

### Test the Combined Pipeline

In [None]:
# TODO: Test the combined pipeline
# test_query = "Elon Musk"
# combined_result = combined_pipeline.run(data={"search": {"query": test_query}})

# TODO: Verify both classification and entities are present
# for doc in combined_result['entity_extractor']['documents']:
#     print(f"Classification: {doc.meta.get('labels')}")
#     print(f"Entities: {doc.meta.get('entities')}")
#     print(f"URL: {doc.meta.get('url')}")
#     print("---")

## Section 5: Part 4 - Create SuperComponents

### üìù Task
Wrap your pipelines as SuperComponents with simplified interfaces.

Create three SuperComponents:
1. **Classification SuperComponent**: Takes query, returns classified documents
2. **NER SuperComponent**: Takes query, returns documents with entities
3. **Combined SuperComponent**: Takes query, returns documents with both classification and entities

### üí° Hints
- Use `input_mapping` to map external parameters to internal component inputs
- Use `output_mapping` to expose specific outputs from the pipeline
- Review the NER tutorial notebook for SuperComponent examples

### üìñ Reference
```python
SuperComponent(
    pipeline=your_pipeline,
    input_mapping={"query": ["search.query"]},
    output_mapping={"component_name.output": "new_name"}
)
```

In [None]:
from haystack import SuperComponent

# TODO: Create Classification SuperComponent
classification_super = None  # Replace with SuperComponent initialization

# TODO: Create NER SuperComponent
# ner_super = SuperComponent(
#     pipeline=...,
#     input_mapping={...},
#     output_mapping={...}
# )

# TODO: Create Combined SuperComponent
# combined_super = SuperComponent(
#     pipeline=...,
#     input_mapping={...},
#     output_mapping={...}
# )

print("SuperComponents created successfully!")

### Test SuperComponents

In [None]:
# TODO: Test each SuperComponent independently
# test_query = "machine learning"

# Test classification
# result_classification = classification_super.run(query=test_query)

# Test NER
# result_ner = ner_super.run(query=test_query)

# Test combined
# result_combined = combined_super.run(query=test_query)

## Section 6: Part 5 - Build Component Tools for Agent

### üìù Task
Convert your SuperComponents into tools that an AI agent can use.

Create three tools:
1. **Classification Tool**: For classifying articles
2. **NER Tool**: For extracting entities
3. **Combined Tool**: For both tasks simultaneously

### üí° Critical: Tool Descriptions
The tool description is crucial! The agent uses it to decide when to use each tool.

Good description format:
- Clear purpose statement
- Input parameters explained
- Output format described
- Use cases or examples

### üìñ Reference
```python
ComponentTool(
    name="tool_name",
    component=your_super_component,
    description="Clear description of what this tool does..."
)
```

In [None]:
from haystack.tools.component_tool import ComponentTool

# TODO: Create Classification Tool
classification_tool = None  # Replace with ComponentTool initialization
# Suggested name: "article_classifier"
# Description should explain: what it classifies, categories used, input/output

# TODO: Create NER Tool
# ner_tool = ComponentTool(
#     name="entity_extractor",
#     component=ner_super,
#     description="..."
# )

# TODO: Create Combined Tool
# combined_tool = ComponentTool(
#     name="classify_and_extract",
#     component=combined_super,
#     description="..."
# )

print("Component tools created successfully!")

## Section 7: Part 6 - Implement the Multi-Task Agent

### üìù Task
Create an AI agent that can:
1. Understand natural language requests
2. Decide which tool to use (classification, NER, or both)
3. Execute the appropriate tool
4. Summarize results for the user

### üí° Hints
- Use GPT-4 or GPT-4-mini for better tool selection
- Write a comprehensive system prompt that explains each tool
- Include example queries in the system prompt
- Test with various natural language requests

### üéØ Example Queries to Handle
- "Classify articles about climate change"
- "Extract entities from articles about Tesla"
- "Classify and extract entities from news about artificial intelligence"
- "What people and organizations are mentioned in articles about SpaceX?"
- "What category do articles about quantum computing fall into?"

In [None]:
from haystack.components.agents import Agent
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

# TODO: Initialize the LLM for the agent
# agent_llm = OpenAIChatGenerator(
#     api_key=Secret.from_env_var("OPENAI_API_KEY"),
#     model="gpt-4o-mini"  # or "gpt-4"
# )

# TODO: Write a comprehensive system prompt
system_prompt = """
You are a helpful NLP assistant with access to tools for analyzing web articles.

Available tools:
1. article_classifier: Classifies articles into categories (Politics, Sport, Technology, Entertainment, Business)
2. entity_extractor: Extracts named entities (people, organizations, locations, misc) from articles
3. classify_and_extract: Performs both classification and entity extraction simultaneously

When users ask you to:
- Only classify articles ‚Üí Use article_classifier
- Only extract entities ‚Üí Use entity_extractor
- Do both tasks ‚Üí Use classify_and_extract

Always:
- Parse the user's query to identify the search topic
- Choose the appropriate tool based on the request
- Summarize the results clearly
- List key findings (classifications, entities, or both)

Be conversational and helpful!
"""

# TODO: Create the agent with all three tools
# tools = [classification_tool, ner_tool, combined_tool]

# nlp_agent = Agent(
#     chat_generator=agent_llm,
#     tools=tools,
#     system_prompt=system_prompt
# )

print("Multi-task NLP agent created successfully!")

### Test the Agent with Various Queries

In [None]:
# TODO: Test Query 1 - Classification only
# query1 = "Classify articles about renewable energy"
# result1 = nlp_agent.run(messages=[ChatMessage.from_user(query1)])
# print(result1['last_message']._content[0].text)

In [None]:
# TODO: Test Query 2 - Entity extraction only
# query2 = "Extract all people and organizations from articles about Apple Inc"
# result2 = nlp_agent.run(messages=[ChatMessage.from_user(query2)])
# print(result2['last_message']._content[0].text)

In [None]:
# TODO: Test Query 3 - Both classification and entity extraction
# query3 = "Classify and extract entities from articles about artificial intelligence"
# result3 = nlp_agent.run(messages=[ChatMessage.from_user(query3)])
# print(result3['last_message']._content[0].text)

## Section 8: Part 7 - Serialize Pipelines

### üìù Task
Serialize your pipelines to YAML format for deployment with Hayhooks.

Save three pipeline files:
1. `classification_pipeline.yaml`
2. `ner_pipeline.yaml`
3. `combined_pipeline.yaml`

### üí° Hints
- Use `pipeline.dumps()` to serialize to YAML string
- Save to the `pipelines/` directory
- Verify the YAML files are valid
- Check that API keys are properly referenced (should use env variables)

### ‚ö†Ô∏è Security Note
Never hardcode API keys in YAML files! Always use `Secret.from_env_var()`.

In [None]:
# TODO: Serialize classification pipeline
# classification_yaml = classification_pipeline.dumps()
# with open("pipelines/classification_pipeline.yaml", "w") as f:
#     f.write(classification_yaml)

# TODO: Serialize NER pipeline
# Create a standalone NER pipeline if needed (without classification)

# TODO: Serialize combined pipeline
# combined_yaml = combined_pipeline.dumps()
# with open("pipelines/combined_pipeline.yaml", "w") as f:
#     f.write(combined_yaml)

print("Pipelines serialized successfully!")
print("Files saved in pipelines/ directory")

### Verify Serialized Pipelines

In [None]:
# TODO: Load and verify a serialized pipeline
# test_pipeline = Pipeline.loads(classification_yaml)
# test_result = test_pipeline.run(data={"search": {"query": "test"}})
# print("Pipeline loaded and tested successfully!")

## Section 9: Part 8 - Deploy with Hayhooks

### üìù Task
Deploy your pipelines as REST APIs using Hayhooks.

### üöÄ Deployment Steps

1. **Install Hayhooks** (if not already installed):
   ```bash
   pip install hayhooks
   ```

2. **Start Hayhooks Server**:
   ```bash
   hayhooks run --pipelines-dir pipelines/
   ```

3. **Test Endpoints**:
   - List pipelines: `GET http://localhost:1416/pipelines`
   - Run pipeline: `POST http://localhost:1416/pipelines/classification_pipeline`

### üí° Hints
- Hayhooks automatically creates REST endpoints from YAML files
- Each pipeline gets its own endpoint
- Use tools like `curl`, Postman, or Python `requests` library to test
- Check Hayhooks documentation for advanced configuration

### üìñ References
- Review the Yelp Navigator Hayhooks guide in your workspace
- Check `yelp-navigator/yelp-navigator-hayhooks-guide.md`

In [None]:
# This cell provides Python code to test your deployed Hayhooks endpoints

import requests
import json

# TODO: Define Hayhooks server URL
HAYHOOKS_URL = "http://localhost:1416"

def list_pipelines():
    """List all deployed pipelines."""
    # TODO: Implement function to list pipelines
    # Hint: GET request to /pipelines endpoint
    pass

def run_classification(query: str):
    """Run classification pipeline via Hayhooks."""
    # TODO: Implement function to call classification endpoint
    # Hint: POST request to /pipelines/classification_pipeline
    # Body should contain the search query
    pass

def run_ner(query: str):
    """Run NER pipeline via Hayhooks."""
    # TODO: Implement function to call NER endpoint
    pass

def run_combined(query: str):
    """Run combined pipeline via Hayhooks."""
    # TODO: Implement function to call combined endpoint
    pass

# TODO: Test the functions
# print(list_pipelines())
# print(run_classification("machine learning"))

### Terminal Commands for Hayhooks

Run these commands in your terminal:

```bash
# Start Hayhooks server
hayhooks run --pipelines-dir pipelines/

# In another terminal, test with curl:
curl http://localhost:1416/pipelines

# Test classification endpoint
curl -X POST http://localhost:1416/pipelines/classification_pipeline \
  -H "Content-Type: application/json" \
  -d '{"search": {"query": "artificial intelligence"}}'

# Test combined endpoint
curl -X POST http://localhost:1416/pipelines/combined_pipeline \
  -H "Content-Type: application/json" \
  -d '{"search": {"query": "Elon Musk"}}'
```

## Section 10: Testing and Validation

### üìù Comprehensive Testing Checklist

Test your complete system:

#### ‚úÖ Component Testing
- [ ] EntityExtractor extracts entities correctly
- [ ] NewsClassifier classifies articles correctly
- [ ] Both components preserve document metadata

#### ‚úÖ Pipeline Testing
- [ ] Classification pipeline works end-to-end
- [ ] NER pipeline works end-to-end
- [ ] Combined pipeline works end-to-end
- [ ] Pipelines handle various query topics

#### ‚úÖ Agent Testing
- [ ] Agent correctly interprets classification requests
- [ ] Agent correctly interprets NER requests
- [ ] Agent correctly interprets combined requests
- [ ] Agent provides clear, helpful responses
- [ ] Agent handles ambiguous queries appropriately

#### ‚úÖ Deployment Testing
- [ ] Pipelines serialize to valid YAML
- [ ] YAML files load without errors
- [ ] Hayhooks server starts successfully
- [ ] All endpoints are accessible
- [ ] Endpoints return expected results

### üß™ Test Cases

Run these test cases and verify results:

In [None]:
# Test Case 1: Simple Classification
# TODO: Test with a Technology topic
# Expected: Articles should be classified as "Technology"

In [None]:
# Test Case 2: Simple NER
# TODO: Test with a person's name (e.g., "Marie Curie")
# Expected: Should extract PER, ORG, LOC entities

In [None]:
# Test Case 3: Combined Task
# TODO: Test with a company name (e.g., "Tesla")
# Expected: Should have both classification and entities

In [None]:
# Test Case 4: Agent Intelligence
# TODO: Test with an ambiguous query
# Example: "Tell me about articles on quantum computing"
# Expected: Agent should decide which tool(s) to use

## üéì Summary and Key Learnings

### What You've Accomplished

Congratulations! You've built a sophisticated multi-task NLP system that:

1. ‚úÖ **Combines Multiple NLP Tasks**: Classification and entity extraction in one system
2. ‚úÖ **Uses Custom Components**: Extended Haystack with specialized functionality
3. ‚úÖ **Implements SuperComponents**: Created reusable pipeline wrappers
4. ‚úÖ **Built an AI Agent**: Enabled natural language interaction with NLP pipelines
5. ‚úÖ **Deployed as APIs**: Made pipelines accessible via REST endpoints

### Key Concepts Mastered

- **Component Design**: Creating custom Haystack components with proper input/output types
- **Pipeline Composition**: Connecting multiple components into coherent workflows
- **Metadata Management**: Enriching documents without losing information
- **Agent Architecture**: Building intelligent systems that choose appropriate tools
- **API Deployment**: Serializing and deploying ML pipelines as web services

### Real-World Applications

Your system can be used for:
- **Content Analysis Platforms**: Automatically categorize and extract key information from articles
- **Media Monitoring**: Track mentions of entities across different content categories
- **Research Tools**: Systematically analyze and categorize large document collections
- **News Aggregation**: Classify articles and identify key people, organizations, and locations

### Next Steps

Extend your system further:

1. **Add More NLP Tasks**:
   - Sentiment analysis
   - Summarization
   - Question answering
   - Relation extraction

2. **Enhance Entity Processing**:
   - Entity linking (connect to knowledge bases)
   - Coreference resolution
   - Entity disambiguation

3. **Improve Classification**:
   - Add hierarchical categories
   - Support multi-label classification
   - Fine-tune models on domain-specific data

4. **Scale the System**:
   - Add caching for repeated queries
   - Implement batch processing
   - Deploy on cloud infrastructure
   - Add database storage for results

5. **Enhance the Agent**:
   - Add conversation memory
   - Support follow-up questions
   - Implement result filtering and refinement
   - Add visualization tools

### Resources for Further Learning

- **Haystack Documentation**: https://docs.haystack.deepset.ai/
- **Hayhooks Documentation**: https://docs.haystack.deepset.ai/docs/hayhooks
- **HuggingFace Models**: https://huggingface.co/models
- **NER Research**: Explore different NER models and techniques
- **Zero-Shot Learning**: Deep dive into zero-shot classification methods

## üìù Reflection Questions

Take a moment to reflect on your learning:

1. **What was the most challenging part of this exercise?**
   - Building custom components?
   - Connecting pipeline components?
   - Designing the agent system?
   - Deploying with Hayhooks?

2. **How would you modify this system for your own use case?**
   - Different NLP tasks?
   - Different data sources?
   - Different classification categories?

3. **What performance optimizations would you consider?**
   - Caching strategies?
   - Parallel processing?
   - Model selection?

4. **How would you evaluate the quality of results?**
   - Classification accuracy?
   - Entity extraction precision/recall?
   - Agent response quality?

Write your reflections in the markdown cell below:

### Your Reflections

*(Add your thoughts here)*