# Pipeline 2: Document Store with Website Content

## Overview
This notebook implements a Haystack pipeline that creates a document store from Pipeline 1 results. It takes the business search results (which already contain all business details including websites) and creates enriched Haystack Documents with website content.

## What This Pipeline Does
1. Accepts business results list from Pipeline 1 (each containing bizId, name, website, rating, etc.)
2. Extracts website URLs from the business data
3. Fetches and converts website content to Haystack Documents
4. Returns Documents with business metadata and website content for document store

## Use Cases
- Creating searchable document stores from business data
- Website content extraction and indexing
- Enriching business metadata with website content
- Preparing data for RAG (Retrieval Augmented Generation)

## Pipeline Architecture
```
Business Results (from Pipeline 1) → Extract Websites → Fetch Content → HTML to Document → Enriched Documents
```

## Input Format
Pipeline 2 accepts the **complete Pipeline 1 output** with this structure:
```python
{
  'yelp_search': {
    'results': {
      'resultCount': 11,
      'currentPage': 1,
      'totalPages': 2,
      'location': {'city': 'Madison'},
      'results': [
        {
          'bizId': 'abc123',
          'name': 'Business Name',
          'alias': 'business-name-location',
          'website': 'https://example.com',
          'rating': 4.5,
          'reviewCount': 100,
          'categories': ['Category1', 'Category2'],
          'priceRange': '$$',
          'phone': '(555) 123-4567',
          'lat': 43.05,
          'lon': -89.52,
          'services': [],
          'businessHighlights': [],
          'images': [...]
        },
        ...
      ]
    }
  }
}
```

The `Pipeline1ResultParser` component automatically extracts the business list from `pipeline1_output['yelp_search']['results']['results']`.

## Setup and Environment Variables

No API keys needed for this pipeline - it processes data from Pipeline 1.

In [1]:
# Import required libraries
from haystack import Pipeline, component, Document
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from typing import List, Dict, Any
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

print("✓ Libraries loaded successfully")

✓ Libraries loaded successfully


## Custom Component 1: Pipeline 1 Result Parser

This component parses the full Pipeline 1 output and extracts the business results list.

In [2]:
@component
class Pipeline1ResultParser:
    """
    Parses the full Pipeline 1 output to extract business results.
    
    This component:
    1. Accepts the complete Pipeline 1 output dictionary
    2. Navigates the nested structure to find business results
    3. Returns the business results list for downstream processing
    
    Input:
        - pipeline1_output (Dict): Complete output from Pipeline 1
    
    Output:
        - business_results (List[Dict]): List of business dictionaries
    """
    
    def __init__(self):
        """Initialize the component with a logger."""
        self.logger = logging.getLogger(__name__ + ".Pipeline1ResultParser")
    
    @component.output_types(business_results=List[Dict])
    def run(self, pipeline1_output: Dict) -> Dict[str, List[Dict]]:
        """
        Parse Pipeline 1 output to extract business results.
        
        Args:
            pipeline1_output: Full output dictionary from Pipeline 1
                Expected structure: {'yelp_search': {'results': {'results': [...]}}}
            
        Returns:
            Dictionary with business_results key containing list of businesses
        """
        self.logger.info("Parsing Pipeline 1 output")
        
        try:
            # Navigate the nested structure
            yelp_search = pipeline1_output.get('yelp_search', {})
            results_wrapper = yelp_search.get('results', {})
            business_results = results_wrapper.get('results', [])
            
            result_count = results_wrapper.get('resultCount', 0)
            location = results_wrapper.get('location', {})
            
            self.logger.info(f"Extracted {len(business_results)} businesses from Pipeline 1")
            self.logger.debug(f"Result count: {result_count}, Location: {location}")
            
            return {"business_results": business_results}
            
        except Exception as e:
            self.logger.error(f"Error parsing Pipeline 1 output: {e}", exc_info=True)
            return {"business_results": []}

print("✓ Pipeline1ResultParser component defined")

✓ Pipeline1ResultParser component defined


## Custom Component 2: Website URL Extractor

This component extracts website URLs from business results and prepares metadata.

In [3]:
@component
class WebsiteURLExtractor:
    """
    Extracts website URLs from business results.
    
    This component:
    1. Accepts business results list from Pipeline1ResultParser
    2. Extracts website URLs for each business with a website
    3. Prepares business metadata for document enrichment
    4. Returns URLs to fetch and corresponding metadata
    
    Input:
        - business_results (List[Dict]): Business data from parser
    
    Output:
        - urls (List[str]): Website URLs to fetch
        - business_metadata (List[Dict]): Associated business information
    """
    
    def __init__(self):
        """Initialize the component with a logger."""
        self.logger = logging.getLogger(__name__ + ".WebsiteURLExtractor")
    
    @component.output_types(
        urls=List[str],
        business_metadata=List[Dict]
    )
    def run(self, business_results: List[Dict]) -> Dict[str, Any]:
        """
        Extract website URLs and prepare metadata from business results.
        
        Args:
            business_results: List of business dictionaries
            
        Returns:
            Dictionary with urls and business_metadata
        """
        if not business_results:
            self.logger.warning("No business results provided")
            return {"urls": [], "business_metadata": []}
        
        self.logger.info(f"Processing {len(business_results)} business results")
        
        urls = []
        metadata_list = []
        
        for business in business_results:
            website = business.get('website')
            business_name = business.get('name', 'Unknown')
            
            if website and website.strip() and website != 'N/A':
                urls.append(website)
                self.logger.info(f"Extracted website for '{business_name}': {website}")
                
                # Prepare metadata from Pipeline 1 data
                meta = {
                    "business_id": business.get('bizId', ''),
                    "business_name": business_name,
                    "business_alias": business.get('alias', ''),
                    "price_range": business.get('priceRange', 'N/A'),
                    "latitude": business.get('lat', 0.0),
                    "longitude": business.get('lon', 0.0),
                    "rating": business.get('rating', 0.0),
                    "review_count": business.get('reviewCount', 0),
                    "phone": business.get('phone', ''),
                    "categories": business.get('categories', []),
                    "services": business.get('services', []),
                    "business_highlights": business.get('businessHighlights', []),
                    "website": website,
                    "images": business.get('images', [])
                }
                metadata_list.append(meta)
                self.logger.debug(f"Prepared metadata for '{business_name}'")
            else:
                self.logger.warning(f"No website found for business: '{business_name}'")
        
        self.logger.info(f"Successfully extracted {len(urls)} website URLs")
        
        return {
            "urls": urls,
            "business_metadata": metadata_list
        }

print("✓ WebsiteURLExtractor component defined")

✓ WebsiteURLExtractor component defined


In [4]:
# Test the components with sample data in Pipeline 1 output format
parser = Pipeline1ResultParser()
url_extractor = WebsiteURLExtractor()

# Sample Pipeline 1 output (full structure)
test_pipeline1_output = {
    'yelp_search': {
        'results': {
            'resultCount': 2,
            'currentPage': 1,
            'totalPages': 1,
            'location': {'city': 'Madison'},
            'results': [
                {
                    'bizId': 'Xp_cWXY5rxDLkX-wqUg-iQ',
                    'name': "Brennan's Market",
                    'alias': 'brennans-market-madison-3',
                    'serviceArea': None,
                    'lat': 43.0556119,
                    'lon': -89.524489,
                    'rating': 4.5,
                    'reviewCount': 81,
                    'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'],
                    'services': [],
                    'businessHighlights': [],
                    'priceRange': '$$',
                    'phone': '(608) 833-2893',
                    'website': 'https://www.brennansmarket.com',
                    'images': ['https://s3-media0.fl.yelpcdn.com/bphoto/WuL0iPar3ea_lGrMInAPzw/348s.jpg']
                },
                {
                    'bizId': 'RJNAeNA-209sctUO0dmwuA',
                    'name': 'The Old Fashioned',
                    'alias': 'the-old-fashioned-madison',
                    'serviceArea': None,
                    'lat': 43.073904,
                    'lon': -89.384872,
                    'rating': 4.0,
                    'reviewCount': 2156,
                    'categories': ['American', 'Bars'],
                    'services': [],
                    'businessHighlights': [],
                    'priceRange': '$$',
                    'phone': '(608) 310-4545',
                    'website': 'https://www.theoldfashioned.com',
                    'images': []
                }
            ]
        }
    }
}

# Test the parser
parsed = parser.run(pipeline1_output=test_pipeline1_output)
print(f"Parsed {len(parsed['business_results'])} businesses")

# Test the URL extractor
result = url_extractor.run(business_results=parsed['business_results'])

2025-11-07 16:41:50,512 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 16:41:50,512 - __main__.Pipeline1ResultParser - INFO - Extracted 2 businesses from Pipeline 1
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Processing 2 business results
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'Brennan's Market': https://www.brennansmarket.com
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'The Old Fashioned': https://www.theoldfashioned.com
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Successfully extracted 2 website URLs
2025-11-07 16:41:50,512 - __main__.Pipeline1ResultParser - INFO - Extracted 2 businesses from Pipeline 1
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Processing 2 business results
2025-11-07 16:41:50,513 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'Brennan's Market': https://www.brennansmarket

Parsed 2 businesses


In [5]:
print(f"URLs: {result['urls']}")
print(f"\nMetadata count: {len(result['business_metadata'])}")
print(f"\nFirst business metadata:")
result['business_metadata'][0]

URLs: ['https://www.brennansmarket.com', 'https://www.theoldfashioned.com']

Metadata count: 2

First business metadata:


{'business_id': 'Xp_cWXY5rxDLkX-wqUg-iQ',
 'business_name': "Brennan's Market",
 'business_alias': 'brennans-market-madison-3',
 'price_range': '$$',
 'latitude': 43.0556119,
 'longitude': -89.524489,
 'rating': 4.5,
 'review_count': 81,
 'phone': '(608) 833-2893',
 'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'],
 'services': [],
 'business_highlights': [],
 'website': 'https://www.brennansmarket.com',
 'images': ['https://s3-media0.fl.yelpcdn.com/bphoto/WuL0iPar3ea_lGrMInAPzw/348s.jpg']}

## Custom Component 3: Document Metadata Enricher

This component enriches HTML documents with business metadata for the document store.

In [6]:
@component
class DocumentMetadataEnricher:
    """
    Enriches documents with business metadata.
    
    This component:
    1. Receives documents from HTMLToDocument converter
    2. Matches documents with corresponding business metadata
    3. Enriches document metadata with business information
    4. Returns fully enriched Haystack Documents
    
    Input:
        - documents (List[Document]): Documents from HTMLToDocument
        - business_metadata (List[Dict]): Business metadata from extractor
    
    Output:
        - documents (List[Document]): Enriched documents with full metadata
    """
    
    def __init__(self):
        """Initialize the component with a logger."""
        self.logger = logging.getLogger(__name__ + ".DocumentMetadataEnricher")
    
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document], business_metadata: List[Dict]) -> Dict[str, List[Document]]:
        """
        Enrich documents with business metadata.
        
        Args:
            documents: Documents containing website content
            business_metadata: Business information to add to documents
            
        Returns:
            Dictionary with enriched documents
        """
        self.logger.info(f"Enriching {len(documents)} documents with business metadata")
        
        enriched_documents = []
        
        # Match documents with metadata by URL
        for i, doc in enumerate(documents):
            if i < len(business_metadata):
                # Get corresponding business metadata
                meta = business_metadata[i]
                business_name = meta.get('business_name', 'Unknown')
                
                self.logger.debug(f"Enriching document {i+1} for business: '{business_name}'")
                
                # Create enriched document
                enriched_doc = Document(
                    content=doc.content,
                    meta={
                        **doc.meta,  # Keep original metadata (URL, etc.)
                        **meta       # Add business metadata
                    }
                )
                enriched_documents.append(enriched_doc)
                
                content_length = len(doc.content)
                self.logger.info(f"Successfully enriched document for '{business_name}' (content length: {content_length} chars)")
            else:
                # If no matching metadata, keep original document
                self.logger.warning(f"No metadata available for document {i+1}, keeping original")
                enriched_documents.append(doc)
        
        self.logger.info(f"Completed enrichment of {len(enriched_documents)} documents")
        
        return {"documents": enriched_documents}

print("✓ DocumentMetadataEnricher component defined")

✓ DocumentMetadataEnricher component defined


## Build the Pipeline

Assembling all components:
1. Pipeline1ResultParser - Parse Pipeline 1 output and extract business results
2. WebsiteURLExtractor - Extract URLs and metadata from business results
3. LinkContentFetcher - Fetch website content
4. HTMLToDocument - Convert HTML to Documents
5. DocumentMetadataEnricher - Enrich documents with business metadata

In [7]:
# Initialize pipeline
pipeline = Pipeline()

# Initialize components
parser = Pipeline1ResultParser()
url_extractor = WebsiteURLExtractor()
content_fetcher = LinkContentFetcher(
    retry_attempts=2,
    timeout=10,
    raise_on_failure=False  # Don't fail pipeline if some websites timeout
)
html_converter = HTMLToDocument()
metadata_enricher = DocumentMetadataEnricher()

# Add components to pipeline
pipeline.add_component("parser", parser)
pipeline.add_component("url_extractor", url_extractor)
pipeline.add_component("content_fetcher", content_fetcher)
pipeline.add_component("html_converter", html_converter)
pipeline.add_component("metadata_enricher", metadata_enricher)

# Connect components
pipeline.connect("parser.business_results", "url_extractor.business_results")
pipeline.connect("url_extractor.urls", "content_fetcher.urls")
pipeline.connect("content_fetcher.streams", "html_converter.sources")
pipeline.connect("html_converter.documents", "metadata_enricher.documents")
pipeline.connect("url_extractor.business_metadata", "metadata_enricher.business_metadata")

print("✓ Pipeline built successfully")
print("\nPipeline structure:")
print("Pipeline 1 Full Output → Parser → URLExtractor → ContentFetcher → HTMLConverter → MetadataEnricher → Enriched Documents")

✓ Pipeline built successfully

Pipeline structure:
Pipeline 1 Full Output → Parser → URLExtractor → ContentFetcher → HTMLConverter → MetadataEnricher → Enriched Documents


## Test the Pipeline

Let's test with the full Pipeline 1 output format.

In [8]:
# Sample Pipeline 1 output (complete structure as it would come from Pipeline 1)
test_pipeline1_output = {
    'yelp_search': {
        'results': {
            'resultCount': 2,
            'currentPage': 1,
            'totalPages': 1,
            'location': {'city': 'Madison'},
            'results': [
                {
                    'bizId': 'Xp_cWXY5rxDLkX-wqUg-iQ',
                    'name': "Brennan's Market",
                    'alias': 'brennans-market-madison-3',
                    'serviceArea': None,
                    'lat': 43.0556119,
                    'lon': -89.524489,
                    'rating': 4.5,
                    'reviewCount': 81,
                    'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'],
                    'services': [],
                    'businessHighlights': [],
                    'priceRange': '$$',
                    'phone': '(608) 833-2893',
                    'website': 'https://www.brennansmarket.com',
                    'images': ['https://s3-media0.fl.yelpcdn.com/bphoto/WuL0iPar3ea_lGrMInAPzw/348s.jpg']
                },
                {
                    'bizId': 'RJNAeNA-209sctUO0dmwuA',
                    'name': 'The Old Fashioned',
                    'alias': 'the-old-fashioned-madison',
                    'serviceArea': None,
                    'lat': 43.073904,
                    'lon': -89.384872,
                    'rating': 4.0,
                    'reviewCount': 2156,
                    'categories': ['American', 'Bars'],
                    'services': [],
                    'businessHighlights': [],
                    'priceRange': '$$',
                    'phone': '(608) 310-4545',
                    'website': 'https://www.theoldfashioned.com',
                    'images': []
                }
            ]
        }
    }
}

print("Testing Pipeline 2 with full Pipeline 1 output:")
print(f"Result count: {test_pipeline1_output['yelp_search']['results']['resultCount']}")
print("="*60)

result = pipeline.run(data={
    "parser": {
        "pipeline1_output": test_pipeline1_output
    }
})

# Display results
documents = result['metadata_enricher']['documents']
print(f"\n✓ Retrieved {len(documents)} business documents")

for i, doc in enumerate(documents, 1):
    print(f"\n--- Business {i} ---")
    print(f"Name: {doc.meta.get('business_name', 'N/A')}")
    print(f"Business ID: {doc.meta.get('business_id', 'N/A')}")
    print(f"Alias: {doc.meta.get('business_alias', 'N/A')}")
    print(f"Price Range: {doc.meta.get('price_range', 'N/A')}")
    print(f"Rating: {doc.meta.get('rating', 0.0)} ({doc.meta.get('review_count', 0)} reviews)")
    print(f"Location: ({doc.meta.get('latitude', 0.0)}, {doc.meta.get('longitude', 0.0)})")
    print(f"Phone: {doc.meta.get('phone', 'N/A')}")
    print(f"Categories: {', '.join(doc.meta.get('categories', []))}")
    print(f"Website: {doc.meta.get('website', 'N/A')}")
    print(f"Content length: {len(doc.content)} characters")
    print(f"Content preview: {doc.content[:200]}...")

2025-11-07 16:41:50,676 - haystack.core.pipeline.pipeline - INFO - Running component parser
2025-11-07 16:41:50,676 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 16:41:50,676 - __main__.Pipeline1ResultParser - INFO - Extracted 2 businesses from Pipeline 1
2025-11-07 16:41:50,676 - haystack.core.pipeline.pipeline - INFO - Running component url_extractor
2025-11-07 16:41:50,677 - __main__.WebsiteURLExtractor - INFO - Processing 2 business results
2025-11-07 16:41:50,677 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'Brennan's Market': https://www.brennansmarket.com
2025-11-07 16:41:50,677 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'The Old Fashioned': https://www.theoldfashioned.com
2025-11-07 16:41:50,677 - __main__.WebsiteURLExtractor - INFO - Successfully extracted 2 website URLs
2025-11-07 16:41:50,678 - haystack.core.pipeline.pipeline - INFO - Running component content_fetcher
2025-11-07 16:41:50,676 - __main__.Pipe

Testing Pipeline 2 with full Pipeline 1 output:
Result count: 2


2025-11-07 16:41:50,755 - httpx - INFO - HTTP Request: GET https://www.theoldfashioned.com "HTTP/1.1 301 Moved Permanently"
2025-11-07 16:41:50,773 - httpx - INFO - HTTP Request: GET https://www.brennansmarket.com "HTTP/1.1 301 Moved Permanently"
2025-11-07 16:41:50,773 - httpx - INFO - HTTP Request: GET https://www.brennansmarket.com "HTTP/1.1 301 Moved Permanently"
2025-11-07 16:41:50,828 - httpx - INFO - HTTP Request: GET https://theoldfashioned.com/ "HTTP/1.1 200 OK"
2025-11-07 16:41:50,828 - httpx - INFO - HTTP Request: GET https://theoldfashioned.com/ "HTTP/1.1 200 OK"
2025-11-07 16:41:51,159 - httpx - INFO - HTTP Request: GET https://brennansmarket.com/ "HTTP/1.1 200 OK"
2025-11-07 16:41:51,163 - haystack.core.pipeline.pipeline - INFO - Running component html_converter
2025-11-07 16:41:51,159 - httpx - INFO - HTTP Request: GET https://brennansmarket.com/ "HTTP/1.1 200 OK"
2025-11-07 16:41:51,163 - haystack.core.pipeline.pipeline - INFO - Running component html_converter
2025-11-


✓ Retrieved 2 business documents

--- Business 1 ---
Name: Brennan's Market
Business ID: Xp_cWXY5rxDLkX-wqUg-iQ
Alias: brennans-market-madison-3
Price Range: $$
Rating: 4.5 (81 reviews)
Location: (43.0556119, -89.524489)
Phone: (608) 833-2893
Categories: Off Licence, Greengrocers, Cheese Shops
Website: https://www.brennansmarket.com
Content length: 1009 characters
Content preview: MORE WAYS TO SHOP
Design & Ship a Gift Box
Design a custom gift box with your favorite Brennan's Cheese and Sausage. Add crackers, spreads, mustard, granola, jam and more!
Shop Online- Non-Gift
Don't ...

--- Business 2 ---
Name: The Old Fashioned
Business ID: RJNAeNA-209sctUO0dmwuA
Alias: the-old-fashioned-madison
Price Range: $$
Rating: 4.0 (2156 reviews)
Location: (43.073904, -89.384872)
Phone: (608) 310-4545
Categories: American, Bars
Website: https://www.theoldfashioned.com
Content length: 274 characters
Content preview: The Old Fashioned Tavern & Restaurant
HOURS:
Monday - Thursday 11 am - 9 pm
Friday 

In [9]:
documents

[Document(id=a3825c86afe0c43640f29919bfe0c83cd7b1b365c67244c1a65592402c6427e1, content: 'MORE WAYS TO SHOP
 Design & Ship a Gift Box
 Design a custom gift box with your favorite Brennan's Che...', meta: {'content_type': 'text/html', 'url': 'https://www.brennansmarket.com', 'business_id': 'Xp_cWXY5rxDLkX-wqUg-iQ', 'business_name': "Brennan's Market", 'business_alias': 'brennans-market-madison-3', 'price_range': '$$', 'latitude': 43.0556119, 'longitude': -89.524489, 'rating': 4.5, 'review_count': 81, 'phone': '(608) 833-2893', 'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'], 'services': [], 'business_highlights': [], 'website': 'https://www.brennansmarket.com', 'images': ['https://s3-media0.fl.yelpcdn.com/bphoto/WuL0iPar3ea_lGrMInAPzw/348s.jpg']}),
 Document(id=985e2351ad800721e58de630424d3a875e13f3e6837dd439d6ec25a3a7de7db3, content: 'The Old Fashioned Tavern & Restaurant
 HOURS:
 Monday - Thursday 11 am - 9 pm
 Friday 11 am - 10 pm
 Sat...', meta: {'content_type': 'text/h

## Helper Function: Extract Document Metadata

Utility function to extract key metadata from the enriched documents for downstream use.

In [10]:
def extract_document_metadata(documents: List[Document]) -> List[Dict[str, Any]]:
    """
    Extract key metadata from enriched documents.
    
    Args:
        documents: List of enriched Haystack Documents
    
    Returns:
        List of metadata dictionaries with key business information
    """
    metadata_list = []
    
    for doc in documents:
        metadata = {
            "business_id": doc.meta.get("business_id"),
            "business_name": doc.meta.get("business_name"),
            "business_alias": doc.meta.get("business_alias"),
            "price_range": doc.meta.get("price_range"),
            "rating": doc.meta.get("rating"),
            "review_count": doc.meta.get("review_count"),
            "latitude": doc.meta.get("latitude"),
            "longitude": doc.meta.get("longitude"),
            "phone": doc.meta.get("phone"),
            "categories": doc.meta.get("categories"),
            "website": doc.meta.get("website"),
            "has_website_content": len(doc.content) > 0
        }
        metadata_list.append(metadata)
    
    return metadata_list

# Test the helper function
metadata_summary = extract_document_metadata(documents)
print("\nExtracted Metadata Summary:")
for meta in metadata_summary:
    print(f"\n{meta['business_name']}:")
    print(f"  - Price: {meta['price_range']}")
    print(f"  - Rating: {meta['rating']}")
    print(f"  - Website content available: {meta['has_website_content']}")


Extracted Metadata Summary:

Brennan's Market:
  - Price: $$
  - Rating: 4.5
  - Website content available: True

The Old Fashioned:
  - Price: $$
  - Rating: 4.0
  - Website content available: True


## Test with Single Business

Test the pipeline with a single business to verify it works with different input sizes.

In [11]:
# Test with single business in full Pipeline 1 output format
single_business_pipeline1_output = {
    'yelp_search': {
        'results': {
            'resultCount': 1,
            'currentPage': 1,
            'totalPages': 1,
            'location': {'city': 'Madison'},
            'results': [
                {
                    'bizId': 'Xp_cWXY5rxDLkX-wqUg-iQ',
                    'name': "Brennan's Market",
                    'alias': 'brennans-market-madison-3',
                    'lat': 43.0556119,
                    'lon': -89.524489,
                    'rating': 4.5,
                    'reviewCount': 81,
                    'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'],
                    'priceRange': '$$',
                    'phone': '(608) 833-2893',
                    'website': 'https://www.brennansmarket.com'
                }
            ]
        }
    }
}

print("Testing with single business:")
print(f"Business: {single_business_pipeline1_output['yelp_search']['results']['results'][0]['name']}")
print("="*60)

result = pipeline.run(data={
    "parser": {
        "pipeline1_output": single_business_pipeline1_output
    }
})

documents = result['metadata_enricher']['documents']
print(f"\n✓ Retrieved {len(documents)} document")

if documents:
    doc = documents[0]
    print(f"\nBusiness: {doc.meta.get('business_name')}")
    print(f"Website content length: {len(doc.content)} characters")

2025-11-07 16:41:51,289 - haystack.core.pipeline.pipeline - INFO - Running component parser
2025-11-07 16:41:51,290 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 16:41:51,290 - __main__.Pipeline1ResultParser - INFO - Extracted 1 businesses from Pipeline 1
2025-11-07 16:41:51,290 - haystack.core.pipeline.pipeline - INFO - Running component url_extractor
2025-11-07 16:41:51,291 - __main__.WebsiteURLExtractor - INFO - Processing 1 business results
2025-11-07 16:41:51,291 - __main__.WebsiteURLExtractor - INFO - Extracted website for 'Brennan's Market': https://www.brennansmarket.com
2025-11-07 16:41:51,291 - __main__.WebsiteURLExtractor - INFO - Successfully extracted 1 website URLs
2025-11-07 16:41:51,291 - haystack.core.pipeline.pipeline - INFO - Running component content_fetcher
2025-11-07 16:41:51,290 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 16:41:51,290 - __main__.Pipeline1ResultParser - INFO - Extracted 1 business

Testing with single business:
Business: Brennan's Market


2025-11-07 16:41:51,470 - httpx - INFO - HTTP Request: GET https://brennansmarket.com/ "HTTP/1.1 200 OK"
2025-11-07 16:41:51,475 - haystack.core.pipeline.pipeline - INFO - Running component html_converter
2025-11-07 16:41:51,475 - haystack.core.pipeline.pipeline - INFO - Running component html_converter
2025-11-07 16:41:51,542 - haystack.core.pipeline.pipeline - INFO - Running component metadata_enricher
2025-11-07 16:41:51,542 - __main__.DocumentMetadataEnricher - INFO - Enriching 1 documents with business metadata
2025-11-07 16:41:51,543 - __main__.DocumentMetadataEnricher - INFO - Successfully enriched document for 'Brennan's Market' (content length: 1009 chars)
2025-11-07 16:41:51,543 - __main__.DocumentMetadataEnricher - INFO - Completed enrichment of 1 documents
2025-11-07 16:41:51,542 - haystack.core.pipeline.pipeline - INFO - Running component metadata_enricher
2025-11-07 16:41:51,542 - __main__.DocumentMetadataEnricher - INFO - Enriching 1 documents with business metadata
2025


✓ Retrieved 1 document

Business: Brennan's Market
Website content length: 1009 characters


In [14]:
documents[0].meta

{'content_type': 'text/html',
 'url': 'https://www.brennansmarket.com',
 'business_id': 'Xp_cWXY5rxDLkX-wqUg-iQ',
 'business_name': "Brennan's Market",
 'business_alias': 'brennans-market-madison-3',
 'price_range': '$$',
 'latitude': 43.0556119,
 'longitude': -89.524489,
 'rating': 4.5,
 'review_count': 81,
 'phone': '(608) 833-2893',
 'categories': ['Off Licence', 'Greengrocers', 'Cheese Shops'],
 'services': [],
 'business_highlights': [],
 'website': 'https://www.brennansmarket.com',
 'images': []}

## Summary

### What We Built
- **Pipeline 2** creates a document store from Pipeline 1 results
- Extracts website content and enriches it with business metadata
- No additional API calls needed - uses data already fetched by Pipeline 1

### Key Outputs
Each Document contains:
- **Business Metadata**: ID, name, alias, categories, services, highlights
- **Location Data**: Latitude, longitude, phone
- **Quality Metrics**: Rating, review count, price range
- **Website Content**: Full HTML content converted to text
- **Additional Data**: Images, business highlights from Pipeline 1

### Integration with Agentic RAG
This pipeline serves as either a second or third node in the agentic RAG system:
1. **Pipeline 5** (Interactive Clarification) - First entry point for user queries
2. **Pipeline 1** (Business Search with NER) - Searches Yelp for businesses
3. **Pipeline 2** (This pipeline) - Creates document store with website content
4. **Pipeline 3** (Reviews & Sentiment) - Analyzes business reviews
5. **Pipeline 4** (Summary & Recommendations) - Generates final recommendations

### Usage Example
```python
# Get full output from Pipeline 1
from pipeline1 import pipeline as pipeline1

pipeline1_result = pipeline1.run(data={"query_converter": {"query": "cheese shops in Madison"}})

# Pass the entire Pipeline 1 output to Pipeline 2
# The parser will automatically extract the business results
pipeline2_result = pipeline.run(data={
    "parser": {
        "pipeline1_output": pipeline1_result
    }
})

# Access enriched documents for document store
documents = pipeline2_result['metadata_enricher']['documents']

# Extract metadata summary
metadata = extract_document_metadata(documents)
```

### Direct Piping
```python
# Or pipe directly from Pipeline 1 to Pipeline 2
result1 = pipeline1.run(data={"query_converter": {"query": "cheese shops in Madison"}})
result2 = pipeline.run(data={"parser": {"pipeline1_output": result1}})
```