# Pipeline 3: Business Reviews with Sentiment Analysis

## Overview
This notebook implements a Haystack pipeline that fetches business reviews from Yelp and performs sentiment analysis to identify the highest and lowest rated reviews. It uses components from the sentiment analysis pipeline to classify reviews and create enriched documents.

## What This Pipeline Does
1. Accepts business IDs from Pipeline 1
2. Fetches reviews using the GET BUSINESS REVIEWS endpoint
3. Applies sentiment analysis to each review
4. Identifies highest-rated and lowest-rated reviews
5. Returns Haystack documents with enhanced metadata containing classified reviews

## Use Cases
- Review sentiment analysis
- Identifying best and worst customer experiences
- Content analysis for business recommendations
- Review summarization and theme extraction

## Pipeline Architecture
```
Business IDs → Fetch Reviews → Create Documents → Sentiment Classification → Enrich Metadata → Enhanced Documents
```

## API Reference
- Endpoint: GET BUSINESS REVIEWS
- URL: https://yelp-business-reviews.p.rapidapi.com/reviews/{bizId}
- Path param: bizId (business ID)
- Optional params: query, page, sortBy, language
- Returns: Review text, rating, user info, timestamps

## Setup and Environment Variables

Ensure your `.env` file contains:
```
RAPID_API_KEY=your_key_here
```

In [1]:
# Import required libraries
import requests
from dotenv import load_dotenv
import os
from haystack import Pipeline, component, Document
from haystack.components.routers import TransformersTextRouter
from typing import List, Dict, Any
import json
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Load environment variables
load_dotenv(".env")
RAPID_API_KEY = os.getenv("RAPID_API_KEY")

print("✓ Environment variables loaded successfully")

✓ Environment variables loaded successfully


## Custom Component 1: Pipeline 1 Result Parser

This component parses the full Pipeline 1 output and extracts the business results list.

In [2]:
@component
class Pipeline1ResultParser:
    """
    Parses the full Pipeline 1 output to extract business results.
    
    This component:
    1. Accepts the complete Pipeline 1 output dictionary
    2. Navigates the nested structure to find business results
    3. Extracts business IDs for downstream processing
    
    Input:
        - pipeline1_output (Dict): Complete output from Pipeline 1
    
    Output:
        - business_ids (List[str]): List of business IDs for review fetching
    """
    
    def __init__(self):
        """Initialize the component with a logger."""
        self.logger = logging.getLogger(__name__ + ".Pipeline1ResultParser")
    
    @component.output_types(business_ids=List[str])
    def run(self, pipeline1_output: Dict) -> Dict[str, List[str]]:
        """
        Parse Pipeline 1 output to extract business IDs.
        
        Args:
            pipeline1_output: Full output dictionary from Pipeline 1
                Expected structure: {'yelp_search': {'results': {'results': [...]}}}
            
        Returns:
            Dictionary with business_ids key containing list of business IDs
        """
        self.logger.info("Parsing Pipeline 1 output")
        
        try:
            # Navigate the nested structure
            yelp_search = pipeline1_output.get('yelp_search', {})
            results_wrapper = yelp_search.get('results', {})
            business_results = results_wrapper.get('results', [])
            
            result_count = results_wrapper.get('resultCount', 0)
            location = results_wrapper.get('location', {})
            
            # Extract business IDs
            business_ids = [business.get('bizId') for business in business_results if business.get('bizId')]
            
            self.logger.info(f"Extracted {len(business_ids)} business IDs from Pipeline 1")
            self.logger.debug(f"Result count: {result_count}, Location: {location}")
            self.logger.debug(f"Business IDs: {business_ids}")
            
            return {"business_ids": business_ids}
            
        except Exception as e:
            self.logger.error(f"Error parsing Pipeline 1 output: {e}", exc_info=True)
            return {"business_ids": []}

print("✓ Pipeline1ResultParser component defined")

✓ Pipeline1ResultParser component defined


## Custom Component 2: Yelp Reviews Fetcher

This component fetches reviews for one or more businesses from the Yelp API and creates Haystack Documents with review metadata.

In [3]:
@component
class YelpReviewsFetcher:
    """
    Fetches business reviews from Yelp API and creates Documents.
    
    This component:
    1. Accepts a list of business IDs
    2. Fetches reviews for each business
    3. Creates Haystack Documents with review text and metadata
    4. Returns documents ready for sentiment analysis
    
    Input:
        - business_ids (List[str]): List of Yelp business IDs
    
    Output:
        - documents (List[Document]): Documents containing review text and metadata
    """
    
    def __init__(self, api_key: str, max_reviews_per_business: int = 10):
        """
        Initialize the reviews fetcher.
        
        Args:
            api_key: RapidAPI key for Yelp Business Reviews API
            max_reviews_per_business: Maximum reviews to fetch per business
        """
        self.api_key = api_key
        self.base_url = "https://yelp-business-reviews.p.rapidapi.com/reviews"
        self.max_reviews = max_reviews_per_business
        self.headers = {
            "x-rapidapi-key": self.api_key,
            "x-rapidapi-host": "yelp-business-reviews.p.rapidapi.com"
        }
    
    @component.output_types(documents=List[Document])
    def run(self, business_ids: List[str]) -> Dict[str, List[Document]]:
        """
        Fetch reviews and create Documents.
        
        Args:
            business_ids: List of business IDs to fetch reviews for
            
        Returns:
            Dictionary with 'documents' key containing review Documents
        """
        all_documents = []
        
        for biz_id in business_ids:
            try:
                # Construct URL with business ID
                url = f"{self.base_url}/{biz_id}"
                
                # Execute API request
                response = requests.get(url, headers=self.headers)
                results = response.json()
                
                # Extract reviews
                reviews = results.get('reviews', [])
                
                # Create documents for each review (up to max_reviews)
                for i, review in enumerate(reviews[:self.max_reviews]):
                    doc = Document(
                        content=review.get('text', ''),
                        meta={
                            "business_id": biz_id,
                            "review_id": review.get('id', f"{biz_id}_{i}"),
                            "rating": review.get('rating', 0),
                            "user_name": review.get('user', {}).get('name', 'Anonymous'),
                            "review_url": review.get('url', ''),
                            "time_created": review.get('timeCreated', ''),
                            "review_index": i
                        }
                    )
                    all_documents.append(doc)
                    
            except Exception as e:
                print(f"Error fetching reviews for business {biz_id}: {e}")
                continue
        
        return {"documents": all_documents}

print("✓ YelpReviewsFetcher component defined")

✓ YelpReviewsFetcher component defined


## Initialize Sentiment Analysis Model

We'll use the TransformersTextRouter with a pre-trained sentiment model to classify reviews.

In [4]:
# Initialize sentiment analysis router
print("Loading sentiment analysis model...")
sentiment_router = TransformersTextRouter(
    model="cardiffnlp/twitter-roberta-base-sentiment"
)
sentiment_router.warm_up()
print("✓ Sentiment model loaded")
print("\nModel outputs:")
print("  - LABEL_0: Negative sentiment")
print("  - LABEL_1: Neutral sentiment")
print("  - LABEL_2: Positive sentiment")

Loading sentiment analysis model...


Device set to use mps


✓ Sentiment model loaded

Model outputs:
  - LABEL_0: Negative sentiment
  - LABEL_1: Neutral sentiment
  - LABEL_2: Positive sentiment


## Custom Component 2: Review Sentiment Enricher

This component enriches review documents with sentiment classifications, mapping model labels to human-readable sentiment categories.

In [5]:
@component
class ReviewSentimentEnricher:
    """
    Enriches review documents with sentiment classifications.
    
    This component:
    1. Receives text and documents from sentiment router
    2. Maps sentiment labels to human-readable categories
    3. Adds sentiment metadata to documents
    4. Returns enriched documents
    
    Input:
        - text (str): Review text from router
        - documents (List[Document]): Original review documents
        - sentiment_label (str): The label from sentiment router (LABEL_0, LABEL_1, LABEL_2)
    
    Output:
        - documents (List[Document]): Documents with added sentiment metadata
    """
    
    def __init__(self):
        """Initialize the sentiment enricher."""
        self.sentiment_map = {
            "LABEL_0": "negative",
            "LABEL_1": "neutral",
            "LABEL_2": "positive"
        }
    
    @component.output_types(documents=List[Document])
    def run(self, text: str, documents: List[Document], sentiment_label: str = None) -> Dict[str, List[Document]]:
        """
        Enrich documents with sentiment information.
        
        Args:
            text: The review text that was classified
            documents: Original review documents
            sentiment_label: Sentiment label from router (optional, can be inferred from connection)
            
        Returns:
            Dictionary with enriched documents
        """
        # Infer sentiment from component connection if not explicitly provided
        # The sentiment_label parameter will be automatically set based on which
        # router output socket this component is connected to
        
        enriched_documents = []
        
        for doc in documents:
            # Only enrich the document that matches the text being processed
            if doc.content == text:
                # Determine sentiment based on connection or explicit label
                sentiment = "unknown"
                if sentiment_label:
                    sentiment = self.sentiment_map.get(sentiment_label, "unknown")
                
                # Create enriched document
                enriched_doc = Document(
                    content=doc.content,
                    meta={
                        **doc.meta,
                        "sentiment": sentiment,
                        "sentiment_label": sentiment_label
                    }
                )
                enriched_documents.append(enriched_doc)
            else:
                enriched_documents.append(doc)
        
        return {"documents": enriched_documents}

print("✓ ReviewSentimentEnricher component defined")

✓ ReviewSentimentEnricher component defined


## Custom Component 3: Reviews Aggregator by Business

This component aggregates reviews by business and identifies the highest and lowest rated reviews based on both star rating and sentiment.

In [6]:
@component
class ReviewsAggregatorByBusiness:
    """
    Aggregates reviews by business and identifies top/bottom reviews.
    
    This component:
    1. Groups reviews by business ID
    2. Identifies highest-rated reviews (high star rating + positive sentiment)
    3. Identifies lowest-rated reviews (low star rating + negative sentiment)
    4. Creates summary documents for each business
    
    Input:
        - documents (List[Document]): All review documents with sentiment
    
    Output:
        - documents (List[Document]): One document per business with aggregated review metadata
    """
    
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]) -> Dict[str, List[Document]]:
        """
        Aggregate reviews by business.
        
        Args:
            documents: List of review documents with sentiment metadata
            
        Returns:
            Dictionary with one document per business containing review summaries
        """
        # Group reviews by business_id
        business_reviews = {}
        
        for doc in documents:
            biz_id = doc.meta.get("business_id", "unknown")
            
            if biz_id not in business_reviews:
                business_reviews[biz_id] = []
            
            business_reviews[biz_id].append(doc)
        
        # Create aggregated documents
        aggregated_docs = []
        
        for biz_id, reviews in business_reviews.items():
            # Separate by sentiment
            positive_reviews = [r for r in reviews if r.meta.get("sentiment") == "positive"]
            negative_reviews = [r for r in reviews if r.meta.get("sentiment") == "negative"]
            neutral_reviews = [r for r in reviews if r.meta.get("sentiment") == "neutral"]
            
            # Find highest-rated reviews (high rating + positive sentiment)
            highest_rated = sorted(
                [r for r in positive_reviews if r.meta.get("rating", 0) >= 4],
                key=lambda x: x.meta.get("rating", 0),
                reverse=True
            )[:3]
            
            # Find lowest-rated reviews (low rating + negative sentiment)
            lowest_rated = sorted(
                [r for r in negative_reviews if r.meta.get("rating", 0) <= 3],
                key=lambda x: x.meta.get("rating", 0)
            )[:3]
            
            # Create summary content (without detailed review text)
            summary_content = f"Business Review Summary (ID: {biz_id})\n\n"
            summary_content += f"Total Reviews: {len(reviews)}\n"
            summary_content += f"Positive: {len(positive_reviews)}, "
            summary_content += f"Neutral: {len(neutral_reviews)}, "
            summary_content += f"Negative: {len(negative_reviews)}\n\n"
            
               
            # Create aggregated document
            agg_doc = Document(
                content=summary_content,
                meta={
                    "business_id": biz_id,
                    "total_reviews": len(reviews),
                    "positive_count": len(positive_reviews),
                    "neutral_count": len(neutral_reviews),
                    "negative_count": len(negative_reviews),
                    "highest_rated_reviews": [
                        {
                            "rating": r.meta.get("rating"),
                            "sentiment": r.meta.get("sentiment"),
                            "text": r.content,
                            "user": r.meta.get("user_name"),
                            "url": r.meta.get("review_url")
                        }
                        for r in highest_rated
                    ],
                    "lowest_rated_reviews": [
                        {
                            "rating": r.meta.get("rating"),
                            "sentiment": r.meta.get("sentiment"),
                            "text": r.content,
                            "user": r.meta.get("user_name"),
                            "url": r.meta.get("review_url")
                        }
                        for r in lowest_rated
                    ]
                }
            )
            aggregated_docs.append(agg_doc)
        
        return {"documents": aggregated_docs}

print("✓ ReviewsAggregatorByBusiness component defined")

✓ ReviewsAggregatorByBusiness component defined


## Build a Simplified Pipeline for Processing Reviews

Since the TransformersTextRouter routes each document individually and requires complex routing logic, we'll create a simpler pipeline that processes all reviews in batch.

In [7]:
@component
class BatchSentimentAnalyzer:
    """
    Analyzes sentiment for multiple review documents in batch.
    
    This component wraps the sentiment router to process all documents
    and add sentiment metadata efficiently.
    """
    
    def __init__(self):
        """Initialize the batch sentiment analyzer."""
        self.router = TransformersTextRouter(
            model="cardiffnlp/twitter-roberta-base-sentiment"
        )
        self.router.warm_up()
        
        self.sentiment_map = {
            "LABEL_0": "negative",
            "LABEL_1": "neutral",
            "LABEL_2": "positive"
        }
    
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]) -> Dict[str, List[Document]]:
        """
        Analyze sentiment for all documents.
        
        Args:
            documents: List of review documents
            
        Returns:
            Dictionary with sentiment-enriched documents
        """
        enriched_docs = []
        
        for doc in documents:
            # Run sentiment analysis
            result = self.router.run(text=doc.content)
            
            # Extract the label from the result
            # The router outputs to different sockets (LABEL_0, LABEL_1, LABEL_2)
            # We need to find which socket has output
            sentiment_label = None
            for label in ["LABEL_0", "LABEL_1", "LABEL_2"]:
                if label in result and result[label]:
                    sentiment_label = label
                    break
            
            # Map to human-readable sentiment
            sentiment = self.sentiment_map.get(sentiment_label, "unknown")
            
            # Create enriched document
            enriched_doc = Document(
                content=doc.content,
                meta={
                    **doc.meta,
                    "sentiment": sentiment,
                    "sentiment_label": sentiment_label
                }
            )
            enriched_docs.append(enriched_doc)
        
        return {"documents": enriched_docs}

print("✓ BatchSentimentAnalyzer component defined")

✓ BatchSentimentAnalyzer component defined


## Build the Complete Pipeline

Assembling all components:
1. Pipeline1ResultParser - Parse Pipeline 1 output to extract business IDs
2. YelpReviewsFetcher - Fetch reviews from Yelp API
3. BatchSentimentAnalyzer - Analyze sentiment for all reviews
4. ReviewsAggregatorByBusiness - Aggregate and summarize by business

In [8]:
# Initialize pipeline
pipeline = Pipeline()

# Initialize components
parser = Pipeline1ResultParser()
reviews_fetcher = YelpReviewsFetcher(
    api_key=RAPID_API_KEY,
    max_reviews_per_business=10
)
sentiment_analyzer = BatchSentimentAnalyzer()
reviews_aggregator = ReviewsAggregatorByBusiness()

# Add components to pipeline
pipeline.add_component("parser", parser)
pipeline.add_component("reviews_fetcher", reviews_fetcher)
pipeline.add_component("sentiment_analyzer", sentiment_analyzer)
pipeline.add_component("reviews_aggregator", reviews_aggregator)

# Connect components
pipeline.connect("parser.business_ids", "reviews_fetcher.business_ids")
pipeline.connect("reviews_fetcher.documents", "sentiment_analyzer.documents")
pipeline.connect("sentiment_analyzer.documents", "reviews_aggregator.documents")

print("✓ Pipeline built successfully")
print("\nPipeline structure:")
print("Pipeline1Output → Parser → ReviewsFetcher → SentimentAnalyzer → ReviewsAggregator → Aggregated Documents")

Device set to use mps


✓ Pipeline built successfully

Pipeline structure:
Pipeline1Output → Parser → ReviewsFetcher → SentimentAnalyzer → ReviewsAggregator → Aggregated Documents


## Test the Pipeline

Let's test with the full Pipeline 1 output structure.

In [9]:
# Create a mock Pipeline 1 output for testing
mock_pipeline1_output = {
    "yelp_search": {
        "results": {
            "resultCount": 1,
            "location": {"city": "Madison", "state": "WI"},
            "results": [
                {
                    "bizId": "RJNAeNA-209sctUO0dmwuA",
                    "name": "The Old Fashioned",
                    "alias": "the-old-fashioned-madison",
                    "rating": 4.0,
                    "categories": [{"title": "American", "alias": "newamerican"}]
                }
            ]
        }
    }
}

print("Testing Pipeline 3 with Pipeline 1 output:")
print(f"Result count: {mock_pipeline1_output['yelp_search']['results']['resultCount']}")
print(f"Location: {mock_pipeline1_output['yelp_search']['results']['location']}")
print("="*60)

result = pipeline.run(data={
    "parser": {
        "pipeline1_output": mock_pipeline1_output
    }
})

# Display results
documents = result['reviews_aggregator']['documents']
print(f"\n✓ Processed reviews for {len(documents)} business(es)")

for doc in documents:
    print(f"\n{'='*60}")
    print(f"Business ID: {doc.meta['business_id']}")
    print(f"Total Reviews Analyzed: {doc.meta['total_reviews']}")
    print(f"  Positive: {doc.meta['positive_count']}")
    print(f"  Neutral: {doc.meta['neutral_count']}")
    print(f"  Negative: {doc.meta['negative_count']}")
    
    print(f"\n--- Top {len(doc.meta['highest_rated_reviews'])} Highest-Rated Reviews ---")
    for i, review in enumerate(doc.meta['highest_rated_reviews'], 1):
        print(f"\n{i}. Rating: {review['rating']}/5 | Sentiment: {review['sentiment']}")
        print(f"   User: {review['user']}")
        print(f"   Text: {review['text'][:150]}...")
    
    print(f"\n--- Top {len(doc.meta['lowest_rated_reviews'])} Lowest-Rated Reviews ---")
    for i, review in enumerate(doc.meta['lowest_rated_reviews'], 1):
        print(f"\n{i}. Rating: {review['rating']}/5 | Sentiment: {review['sentiment']}")
        print(f"   User: {review['user']}")
        print(f"   Text: {review['text'][:150]}...")

2025-11-07 17:38:25,728 - haystack.core.pipeline.pipeline - INFO - Running component parser
2025-11-07 17:38:25,728 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 17:38:25,728 - __main__.Pipeline1ResultParser - INFO - Extracted 1 business IDs from Pipeline 1
2025-11-07 17:38:25,729 - haystack.core.pipeline.pipeline - INFO - Running component reviews_fetcher


Testing Pipeline 3 with Pipeline 1 output:
Result count: 1
Location: {'city': 'Madison', 'state': 'WI'}


2025-11-07 17:38:28,013 - haystack.core.pipeline.pipeline - INFO - Running component sentiment_analyzer
2025-11-07 17:38:28,726 - haystack.core.pipeline.pipeline - INFO - Running component reviews_aggregator



✓ Processed reviews for 1 business(es)

Business ID: RJNAeNA-209sctUO0dmwuA
Total Reviews Analyzed: 10
  Positive: 9
  Neutral: 0
  Negative: 1

--- Top 3 Highest-Rated Reviews ---

1. Rating: 5/5 | Sentiment: positive
   User: Thien-Tyler N.
   Text: Food 5/5
Service 4.5/5
Ambiance 4.5/5

A Wisconsin staple at this point, The Old Fashioned does pub food (and even some brunch) well. The cheese curds...

2. Rating: 5/5 | Sentiment: positive
   User: Caryn I.
   Text: Came to Old Fashion on a mission to finally try their famous cheese curds and a Spotted Cow on tap (because when in Wisconsin, right?). The cheese cur...

3. Rating: 5/5 | Sentiment: positive
   User: Candace R.
   Text: Visited The Old Fashioned while in Madison, WI for a friend's 40th birthday, and we ended up going twice in one day. That's how fun it was! 

It's a c...

--- Top 1 Lowest-Rated Reviews ---

1. Rating: 3/5 | Sentiment: negative
   User: Matthew C.
   Text: Just very inconsistent. I went here on a work trip

In [10]:
documents

[Document(id=e2fdae75458e6a8c6b14f398bf4072f486848a2af19cc33a8c0a9c0c687f9475, content: 'Business Review Summary (ID: RJNAeNA-209sctUO0dmwuA)
 
 Total Reviews: 10
 Positive: 9, Neutral: 0, Neg...', meta: {'business_id': 'RJNAeNA-209sctUO0dmwuA', 'total_reviews': 10, 'positive_count': 9, 'neutral_count': 0, 'negative_count': 1, 'highest_rated_reviews': [{'rating': 5, 'sentiment': 'positive', 'text': 'Food 5/5\nService 4.5/5\nAmbiance 4.5/5\n\nA Wisconsin staple at this point, The Old Fashioned does pub food (and even some brunch) well. The cheese curds were solid, the tenderloin sandwich lasted me a whole two meals, and the food the rest of my party got (a Belgian waffle and a Double Brat) also was quite tasty. Would recommend after a nice market stroll.', 'user': 'Thien-Tyler N.', 'url': 'https://www.yelp.com/biz/RJNAeNA-209sctUO0dmwuA?hrid=-AHOyDmPRahLrNXnXYS0PA'}, {'rating': 5, 'sentiment': 'positive', 'text': "Came to Old Fashion on a mission to finally try their famous cheese curds

## Test with Multiple Businesses

Test the pipeline with multiple businesses to see aggregated results.

In [11]:
# Create mock Pipeline 1 output with multiple businesses
mock_pipeline1_output_multi = {
    "yelp_search": {
        "results": {
            "resultCount": 2,
            "location": {"city": "Madison", "state": "WI"},
            "results": [
                {
                    "bizId": "RJNAeNA-209sctUO0dmwuA",
                    "name": "The Old Fashioned",
                    "alias": "the-old-fashioned-madison",
                    "rating": 4.0
                },
                {
                    "bizId": "EgtyW19V-64c6PmRuvzSEA",
                    "name": "The Great Dane",
                    "alias": "the-great-dane-madison",
                    "rating": 4.5
                }
            ]
        }
    }
}

print("Testing with multiple businesses:")
print(f"Result count: {mock_pipeline1_output_multi['yelp_search']['results']['resultCount']}")
print("="*60)

result = pipeline.run(data={
    "parser": {
        "pipeline1_output": mock_pipeline1_output_multi
    }
})

documents = result['reviews_aggregator']['documents']
print(f"\n✓ Processed reviews for {len(documents)} businesses")

for i, doc in enumerate(documents, 1):
    print(f"\n{'='*60}")
    print(f"Business {i} - ID: {doc.meta['business_id']}")
    print(f"Reviews: {doc.meta['total_reviews']} total")
    print(f"  Positive: {doc.meta['positive_count']}")
    print(f"  Neutral: {doc.meta['neutral_count']}")
    print(f"  Negative: {doc.meta['negative_count']}")
    
    sentiment_ratio = (
        doc.meta['positive_count'] / doc.meta['total_reviews'] * 100
        if doc.meta['total_reviews'] > 0 else 0
    )
    print(f"  Positive Ratio: {sentiment_ratio:.1f}%")

2025-11-07 17:38:28,747 - haystack.core.pipeline.pipeline - INFO - Running component parser
2025-11-07 17:38:28,748 - __main__.Pipeline1ResultParser - INFO - Parsing Pipeline 1 output
2025-11-07 17:38:28,749 - __main__.Pipeline1ResultParser - INFO - Extracted 2 business IDs from Pipeline 1
2025-11-07 17:38:28,749 - haystack.core.pipeline.pipeline - INFO - Running component reviews_fetcher


Testing with multiple businesses:
Result count: 2


2025-11-07 17:38:32,238 - haystack.core.pipeline.pipeline - INFO - Running component sentiment_analyzer
2025-11-07 17:38:33,078 - haystack.core.pipeline.pipeline - INFO - Running component reviews_aggregator



✓ Processed reviews for 2 businesses

Business 1 - ID: RJNAeNA-209sctUO0dmwuA
Reviews: 10 total
  Positive: 9
  Neutral: 0
  Negative: 1
  Positive Ratio: 90.0%

Business 2 - ID: EgtyW19V-64c6PmRuvzSEA
Reviews: 10 total
  Positive: 8
  Neutral: 1
  Negative: 1
  Positive Ratio: 80.0%


In [12]:
documents

[Document(id=e2fdae75458e6a8c6b14f398bf4072f486848a2af19cc33a8c0a9c0c687f9475, content: 'Business Review Summary (ID: RJNAeNA-209sctUO0dmwuA)
 
 Total Reviews: 10
 Positive: 9, Neutral: 0, Neg...', meta: {'business_id': 'RJNAeNA-209sctUO0dmwuA', 'total_reviews': 10, 'positive_count': 9, 'neutral_count': 0, 'negative_count': 1, 'highest_rated_reviews': [{'rating': 5, 'sentiment': 'positive', 'text': 'Food 5/5\nService 4.5/5\nAmbiance 4.5/5\n\nA Wisconsin staple at this point, The Old Fashioned does pub food (and even some brunch) well. The cheese curds were solid, the tenderloin sandwich lasted me a whole two meals, and the food the rest of my party got (a Belgian waffle and a Double Brat) also was quite tasty. Would recommend after a nice market stroll.', 'user': 'Thien-Tyler N.', 'url': 'https://www.yelp.com/biz/RJNAeNA-209sctUO0dmwuA?hrid=-AHOyDmPRahLrNXnXYS0PA'}, {'rating': 5, 'sentiment': 'positive', 'text': "Came to Old Fashion on a mission to finally try their famous cheese curds

In [13]:
documents[0].content

'Business Review Summary (ID: RJNAeNA-209sctUO0dmwuA)\n\nTotal Reviews: 10\nPositive: 9, Neutral: 0, Negative: 1\n\n'

## Usage with Real Pipeline 1 Output

To use this pipeline in production with actual Pipeline 1 output:

```python
# Assume you have run Pipeline 1 and got the output
# pipeline1_result = pipeline1.run(...)
# pipeline1_output = pipeline1_result['yelp_business_search']

# Run Pipeline 3 with Pipeline 1 output
result = pipeline.run(data={
    "parser": {
        "pipeline1_output": pipeline1_output  # Full nested structure from Pipeline 1
    }
})

# Access the results
aggregated_reviews = result['reviews_aggregator']['documents']
```

## Helper Function: Extract Review Insights

Utility function to extract key insights from the aggregated review documents.

In [14]:
def extract_review_insights(documents: List[Document]) -> Dict[str, Any]:
    """
    Extract key insights from aggregated review documents.
    
    Args:
        documents: List of aggregated review documents
    
    Returns:
        Dictionary with insights for each business
    """
    insights = {}
    
    for doc in documents:
        biz_id = doc.meta['business_id']
        total = doc.meta['total_reviews']
        
        if total == 0:
            continue
        
        insights[biz_id] = {
            "total_reviews": total,
            "sentiment_distribution": {
                "positive": doc.meta['positive_count'],
                "neutral": doc.meta['neutral_count'],
                "negative": doc.meta['negative_count']
            },
            "sentiment_percentages": {
                "positive": round(doc.meta['positive_count'] / total * 100, 1),
                "neutral": round(doc.meta['neutral_count'] / total * 100, 1),
                "negative": round(doc.meta['negative_count'] / total * 100, 1)
            },
            "overall_sentiment": (
                "positive" if doc.meta['positive_count'] > doc.meta['negative_count']
                else "negative" if doc.meta['negative_count'] > doc.meta['positive_count']
                else "neutral"
            ),
            "highest_rated_count": len(doc.meta['highest_rated_reviews']),
            "lowest_rated_count": len(doc.meta['lowest_rated_reviews'])
        }
    
    return insights

# Test the helper function
insights = extract_review_insights(documents)
print("\nReview Insights Summary:")
for biz_id, data in insights.items():
    print(f"\nBusiness ID: {biz_id}")
    print(f"  Total Reviews: {data['total_reviews']}")
    print(f"  Overall Sentiment: {data['overall_sentiment']}")
    print(f"  Distribution:")
    for sentiment, percentage in data['sentiment_percentages'].items():
        print(f"    {sentiment.capitalize()}: {percentage}%")


Review Insights Summary:

Business ID: RJNAeNA-209sctUO0dmwuA
  Total Reviews: 10
  Overall Sentiment: positive
  Distribution:
    Positive: 90.0%
    Neutral: 0.0%
    Negative: 10.0%

Business ID: EgtyW19V-64c6PmRuvzSEA
  Total Reviews: 10
  Overall Sentiment: positive
  Distribution:
    Positive: 80.0%
    Neutral: 10.0%
    Negative: 10.0%


## Summary

### What We Built
- **Pipeline 3** successfully fetches business reviews and performs sentiment analysis
- Identifies highest and lowest rated reviews based on star ratings and sentiment
- Returns aggregated documents with comprehensive review metadata per business

### Key Outputs
Each aggregated Document contains:
- **Review Counts**: Total, positive, neutral, negative
- **Highest-Rated Reviews**: Top reviews with high ratings and positive sentiment
- **Lowest-Rated Reviews**: Bottom reviews with low ratings and negative sentiment
- **Full Review Data**: Text, ratings, sentiment, user info, URLs

### Next Steps
These enriched review documents can be:
- Passed to Pipeline 4 for summarization and recommendations
- Used to identify themes and patterns
- Analyzed for business improvement insights
- Combined with business details from Pipeline 2

### Usage Example
```python
# Run the pipeline with business IDs from Pipeline 1
result = pipeline.run(data={
    "reviews_fetcher": {
        "business_ids": ["id1", "id2", "id3"]
    }
})

# Access aggregated documents
documents = result['reviews_aggregator']['documents']

# Extract insights
insights = extract_review_insights(documents)
```