🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Advanced Branching Pipeline - Multi-Source Knowledge Graph Generation

This notebook demonstrates how to build sophisticated branching pipelines that can:
1. **Process Multiple Input Types**: Handle PDFs, web URLs, and other document formats simultaneously
2. **Intelligent Routing**: Automatically route different content types through appropriate processing paths
3. **Unified Knowledge Graphs**: Combine information from multiple sources into a single knowledge representation
4. **Scalable Architecture**: Design patterns that can be extended to handle additional content types

## Learning Objectives

By the end of this notebook, you will understand:
- How to use Haystack's `FileTypeRouter` for automatic input type detection
- How to design branching pipelines that process heterogeneous data sources
- How to use `DocumentJoiner` to combine processed content from multiple branches
- Best practices for production-ready multi-source processing pipelines

## Key Architectural Components
- **FileTypeRouter**: Automatically detects input types and routes them appropriately
- **DocumentJoiner**: Combines documents from different processing branches
- **LinkContentFetcher + HTMLToDocument**: Web content processing branch
- **PyPDFToDocument**: PDF processing branch
- **Shared Processing Components**: Unified cleaning, splitting, and knowledge graph generation

## Real-World Applications
This approach is essential for:
- **Enterprise Knowledge Management**: Processing diverse document collections
- **Research Data Integration**: Combining academic papers, web articles, and reports
- **Multi-Modal Content Analysis**: Handling various content formats in a single workflow
- **Automated Content Pipelines**: Production systems that need to handle varied input types

In [1]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, HTMLToDocument
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator,\
                                                DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator,\
                                                TestDatasetSaver

# Load environment variables
load_dotenv("./.env")

  from .autonotebook import tqdm as notebook_tqdm


False

## Building the Advanced Branching Pipeline

### Pipeline Architecture Overview

Our advanced pipeline will follow this architecture:

```
Input Sources (PDF + Web URL)
    ↓                    ↓
FileTypeRouter    LinkContentFetcher
    ↓                    ↓  
PDFConverter      HTMLConverter
    ↓                    ↓
    └── DocumentJoiner ──┘
            ↓
    Document Processing Chain
    (Cleaner → Splitter → Converter)
            ↓
    Knowledge Graph Generator
            ↓
    Synthetic Test Generator  
            ↓
    Test Dataset Saver
```

### Key Design Principles

1. **Separation of Concerns**: Each component has a single, well-defined responsibility
2. **Flexible Input Handling**: Can process multiple input types simultaneously
3. **Unified Processing**: Same downstream logic regardless of input source
4. **Extensibility**: Easy to add new input types (CSV, Word docs, etc.)
5. **Error Isolation**: Problems with one input source don't affect others

In [None]:
# Initialize pipeline
pipeline = Pipeline()

# Core routing and joining components  
file_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/html"])
doc_joiner = DocumentJoiner()  # Joins documents from different branches

# Input converters for each file type
pdf_converter = PyPDFToDocument()
html_converter = HTMLToDocument()  
link_fetcher = LinkContentFetcher()

# Shared processing components
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True, 
    remove_extra_whitespaces=True
)
doc_splitter = DocumentSplitter(split_by="sentence", split_length=50, split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
test_generator = SyntheticTestGenerator(
    testset_size=15,  # Larger test set for multiple sources
    llm_model="gpt-4o-mini",
    query_distribution=[
        ("single_hop", 0.3),
        ("multi_hop_specific", 0.3), 
        ("multi_hop_abstract", 0.4)
    ]
)
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_advanced_branching.csv")

# Add all components to pipeline
pipeline.add_component("file_router", file_router)
pipeline.add_component("link_fetcher", link_fetcher)
pipeline.add_component("pdf_converter", pdf_converter) 
pipeline.add_component("html_converter", html_converter)
pipeline.add_component("doc_joiner", doc_joiner)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect file routing branches
pipeline.connect("file_router.application/pdf", "pdf_converter.sources") 
pipeline.connect("link_fetcher.streams", "html_converter.sources")

# Connect converters to joiner
pipeline.connect("pdf_converter.documents", "doc_joiner.documents")
pipeline.connect("html_converter.documents", "doc_joiner.documents")

# Connect main processing path
pipeline.connect("doc_joiner.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

<haystack.core.pipeline.pipeline.Pipeline object at 0x10751cda0>
🚅 Components
  - file_router: FileTypeRouter
  - link_fetcher: LinkContentFetcher
  - pdf_converter: PyPDFToDocument
  - html_converter: HTMLToDocument
  - doc_joiner: DocumentJoiner
  - doc_cleaner: DocumentCleaner
  - doc_splitter: DocumentSplitter
  - doc_converter: DocumentToLangChainConverter
  - kg_generator: KnowledgeGraphGenerator
  - test_generator: SyntheticTestGenerator
  - test_saver: TestDatasetSaver
🛤️ Connections
  - file_router.application/pdf -> pdf_converter.sources (list[Union[str, Path, ByteStream]])
  - link_fetcher.streams -> html_converter.sources (list[ByteStream])
  - pdf_converter.documents -> doc_joiner.documents (list[Document])
  - html_converter.documents -> doc_joiner.documents (list[Document])
  - doc_joiner.documents -> doc_cleaner.documents (list[Document])
  - doc_cleaner.documents -> doc_splitter.documents (list[Document])
  - doc_splitter.documents -> doc_converter.documents (list[Docu

In [None]:
# Define inputs
pdf_file = Path("./data_for_indexing/howpeopleuseai.pdf")
web_url = "https://www.tableau.com/data-insights/ai/examples"

# Run pipeline with both input types
result = pipeline.run({
    "file_router": {"sources": [pdf_file]},  # PDF input through FileTypeRouter
    "link_fetcher": {"urls": [web_url]}      # Web input through LinkContentFetcher
})
    

Applying HeadlinesExtractor: 100%|██████████| 19/19 [00:10<00:00,  1.86it/s]
Applying HeadlineSplitter: 100%|██████████| 20/20 [00:00<00:00, 526.48it/s]
Applying SummaryExtractor: 100%|██████████| 19/19 [00:12<00:00,  1.51it/s]
Applying CustomNodeFilter:   0%|          | 0/51 [00:00<?, ?it/s]

In [None]:
# Visualize the advanced branching pipeline architecture
pipeline.draw(path="./images/advanced_branching_kg_pipeline.png")
print("📸 Pipeline diagram saved to: ./images/advanced_branching_kg_pipeline.png")

📸 Pipeline diagram saved to: ./images/advanced_branching_kg_pipeline.png


![Advanced Branching Pipeline](./images/advanced_branching_kg_pipeline.png)

In [None]:
import pandas as pd

# Load and analyze results from the advanced branching pipeline
advanced_test_file = "data_for_eval/synthetic_tests_advanced_branching.csv"

if os.path.exists(advanced_test_file):
    advanced_tests_df = pd.read_csv(advanced_test_file)

    display(advanced_tests_df.head())
    display(advanced_tests_df.tail())
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("❌ Synthetic test file not found")
    print("Please run the previous cells to generate the test data.")

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What AI-powered service does Samsung provide?,['You may or may not be aware of how pervasive...,"Samsung provides Bixby, which is a digital ass...",single_hop_specific_query_synthesizer
1,How YouTube use AI for keep people engaged?,['Social media\nSocial media platforms are ano...,"YouTube, like other social media platforms, us...",single_hop_specific_query_synthesizer
2,How does the Mars rover Perseverance contribut...,['Some examples of industrial robots include:\...,The Mars rover Perseverance is programmed to g...,single_hop_specific_query_synthesizer
3,How does AI enhance analytics in business deci...,['Fraud prevention\nIf you have an account wit...,AI enhances analytics in business decision-mak...,single_hop_specific_query_synthesizer
4,What is OpenAI's role in the development of Ch...,['NBER WORKING PAPER SERIES\nHOW PEOPLE USE CH...,OpenAI is involved in the development of ChatG...,single_hop_specific_query_synthesizer


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
11,How does ChatGPT usage relate to fraud prevent...,['<1-hop>\n\nPanel C2.Technical Help. Panel C3...,ChatGPT usage is broadly focused on seeking in...,multi_hop_abstract_query_synthesizer
12,How does the automated classification of messa...,['<1-hop>\n\nThe left column\nshows a standalo...,The automated classification of messages ensur...,multi_hop_abstract_query_synthesizer
13,How does ChatGPT usage for decision-making var...,['<1-hop>\n\nPanel C2.Technical Help. Panel C3...,ChatGPT usage for decision-making is notably c...,multi_hop_abstract_query_synthesizer
14,What are the main writing sub-categories ident...,"['<1-hop>\n\nFor example, the five sub-categor...",The main writing sub-categories identified in ...,multi_hop_abstract_query_synthesizer
15,How does message classification relate to user...,['<1-hop>\n\nThe left column\nshows a standalo...,Message classification is performed using auto...,multi_hop_abstract_query_synthesizer


NameError: name 'synthetic_tests_df' is not defined

## Summary and Architecture Analysis

### What We've Accomplished

In this notebook, we've built increasingly sophisticated branching pipelines:

1. **Basic Branching Pipeline**: PDF + Web content processing  
2. **Production-Ready Pipeline**: Enhanced error handling and monitoring

### Key Architectural Benefits

1. **Modularity**: Each component has a single responsibility and can be reused
2. **Flexibility**: Easy to add new input types (CSV, Word docs, etc.) 
3. **Scalability**: DocumentJoiner allows processing multiple sources simultaneously
4. **Consistency**: Same processing logic regardless of input source
5. **Error Isolation**: Problems with one input source don't affect others

### Production Considerations

**Advantages of Branching Pipelines:**
- **Unified Output**: Single knowledge graph and test dataset from multiple sources
- **Rich Context**: Cross-referencing information between different document types
- **Operational Efficiency**: One pipeline deployment handles multiple scenarios
- **Quality Improvement**: More diverse training data leads to better synthetic questions

**When to Use Branching Pipelines:**
- Processing heterogeneous document collections
- Building comprehensive knowledge bases from multiple sources
- Creating robust test datasets that cover various content types
- Implementing production pipelines that need input flexibility


### Extension Patterns

To add new input types:
1. Add MIME type to `FileTypeRouter`
2. Create appropriate converter component
3. Connect converter to `DocumentJoiner`
4. No changes needed to downstream processing!

This modular approach makes the pipeline highly maintainable and extensible for future requirements.