üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Advanced Branching Pipeline - Multi-Source Knowledge Graph Generation

This notebook demonstrates how to build sophisticated branching pipelines that can:
1. **Process Multiple Input Types**: Handle PDFs, web URLs, and other document formats simultaneously
2. **Intelligent Routing**: Automatically route different content types through appropriate processing paths
3. **Unified Knowledge Graphs**: Combine information from multiple sources into a single knowledge representation
4. **Scalable Architecture**: Design patterns that can be extended to handle additional content types

## Learning Objectives

By the end of this notebook, you will understand:
- How to use Haystack's `FileTypeRouter` for automatic input type detection
- How to design branching pipelines that process heterogeneous data sources
- How to use `DocumentJoiner` to combine processed content from multiple branches
- Best practices for production-ready multi-source processing pipelines

## Key Architectural Components
- **FileTypeRouter**: Automatically detects input types and routes them appropriately
- **DocumentJoiner**: Combines documents from different processing branches
- **LinkContentFetcher + HTMLToDocument**: Web content processing branch
- **PyPDFToDocument**: PDF processing branch
- **Shared Processing Components**: Unified cleaning, splitting, and knowledge graph generation

## Real-World Applications
This approach is essential for:
- **Enterprise Knowledge Management**: Processing diverse document collections
- **Research Data Integration**: Combining academic papers, web articles, and reports
- **Multi-Modal Content Analysis**: Handling various content formats in a single workflow
- **Automated Content Pipelines**: Production systems that need to handle varied input types

In [3]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, HTMLToDocument
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.langchaindocument_component import DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator, TestDatasetSaver
import os
# Load environment variables
load_dotenv(".env")

True

## Building the Advanced Branching Pipeline

### Pipeline Architecture Overview

Our advanced pipeline will follow this architecture:

```
Input Sources (PDF + Web URL)
    ‚Üì                    ‚Üì
FileTypeRouter    LinkContentFetcher
    ‚Üì                    ‚Üì  
PDFConverter      HTMLConverter
    ‚Üì                    ‚Üì
    ‚îî‚îÄ‚îÄ DocumentJoiner ‚îÄ‚îÄ‚îò
            ‚Üì
    Document Processing Chain
    (Cleaner ‚Üí Splitter ‚Üí Converter)
            ‚Üì
    Knowledge Graph Generator
            ‚Üì
    Synthetic Test Generator  
            ‚Üì
    Test Dataset Saver
```

### Key Design Principles

1. **Separation of Concerns**: Each component has a single, well-defined responsibility
2. **Flexible Input Handling**: Can process multiple input types simultaneously
3. **Unified Processing**: Same downstream logic regardless of input source
4. **Extensibility**: Easy to add new input types (CSV, Word docs, etc.)
5. **Error Isolation**: Problems with one input source don't affect others

In [5]:
# Initialize pipeline
pipeline = Pipeline()

# Core routing and joining components  
file_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/html"])
doc_joiner = DocumentJoiner()  # Joins documents from different branches

# Input converters for each file type
pdf_converter = PyPDFToDocument()
html_converter = HTMLToDocument()  
link_fetcher = LinkContentFetcher()

# Shared processing components
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True, 
    remove_extra_whitespaces=True
)
doc_splitter = DocumentSplitter(split_by="sentence", split_length=50, split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
test_generator = SyntheticTestGenerator(
            test_size=10,
            llm_model="gpt-4o-mini",
            embedder_model="text-embedding-ada-002",
            query_distribution=[
                ("single_hop", 0.3),
                ("multi_hop_specific", 0.3),
                ("multi_hop_abstract", 0.4)
            ],
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )

test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_advanced_branching.csv")

# Add all components to pipeline
pipeline.add_component("file_router", file_router)
pipeline.add_component("link_fetcher", link_fetcher)
pipeline.add_component("pdf_converter", pdf_converter) 
pipeline.add_component("html_converter", html_converter)
pipeline.add_component("doc_joiner", doc_joiner)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect file routing branches
pipeline.connect("file_router.application/pdf", "pdf_converter.sources") 
pipeline.connect("link_fetcher.streams", "html_converter.sources")

# Connect converters to joiner
pipeline.connect("pdf_converter.documents", "doc_joiner.documents")
pipeline.connect("html_converter.documents", "doc_joiner.documents")

# Connect main processing path
pipeline.connect("doc_joiner.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

<haystack.core.pipeline.pipeline.Pipeline object at 0x310cad0d0>
üöÖ Components
  - file_router: FileTypeRouter
  - link_fetcher: LinkContentFetcher
  - pdf_converter: PyPDFToDocument
  - html_converter: HTMLToDocument
  - doc_joiner: DocumentJoiner
  - doc_cleaner: DocumentCleaner
  - doc_splitter: DocumentSplitter
  - doc_converter: DocumentToLangChainConverter
  - kg_generator: KnowledgeGraphGenerator
  - test_generator: SyntheticTestGenerator
  - test_saver: TestDatasetSaver
üõ§Ô∏è Connections
  - file_router.application/pdf -> pdf_converter.sources (list[Union[str, Path, ByteStream]])
  - link_fetcher.streams -> html_converter.sources (list[ByteStream])
  - pdf_converter.documents -> doc_joiner.documents (list[Document])
  - html_converter.documents -> doc_joiner.documents (list[Document])
  - doc_joiner.documents -> doc_cleaner.documents (list[Document])
  - doc_cleaner.documents -> doc_splitter.documents (list[Document])
  - doc_splitter.documents -> doc_converter.documents (l

In [8]:
# Define inputs
pdf_file = Path("./data_for_indexing/howpeopleuseai.pdf")
web_urls = ["https://www.bbc.com/news/articles/c2l799gxjjpo",
            "https://www.brookings.edu/articles/how-artificial-intelligence-is-transforming-the-world/"
            ]

try:
    # Run pipeline with both input types
    result = pipeline.run({
    "file_router": {"sources": [pdf_file]},  # PDF input through FileTypeRouter
    "link_fetcher": {"urls":web_urls }      # Web input through LinkContentFetcher
})

    print("\nüìä Pipeline Results:")
    print(f"  üìÑ Documents Processed: {result['doc_converter']['document_count']}")
    print(f"  üß† Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
    print(f"  üß™ Test Cases Generated: {result['test_generator']['testset_size']}")
    print(f"  üîß Generation Method: {result['test_generator']['generation_method']}")
    
except Exception as e:
    print(f"‚ùå Error processing web content: {str(e)}")
    print("This might be due to network issues or website access restrictions.")

Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 26/26 [00:07<00:00,  3.68it/s]
Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 26/26 [00:07<00:00,  3.68it/s]
Applying HeadlineSplitter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 26/26 [00:00<00:00, 524.45it/s]
Applying SummaryExtractor:   0%|          | 0/27 [00:00<?, ?it/s]
Applying SummaryExtractor:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 18/27 [00:07<00:02,  4.18it/s]Property 'summary' already exists in node 'a2079d'. Skipping!
Applying SummaryExtractor:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 19/27 [00:07<00:02,  3.72it/s]Property 'summary' already exists in node 'a2079d'. Skipping!
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 27/27 [00:09<00:00,  2.83it/s]
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 27/27 [00:09<00:00,  2.83it/s]
Applying CustomNodeFilter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 74/74 [00:17<00:00,  4.21it/s]
Applying EmbeddingExtractor:   0%|          | 0/27 [00:00<?


üìä Pipeline Results:
  üìÑ Documents Processed: 26
  üß† Knowledge Graph Nodes: 26
  üß™ Test Cases Generated: 10
  üîß Generation Method: knowledge_graph


In [9]:
# Visualize the advanced branching pipeline architecture
pipeline.draw(path="./images/advanced_branching_kg_pipeline.png")
print("üì∏ Pipeline diagram saved to: ./images/advanced_branching_kg_pipeline.png")

üì∏ Pipeline diagram saved to: ./images/advanced_branching_kg_pipeline.png


![Advanced Branching Pipeline](./images/advanced_branching_kg_pipeline.png)

In [11]:
import pandas as pd

# Load and analyze results from the advanced branching pipeline
advanced_test_file = "data_for_eval/synthetic_tests_advanced_branching.csv"

if os.path.exists(advanced_test_file):
    advanced_tests_df = pd.read_csv(advanced_test_file)

    display(advanced_tests_df.head())
    display(advanced_tests_df.tail())

else:
    print("‚ùå Synthetic test file not found")
    print("Please run the previous cells to generate the test data.")

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wut is the role of Apple in AI technolgy?,"[""What is AI, how does it work and why are som...",Apple is involved in AI technology through its...,single_hop_specific_query_synthesizer
1,What happened with UnitedHealthcare CEO?,['Why is AI controversial?\nWhile acknowledgin...,The BBC complained about Apple's AI falsely te...,single_hop_specific_query_synthesizer
2,What laws is US having about AI and how it com...,['Are there laws governing AI?\nSome governmen...,"In the US, there are AI Safety Institutes that...",single_hop_specific_query_synthesizer
3,How does the gpt-5 classifier improve user sat...,['<1-hop>\n\nPrivacy via Automated Classifiers...,The gpt-5 classifier improves user satisfactio...,multi_hop_specific_query_synthesizer
4,What are the key aspects of the EU's Artificia...,['<1-hop>\n\nAre there laws governing AI?\nSom...,The EU's Artificial Intelligence Act places st...,multi_hop_specific_query_synthesizer


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
5,How Google help researchers with data access a...,['<1-hop>\n\nImproving data access\nThe United...,Google long has made available search results ...,multi_hop_specific_query_synthesizer
6,What privacy protections are implemented in th...,['<1-hop>\n\nWe describe the contents of each ...,The privacy protections implemented in the con...,multi_hop_abstract_query_synthesizer
7,How does the establishment of a federal AI adv...,['<1-hop>\n\nAI will reconfigure how society a...,The establishment of a federal AI advisory com...,multi_hop_abstract_query_synthesizer
8,How does the issue of discrimination claims re...,"['<1-hop>\n\n27-28.\n- Christian Davenport, ‚Äú ...",The issue of discrimination claims is closely ...,multi_hop_abstract_query_synthesizer
9,What trends can be observed in the usage of Ch...,['<1-hop>\n\nThe yellow line represents the fi...,The trends observed in the usage of ChatGPT us...,multi_hop_abstract_query_synthesizer


## Summary and Architecture Analysis

### What We've Accomplished

In this notebook, we've built increasingly sophisticated branching pipelines:

1. **Basic Branching Pipeline**: PDF + Web content processing  
2. **Production-Ready Pipeline**: Enhanced error handling and monitoring

### Key Architectural Benefits

1. **Modularity**: Each component has a single responsibility and can be reused
2. **Flexibility**: Easy to add new input types (CSV, Word docs, etc.) 
3. **Scalability**: DocumentJoiner allows processing multiple sources simultaneously
4. **Consistency**: Same processing logic regardless of input source
5. **Error Isolation**: Problems with one input source don't affect others

### Production Considerations

**Advantages of Branching Pipelines:**
- **Unified Output**: Single knowledge graph and test dataset from multiple sources
- **Rich Context**: Cross-referencing information between different document types
- **Operational Efficiency**: One pipeline deployment handles multiple scenarios
- **Quality Improvement**: More diverse training data leads to better synthetic questions

**When to Use Branching Pipelines:**
- Processing heterogeneous document collections
- Building comprehensive knowledge bases from multiple sources
- Creating robust test datasets that cover various content types
- Implementing production pipelines that need input flexibility


### Extension Patterns

To add new input types:
1. Add MIME type to `FileTypeRouter`
2. Create appropriate converter component
3. Connect converter to `DocumentJoiner`
4. No changes needed to downstream processing!

This modular approach makes the pipeline highly maintainable and extensible for future requirements.