🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# Standalone Synthetic Data Generation with Knowledge Graphs for PDF and Websites

This notebook demonstrates how to:
1. Initialize Haystack and custom components
2. Connect components into a pipeline for knowledge graph and synthetic data generation
3. Run pipeline on a PDF and website URLs


## 1. Setup and Imports

First, let's import all the necessary libraries and set up our environment.

In [1]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, HTMLToDocument
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.synthetic_data_generation.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.synthetic_data_generation.langchaindocument import DocumentToLangChainConverter
from scripts.synthetic_data_generation.synthetic_test_components import SyntheticTestGenerator,\
                                                TestDatasetSaver

# Load environment variables
load_dotenv(".env")


  from .autonotebook import tqdm as notebook_tqdm


True

## 2. Initialize components 

We will use a combination of custom and pre-existing Haystack components to generate a comprehensive synthetic dataset from a PDF and 2 URLs. A router will be using to process each then merge them once they're Haystack Document objects.

In [9]:
# Core routing and joining components  
file_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/html"])
doc_joiner = DocumentJoiner()  # Joins documents from different branches

# Input converters for each file type
pdf_converter = PyPDFToDocument()
html_converter = HTMLToDocument()  
link_fetcher = LinkContentFetcher()

# Shared processing components
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True, 
    remove_extra_whitespaces=True
)
doc_splitter = DocumentSplitter(split_by="sentence", split_length=50, split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
test_generator = SyntheticTestGenerator(
    testset_size=3,  # Larger test set for multiple sources
    llm_model="gpt-4o-mini",
    query_distribution=[
        ("single_hop", 0.3),
        ("multi_hop_specific", 0.3), 
        ("multi_hop_abstract", 0.4)
    ]
)
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_advanced_branching_3.csv")

## 2. Initialize and setup pipeline 

We will prepare our pipeline so it routes the different kinds of documents from 2 streams: PDF and web content fetched from a URL. We will clean the documents, split them and apply our custom components to generate a knowledge graph from the joint documents, and then a synthetic dataset with question-answer pairs.

In [None]:
# Initialize pipeline
pipeline = Pipeline()
# Add all components to pipeline
pipeline.add_component("file_router", file_router)
pipeline.add_component("link_fetcher", link_fetcher)
pipeline.add_component("pdf_converter", pdf_converter) 
pipeline.add_component("html_converter", html_converter)
pipeline.add_component("doc_joiner", doc_joiner)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect file routing branches
pipeline.connect("file_router.application/pdf", "pdf_converter.sources") 
pipeline.connect("link_fetcher.streams", "html_converter.sources")

# Connect converters to joiner
pipeline.connect("pdf_converter.documents", "doc_joiner.documents")
pipeline.connect("html_converter.documents", "doc_joiner.documents")

# Connect main processing path
pipeline.connect("doc_joiner.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

## 3. Execute pipeline

In [8]:
pdf_file = Path("./data_for_indexing/howpeopleuseai.pdf")
web_urls = ["https://www.bbc.com/news/articles/c2l799gxjjpo",
            "https://www.brookings.edu/articles/how-artificial-intelligence-is-transforming-the-world/"
            ]

# Run pipeline with both input types
result = pipeline.run({
    "file_router": {"sources": [pdf_file]},  # PDF input through FileTypeRouter
    "link_fetcher": {"urls":web_urls }      # Web input through LinkContentFetcher
})

Applying HeadlinesExtractor: 100%|██████████| 26/26 [00:06<00:00,  4.06it/s]
Applying HeadlineSplitter: 100%|██████████| 26/26 [00:00<00:00, 676.59it/s]
Applying SummaryExtractor:  63%|██████▎   | 17/27 [00:06<00:02,  3.93it/s]Property 'summary' already exists in node 'fcce35'. Skipping!
Applying SummaryExtractor: 100%|██████████| 27/27 [00:08<00:00,  3.28it/s]
Applying CustomNodeFilter: 100%|██████████| 72/72 [00:17<00:00,  4.16it/s]
Applying EmbeddingExtractor:  74%|███████▍  | 20/27 [00:00<00:00, 42.33it/s]Property 'summary_embedding' already exists in node 'fcce35'. Skipping!
Applying EmbeddingExtractor: 100%|██████████| 27/27 [00:01<00:00, 18.59it/s]
Applying ThemesExtractor: 100%|██████████| 71/71 [00:17<00:00,  4.03it/s]
Applying NERExtractor: 100%|██████████| 71/71 [00:17<00:00,  3.98it/s]
Applying CosineSimilarityBuilder: 100%|██████████| 1/1 [00:00<00:00, 195.20it/s]
Applying OverlapScoreBuilder: 100%|██████████| 1/1 [00:00<00:00, 19.95it/s]
Generating personas: 100%|████████