# Knowledge Graph and Synthetic Data Generation Pipelines

This notebook demonstrates how to build comprehensive pipelines for:
1. **Knowledge Graph Creation** - Converting documents into structured knowledge representations
2. **Synthetic Test Data Generation** - Creating question-answer pairs for evaluation
3. **Multi-Source Processing** - Working with PDFs, web content, and other document types

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for knowledge extraction
- The relationship between knowledge graphs and test data generation
- Best practices for processing different document formats
- How to evaluate and validate synthetic datasets

## Prerequisites

Before running this notebook, ensure you have:
- ✅ OpenAI API key configured in your `.env` file
- ✅ Required dependencies installed (`ragas`, `haystack-ai`, etc.)
- ✅ Sample documents in the `data_for_indexing` directory
- ✅ Understanding of Haystack 2.0 pipeline architecture

## Part 1: PDF Processing Pipeline

### Overview
In this section, we'll build a comprehensive pipeline that:
1. **Extracts content** from PDF files using Haystack's PyPDFToDocument converter
2. **Preprocesses the text** with cleaning and splitting components
3. **Creates a knowledge graph** from the processed documents
4. **Generates synthetic test data** using the knowledge graph

### Key Components
- **PyPDFToDocument**: Converts PDF files to Haystack Document objects
- **DocumentCleaner**: Removes extra whitespaces and empty lines
- **DocumentSplitter**: Breaks documents into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

### Why This Approach?
Using knowledge graphs as an intermediate step improves the quality of synthetic test generation because:
- Knowledge graphs capture relationships between entities
- They provide structured context for question generation
- The resulting questions are more coherent and factually grounded

In [None]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.synthetic_test_components import SyntheticTestGenerator,\
                                                TestDatasetSaver,\
                                                    DocumentToLangChainConverter

# Load environment variables
load_dotenv("./.env")

# Example: Create a complete pipeline for synthetic test generation
data_path = "data_for_indexing"

if os.path.exists(data_path):
    print("Creating synthetic test generation pipeline...")
    
    # Get PDF files from the directory
    pdf_files = list(Path(data_path).glob("*.pdf"))
    
    if pdf_files:
        print(f"Found {len(pdf_files)} PDF files to process")
        
        # Create pipeline components
        pdf_converter = PyPDFToDocument()
        doc_cleaner = DocumentCleaner(remove_empty_lines=True,
                                      remove_extra_whitespaces=True)
        doc_splitter = DocumentSplitter(split_by="sentence",
                                       split_length=50,
                                       split_overlap=5)
        doc_converter = DocumentToLangChainConverter()
        kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
        
        
        test_generator = SyntheticTestGenerator(
            testset_size=10,  
            llm_model="gpt-4o-mini",
            query_distribution=[
                ("single_hop", 0.25), 
                ("multi_hop_specific", 0.25),
                ("multi_hop_abstract", 0.5)
            ],
            # Optional: Add max_testset_size=5 if you want to limit due to API constraints
            # max_testset_size=5  # Uncomment this line if you experience API timeouts
        )
        test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_pdf.csv")
        
        # Create pipeline
        pipeline = Pipeline()
        pipeline.add_component("pdf_converter", pdf_converter)
        pipeline.add_component("doc_cleaner", doc_cleaner)
        pipeline.add_component("doc_splitter", doc_splitter)
        pipeline.add_component("doc_converter", doc_converter)
        pipeline.add_component("kg_generator", kg_generator)
        pipeline.add_component("test_generator", test_generator)
        pipeline.add_component("test_saver", test_saver)
        
        # Connect components in sequence
        pipeline.connect("pdf_converter.documents", "doc_cleaner.documents")
        pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
        pipeline.connect("doc_splitter.documents", "doc_converter.documents")
        pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
        pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
        pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
        pipeline.connect("test_generator.testset", "test_saver.testset")
        
        # Prepare input data - convert PDF files to ByteStream objects
        pdf_sources = [Path("./data_for_indexing/howpeopleuseai.pdf")]
         
        result = pipeline.run({
            "pdf_converter": {"sources": pdf_sources}
        })
        
        print("\n📊 Pipeline Results:")
        print(f"  📄 Documents Processed: {result['doc_converter']['document_count']}")
        print(f"  🧠 Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
        print(f"  🧪 Test Cases Generated: {result['test_generator']['testset_size']}")
        print(f"  🔧 Generation Method: {result['test_generator']['generation_method']}")
        
    else:
        print("❌ No PDF files found in data_for_indexing directory")
else:
    print("❌ Data path 'data_for_indexing' not found")

Creating synthetic test generation pipeline...
Found 1 PDF files to process


Applying HeadlinesExtractor: 100%|██████████| 17/17 [00:08<00:00,  2.01it/s]
Applying HeadlineSplitter: 100%|██████████| 17/17 [00:00<00:00, 431.12it/s]
Applying SummaryExtractor: 100%|██████████| 17/17 [00:13<00:00,  1.30it/s]
Applying CustomNodeFilter: 100%|██████████| 49/49 [00:25<00:00,  1.94it/s]
Applying EmbeddingExtractor: 100%|██████████| 17/17 [00:04<00:00,  4.07it/s]
Applying ThemesExtractor: 100%|██████████| 45/45 [00:25<00:00,  1.79it/s]
Applying NERExtractor: 100%|██████████| 45/45 [00:21<00:00,  2.06it/s]
Applying CosineSimilarityBuilder: 100%|██████████| 1/1 [00:00<00:00, 314.04it/s]
Applying OverlapScoreBuilder: 100%|██████████| 1/1 [00:00<00:00, 75.99it/s]
Generating personas: 100%|██████████| 3/3 [00:02<00:00,  1.35it/s]
Generating Scenarios: 100%|██████████| 3/3 [00:20<00:00,  6.75s/it]
Generating Samples: 100%|██████████| 11/11 [00:08<00:00,  1.37it/s]



📊 Pipeline Results:
  📄 Documents Processed: 17
  🧠 Knowledge Graph Nodes: 17
  🧪 Test Cases Generated: 11
  🔧 Generation Method: knowledge_graph


### Understanding the Pipeline Architecture

The pipeline we're building follows this flow:

```
PDF File → PDF Converter → Document Cleaner → Document Splitter 
    ↓
Document Converter → Knowledge Graph Generator
    ↓                         ↓
Test Generator ← ← ← ← ← ← ← ←
    ↓
Test Dataset Saver
```

**Key Design Decisions:**

1. **Document Processing Chain**: We clean and split documents before knowledge graph generation to ensure high-quality input
2. **Dual Input to Test Generator**: Both the knowledge graph and original documents are provided to enable fallback generation methods
3. **Configurable Test Distribution**: We can control the types of questions generated (single-hop vs multi-hop)

**Pipeline Parameters Explained:**
- `testset_size=10`: Number of question-answer pairs to generate
- `split_length=50`: Number of sentences per document chunk
- `query_distribution`: Controls complexity of generated questions

In [17]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_pdf.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\n🧪 Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("❌ Synthetic test file not found")


🧪 Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Who is Zoe Hitzig and what is her role in the ...,['NBER WORKING PAPER SERIES\nHOW PEOPLE USE CH...,Zoe Hitzig is one of the co-authors of the NBE...,single_hop_specific_query_synthesizer
1,How many users was ChatGPT having by July 2025?,['ABSTRACT Despite the rapid adoption of LLM c...,"By July 2025, ChatGPT had 700 million users, r...",single_hop_specific_query_synthesizer
2,What insights does Roth provide regarding the ...,"['LLM, allowing us to classify messages withou...",Roth (2025) reports that 28% of US adults used...,single_hop_specific_query_synthesizer
3,What trends in user engagement and message vol...,['<1-hop>\n\nThe yellow line represents the fi...,"For ChatGPT users who signed up in 2023, parti...",multi_hop_specific_query_synthesizer
4,What trends in user interaction quality and ge...,['<1-hop>\n\n5.5 Quality of Interactions\nWe a...,"In the lead-up to June 2025, trends in user in...",multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
6,What trends can be observed in ChatGPT user co...,['<1-hop>\n\nThe yellow line represents the fi...,The trends observed in ChatGPT user cohorts in...,multi_hop_abstract_query_synthesizer
7,How does user satisfaction relate to data priv...,['<1-hop>\n\nWe retain this classifier because...,User satisfaction in ChatGPT interactions is a...,multi_hop_abstract_query_synthesizer
8,What is the significance of IWA ID in relation...,['<1-hop>\n\nTask details Your response should...,The significance of IWA ID in relation to Cohe...,multi_hop_abstract_query_synthesizer
9,What are the key patterns of ChatGPT usage amo...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,The key patterns of ChatGPT usage reveal that ...,multi_hop_abstract_query_synthesizer
10,What insights can be drawn about user satisfac...,"['<1-hop>\n\nOuyang, Long, Jeff Wu, Xu Jiang, ...",Insights about user satisfaction from the feed...,multi_hop_abstract_query_synthesizer


### Analyzing the Generated Test Dataset

Now let's examine the synthetic test data that was generated from our PDF processing pipeline.

**What to Look For:**
- **Question Quality**: Are the questions grammatically correct and meaningful?
- **Answer Accuracy**: Do the answers correctly reflect the source material?
- **Question Types**: Notice the variety of single-hop and multi-hop questions
- **Context Relevance**: Check if the reference contexts support the answers

**Common Question Types You'll See:**
1. **Single-hop questions**: Direct factual queries (e.g., "What is X?")
2. **Multi-hop specific**: Questions requiring connecting specific facts
3. **Multi-hop abstract**: Questions requiring broader reasoning across multiple concepts

## Part 2: Web Content Processing Pipeline

### Overview
In this section, we'll adapt our pipeline to work with web content instead of PDF files. This demonstrates the flexibility of Haystack pipelines and how the same knowledge graph generation approach can work across different content sources.

### Key Differences from PDF Processing
1. **LinkContentFetcher**: Retrieves content directly from URLs
2. **HTMLToDocument**: Converts HTML content to Haystack Documents
3. **Same Processing Chain**: The rest of the pipeline remains identical

### Real-World Applications
This approach is particularly useful for:
- **Documentation Analysis**: Processing online documentation and creating test datasets
- **Content Monitoring**: Regularly generating tests from updated web content  
- **Multi-Source Knowledge**: Combining web content with other document types
- **Research Applications**: Creating datasets from academic papers, blog posts, etc.

### Technical Considerations
- **Rate Limiting**: Be mindful of website rate limits when fetching content
- **Content Quality**: Web content may require more aggressive cleaning
- **Dynamic Content**: Some websites use JavaScript; static HTML fetching may miss content

In [15]:
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

fetcher = LinkContentFetcher()
converter = HTMLToDocument()
doc_cleaner = DocumentCleaner(remove_empty_lines=True,
                                      remove_extra_whitespaces=True)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=50,
                                split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
test_generator = SyntheticTestGenerator(
            testset_size=10,  
            llm_model="gpt-4o-mini",
            query_distribution=[
                ("single_hop", 0.25), 
                ("multi_hop_specific", 0.25),
                ("multi_hop_abstract", 0.5)
            ]
        )
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_html_page.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)
pipeline.add_component("converter", converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("fetcher.streams", "converter.sources")
pipeline.connect("converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

web_url = "https://haystack.deepset.ai/blog/haystack-2-release"

result = pipeline.run({
    "fetcher": {"urls": [web_url]}
})

print("\n📊 Pipeline Results:")
print(f"  📄 Documents Processed: {result['doc_converter']['document_count']}")
print(f"  🧠 Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
print(f"  🧪 Test Cases Generated: {result['test_generator']['testset_size']}")
print(f"  🔧 Generation Method: {result['test_generator']['generation_method']}")

Applying HeadlinesExtractor: 100%|██████████| 2/2 [00:02<00:00,  1.33s/it]
Applying HeadlineSplitter: 100%|██████████| 2/2 [00:00<00:00, 442.76it/s]
Applying SummaryExtractor: 100%|██████████| 2/2 [00:04<00:00,  2.41s/it]
Applying CustomNodeFilter: 100%|██████████| 6/6 [00:04<00:00,  1.43it/s]
Applying EmbeddingExtractor: 100%|██████████| 2/2 [00:00<00:00,  2.59it/s]
Applying ThemesExtractor: 100%|██████████| 6/6 [00:05<00:00,  1.19it/s]
Applying NERExtractor: 100%|██████████| 6/6 [00:03<00:00,  1.60it/s]
Applying CosineSimilarityBuilder: 100%|██████████| 1/1 [00:00<00:00, 964.21it/s]
Applying OverlapScoreBuilder: 100%|██████████| 1/1 [00:00<00:00, 1633.93it/s]
Generating personas: 100%|██████████| 2/2 [00:02<00:00,  1.03s/it]
Generating Scenarios: 100%|██████████| 3/3 [00:06<00:00,  2.00s/it]
Generating Samples: 100%|██████████| 11/11 [00:07<00:00,  1.57it/s]



📊 Pipeline Results:
  📄 Documents Processed: 2
  🧠 Knowledge Graph Nodes: 2
  🧪 Test Cases Generated: 11
  🔧 Generation Method: knowledge_graph


### Web Pipeline Architecture

The web processing pipeline follows a similar structure but with adapted input components:

```
Web URL → Link Fetcher → HTML Converter → Document Cleaner → Document Splitter
    ↓
Document Converter → Knowledge Graph Generator  
    ↓                         ↓
Test Generator ← ← ← ← ← ← ← ←
    ↓
Test Dataset Saver
```

**Why This Works:**
- The knowledge graph generation is **content-agnostic** - it works the same whether input comes from PDFs, web pages, or other sources
- Document preprocessing steps ensure consistent quality regardless of input format
- The same test generation logic produces comparable quality across all sources

**Pipeline Reusability:**
Notice how we can reuse the same components (`doc_cleaner`, `doc_splitter`, `kg_generator`, etc.) with different input sources. This demonstrates the modularity and flexibility of Haystack's component architecture.

In [18]:
# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_html_page.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\n🧪 Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("❌ Synthetic test file not found")


🧪 Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wen was Haystack first released?,['Haystack 2.0: The Composable Open-Source LLM...,Haystack was first officially released in 2020.,single_hop_specific_query_synthesizer
1,How does Haystack 2.0 improve upon the limitat...,['Composable and customizable Pipelines\nModer...,Haystack 2.0 improves upon the limitations of ...,single_hop_specific_query_synthesizer
2,What are the key features and benefits of Asse...,['A common interface for storing data - A clea...,Assembly AI contributes to the Haystack ecosys...,single_hop_specific_query_synthesizer
3,What are the limitations of Haystack 1.0 regar...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 1.0 had a significant limitation in t...,multi_hop_specific_query_synthesizer
4,What limitations of Haystack 1.0 were addresse...,['<1-hop>\n\nA common interface for storing da...,One important limitation in Haystack 1.0 was t...,multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
6,How does Haystack 2.0 enhance the integration ...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 enhances the integration of data ...,multi_hop_abstract_query_synthesizer
7,What are the main features of Haystack 2.0 tha...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 introduces several main features ...,multi_hop_abstract_query_synthesizer
8,How does Haystack 2.0 facilitate integration w...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 facilitates integration with mode...,multi_hop_abstract_query_synthesizer
9,What are the key features of Haystack 2.0 that...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 introduces several key features t...,multi_hop_abstract_query_synthesizer
10,How does Haystack 2.0 integrate with data stor...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 integrates with data storage serv...,multi_hop_abstract_query_synthesizer


### Comparing Results Across Sources

Let's examine how the synthetic test generation performs when using web content versus PDF content.

**Expected Differences:**
- **Content Structure**: Web content may have different formatting and structure
- **Question Complexity**: Depending on the source material's complexity
- **Context Quality**: Web content might include navigation elements or ads that need filtering

**Quality Assessment Checklist:**
- [ ] Questions are grammatically correct
- [ ] Answers are factually accurate based on the source
- [ ] Context excerpts support the provided answers
- [ ] Questions test different levels of comprehension
- [ ] No duplicate or overly similar questions

## Summary and Next Steps

### What We've Learned

In this notebook, we explored:

1. **Knowledge Graph-Driven Test Generation**: How structured knowledge representations improve synthetic data quality
2. **Multi-Source Processing**: Adapting the same pipeline architecture for different input types (PDFs, web content)
3. **Pipeline Modularity**: Reusing components across different use cases while maintaining consistency
4. **Quality Assessment**: Evaluating synthetic test datasets for accuracy and usefulness

### Key Takeaways

- **Knowledge graphs act as a quality filter** for test generation, producing more coherent and factually grounded questions
- **Haystack's component architecture** enables easy adaptation between different content sources
- **Preprocessing matters** - cleaning and splitting documents appropriately affects downstream quality
- **Synthetic test generation** can scale evaluation efforts but requires careful quality validation

### Production Considerations

When moving to production, consider:

1. **Quality Control**: Implement automated quality checks (see the quality control components in other notebooks)
2. **Scalability**: Use batch processing for large document collections
3. **Monitoring**: Track generation success rates and quality metrics over time
4. **Cost Management**: Balance test quantity with API usage costs
5. **Validation**: Always human-review a sample of generated tests before deployment
