üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# PDF Knowledge Graph and Synthetic Data Generation Pipeline

This notebook demonstrates how to build a comprehensive pipeline for PDF document processing that:
1. **Extracts content** from PDF files using Haystack's PyPDFToDocument converter
2. **Preprocesses the text** with cleaning and splitting components
3. **Creates a knowledge graph** from the processed documents
4. **Generates synthetic test data** using the knowledge graph

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for PDF processing
- The relationship between knowledge graphs and synthetic test data generation
- Best practices for PDF document preprocessing
- How to evaluate synthetic datasets generated from PDF content

## Key Components
- **PyPDFToDocument**: Converts PDF files to Haystack Document objects
- **DocumentCleaner**: Removes extra whitespaces and empty lines
- **DocumentSplitter**: Breaks documents into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

## Why This Approach?
Using knowledge graphs as an intermediate step improves the quality of synthetic test generation because:
- Knowledge graphs capture relationships between entities
- They provide structured context for question generation
- The resulting questions are more coherent and factually grounded

In [1]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.langchaindocument_component import DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator, TestDatasetSaver
                                                    
# Load environment variables
load_dotenv("./.env")

        
# Create pipeline components
pdf_converter = PyPDFToDocument()
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=50,
                                split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)


test_generator = SyntheticTestGenerator(
            test_size=10,
            llm_model="gpt-4o-mini",
            embedder_model="text-embedding-ada-002",
            query_distribution=[
                ("single_hop", 0.3),
                ("multi_hop_specific", 0.3),
                ("multi_hop_abstract", 0.4)
            ],
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_pdf.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("pdf_converter", pdf_converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("pdf_converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

  from .autonotebook import tqdm as notebook_tqdm


<haystack.core.pipeline.pipeline.Pipeline object at 0x168c43920>
üöÖ Components
  - pdf_converter: PyPDFToDocument
  - doc_cleaner: DocumentCleaner
  - doc_splitter: DocumentSplitter
  - doc_converter: DocumentToLangChainConverter
  - kg_generator: KnowledgeGraphGenerator
  - test_generator: SyntheticTestGenerator
  - test_saver: TestDatasetSaver
üõ§Ô∏è Connections
  - pdf_converter.documents -> doc_cleaner.documents (list[Document])
  - doc_cleaner.documents -> doc_splitter.documents (list[Document])
  - doc_splitter.documents -> doc_converter.documents (list[Document])
  - doc_converter.langchain_documents -> kg_generator.documents (List[Document])
  - doc_converter.langchain_documents -> test_generator.documents (List[Document])
  - kg_generator.knowledge_graph -> test_generator.knowledge_graph (KnowledgeGraph)
  - test_generator.testset -> test_saver.testset (DataFrame)

In [2]:
# Prepare input data - convert PDF files to ByteStream objects
pdf_sources = [Path("./data_for_indexing/howpeopleuseai.pdf")]

try:
    # Run pipeline with both input types
    result = pipeline.run({
            "pdf_converter": {"sources": pdf_sources}
        })
    print("\nüìä Pipeline Results:")
    print(f"  üìÑ Documents Processed: {result['doc_converter']['document_count']}")
    print(f"  üß† Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
    print(f"  üß™ Test Cases Generated: {result['test_generator']['testset_size']}")
    print(f"  üîß Generation Method: {result['test_generator']['generation_method']}")
    
except Exception as e:
    print(f"‚ùå Error processing web content: {str(e)}")
    print("This might be due to network issues or website access restrictions.")   

Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:05<00:00,  3.34it/s]
Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:05<00:00,  3.34it/s]
Applying HeadlineSplitter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:00<00:00, 511.67it/s]
Applying SummaryExtractor:   0%|          | 0/17 [00:00<?, ?it/s]
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:06<00:00,  2.73it/s]
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:06<00:00,  2.73it/s]
Applying CustomNodeFilter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 49/49 [00:11<00:00,  4.16it/s]
Applying CustomNodeFilter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 49/49 [00:11<00:00,  4.16it/s]
Applying EmbeddingExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:01<00:00, 11.93it/s]
Applying EmbeddingExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [00:01<00:00, 11.93it/s]
Applying ThemesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 45/45 [


üìä Pipeline Results:
  üìÑ Documents Processed: 17
  üß† Knowledge Graph Nodes: 17
  üß™ Test Cases Generated: 10
  üîß Generation Method: knowledge_graph


### Understanding the PDF Processing Pipeline Architecture

The pipeline we're building follows this flow:

```
PDF File ‚Üí PDF Converter ‚Üí Document Cleaner ‚Üí Document Splitter 
    ‚Üì
Document Converter ‚Üí Knowledge Graph Generator
    ‚Üì                         ‚Üì
Test Generator ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê
    ‚Üì
Test Dataset Saver
```

**Key Design Decisions:**

1. **Document Processing Chain**: We clean and split documents before knowledge graph generation to ensure high-quality input
2. **Dual Input to Test Generator**: Both the knowledge graph and original documents are provided to enable fallback generation methods
3. **Configurable Test Distribution**: We can control the types of questions generated (single-hop vs multi-hop)

**Pipeline Parameters Explained:**
- `test_size=10`: Number of question-answer pairs to generate
- `split_length=50`: Number of sentences per document chunk
- `query_distribution`: Controls complexity of generated questions

In [3]:
pipeline.draw(path="./images/pdf_knowledge_graph_pipeline.png")
print("üì∏ Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png")

üì∏ Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png


![](./images/pdf_knowledge_graph_pipeline.png)

In [4]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_pdf.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\nüß™ Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("‚ùå Synthetic test file not found")


üß™ Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Who is Kevin Wadman and what is his role in th...,['NBER WORKING PAPER SERIES\nHOW PEOPLE USE CH...,Kevin Wadman is one of the co-authors of the N...,single_hop_specific_query_synthesizer
1,Wen did ChatGPT launch and how has it grown si...,['ABSTRACT Despite the rapid adoption of LLM c...,ChatGPT launched in November 2022. By July 202...,single_hop_specific_query_synthesizer
2,How many US adults used ChatGPT in late 2024?,['to classify messages without any human seein...,"28% of US adults used ChatGPT in late 2024, wh...",single_hop_specific_query_synthesizer
3,What are the primary work activities associate...,['<1-hop>\n\n5.4 O*NET Work Activities\nWe map...,The primary work activities associated with Ch...,multi_hop_specific_query_synthesizer
4,What datasets were used in the analysis and wh...,['<1-hop>\n\nData and Privacy\nIn this section...,"The analysis utilized several datasets, includ...",multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
5,What does Zao-Sanders (2025) say about the dis...,['<1-hop>\n\nThis is consistent with the fact ...,Zao-Sanders (2025) indicates that their findin...,multi_hop_specific_query_synthesizer
6,How has the usage of ChatGPT evolved since its...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,"Since its launch in November 2022, the usage o...",multi_hop_abstract_query_synthesizer
7,What patterns of ChatGPT usage have been obser...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,"Since its launch in November 2022, ChatGPT has...",multi_hop_abstract_query_synthesizer
8,How has the usage of ChatGPT evolved since its...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,"Since its launch in November 2022, the usage o...",multi_hop_abstract_query_synthesizer
9,How does ChatGPT usage differ between work-rel...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,ChatGPT usage shows a significant difference b...,multi_hop_abstract_query_synthesizer


### Analyzing the Generated Test Dataset

Now let's examine the synthetic test data that was generated from our PDF processing pipeline.

**What to Look For:**
- **Question Quality**: Are the questions grammatically correct and meaningful?
- **Answer Accuracy**: Do the answers correctly reflect the source material?
- **Question Types**: Notice the variety of single-hop and multi-hop questions
- **Context Relevance**: Check if the reference contexts support the answers

**Common Question Types You'll See:**
1. **Single-hop questions**: Direct factual queries (e.g., "What is X?")
2. **Multi-hop specific**: Questions requiring connecting specific facts
3. **Multi-hop abstract**: Questions requiring broader reasoning across multiple concepts

**PDF-Specific Considerations:**
- **Text Extraction Quality**: PDFs may have formatting artifacts that affect question quality
- **Document Structure**: Well-structured PDFs tend to produce better knowledge graphs
- **Content Density**: Dense technical content may result in more complex questions

## Summary

### What We've Accomplished

In this notebook, we successfully:

1. **Built a PDF Processing Pipeline**: Created an end-to-end pipeline specifically optimized for PDF documents
2. **Generated Knowledge Graphs**: Converted unstructured PDF content into structured knowledge representations
3. **Produced Synthetic Test Data**: Created question-answer pairs for evaluation and testing purposes
4. **Analyzed Results**: Examined the quality and characteristics of the generated synthetic dataset

### Key Benefits of This Approach

- **Automated Processing**: No manual intervention required for PDF to test data conversion
- **Scalable**: Can process multiple PDF documents in batch
- **Quality-Driven**: Knowledge graphs act as a quality filter for better synthetic questions
- **Configurable**: Easy to adjust parameters for different use cases
