🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# PDF Knowledge Graph and Synthetic Data Generation Pipeline

This notebook demonstrates how to build a comprehensive pipeline for PDF document processing that:
1. **Extracts content** from PDF files using Haystack's PyPDFToDocument converter
2. **Preprocesses the text** with cleaning and splitting components
3. **Creates a knowledge graph** from the processed documents
4. **Generates synthetic test data** using the knowledge graph

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for PDF processing
- The relationship between knowledge graphs and synthetic test data generation
- Best practices for PDF document preprocessing
- How to evaluate synthetic datasets generated from PDF content

## Key Components
- **PyPDFToDocument**: Converts PDF files to Haystack Document objects
- **DocumentCleaner**: Removes extra whitespaces and empty lines
- **DocumentSplitter**: Breaks documents into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

## Why This Approach?
Using knowledge graphs as an intermediate step improves the quality of synthetic test generation because:
- Knowledge graphs capture relationships between entities
- They provide structured context for question generation
- The resulting questions are more coherent and factually grounded

In [2]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator,\
                                                DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator,\
                                                TestDatasetSaver
                                                    
# Load environment variables
load_dotenv("./.env")

        
# Create pipeline components
pdf_converter = PyPDFToDocument()
doc_cleaner = DocumentCleaner(remove_empty_lines=True,
                                remove_extra_whitespaces=True)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=50,
                                split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)


test_generator = SyntheticTestGenerator(
    testset_size=10,  
    llm_model="gpt-4o-mini",
    query_distribution=[
        ("single_hop", 0.25), 
        ("multi_hop_specific", 0.25),
        ("multi_hop_abstract", 0.5)
    ],
    # Optional: Add max_testset_size=5 if you want to limit due to API constraints
    # max_testset_size=5  # Uncomment this line if you experience API timeouts
)
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_pdf.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("pdf_converter", pdf_converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("pdf_converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

<haystack.core.pipeline.pipeline.Pipeline object at 0x300728f50>
🚅 Components
  - pdf_converter: PyPDFToDocument
  - doc_cleaner: DocumentCleaner
  - doc_splitter: DocumentSplitter
  - doc_converter: DocumentToLangChainConverter
  - kg_generator: KnowledgeGraphGenerator
  - test_generator: SyntheticTestGenerator
  - test_saver: TestDatasetSaver
🛤️ Connections
  - pdf_converter.documents -> doc_cleaner.documents (list[Document])
  - doc_cleaner.documents -> doc_splitter.documents (list[Document])
  - doc_splitter.documents -> doc_converter.documents (list[Document])
  - doc_converter.langchain_documents -> kg_generator.documents (List[Document])
  - doc_converter.langchain_documents -> test_generator.documents (List[Document])
  - kg_generator.knowledge_graph -> test_generator.knowledge_graph (KnowledgeGraph)
  - test_generator.testset -> test_saver.testset (DataFrame)

In [3]:
# Prepare input data - convert PDF files to ByteStream objects
pdf_sources = [Path("./data_for_indexing/howpeopleuseai.pdf")]
result = pipeline.run({
            "pdf_converter": {"sources": pdf_sources}
        })
    

Applying HeadlinesExtractor: 100%|██████████| 17/17 [00:09<00:00,  1.81it/s]
Applying HeadlinesExtractor: 100%|██████████| 17/17 [00:09<00:00,  1.81it/s]
Applying HeadlineSplitter: 100%|██████████| 17/17 [00:00<00:00, 346.79it/s]
Applying SummaryExtractor:   0%|          | 0/17 [00:00<?, ?it/s]
Applying SummaryExtractor: 100%|██████████| 17/17 [00:10<00:00,  1.63it/s]
Applying SummaryExtractor: 100%|██████████| 17/17 [00:10<00:00,  1.63it/s]
Applying CustomNodeFilter: 100%|██████████| 49/49 [00:26<00:00,  1.82it/s]
Applying EmbeddingExtractor:   0%|          | 0/17 [00:00<?, ?it/s]
Applying EmbeddingExtractor: 100%|██████████| 17/17 [00:04<00:00,  3.79it/s]
Applying EmbeddingExtractor: 100%|██████████| 17/17 [00:04<00:00,  3.79it/s]
Applying ThemesExtractor: 100%|██████████| 44/44 [00:29<00:00,  1.48it/s]
Applying ThemesExtractor: 100%|██████████| 44/44 [00:29<00:00,  1.48it/s]
Applying NERExtractor: 100%|██████████| 44/44 [00:23<00:00,  1.84it/s]
Applying CosineSimilarityBuilder: 100%

### Understanding the PDF Processing Pipeline Architecture

The pipeline we're building follows this flow:

```
PDF File → PDF Converter → Document Cleaner → Document Splitter 
    ↓
Document Converter → Knowledge Graph Generator
    ↓                         ↓
Test Generator ← ← ← ← ← ← ← ←
    ↓
Test Dataset Saver
```

**Key Design Decisions:**

1. **Document Processing Chain**: We clean and split documents before knowledge graph generation to ensure high-quality input
2. **Dual Input to Test Generator**: Both the knowledge graph and original documents are provided to enable fallback generation methods
3. **Configurable Test Distribution**: We can control the types of questions generated (single-hop vs multi-hop)

**Pipeline Parameters Explained:**
- `testset_size=10`: Number of question-answer pairs to generate
- `split_length=50`: Number of sentences per document chunk
- `query_distribution`: Controls complexity of generated questions

In [4]:
pipeline.draw(path="./images/pdf_knowledge_graph_pipeline.png")
print("📸 Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png")

📸 Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png


![](./images/pdf_knowledge_graph_pipeline.png)

In [5]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_pdf.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\n🧪 Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("❌ Synthetic test file not found")


🧪 Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Who is Kevin Wadman and what role does he play...,['NBER WORKING PAPER SERIES\nHOW PEOPLE USE CH...,Kevin Wadman is one of the co-authors of the N...,single_hop_specific_query_synthesizer
1,When did ChatGPT launch and how many users did...,['ABSTRACT Despite the rapid adoption of LLM c...,ChatGPT launched in November 2022. By July 202...,single_hop_specific_query_synthesizer
2,What info in Appendix B?,['to classify messages without any human seein...,Appendix B contains details about how the prom...,single_hop_specific_query_synthesizer
3,What are the primary functions of ChatGPT usag...,['<1-hop>\n\nWe also document several importan...,The primary functions of ChatGPT usage at work...,multi_hop_specific_query_synthesizer
4,What are the trends in user satisfaction regar...,['<1-hop>\n\n37% of messages are work-related\...,The trends in user satisfaction regarding Tech...,multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
6,What are the primary conversation topics users...,['<1-hop>\n\nFigure 9 disaggregates four of th...,Users primarily engage with ChatGPT for conver...,multi_hop_abstract_query_synthesizer
7,How did the usage patterns of ChatGPT evolve f...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,"From its launch in November 2022 to July 2025,...",multi_hop_abstract_query_synthesizer
8,How do the trends in ChatGPT user cohorts rela...,['<1-hop>\n\nThe yellow line represents the fi...,The trends in ChatGPT user cohorts indicate th...,multi_hop_abstract_query_synthesizer
9,What are the primary usage patterns of ChatGPT...,['<1-hop>\n\nNBER WORKING PAPER SERIES\nHOW PE...,The primary usage patterns of ChatGPT reveal t...,multi_hop_abstract_query_synthesizer
10,What are the main conversation topics users en...,['<1-hop>\n\nFigure 9 disaggregates four of th...,Users engage with ChatGPT primarily for conver...,multi_hop_abstract_query_synthesizer


### Analyzing the Generated Test Dataset

Now let's examine the synthetic test data that was generated from our PDF processing pipeline.

**What to Look For:**
- **Question Quality**: Are the questions grammatically correct and meaningful?
- **Answer Accuracy**: Do the answers correctly reflect the source material?
- **Question Types**: Notice the variety of single-hop and multi-hop questions
- **Context Relevance**: Check if the reference contexts support the answers

**Common Question Types You'll See:**
1. **Single-hop questions**: Direct factual queries (e.g., "What is X?")
2. **Multi-hop specific**: Questions requiring connecting specific facts
3. **Multi-hop abstract**: Questions requiring broader reasoning across multiple concepts

**PDF-Specific Considerations:**
- **Text Extraction Quality**: PDFs may have formatting artifacts that affect question quality
- **Document Structure**: Well-structured PDFs tend to produce better knowledge graphs
- **Content Density**: Dense technical content may result in more complex questions

## Summary

### What We've Accomplished

In this notebook, we successfully:

1. **Built a PDF Processing Pipeline**: Created an end-to-end pipeline specifically optimized for PDF documents
2. **Generated Knowledge Graphs**: Converted unstructured PDF content into structured knowledge representations
3. **Produced Synthetic Test Data**: Created question-answer pairs for evaluation and testing purposes
4. **Analyzed Results**: Examined the quality and characteristics of the generated synthetic dataset

### Key Benefits of This Approach

- **Automated Processing**: No manual intervention required for PDF to test data conversion
- **Scalable**: Can process multiple PDF documents in batch
- **Quality-Driven**: Knowledge graphs act as a quality filter for better synthetic questions
- **Configurable**: Easy to adjust parameters for different use cases
