üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# PDF Knowledge Graph and Synthetic Data Generation Pipeline

This notebook demonstrates how to build a comprehensive pipeline for PDF document processing that:
1. **Extracts content** from PDF files using Haystack's PyPDFToDocument converter
2. **Preprocesses the text** with cleaning and splitting components
3. **Creates a knowledge graph** from the processed documents
4. **Generates synthetic test data** using the knowledge graph

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for PDF processing
- The relationship between knowledge graphs and synthetic test data generation
- Best practices for PDF document preprocessing
- How to evaluate synthetic datasets generated from PDF content

## Key Components
- **PyPDFToDocument**: Converts PDF files to Haystack Document objects
- **DocumentCleaner**: Removes extra whitespaces and empty lines
- **DocumentSplitter**: Breaks documents into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

## Why This Approach?
Using knowledge graphs as an intermediate step improves the quality of synthetic test generation because:
- Knowledge graphs capture relationships between entities
- They provide structured context for question generation
- The resulting questions are more coherent and factually grounded

In [1]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from haystack.components.generators import OpenAIGenerator
from haystack.components.embedders.openai_text_embedder import OpenAITextEmbedder
from haystack.utils import Secret
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.langchaindocument_component import DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator, TestDatasetSaver
                                                    
# Load environment variables
load_dotenv("./.env")

# Helper function to create fresh generator and embedder instances
def create_llm_components():
    """Create fresh instances of generator and embedder."""
    # You can use OpenAI models:
    generator = OpenAIGenerator(
        model="gpt-4o-mini",
        api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
    )
    embedder = OpenAITextEmbedder(
        model="text-embedding-3-small",
        api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
    )
    
    # Or use Ollama models (uncomment to use):
    # from haystack_integrations.components.generators.ollama import OllamaGenerator
    # from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder
    # 
    # generator = OllamaGenerator(
    #     model="mistral-nemo:12b",
    #     generation_kwargs={
    #         "num_predict": 100,
    #         "temperature": 0.9,
    #     }
    # )
    # embedder = OllamaTextEmbedder(model="nomic-embed-text")
    
    return generator, embedder
        
# Create pipeline components
pdf_converter = PyPDFToDocument()
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=5,
                                split_overlap=1)
doc_converter = DocumentToLangChainConverter()

# Create knowledge graph component with its own generator and embedder instances
kg_gen, kg_embed = create_llm_components()
kg_generator = KnowledgeGraphGenerator(
    generator=kg_gen,
    embedder=kg_embed,
    apply_transforms=True
)

# Create test generator component with its own generator and embedder instances
test_gen, test_embed = create_llm_components()
test_generator = SyntheticTestGenerator(
    generator=test_gen,
    embedder=test_embed,
    test_size=10,
    query_distribution=[
        ("single_hop", 0.3),
        ("multi_hop_specific", 0.3),
        ("multi_hop_abstract", 0.4)
    ]
)
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_pdf.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("pdf_converter", pdf_converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("pdf_converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")


  from .autonotebook import tqdm as notebook_tqdm


<haystack.core.pipeline.pipeline.Pipeline object at 0x31cd0d610>
üöÖ Components
  - pdf_converter: PyPDFToDocument
  - doc_cleaner: DocumentCleaner
  - doc_splitter: DocumentSplitter
  - doc_converter: DocumentToLangChainConverter
  - kg_generator: KnowledgeGraphGenerator
  - test_generator: SyntheticTestGenerator
  - test_saver: TestDatasetSaver
üõ§Ô∏è Connections
  - pdf_converter.documents -> doc_cleaner.documents (list[Document])
  - doc_cleaner.documents -> doc_splitter.documents (list[Document])
  - doc_splitter.documents -> doc_converter.documents (list[Document])
  - doc_converter.langchain_documents -> kg_generator.documents (List[Document])
  - doc_converter.langchain_documents -> test_generator.documents (List[Document])
  - kg_generator.knowledge_graph -> test_generator.knowledge_graph (KnowledgeGraph)
  - test_generator.testset -> test_saver.testset (DataFrame)

In [2]:
# Prepare input data - convert PDF files to ByteStream objects
pdf_sources = [Path("./data_for_indexing/howpeopleuseai.pdf")]

try:
    # Run pipeline with both input types
    result = pipeline.run({
            "pdf_converter": {"sources": pdf_sources}
        })
    print("\nüìä Pipeline Results:")
    print(f"  üìÑ Documents Processed: {result['doc_converter']['document_count']}")
    print(f"  üß† Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
    print(f"  üß™ Test Cases Generated: {result['test_generator']['testset_size']}")
    print(f"  üîß Generation Method: {result['test_generator']['generation_method']}")
    
except Exception as e:
    print(f"‚ùå Error processing web content: {str(e)}")
    print("This might be due to network issues or website access restrictions.")   

Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 163/163 [00:40<00:00,  4.01it/s]
Applying CustomNodeFilter:  10%|‚ñâ         | 18/185 [00:05<00:37,  4.41it/s]Node 2b41eae4-c8fc-47f4-bea2-0c4df514fa87 does not have a summary. Skipping filtering.
Node 4fd81ab9-f797-4b10-8767-02e13a22be46 does not have a summary. Skipping filtering.
Node 0dc5ffdc-d76c-44ad-a519-a747ceebab60 does not have a summary. Skipping filtering.
Applying CustomNodeFilter:  22%|‚ñà‚ñà‚ñè       | 41/185 [00:11<00:36,  3.96it/s]Node aa017d86-b472-400f-af93-f26ac4d1e36f does not have a summary. Skipping filtering.
Applying CustomNodeFilter:  28%|‚ñà‚ñà‚ñä       | 51/185 [00:12<00:27,  4.80it/s]Node 48416ab7-775c-4f58-82cd-ef8d5a455eae does not have a summary. Skipping filtering.
Node 9742dd49-3487-4991-bf44-2c493fea6132 does not have a summary. Skipping filtering.
Node 73e00305-3349-472e-9966-4d4778d6b62c does not have a summary. Skipping filtering.
Applying CustomNodeFilter:  31%|‚ñà‚ñà‚ñà‚ñè      | 58/


üìä Pipeline Results:
  üìÑ Documents Processed: 185
  üß† Knowledge Graph Nodes: 185
  üß™ Test Cases Generated: 10
  üîß Generation Method: knowledge_graph


### Understanding the PDF Processing Pipeline Architecture

The pipeline we're building follows this flow:

```
PDF File ‚Üí PDF Converter ‚Üí Document Cleaner ‚Üí Document Splitter 
    ‚Üì
Document Converter ‚Üí Knowledge Graph Generator
    ‚Üì                         ‚Üì
Test Generator ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê
    ‚Üì
Test Dataset Saver
```

**Key Design Decisions:**

1. **Document Processing Chain**: We clean and split documents before knowledge graph generation to ensure high-quality input
2. **Dual Input to Test Generator**: Both the knowledge graph and original documents are provided to enable fallback generation methods
3. **Configurable Test Distribution**: We can control the types of questions generated (single-hop vs multi-hop)

**Pipeline Parameters Explained:**
- `test_size=10`: Number of question-answer pairs to generate
- `split_length=5`: Number of sentences per document chunk
- `query_distribution`: Controls complexity of generated questions

In [3]:
pipeline.draw(path="./images/pdf_knowledge_graph_pipeline.png")
print("üì∏ Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png")

üì∏ Pipeline diagram saved to: ./images/pdf_knowledge_graph_pipeline.png


In [4]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_pdf.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\nüß™ Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("‚ùå Synthetic test file not found")


üß™ Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What NBER do?,['NBER WORKING PAPER SERIES\nHOW PEOPLE USE CH...,The National Bureau of Economic Research (NBER...,single_hop_specific_query_synthesizer
1,What is the role of Zoe Hitzig in the context ...,['The views expressed herein are those of the ...,Zoe Hitzig is one of the co-authors of the NBE...,single_hop_specific_query_synthesizer
2,Who are the authors of the NBER Working Paper ...,"['¬© 2025 by Aaron Chatterji, Thomas Cunningham...",The authors of the NBER Working Paper No. 3425...,single_hop_specific_query_synthesizer
3,Wht is the breakdown of convrsation topics in ...,['<1-hop>\n\nShares are calculated from a samp...,Figure 11 presents the breakdown of conversati...,multi_hop_specific_query_synthesizer
4,What are the differences in user demographics ...,['<1-hop>\n\n8Handa et al. (2025) report that ...,Handa et al. (2025) report that the discrepanc...,multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
5,What does Table 5 reveal about the model-plura...,['<1-hop>\n\nA development set (46 items) was ...,Table 5 reveals that the model-plurality agree...,multi_hop_specific_query_synthesizer
6,How did the adoption of ChatGPT vary between l...,['<1-hop>\n\nThe figure below plots this propo...,The adoption of ChatGPT grew dramatically in l...,multi_hop_abstract_query_synthesizer
7,What impact did the adoption of ChatGPT have o...,['<1-hop>\n\nThe figure below plots this propo...,"By July 2025, the adoption of ChatGPT had a si...",multi_hop_abstract_query_synthesizer
8,What are the total daily counts of messages an...,['<1-hop>\n\nTotal daily counts are exact meas...,Total daily counts are exact measurements of m...,multi_hop_abstract_query_synthesizer
9,How does the quality of user interaction relat...,['<1-hop>\n\nWe do not show the shares for the...,The quality of user interaction is assessed th...,multi_hop_abstract_query_synthesizer


### Analyzing the Generated Test Dataset

Now let's examine the synthetic test data that was generated from our PDF processing pipeline.

**What to Look For:**
- **Question Quality**: Are the questions grammatically correct and meaningful?
- **Answer Accuracy**: Do the answers correctly reflect the source material?
- **Question Types**: Notice the variety of single-hop and multi-hop questions
- **Context Relevance**: Check if the reference contexts support the answers

**Common Question Types You'll See:**
1. **Single-hop questions**: Direct factual queries (e.g., "What is X?")
2. **Multi-hop specific**: Questions requiring connecting specific facts
3. **Multi-hop abstract**: Questions requiring broader reasoning across multiple concepts

**PDF-Specific Considerations:**
- **Text Extraction Quality**: PDFs may have formatting artifacts that affect question quality
- **Document Structure**: Well-structured PDFs tend to produce better knowledge graphs
- **Content Density**: Dense technical content may result in more complex questions

## Summary

### What We've Accomplished

In this notebook, we successfully:

1. **Built a PDF Processing Pipeline**: Created an end-to-end pipeline specifically optimized for PDF documents
2. **Generated Knowledge Graphs**: Converted unstructured PDF content into structured knowledge representations
3. **Produced Synthetic Test Data**: Created question-answer pairs for evaluation and testing purposes
4. **Analyzed Results**: Examined the quality and characteristics of the generated synthetic dataset

### Key Benefits of This Approach

- **Automated Processing**: No manual intervention required for PDF to test data conversion
- **Scalable**: Can process multiple PDF documents in batch
- **Quality-Driven**: Knowledge graphs act as a quality filter for better synthetic questions
- **Configurable**: Easy to adjust parameters for different use cases
