🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Web Content Knowledge Graph and Synthetic Data Generation Pipeline

This notebook demonstrates how to build a comprehensive pipeline for web content processing that:
1. **Retrieves content** from web URLs using Haystack's LinkContentFetcher
2. **Converts HTML** to structured documents using HTMLToDocument
3. **Preprocesses the text** with cleaning and splitting components
4. **Creates a knowledge graph** from the processed web content
5. **Generates synthetic test data** using the knowledge graph

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for web content processing
- The differences between PDF and web content processing
- Best practices for web scraping and content extraction
- How web content characteristics affect synthetic test generation

## Key Components for Web Processing
- **LinkContentFetcher**: Retrieves content directly from URLs
- **HTMLToDocument**: Converts HTML content to Haystack Documents
- **DocumentCleaner**: Removes extra whitespaces and HTML artifacts
- **DocumentSplitter**: Breaks web content into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

## Real-World Applications
This approach is particularly useful for:
- **Documentation Analysis**: Processing online documentation and creating test datasets
- **Content Monitoring**: Regularly generating tests from updated web content  
- **Multi-Source Knowledge**: Combining web content with other document types
- **Research Applications**: Creating datasets from academic papers, blog posts, etc.

## Technical Considerations
- **Rate Limiting**: Be mindful of website rate limits when fetching content
- **Content Quality**: Web content may require more aggressive cleaning
- **Dynamic Content**: Some websites use JavaScript; static HTML fetching may miss content

In [1]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator,\
                                                DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator,\
                                                TestDatasetSaver

# Load environment variables
load_dotenv("./.env")

# Create web content processing components
fetcher = LinkContentFetcher()
converter = HTMLToDocument()
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_substrings=['<1-hop>\n\n', '<multi-hop>\n\n', '<single-hop>\n\n', '\n\n\n', '\f', '\r']  # Remove synthetic data generation artifacts and weird characters
)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=50,
                                split_overlap=5)
doc_converter = DocumentToLangChainConverter()
kg_generator = KnowledgeGraphGenerator(apply_transforms=True)
test_generator = SyntheticTestGenerator(
            testset_size=10,  
            llm_model="gpt-4o-mini",
            query_distribution=[
                ("single_hop", 0.25), 
                ("multi_hop_specific", 0.25),
                ("multi_hop_abstract", 0.5)
            ]
        )
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_web.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)
pipeline.add_component("converter", converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("fetcher.streams", "converter.sources")
pipeline.connect("converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

print("✅ Web Content Processing Pipeline created successfully!")
print("🌐 Ready to process web content and generate knowledge graphs + synthetic tests")

  from .autonotebook import tqdm as notebook_tqdm


✅ Web Content Processing Pipeline created successfully!
🌐 Ready to process web content and generate knowledge graphs + synthetic tests


In [2]:
web_url = "https://haystack.deepset.ai/blog/haystack-2-release"

print(f"🌐 Processing web content from: {web_url}")
print("This may take a moment to fetch and process the content...")

try:
    result = pipeline.run({
        "fetcher": {"urls": [web_url]}
    })

    print("\n📊 Pipeline Results:")
    print(f"  📄 Documents Processed: {result['doc_converter']['document_count']}")
    print(f"  🧠 Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
    print(f"  🧪 Test Cases Generated: {result['test_generator']['testset_size']}")
    print(f"  🔧 Generation Method: {result['test_generator']['generation_method']}")
    
except Exception as e:
    print(f"❌ Error processing web content: {str(e)}")
    print("This might be due to network issues or website access restrictions.")

🌐 Processing web content from: https://haystack.deepset.ai/blog/haystack-2-release
This may take a moment to fetch and process the content...


Applying HeadlinesExtractor: 100%|██████████| 2/2 [00:02<00:00,  1.41s/it]
Applying HeadlinesExtractor: 100%|██████████| 2/2 [00:02<00:00,  1.41s/it]
Applying HeadlineSplitter: 100%|██████████| 2/2 [00:00<00:00, 259.53it/s]
Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]
Applying SummaryExtractor: 100%|██████████| 2/2 [00:05<00:00,  2.95s/it]
Applying SummaryExtractor: 100%|██████████| 2/2 [00:05<00:00,  2.95s/it]
Applying CustomNodeFilter: 100%|██████████| 6/6 [00:05<00:00,  1.15it/s]
Applying CustomNodeFilter: 100%|██████████| 6/6 [00:05<00:00,  1.15it/s]
Applying EmbeddingExtractor: 100%|██████████| 2/2 [00:00<00:00,  2.82it/s]
Applying EmbeddingExtractor: 100%|██████████| 2/2 [00:00<00:00,  2.82it/s]
Applying ThemesExtractor: 100%|██████████| 6/6 [00:07<00:00,  1.21s/it]
Applying ThemesExtractor: 100%|██████████| 6/6 [00:07<00:00,  1.21s/it]
Applying NERExtractor: 100%|██████████| 6/6 [00:05<00:00,  1.01it/s]
Applying CosineSimilarityBuilder: 100%|██████████| 1/1 [


📊 Pipeline Results:
  📄 Documents Processed: 2
  🧠 Knowledge Graph Nodes: 2
  🧪 Test Cases Generated: 11
  🔧 Generation Method: knowledge_graph


### Understanding the Web Content Processing Pipeline Architecture

The web processing pipeline follows a similar structure to PDF processing but with adapted input components:

```
Web URL → Link Fetcher → HTML Converter → Document Cleaner → Document Splitter
    ↓
Document Converter → Knowledge Graph Generator  
    ↓                         ↓
Test Generator ← ← ← ← ← ← ← ←
    ↓
Test Dataset Saver
```

**Why This Works:**
- The knowledge graph generation is **content-agnostic** - it works the same whether input comes from PDFs, web pages, or other sources
- Document preprocessing steps ensure consistent quality regardless of input format
- The same test generation logic produces comparable quality across all sources

**Pipeline Reusability:**
Notice how we can reuse the same components (`doc_cleaner`, `doc_splitter`, `kg_generator`, etc.) with different input sources. This demonstrates the modularity and flexibility of Haystack's component architecture.

**Web-Specific Considerations:**
- **Content Structure**: Web pages may have navigation, ads, and other non-content elements
- **HTML Artifacts**: May require more aggressive cleaning than PDF content
- **Dynamic Loading**: Static HTML fetching may miss JavaScript-rendered content

In [3]:
pipeline.draw(path="./images/web_knowledge_graph_pipeline.png")
print("📸 Pipeline diagram saved to: ./images/web_knowledge_graph_pipeline.png")

📸 Pipeline diagram saved to: ./images/web_knowledge_graph_pipeline.png


In [4]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_web.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\n🧪 Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("❌ Synthetic test file not found")
    print("Please run the previous cells to generate the test data.")


🧪 Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the key features and improvements int...,['Haystack 2.0: The Composable Open-Source LLM...,Haystack 2.0 introduces several key features a...,single_hop_specific_query_synthesizer
1,Wut is AI in Haystack 2.0?,['Composable and customizable Pipelines\nModer...,AI in Haystack 2.0 refers to the customizable ...,single_hop_specific_query_synthesizer
2,What role do LLMs play in modern AI applications?,['A common interface for storing data - A clea...,"In modern AI applications, LLMs are used to an...",single_hop_specific_query_synthesizer
3,What are the key improvements in Haystack 2.0 ...,['<1-hop>\n\nA common interface for storing da...,Haystack 2.0 introduces significant improvemen...,multi_hop_specific_query_synthesizer
4,What are the benefits of using Chroma in Hayst...,"['<1-hop>\n\nThese include Chroma, Weaviate, P...",Chroma is one of the many storage services int...,multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
6,How Haystack 2.0 make user-friendly installati...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 makes user-friendly installation ...,multi_hop_abstract_query_synthesizer
7,How does Haystack 2.0 facilitate the developme...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 facilitates the development of pr...,multi_hop_abstract_query_synthesizer
8,What features in Haystack 2.0 contribute to it...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 offers a user-friendly installati...,multi_hop_abstract_query_synthesizer
9,"How does Haystack 2.0, as an open-source frame...",['<1-hop>\n\nHaystack 2.0: The Composable Open...,Haystack 2.0 is an open-source Python framewor...,multi_hop_abstract_query_synthesizer
10,How can I create customizable pipelines in Hay...,['<1-hop>\n\nHaystack 2.0: The Composable Open...,To create customizable pipelines in Haystack 2...,multi_hop_abstract_query_synthesizer


### Analyzing Web Content vs PDF Results

Let's examine how the synthetic test generation performs when using web content versus PDF content.

**Expected Differences:**
- **Content Structure**: Web content may have different formatting and structure
- **Question Complexity**: Depending on the source material's complexity
- **Context Quality**: Web content might include navigation elements or ads that need filtering

**Web Content Specific Benefits:**
1. **Real-time Content**: Access to the most current information available online
2. **Rich Media Context**: Web pages often have supplementary context that enhances understanding
3. **Diverse Sources**: Easy to process content from multiple websites
4. **Hyperlinked Knowledge**: Web content often contains references that enrich the knowledge graph

**Potential Challenges:**
1. **Content Quality Variability**: Web content quality can vary significantly
2. **Noise Filtering**: Need to filter out navigation, ads, and irrelevant content
3. **Rate Limiting**: Must respect website rate limits and robots.txt
4. **Dynamic Content**: Some content may require JavaScript rendering

## Summary

### What We've Accomplished

In this notebook, we successfully:

1. **Built a Web Content Processing Pipeline**: Created an end-to-end pipeline specifically optimized for web content
2. **Demonstrated Source Flexibility**: Processed content from multiple different websites
3. **Generated Knowledge Graphs from Web Content**: Converted unstructured web content into structured knowledge representations
4. **Produced Comparative Synthetic Test Data**: Created question-answer pairs from different web sources
5. **Analyzed Web-Specific Characteristics**: Examined how web content affects synthetic test generation

### Key Advantages of Web Content Processing

- **Real-Time Content**: Access to the most current information available
- **Diverse Sources**: Easy to process content from multiple websites in sequence
- **Rich Context**: Web content often includes hyperlinks and references that enhance knowledge graphs
- **Scalable Collection**: Can systematically process large numbers of web resources

