üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Web Content Knowledge Graph and Synthetic Data Generation Pipeline

This notebook demonstrates how to build a comprehensive pipeline for web content processing that:
1. **Retrieves content** from web URLs using Haystack's LinkContentFetcher
2. **Converts HTML** to structured documents using HTMLToDocument
3. **Preprocesses the text** with cleaning and splitting components
4. **Creates a knowledge graph** from the processed web content
5. **Generates synthetic test data** using the knowledge graph

## Learning Objectives

By the end of this notebook, you will understand:
- How to build end-to-end Haystack pipelines for web content processing
- The differences between PDF and web content processing
- Best practices for web scraping and content extraction
- How web content characteristics affect synthetic test generation

## Key Components for Web Processing
- **LinkContentFetcher**: Retrieves content directly from URLs
- **HTMLToDocument**: Converts HTML content to Haystack Documents
- **DocumentCleaner**: Removes extra whitespaces and HTML artifacts
- **DocumentSplitter**: Breaks web content into manageable chunks
- **KnowledgeGraphGenerator**: Creates structured knowledge representations
- **SyntheticTestGenerator**: Produces question-answer pairs for evaluation

## Real-World Applications
This approach is particularly useful for:
- **Documentation Analysis**: Processing online documentation and creating test datasets
- **Content Monitoring**: Regularly generating tests from updated web content  
- **Multi-Source Knowledge**: Combining web content with other document types
- **Research Applications**: Creating datasets from academic papers, blog posts, etc.

## Technical Considerations
- **Rate Limiting**: Be mindful of website rate limits when fetching content
- **Content Quality**: Web content may require more aggressive cleaning
- **Dynamic Content**: Some websites use JavaScript; static HTML fetching may miss content

In [None]:
import os
from dotenv import load_dotenv
from haystack import Pipeline
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter)
from haystack.components.generators import OpenAIGenerator
from haystack.components.embedders.openai_text_embedder import OpenAITextEmbedder
from haystack.utils import Secret
from pathlib import Path
from scripts.knowledge_graph_component import KnowledgeGraphGenerator
from scripts.langchaindocument_component import DocumentToLangChainConverter
from scripts.synthetic_test_components import SyntheticTestGenerator, TestDatasetSaver

# Load environment variables
load_dotenv("./.env")

# Helper function to create fresh generator and embedder instances
def create_llm_components():
    """Create fresh instances of generator and embedder."""
    # You can use OpenAI models:
    generator = OpenAIGenerator(
        model="gpt-4o-mini",
        api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
    )
    embedder = OpenAITextEmbedder(
        model="text-embedding-3-small",
        api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
    )
    
    # Or use Ollama models (uncomment to use):
    # from haystack_integrations.components.generators.ollama import OllamaGenerator
    # from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder
    # 
    # generator = OllamaGenerator(
    #     model="mistral-nemo:12b",
    #     generation_kwargs={
    #         "num_predict": 100,
    #         "temperature": 0.9,
    #     }
    # )
    # embedder = OllamaTextEmbedder(model="nomic-embed-text")
    
    return generator, embedder

# Create web content processing components
fetcher = LinkContentFetcher()
converter = HTMLToDocument()
doc_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_substrings=['<1-hop>\n\n', '<multi-hop>\n\n', '<single-hop>\n\n', '\n\n\n', '\f', '\r']  # Remove synthetic data generation artifacts and weird characters
)
doc_splitter = DocumentSplitter(split_by="sentence",
                                split_length=5,  # Reduced from 50 to create more chunks
                                split_overlap=1)
doc_converter = DocumentToLangChainConverter()

# Create knowledge graph component with its own generator and embedder instances
kg_gen, kg_embed = create_llm_components()
kg_generator = KnowledgeGraphGenerator(
    generator=kg_gen,
    embedder=kg_embed,
    apply_transforms=True
)

# Create test generator component with its own generator and embedder instances
test_gen, test_embed = create_llm_components()
test_generator = SyntheticTestGenerator(
    generator=test_gen,
    embedder=test_embed,
    test_size=10,
    query_distribution=[
        ("single_hop", 0.3),
        ("multi_hop_specific", 0.3),
        ("multi_hop_abstract", 0.4)
    ]
)
test_saver = TestDatasetSaver("data_for_eval/synthetic_tests_10_from_web.csv")

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)
pipeline.add_component("converter", converter)
pipeline.add_component("doc_cleaner", doc_cleaner)
pipeline.add_component("doc_splitter", doc_splitter)
pipeline.add_component("doc_converter", doc_converter)
pipeline.add_component("kg_generator", kg_generator)
pipeline.add_component("test_generator", test_generator)
pipeline.add_component("test_saver", test_saver)

# Connect components in sequence
pipeline.connect("fetcher.streams", "converter.sources")
pipeline.connect("converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("test_generator.testset", "test_saver.testset")

print("‚úÖ Web Content Processing Pipeline created successfully!")
print("üåê Ready to process web content and generate knowledge graphs + synthetic tests")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Web Content Processing Pipeline created successfully!
üåê Ready to process web content and generate knowledge graphs + synthetic tests


In [2]:
# Sample web URLs to process - using multiple Haystack docs pages for more content
# Note: Some websites (like Wikipedia) block automated requests, so we use documentation sites
web_urls = [
    "https://docs.haystack.deepset.ai/docs/intro",
    "https://docs.haystack.deepset.ai/docs/creating-pipelines", 
    "https://docs.haystack.deepset.ai/docs/components"
]

print(f"üåê Processing web content from {len(web_urls)} URLs")
print("This may take a moment to fetch and process the content...")

try:
    result = pipeline.run({
        "fetcher": {"urls": web_urls}
    })

    print("\nüìä Pipeline Results:")
    print(f"  üìÑ Documents Processed: {result['doc_converter']['document_count']}")
    print(f"  üß† Knowledge Graph Nodes: {result['kg_generator']['node_count']}")
    print(f"  üß™ Test Cases Generated: {result['test_generator']['testset_size']}")
    print(f"  üîß Generation Method: {result['test_generator']['generation_method']}")
    
except Exception as e:
    print(f"‚ùå Error processing web content: {str(e)}")
    print("This might be due to network issues or website access restrictions.")

üåê Processing web content from 3 URLs
This may take a moment to fetch and process the content...


Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:02<00:00,  1.47it/s]
Applying HeadlineSplitter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:00<00:00, 4153.80it/s]
Applying HeadlinesExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:02<00:00,  1.47it/s]
Applying HeadlineSplitter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:00<00:00, 4153.80it/s]
Applying SummaryExtractor:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [00:03<00:03,  1.53s/it]Property 'summary' already exists in node 'cefa86'. Skipping!
Applying SummaryExtractor:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [00:03<00:00,  1.06it/s]Property 'summary' already exists in node 'cefa86'. Skipping!
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:04<00:00,  1.08s/it]
Applying SummaryExtractor: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:04<00:00,  1.08s/it]
Applying CustomNodeFilter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  2.67it/s]
Applying CustomNodeFilter: 100%|‚ñà‚ñà‚ñà‚ñ


üìä Pipeline Results:
  üìÑ Documents Processed: 12
  üß† Knowledge Graph Nodes: 12
  üß™ Test Cases Generated: 11
  üîß Generation Method: knowledge_graph


### üîß Troubleshooting Note

**If you encountered "No clusters found in the knowledge graph" error above:**

This happened because the original configuration had:
- `split_length=50` sentences (too large for most web pages)
- This resulted in only 1 document chunk ‚Üí 1 knowledge graph node
- Cannot create clusters with just 1 node!

### Understanding the Web Content Processing Pipeline Architecture

The web processing pipeline follows a similar structure to PDF processing but with adapted input components:

```
Web URL ‚Üí Link Fetcher ‚Üí HTML Converter ‚Üí Document Cleaner ‚Üí Document Splitter
    ‚Üì
Document Converter ‚Üí Knowledge Graph Generator  
    ‚Üì                         ‚Üì
Test Generator ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê ‚Üê
    ‚Üì
Test Dataset Saver
```

**Why This Works:**
- The knowledge graph generation is **content-agnostic** - it works the same whether input comes from PDFs, web pages, or other sources
- Document preprocessing steps ensure consistent quality regardless of input format
- The same test generation logic produces comparable quality across all sources

**Pipeline Reusability:**
Notice how we can reuse the same components (`doc_cleaner`, `doc_splitter`, `kg_generator`, etc.) with different input sources. This demonstrates the modularity and flexibility of Haystack's component architecture.

**Web-Specific Considerations:**
- **Content Structure**: Web pages may have navigation, ads, and other non-content elements
- **HTML Artifacts**: May require more aggressive cleaning than PDF content
- **Dynamic Loading**: Static HTML fetching may miss JavaScript-rendered content

In [3]:
pipeline.draw(path="./images/web_knowledge_graph_pipeline.png")
print("üì∏ Pipeline diagram saved to: ./images/web_knowledge_graph_pipeline.png")

üì∏ Pipeline diagram saved to: ./images/web_knowledge_graph_pipeline.png


In [4]:
import pandas as pd

# Load and display the generated synthetic tests
test_file_path = "data_for_eval/synthetic_tests_10_from_web.csv"

if os.path.exists(test_file_path):
    synthetic_tests_df = pd.read_csv(test_file_path)
    print("\nüß™ Synthetic Tests Sample:")
    print("First 5 rows:")
    display(synthetic_tests_df.head())
    print("Last 5 rows:")
    display(synthetic_tests_df.tail())
else:
    print("‚ùå Synthetic test file not found")
    print("Please run the previous cells to generate the test data.")


üß™ Synthetic Tests Sample:
First 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wut is a Document Store in data pipelines?,['You can check them on the documentation page...,A Document Store is a component used in data p...,single_hop_specific_query_synthesizer
1,How is the llm component integrated into the d...,['3. Create the pipeline\nquery_pipeline = Pip...,The llm component is integrated into the data ...,single_hop_specific_query_synthesizer
2,How do I use HTMLToDocument in my pipeline?,"['Pipeline.run()\ncan be called in two ways, e...","To use HTMLToDocument in your pipeline, you fi...",single_hop_specific_query_synthesizer
3,What are the steps to run a pipeline using Doc...,['<1-hop>\n\nPipeline.run()\ncan be called in ...,"To run a pipeline using DocumentWriter, you fi...",multi_hop_specific_query_synthesizer
4,What steps should be followed to create a pipe...,['<1-hop>\n\nYou can check them on the documen...,"To create a pipeline that utilizes documents, ...",multi_hop_specific_query_synthesizer


Last 5 rows:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
6,How do you connect components in a query pipel...,['<1-hop>\n\n3. Create the pipeline\nquery_pip...,To connect components in a query pipeline that...,multi_hop_specific_query_synthesizer
7,How do you create a pipeline and what are the ...,['<1-hop>\n\nPipeline.run()\ncan be called in ...,"To create a pipeline, you first need to import...",multi_hop_abstract_query_synthesizer
8,How does the InMemoryDocumentStore relate to t...,['<1-hop>\n\nPipeline.run()\ncan be called in ...,The InMemoryDocumentStore is a component used ...,multi_hop_abstract_query_synthesizer
9,How do you create a pipeline and what componen...,['<1-hop>\n\nPipeline.run()\ncan be called in ...,"To create a pipeline, you first need to import...",multi_hop_abstract_query_synthesizer
10,What steps are involved in validating the comp...,['<1-hop>\n\nPipeline.run()\ncan be called in ...,To validate the components of a pipeline that ...,multi_hop_abstract_query_synthesizer


### Analyzing Web Content vs PDF Results

Let's examine how the synthetic test generation performs when using web content versus PDF content.

**Expected Differences:**
- **Content Structure**: Web content may have different formatting and structure
- **Question Complexity**: Depending on the source material's complexity
- **Context Quality**: Web content might include navigation elements or ads that need filtering

**Web Content Specific Benefits:**
1. **Real-time Content**: Access to the most current information available online
2. **Rich Media Context**: Web pages often have supplementary context that enhances understanding
3. **Diverse Sources**: Easy to process content from multiple websites
4. **Hyperlinked Knowledge**: Web content often contains references that enrich the knowledge graph

**Potential Challenges:**
1. **Content Quality Variability**: Web content quality can vary significantly
2. **Noise Filtering**: Need to filter out navigation, ads, and irrelevant content
3. **Rate Limiting**: Must respect website rate limits and robots.txt
4. **Dynamic Content**: Some content may require JavaScript rendering

## Summary

### What We've Accomplished

In this notebook, we successfully:

1. **Built a Web Content Processing Pipeline**: Created an end-to-end pipeline specifically optimized for web content
2. **Demonstrated Source Flexibility**: Processed content from multiple different websites
3. **Generated Knowledge Graphs from Web Content**: Converted unstructured web content into structured knowledge representations
4. **Produced Comparative Synthetic Test Data**: Created question-answer pairs from different web sources
5. **Analyzed Web-Specific Characteristics**: Examined how web content affects synthetic test generation

### Key Advantages of Web Content Processing

- **Real-Time Content**: Access to the most current information available
- **Diverse Sources**: Easy to process content from multiple websites in sequence
- **Rich Context**: Web content often includes hyperlinks and references that enhance knowledge graphs
- **Scalable Collection**: Can systematically process large numbers of web resources

