web2llm

Fetch any web page. Get clean Markdown. Ready for LLMs.

`web2llm` is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.

Quick Start

Add this to your Cargo.toml:

[dependencies]
web2llm = "0.3.0"
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }

Fetch and print Markdown in one call:

use web2llm::fetch;

#[tokio::main]
async fn main() {
    // 1. Simple fetch (Uses Auto-mode + default settings)
    let result = fetch("https://example.com".to_string()).await.unwrap();
    
    // 2. Print the cleaned Markdown
    println!("{}", result.markdown());
}

Features

Content-aware extraction — isolates the main article body with extreme precision.
Clean Markdown output — preserves headings, tables, code blocks, and inline links.
Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs.
High Performance — zero-copy traversal and bump-allocation (~3.9ms for Wikipedia).
Semantic Chunking — divide content into logical, token-budgeted islands for AI apps.

Configuration & Fetch Strategies

You can control how web2llm handles pages via the FetchMode configuration:

FetchMode::Static: Fast, standard HTTP request. No JavaScript execution.
FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.
FetchMode::Auto: (Default) Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the browser only if needed.

use web2llm::{Web2llm, Web2llmConfig, FetchMode};

let config = Web2llmConfig {
    fetch_mode: FetchMode::Auto,
    ..Default::default()
};

Lightweight Build (Optional)

web2llm includes Chromium support by default for a "plug-and-play" experience. Power users who only need static scraping can disable defaults to remove the Chromium dependency (~50 sub-dependencies):

[dependencies]
web2llm = { version = "0.3.0", default-features = false }

Performance

web2llm is built for extreme speed and high-throughput ingestion.

Note: Metrics represent pure extraction and processing throughput, excluding network latency.

Task	Average Time	Throughput
Simple Page Extraction	< 1.0 ms	~1,000+ pages/sec
Wikipedia (Large) Extraction	~4.0 ms	~250 pages/sec
Batch Fetch (100x Wikipedia)	~100 ms	~1,000 pages/sec

Speed may vary on different systems

Advanced: Semantic Chunking

For "true AI" applications and RAG pipelines, web2llm can divide documents into logical, structurally-aware chunks that fit your token budget without splitting paragraphs mid-sentence.

let config = Web2llmConfig {
    max_tokens: 500, // Target 500 tokens per chunk
    ..Default::default()
};

let client = Web2llm::new(config).unwrap();
let result = client.fetch(url).await.unwrap();

// Access granular chunks for precise vector embedding
for chunk in result.chunks {
    println!("Chunk #{} ({} tokens): {:.2} quality score", chunk.index, chunk.tokens, chunk.score);
}

Architecture

The pipeline executes in 5 stages:

URL
 │
 ▼
[1] Pre-flight       — URL validation, robots.txt check, rate limiting
 │
 ▼
[2] Fetch            — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
 │
 ▼
[3] Score            — Bottom-up recursive scoring builds a "Scored Tree" (Bump-allocated)
 │
 ▼
[4] Chunk & Wash     — Top-down "Flatten or Recurse" chunking + Markdown optimization
 │
 ▼
[5] Output           — PageResult struct containing Vec<PageChunk>

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web2llm

Fetch any web page. Get clean Markdown. Ready for LLMs.

`web2llm` is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.

Quick Start

Features

Configuration & Fetch Strategies

Lightweight Build (Optional)

Performance

Note: Metrics represent pure extraction and processing throughput, excluding network latency.

Advanced: Semantic Chunking

Architecture

Roadmap

About

Uh oh!

Releases 10

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web2llm

Fetch any web page. Get clean Markdown. Ready for LLMs.

web2llm is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.

Quick Start

Features

Configuration & Fetch Strategies

Lightweight Build (Optional)

Performance

Note: Metrics represent pure extraction and processing throughput, excluding network latency.

Advanced: Semantic Chunking

Architecture

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Contributors

Uh oh!

Languages

`web2llm` is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.