Async-first web scraping framework for Rust. Built on wreq + scraper with
XPath support via sxd-xpath. It keeps the API small
(Spider/Request/Response), adds middlewares and pipelines, and ships with
structured logging so you can focus on crawling.
- Async engine with configurable concurrency, queue limits, and request de-dupe.
- Minimal Spider/Request/Response model with follow helpers and callback overrides.
- HTML helpers for CSS and XPath (
select,select_first,xpath,xpath_first) plus ergonomic variants. - Built-in middlewares for user agents, proxy rotation, delays, retries, and skipping non-HTML.
- Pipelines for JSON Lines, CSV, XML, or custom callbacks.
- Structured logging with periodic crawl statistics (
SILKWORM_LOG_LEVEL).
cargo add silkwormIf you want to use the async API directly (instead of run_spider), add Tokio:
cargo add tokio --features rt-multi-thread,macrosThe examples below also use serde_json for convenience:
cargo add serde_jsonTip: use silkworm::prelude::*; for the most common types.
use serde_json::json;
use silkworm::{run_spider, HtmlResponse, Spider, SpiderResult};
struct QuotesSpider;
impl Spider for QuotesSpider {
fn name(&self) -> &str {
"quotes"
}
fn start_urls(&self) -> Vec<&str> {
vec!["https://quotes.toscrape.com/"]
}
async fn parse(&self, response: HtmlResponse<Self>) -> SpiderResult<Self> {
let mut out = Vec::new();
for quote in response.select_or_empty(".quote") {
let text = quote.text_from(".text");
let author = quote.text_from(".author");
if !text.is_empty() && !author.is_empty() {
out.push(json!({
"text": text,
"author": author,
}).into());
}
}
// Follow pagination links.
out.extend(response.follow_css_outputs("li.next a", "href"));
out
}
}
fn main() -> silkworm::SilkwormResult<()> {
run_spider(QuotesSpider)
}If you already run a Tokio runtime, use crawl/crawl_with:
use silkworm::{crawl_with, RunConfig};
#[tokio::main]
async fn main() -> silkworm::SilkwormResult<()> {
let config = RunConfig::new().with_concurrency(32);
crawl_with(QuotesSpider, config).await
}Write scraped items to files or plug in your own callback:
use silkworm::{run_spider_with, JsonLinesPipeline, RunConfig};
let config = RunConfig::new().with_item_pipeline(JsonLinesPipeline::new("data/items.jl"));
run_spider_with(QuotesSpider, config)?;Available pipelines:
JsonLinesPipeline(streaming JSON Lines)CsvPipeline(flattened CSV)XmlPipeline(nested XML)CallbackPipeline(custom per-item handler)
Enable built-ins by adding them to the run config:
use std::time::Duration;
use silkworm::{
DelayMiddleware, ProxyMiddleware, RetryMiddleware, RunConfig, SkipNonHtmlMiddleware,
UserAgentMiddleware,
};
let config = RunConfig::new()
.with_request_middleware(UserAgentMiddleware::new(
vec![],
Some("silkworm-rs/0.1".to_string()),
))
.with_request_middleware(DelayMiddleware::fixed(0.25))
.with_request_middleware(ProxyMiddleware::new(
vec!["http://proxy.local:8080".to_string()],
true,
))
.with_response_middleware(RetryMiddleware::new(3, None, None, 0.5))
.with_response_middleware(SkipNonHtmlMiddleware::new(None, 1024))
.with_request_timeout(Duration::from_secs(15));Response::follow_url carries the current callback by default:
let next = response.follow_url("/page/2");You can also build new requests fluently:
use silkworm::Request;
let request = Request::get("https://example.com/search")
.with_params([("q", "rust"), ("page", "1")])
.with_headers([("Accept", "text/html"), ("User-Agent", "silkworm-rs/0.1")]);If SkipNonHtmlMiddleware is enabled, mark requests you want to handle as JSON/XML:
let request = Request::get("https://example.com/api")
.with_headers([("Accept", "application/json")])
.with_allow_non_html(true);For per-request parsing, attach a callback with with_callback_fn or
callback_from_fn (signature: Fn(Arc<S>, Response<S>) -> SpiderResult<S>).
Silkworm provides both error-returning and convenience selector methods for maximum flexibility:
// Error-returning methods (when you need precise error handling)
let elements = response.select(".item")?;
let element = response.select_first(".item")?;
// Ergonomic methods (for cleaner code when errors can be ignored)
let elements = response.select_or_empty(".item"); // Returns Vec, never errors
let element = response.select_first_or_none(".item"); // Returns Option
// Direct text/attribute extraction
let title = response.text_from("h1"); // Empty string if not found
let href = response.attr_from("a", "href"); // None if not found
let tags = response.select_texts(".tag"); // Vec<String>
let links = response.select_attrs("a.next", "href"); // Vec<String>
// Follow links directly from selectors
let next_requests = response.follow_css("a.next", "href");
// Also works on HtmlElement for nested selections
for item in response.select_or_empty(".item") {
let name = item.text_from(".name");
let price = item.text_from(".price");
}All these methods work with CSS selectors and XPath (use xpath_or_empty(),
xpath_first_or_none(), etc.).
RunConfig controls concurrency, queue sizing, and HTTP behavior:
use std::time::Duration;
use silkworm::RunConfig;
let config = RunConfig::new()
.with_concurrency(32)
.with_max_pending_requests(500)
.with_log_stats_interval(Duration::from_secs(10))
.with_request_timeout(Duration::from_secs(10))
.with_html_max_size_bytes(2_000_000)
.with_keep_alive(true);Structured logs include crawl statistics and can be adjusted via environment:
SILKWORM_LOG_LEVEL=DEBUG cargo runFetch HTML directly and parse with scraper:
#[tokio::main]
async fn main() -> silkworm::SilkwormResult<()> {
let (text, document) = silkworm::fetch_html("https://example.com").await?;
Ok(())
}If you only need a parsed document:
#[tokio::main]
async fn main() -> silkworm::SilkwormResult<()> {
let document = silkworm::fetch_document("https://example.com").await?;
Ok(())
}Check out the runnable examples in examples/, including:
examples/quotes_spider.rsexamples/quotes_spider_xpath.rsexamples/hackernews_spider.rsexamples/sitemap_spider.rs
MIT