Skip to content

[repo-monitor] Medium: concurrent_requests_per_domain setting silently ignored — no per-domain rate limiting enforced #10

@Liohtml

Description

@Liohtml

Summary

The Spider trait exposes a concurrent_requests_per_domain() method that users can override to cap simultaneous requests to any single domain. The value is stored in CrawlStats but the CrawlerEngine never actually enforces it — all requests contend on a single global Semaphore regardless of their target domain. Spiders that override this method believe they are rate-limiting per-domain when in reality they are not.

Location

  • File: src/spiders/engine.rs
  • Line(s): 95 (stored in stats), entire process_request / crawl loop (no per-domain semaphore)
  • File: src/spiders/spider.rs — Line 17 (trait method declaration)
  • File: src/spiders/result.rs — Line 68 (stored in CrawlStats but unused for control flow)

Severity

Medium

Details

CrawlerEngine creates one global_limiter: Arc<Semaphore> with capacity spider.concurrent_requests().max(1). The per-domain value from spider.concurrent_requests_per_domain() is copied into CrawlStats for reporting purposes only — no HashMap<String, Arc<Semaphore>> keyed on domain is ever created or consulted.

Consequence: a spider that sets concurrent_requests_per_domain to 1 (one-at-a-time per host) can still hammer the same origin with as many parallel requests as concurrent_requests allows. This may trigger bans, violate politeness policies, or cause unintended load on the target.

// engine.rs line 95 — value recorded but never used to throttle
stats.concurrent_requests_per_domain = self.spider.concurrent_requests_per_domain();

No code path checks concurrent_requests_per_domain before acquiring a semaphore permit.

Suggested Fix

Introduce a HashMap<String, Arc<Semaphore>> keyed on the request's domain (lazily created on first encounter). Before dispatching each request, acquire a permit from both the global semaphore and the per-domain semaphore when concurrent_requests_per_domain > 0. Example sketch:

let domain_limiters: Arc<Mutex<HashMap<String, Arc<Semaphore>>>> = ...;

// in process_request:
if per_domain > 0 {
    let domain = request.domain().unwrap_or_default();
    let sem = domain_limiters.lock().await
        .entry(domain)
        .or_insert_with(|| Arc::new(Semaphore::new(per_domain as usize)))
        .clone();
    let _permit = sem.acquire_owned().await?;
    // proceed with fetch
}

Automated finding by repo-monitor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions