⚡ Optimize sitemap parsing with concurrent fetching by saurabhsharma2u · Pull Request #77 · Crawlith/crawlith

saurabhsharma2u · 2026-03-02T08:59:26Z

💡 What: The Sitemap parser now fetches child sitemaps from a sitemap index concurrently instead of sequentially. It uses pLimit to limit the concurrency to 10 active requests at a time, preventing massive spikes in network connections or memory usage while still providing a significant speedup.

🎯 Why: Previously, processing child sitemaps was done using a for...of loop with await on each iteration. This meant the crawler was forced to wait for one sitemap to fully download and process before starting the next one. This was an inefficient use of network and CPU resources, especially for large sites with many sitemaps.

📊 Measured Improvement:
I created a new performance test (sitemap_perf.test.ts) that mocks a sitemap index with 40 child sitemaps, each having an artificial 10ms network delay.

Baseline (Sequential): ~538ms
Optimized (Concurrent): ~78ms
Improvement: ~85% reduction in time taken to fetch all child sitemaps.

PR created automatically by Jules for task 11465383376499099777 started by @saurabhsharma2u

When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes. This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently. Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.

google-labs-jules · 2026-03-02T08:59:27Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes. This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently. Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.

…ranch

github-code-quality Bot found potential problems Mar 2, 2026

View reviewed changes

Comment thread packages/core/tests/sitemap_perf.test.ts Fixed

Comment thread packages/core/tests/sitemap_perf.test.ts Fixed

saurabhsharma2u added 2 commits March 2, 2026 01:24

Merge branch 'main' into perf/sitemap-concurrency-11465383376499099777

d19e6b3

saurabhsharma2u marked this pull request as ready for review March 2, 2026 10:25

fix: restore accidentally deleted tests in sitemap concurrency perf b…

12a0ba9

…ranch

saurabhsharma2u merged commit f1017e9 into main Mar 2, 2026
3 checks passed

saurabhsharma2u deleted the perf/sitemap-concurrency-11465383376499099777 branch March 2, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Optimize sitemap parsing with concurrent fetching#77

⚡ Optimize sitemap parsing with concurrent fetching#77
saurabhsharma2u merged 4 commits intomainfrom
perf/sitemap-concurrency-11465383376499099777

saurabhsharma2u commented Mar 2, 2026

Uh oh!

google-labs-jules Bot commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saurabhsharma2u commented Mar 2, 2026

Uh oh!

google-labs-jules Bot commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant