Skip to content

⚡ Optimize sitemap parsing with concurrent fetching#77

Merged
saurabhsharma2u merged 4 commits intomainfrom
perf/sitemap-concurrency-11465383376499099777
Mar 2, 2026
Merged

⚡ Optimize sitemap parsing with concurrent fetching#77
saurabhsharma2u merged 4 commits intomainfrom
perf/sitemap-concurrency-11465383376499099777

Conversation

@saurabhsharma2u
Copy link
Copy Markdown
Contributor

💡 What: The Sitemap parser now fetches child sitemaps from a sitemap index concurrently instead of sequentially. It uses pLimit to limit the concurrency to 10 active requests at a time, preventing massive spikes in network connections or memory usage while still providing a significant speedup.

🎯 Why: Previously, processing child sitemaps was done using a for...of loop with await on each iteration. This meant the crawler was forced to wait for one sitemap to fully download and process before starting the next one. This was an inefficient use of network and CPU resources, especially for large sites with many sitemaps.

📊 Measured Improvement:
I created a new performance test (sitemap_perf.test.ts) that mocks a sitemap index with 40 child sitemaps, each having an artificial 10ms network delay.

  • Baseline (Sequential): ~538ms
  • Optimized (Concurrent): ~78ms
  • Improvement: ~85% reduction in time taken to fetch all child sitemaps.

PR created automatically by Jules for task 11465383376499099777 started by @saurabhsharma2u

When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes.

This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently.

Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.
@google-labs-jules
Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Comment thread packages/core/tests/sitemap_perf.test.ts Fixed
Comment thread packages/core/tests/sitemap_perf.test.ts Fixed
When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes.

This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently.

Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.
@saurabhsharma2u saurabhsharma2u marked this pull request as ready for review March 2, 2026 10:25
@saurabhsharma2u saurabhsharma2u merged commit f1017e9 into main Mar 2, 2026
3 checks passed
@saurabhsharma2u saurabhsharma2u deleted the perf/sitemap-concurrency-11465383376499099777 branch March 2, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant