⚡ Optimize sitemap parsing with concurrent fetching#77
Conversation
When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes. This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently. Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
When encountering a sitemap index containing multiple child sitemaps, the `Sitemap` class previously processed each child sequentially. This severely limited performance when processing large sitemap indexes. This commit updates `Sitemap.processSitemap` to use `Promise.all` alongside `pLimit(10)` to process up to 10 child sitemaps concurrently. Locally, a test simulating 40 child sitemaps with an artificial 10ms latency showed a reduction in fetch time from ~538ms to ~78ms.
💡 What: The
Sitemapparser now fetches child sitemaps from a sitemap index concurrently instead of sequentially. It usespLimitto limit the concurrency to 10 active requests at a time, preventing massive spikes in network connections or memory usage while still providing a significant speedup.🎯 Why: Previously, processing child sitemaps was done using a
for...ofloop withawaiton each iteration. This meant the crawler was forced to wait for one sitemap to fully download and process before starting the next one. This was an inefficient use of network and CPU resources, especially for large sites with many sitemaps.📊 Measured Improvement:
I created a new performance test (
sitemap_perf.test.ts) that mocks a sitemap index with 40 child sitemaps, each having an artificial 10ms network delay.PR created automatically by Jules for task 11465383376499099777 started by @saurabhsharma2u