Strip buffers after writing to disk to reduce memory usage (issue #302)#315
Strip buffers after writing to disk to reduce memory usage (issue #302)#315jornmineur wants to merge 1 commit into11ty:mainfrom
Conversation
…duction builds (issue 11ty#302)
|
Your approach makes perfect sense to me, but, bizarrely, I couldn’t build my blog using your branch because it kept running out of memory. I realized this was the first time I’d upgraded from v5 of the plugin, so I tried it with the upstream v6 and had the same issue. Increasing the heap size allowed me to complete that build successfully, though much more slowly (~120s vs ~40s) and using much more memory than with v5. I tried the branch again. The runtime was about the same. I don’t have a scientific way to measure memory usage without creating a VM or container, but just eyeballing what the system tells me as it runs, I don’t any tangible difference. I don’t think Node is releasing the memory from the buffers, at least in my case. But I may be doing something unusual. I’m sorry to say all I can offer is this anecdotal evidence, heh. I hope someone else can try it with a normal image-heavy site. |
|
Great that you invested the time to test this, @Aankhen , much appreciated. I'll take another look! |
|
It looks like the problem runs deeper than I thought, which explains why the patch isn't working (not sure yet why it worked for me – perhaps I mixed something up on my end). This is my updated understanding of how eleventy-image works:
So the hash is computed before checking the cache, and the file is read into memory as a side effect of computing the hash. The cache lookup is:
A hit means: "we've seen this exact image with these exact options before, here's the Image instance that's already processing or has processed it." The problem: even on a cache hit, the new Image instance has already read the file into Hence the memory explosion. This approach of comparing hash keys seems like a really safe approach, but it is also quite expensive. Possibly it was done in a different way in v5 – haven't looked into that yet. It could be done much faster, and with >99% cache hit accuracy, by keeping a cache file keeping track of image paths, I'll keep looking! :-) |
|
That’s interesting. Thank you for the detailed analysis. I guess the cache is really just to prevent processing images more than once in this model, not to prevent reading them. I wonder if there could be an option for an alternative hash function that uses only the filename and file stats without reading the contents. |
|
@Aankhen The patched version is now running in production on my end without issues. Build times are very fast, and memory usage remains negligible even with a large image set. If you'd be interested in giving it a try, I would be curious to hear how the patched version works for you. The PR introduces two changes: 1. Persistent manifest cache A new file The manifest is conservative: any change in path, mtime, or size forces a full reprocess. 2. Avoids buffering images in JS memory for production builds For local images in production mode, the source file path is passed directly to Sharp instead of reading the file into a buffer first. Sharp supports this natively, so the data flows from disk → Sharp → disk without ever sitting in JavaScript memory. What This Means When Does This Apply? The optimization kicks in when producing local binary images (not SGV), in production mode (not dryRun, statsOnly, or transformOnRequest). In all other cases, existing behavior remains unchanged. No Changes Required This works as a drop-in replacement. No configuration or code changes are needed; the manifest is managed automatically. |
Calculating the hash of the images is really inexpensive. You can make it even cheaper if you have multiple GB of images by using xxhash instead of SHA. I tried using mtime to skip hashing, but mtime is not so reliable. For example, consider building on Netlify. They restore the cache contents, but may not restore the mtime and other file attributes. If your CI/CD uses a build system that mounts files over network, mtime is unreliable. One might check mtime to detect if a file has possibly changed, but it shouldn't be relied on to detect if a file hasn't changed. Hashing is a nice way to store the image and the options in a stateless way, it's really fast, and the filenames have a hash so the browser could "cache forever" in deployment. Passing the filepaths directly to sharp is very nice, I wonder why this wasn't already done. |
That's a really good point @zeroby0! I updated the code to use the content hash for cache validation instead of mtime+size, tested locally (not on Netlify), and pushed to the new PR. What do you think? |
|
The premise and the methodology look good to me, but the code has changed so much since I last looked at it that I can't provide good feedback on whether this causes any unforeseen bugs. But the worst that might happen is that some images might get processed again and again, so I'd say it's alright :D |
EDIT: the approach in this PR didn't turn out to work well and is superseded by PR 319.
This PR addresses issue #302 by removing image buffers from memory after files have been written to disk.
Problem
When processing many images, the memory cache retains image buffers even after they've been written to disk. This causes memory usage to grow linearly with the number of images processed, leading to builds failing on large sites.
Solution
After processing completes, strip the buffer property from results when:
outputPathexists)dryRunmode (which needs buffers)The metadata (dimensions, URLs, paths) remains available for generating HTML.
Testing