New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix utf8-c8 memory usage #1271
Fix utf8-c8 memory usage #1271
Conversation
Commits look sensible to me. What I don't understand is: we allocate a large enough buffer to hold one grapheme for every input byte. To my understanding, we would actually use the whole buffer if the input stream actually consists solely of ASCII characters as those would give us one grapheme for every input byte. The absolute worst case would be an input buffer full of graphemes from the extended plane as those could get to 4 bytes per grapheme and in that case we'd allocate 4 times as large a buffer as we actually needed. This makes me wonder, what your input file looks like? If it does not look like this worst case, there may be some other reason for why your patch helps so much. A reason that could lead us to an even better solution (like not allocating too much in the first place). |
I've tested it on 100% ascii files and files of random bytes with newlines thrown in. The file size is the only thing that has an affect, since it controls the size of the read buffer and therefore the size of the output strings. There really isn't a worst case like you're asking about, since the resulting strings are 4 bytes/grapheme no matter what. To actually use the whole buffer, you just need a line that's longer than the input buffer. In all my fuzzing, I don't think I ever tested that, though... Rakudo might share some blame. While the default read length is 65535 bytes if you call read() on an IO::Handle, it reads 0x100000 for things like .lines and probably .slurp, too. That's why the file size has an affect on the resulting string for files up to 1MB. If rakudo was reading 65K instead of 1MB, I might never have noticed the issue. If you want a smaller initial allocation per line, a change in rakudo to read smaller chunks for calls to .lines should do the trick. I'm not very familiar with nqp and rakudo, but it looks like this might be the place: src/core.c/IO/Handle.pm6 |
Apparently If you've got some further time to dig this, I'd very much appreciate that. Otherwise I'm fine with merging the PR as-is, as it's clearly an improvement already. |
I'll look into it. |
There's no reason utf8-c8 can't use
With so many <pick-a-format>-decoding vulnerabilities in the historical record, I'm afraid to make this function any more complex. It's completely untouched by the nqp tests (not sure about rakudo's), and generating sufficient test cases is nontrivial. However, thanks to China's generous donation of free time, I just might get around to testing it properly. I'd be afraid to make more than minor changes until then, though. I plan on giving all the decoders a thorough kicking once I get some more tooling in place. I think it would make sense for me to open a new issue regarding refactoring once I'm done fuzzing, but feel free to leave this open if you'd like. |
Ok, many thanks for the excellent analysis! Will merge the PR as-is and am looking forward to your future contributions. Welcome to the team :) |
MVM_string_utf8_c8_decodestream allocates 4x the size of it's input buffer for every string it spits out and never releases the excess. This means the size of every MVMString produced is directly tied to the size of the input file (for files up to 1MB). With a large input file, this results in each MVMString consuming 4MB of memory (on my machine).
Consider something like this:
for 'bigFileWithManyLines.txt'.IO.lines(:enc('utf8-c8')) {...}
The GC can keep up if a single thread is consuming the decoder's output, though the memory usage is a bit high. However, multiple workers overwhelm it, and the OOM killer comes out.
This change reallocs the result buffer if it's larger than required. Since it can only request a smaller chunk, the call to realloc is cheap.
This script demonstrates the issue.
Here are before and after runs using a readily available source file.
Before patch:
After patch: