Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix utf8-c8 memory usage #1271

Merged
merged 3 commits into from Apr 26, 2020
Merged

Fix utf8-c8 memory usage #1271

merged 3 commits into from Apr 26, 2020

Conversation

ddlws
Copy link
Contributor

@ddlws ddlws commented Apr 16, 2020

MVM_string_utf8_c8_decodestream allocates 4x the size of it's input buffer for every string it spits out and never releases the excess. This means the size of every MVMString produced is directly tied to the size of the input file (for files up to 1MB). With a large input file, this results in each MVMString consuming 4MB of memory (on my machine).

Consider something like this:
for 'bigFileWithManyLines.txt'.IO.lines(:enc('utf8-c8')) {...}

The GC can keep up if a single thread is consuming the decoder's output, though the memory usage is a bit high. However, multiple workers overwhelm it, and the OOM killer comes out.

This change reallocs the result buffer if it's larger than required. Since it can only request a smaller chunk, the call to realloc is cheap.

This script demonstrates the issue.

#!/bin/env raku
my $words-done = 0;
class statm {
    has Int $.Vmsize;
    has Int $.VmRSS;
    has Int $.shared;
    has Int $.text;
    has Int $.lib;
    has Int $.data;
    has Int $.dt;
    method new($Vmsize,$VmRSS,$shared,$text,$lib,$data,$dt) {
        self.bless(:$Vmsize,:$VmRSS,:$shared,:$text,:$lib,:$data,:$dt);
    }
}
sub memstats() { return statm.new(|('/proc/self/statm'.IO.slurp.split(' ')>>.Int)); }

sub MAIN($wordlist) {
    signal(SIGINT).tap({ say 'words-done: '~ $words-done; exit 1;});
    say "lines\tVirt(GB)\tResident(GB)";
    my $s = memstats;
    for $wordlist.IO.lines(:enc('utf8-c8')) {
        $words-done++;
        if ($words-done % 1000 == 0) {
            $s = memstats;
            say $words-done~"\t"~
                Rat.new($s.Vmsize*4096,2**30).fmt('%.1f') ~ 
                "\t\t"~ Rat.new($s.VmRSS*4096,2**30).fmt('%.1f') ;
        }
        if $_ ~~ 'adsfjalks;dhjasdlfh;dsalfjkdlsajf' { say 'what are the odds' }
    }
}

Here are before and after runs using a readily available source file.

Before patch:

perl6 bughunt.pl6 MoarVM/src/strings/gb18030_codeindex.h
lines   Virt(GB)    Resident(GB)
1000    3.5         0.2
2000    6.6         0.4
3000    7.9         0.4
4000    7.9         0.9
5000    8.4         1.0
6000    11.5        1.1
7000    14.6        1.3
8000    17.7        1.4
9000    20.8        1.5
10000   23.9        1.7

After patch:

perl6 bughunt.pl6 MoarVM/src/strings/gb18030_codeindex.h
lines    Virt(GB)    Resident(GB)
10000    .4          0.1
20000    .4          0.1
30000    .4          0.1
40000    .4          0.1
50000    .4          0.1
60000    .4          0.1
70000    .4          0.1
80000    .4          0.1
90000    .4          0.1
100000   .4          0.1

@niner
Copy link
Contributor

niner commented Apr 16, 2020

Commits look sensible to me. What I don't understand is: we allocate a large enough buffer to hold one grapheme for every input byte. To my understanding, we would actually use the whole buffer if the input stream actually consists solely of ASCII characters as those would give us one grapheme for every input byte. The absolute worst case would be an input buffer full of graphemes from the extended plane as those could get to 4 bytes per grapheme and in that case we'd allocate 4 times as large a buffer as we actually needed.

This makes me wonder, what your input file looks like? If it does not look like this worst case, there may be some other reason for why your patch helps so much. A reason that could lead us to an even better solution (like not allocating too much in the first place).

@ddlws
Copy link
Contributor Author

ddlws commented Apr 16, 2020

I've tested it on 100% ascii files and files of random bytes with newlines thrown in. The file size is the only thing that has an affect, since it controls the size of the read buffer and therefore the size of the output strings.

There really isn't a worst case like you're asking about, since the resulting strings are 4 bytes/grapheme no matter what. To actually use the whole buffer, you just need a line that's longer than the input buffer. In all my fuzzing, I don't think I ever tested that, though...

Rakudo might share some blame. While the default read length is 65535 bytes if you call read() on an IO::Handle, it reads 0x100000 for things like .lines and probably .slurp, too. That's why the file size has an affect on the resulting string for files up to 1MB. If rakudo was reading 65K instead of 1MB, I might never have noticed the issue.

If you want a smaller initial allocation per line, a change in rakudo to read smaller chunks for calls to .lines should do the trick. I'm not very familiar with nqp and rakudo, but it looks like this might be the place: src/core.c/IO/Handle.pm6

@niner
Copy link
Contributor

niner commented Apr 17, 2020

Apparently MVM_string_utf8_decodestream uses ds->result_size_guess to size the buffer and uses MVM_string_decodestream_add_chars in case that buffer might be too small after all. I wonder if a similar strategy could be employed for utf8_c8?

If you've got some further time to dig this, I'd very much appreciate that. Otherwise I'm fine with merging the PR as-is, as it's clearly an improvement already.

@ddlws
Copy link
Contributor Author

ddlws commented Apr 22, 2020

I'll look into it.

@ddlws
Copy link
Contributor Author

ddlws commented Apr 26, 2020

There's no reason utf8-c8 can't use result_size_guess, but I think it's a bad idea at the moment.

  • The utf8 stream decoder only adds a single grapheme to the result at a time, and it only does it in two places, making it easy to track the size and reallocate when needed.

  • utf8c8 adds to the result buffer in append_grapheme, which is called from process_bad_bytes and process_ok_codepoint. MVM_string_utf8_c8_decodestream calls those two functions in six different places. append_grapheme can also add multiple graphemes to the buffer per call.

  • In short, safely tracking the bufsize would require some refactoring.

With so many <pick-a-format>-decoding vulnerabilities in the historical record, I'm afraid to make this function any more complex. It's completely untouched by the nqp tests (not sure about rakudo's), and generating sufficient test cases is nontrivial.

However, thanks to China's generous donation of free time, I just might get around to testing it properly. I'd be afraid to make more than minor changes until then, though.

I plan on giving all the decoders a thorough kicking once I get some more tooling in place. I think it would make sense for me to open a new issue regarding refactoring once I'm done fuzzing, but feel free to leave this open if you'd like.

@niner
Copy link
Contributor

niner commented Apr 26, 2020

Ok, many thanks for the excellent analysis! Will merge the PR as-is and am looking forward to your future contributions. Welcome to the team :)

@niner niner merged commit 9bb7a18 into MoarVM:master Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants