New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when trying to import large data set via WriteStream #298

Closed
danburzo opened this Issue Jan 9, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@danburzo

danburzo commented Jan 9, 2015

I am getting out of memory errors when I'm trying to import about 500MB worth of nodes from an OpenStreetMap JSON file. Here's my code, in case I am approaching this the wrong way:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    through2 = require('through2'),
    levelup = require('level');

var db = levelup('db');

var write_stream = db.createWriteStream();

fs.createReadStream('output/nodes.json',{ encoding: 'utf8' })
    .pipe(JSONStream.parse('*.*'))
    .pipe(through2.obj(function(item, enc, next){
                /* json looks like [ [ [id,lat,lon], [id,lat,lon] ... ] ] */
        this.push({ 
            key: item[0] + "",  // id
            value: item[1] + "," + item[2] // lat,lon
        });
        next();
    }))
    .pipe(write_stream);

This happens for level@0.18.0 (levelup@0.18.6 + leveldown@0.10.2), on Node v0.10.35 OS X.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Jan 9, 2015

Member

Quite possible, we even want to deprecate WriteStream #199. However, it should respect backpressure and not have this kind of issue

My educated guess is that JSONStream put everything in 'old mode' and then we lose backpressure somehow. Or worse, the JSON structure causes a massive stack in JSONStream. @dominictarr your help is needed here :).

Can you please try to see if the same happens with a newline-delimited-json thing (you can parse it with split2?

I would recommend anyway against having a JSON thing of 500MB. Use something simpler, like newlines, and everything will be easier. JSON is not built for big files.

Member

mcollina commented Jan 9, 2015

Quite possible, we even want to deprecate WriteStream #199. However, it should respect backpressure and not have this kind of issue

My educated guess is that JSONStream put everything in 'old mode' and then we lose backpressure somehow. Or worse, the JSON structure causes a massive stack in JSONStream. @dominictarr your help is needed here :).

Can you please try to see if the same happens with a newline-delimited-json thing (you can parse it with split2?

I would recommend anyway against having a JSON thing of 500MB. Use something simpler, like newlines, and everything will be easier. JSON is not built for big files.

@danburzo

This comment has been minimized.

Show comment
Hide comment
@danburzo

danburzo Jan 9, 2015

Thank you for the recommendations. Unfortunately the memory still goes through the roof (I am getting FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory.
) with this version:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var db = levelup('db');

var write_stream = db.createWriteStream();

fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
    .pipe(split2())
    .pipe(through2.obj(function(line, enc, next){
        var parts = line.split(',');
        this.push({ 
            key: parts[0] + "", 
            value: parts[1] + "," + parts[2]
        });
        next();
    }))
    .pipe(write_stream);

nodes.txt is a 600MB newline-delimited set of nodes in the form id, lat, lon.

Indeed is sounds like the backpressure is not working as expected, but I'm not sure if levelup is trying to write too frequently or that in effect split2/JSONStream is too fast and levelup's buffer spills over by the time it gets to write. [N.B. I'm still a n00b with streams, so my intuition might be horribly off :-)]

danburzo commented Jan 9, 2015

Thank you for the recommendations. Unfortunately the memory still goes through the roof (I am getting FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory.
) with this version:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var db = levelup('db');

var write_stream = db.createWriteStream();

fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
    .pipe(split2())
    .pipe(through2.obj(function(line, enc, next){
        var parts = line.split(',');
        this.push({ 
            key: parts[0] + "", 
            value: parts[1] + "," + parts[2]
        });
        next();
    }))
    .pipe(write_stream);

nodes.txt is a 600MB newline-delimited set of nodes in the form id, lat, lon.

Indeed is sounds like the backpressure is not working as expected, but I'm not sure if levelup is trying to write too frequently or that in effect split2/JSONStream is too fast and levelup's buffer spills over by the time it gets to write. [N.B. I'm still a n00b with streams, so my intuition might be horribly off :-)]

@brycebaril

This comment has been minimized.

Show comment
Hide comment
@brycebaril

brycebaril Jan 9, 2015

Member

I used to have this problem as well with old versions of levelup but I hadn't seen it for quite some time. I wonder if the object mode backpressure signalling is the issue with the size of your keys being large. Maybe try manually setting lower highWaterMarks to test...

Also, you may want to check out https://github.com/maxogden/level-bulk-load which will attempt to auto-batch the writes which is a significant improvement in my experience.

Another optimization you can do is pre-reverse sort your data by key, though I wouldn't expect that should be necessary.

@maxogden has also spent a lot of time working with bulk-loading leveldb

Member

brycebaril commented Jan 9, 2015

I used to have this problem as well with old versions of levelup but I hadn't seen it for quite some time. I wonder if the object mode backpressure signalling is the issue with the size of your keys being large. Maybe try manually setting lower highWaterMarks to test...

Also, you may want to check out https://github.com/maxogden/level-bulk-load which will attempt to auto-batch the writes which is a significant improvement in my experience.

Another optimization you can do is pre-reverse sort your data by key, though I wouldn't expect that should be necessary.

@maxogden has also spent a lot of time working with bulk-loading leveldb

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Jan 9, 2015

Member

Let me suggest one trick I learned in the past:

Replace:

        next();

with

        setImmediate(next);

Also, you should wait your level to be open before starting the import.

Member

mcollina commented Jan 9, 2015

Let me suggest one trick I learned in the past:

Replace:

        next();

with

        setImmediate(next);

Also, you should wait your level to be open before starting the import.

@danburzo

This comment has been minimized.

Show comment
Hide comment
@danburzo

danburzo Jan 9, 2015

Wow, with setImmediate(next) it's writing like a champ! Although it seems to be slower, the memory is consistently at ~50MB and I don't expect it to crash any time soon.

Going with the levelup(db, callback) pattern does not help the memory problem in itself, but it does sound like a good idea since there can be quite a lot of writes queued by the time the database opens.

Keys are 64 bit integers and replacing them with 32 bit integers for the purpose of this exercise did help, but not enough to prevent the memory issue.

I will also check out level-bulk-load to see if it helps, but I'm glad there's a workable solution in the meantime.

danburzo commented Jan 9, 2015

Wow, with setImmediate(next) it's writing like a champ! Although it seems to be slower, the memory is consistently at ~50MB and I don't expect it to crash any time soon.

Going with the levelup(db, callback) pattern does not help the memory problem in itself, but it does sound like a good idea since there can be quite a lot of writes queued by the time the database opens.

Keys are 64 bit integers and replacing them with 32 bit integers for the purpose of this exercise did help, but not enough to prevent the memory issue.

I will also check out level-bulk-load to see if it helps, but I'm glad there's a workable solution in the meantime.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Jan 9, 2015

Member

if you want it to be faster, you can call setImmediate every 100 (or more) samples. It's ugly but it works.

Member

mcollina commented Jan 9, 2015

if you want it to be faster, you can call setImmediate every 100 (or more) samples. It's ugly but it works.

@danburzo

This comment has been minimized.

Show comment
Hide comment
@danburzo

danburzo Jan 9, 2015

Awesome 👍
I'm using it for scripts in a controlled environment, so it does not matter if it's a bit ugly if it works faster.
I'm also closing this issue in light of the decision to renounce WriteStream in levelup.
Thank you, guys!

For future reference, this is the final code that works well:

var fs = require('fs'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var i = 0;
levelup('db', function(err, db) {
    var write_stream = db.createWriteStream();

    fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
        .pipe(split2())
        .pipe(through2.obj(function(line, enc, next){
            var parts = line.split(',');
            this.push({ 
                key: parts[0], 
                value: parts[1] + "," + parts[2]
            });

            // Prevent memory leak
            // See: https://github.com/rvagg/node-levelup/issues/298
            if (i++ > 999) {
                setImmediate(next);
                i = 0;
            } else {
                next();
            }
        }))
        .pipe(write_stream);
});

danburzo commented Jan 9, 2015

Awesome 👍
I'm using it for scripts in a controlled environment, so it does not matter if it's a bit ugly if it works faster.
I'm also closing this issue in light of the decision to renounce WriteStream in levelup.
Thank you, guys!

For future reference, this is the final code that works well:

var fs = require('fs'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var i = 0;
levelup('db', function(err, db) {
    var write_stream = db.createWriteStream();

    fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
        .pipe(split2())
        .pipe(through2.obj(function(line, enc, next){
            var parts = line.split(',');
            this.push({ 
                key: parts[0], 
                value: parts[1] + "," + parts[2]
            });

            // Prevent memory leak
            // See: https://github.com/rvagg/node-levelup/issues/298
            if (i++ > 999) {
                setImmediate(next);
                i = 0;
            } else {
                next();
            }
        }))
        .pipe(write_stream);
});

@danburzo danburzo closed this Jan 9, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment