Memory leak in db.batch #171

Closed
fergiemcdowall opened this Issue Aug 14, 2013 · 99 comments

Comments

Projects
None yet

Hei

Maintainer of search-index and Norch here :)

There seems to be a memory leak in db.batch. When inserting thousands of batch files with db.batch the memory allocation of the parent app shoots up rather alarmingly, and does not come down again.

This issue has appeared in the last few weeks

Any ways to manually free up memory or otherwise force garbage collection?

F

Owner

dominictarr commented Aug 14, 2013

do you have a test example that reproduces this?
I'm seeing similar stuff running npmd - which is also doing indexing,
I think this can happen if you queue up loads of writes at once?

Owner

dominictarr commented Aug 14, 2013

what I meant to say, is that I've encountered a similar problem,
but I think it was just a flow control issue (just trying todo to many things at once)
it's easy to queue up many more operations than leveldb can take.
Actually, balancing this is a problem that has been encountered in other areas.

rvagg#153

Test example: reading in the test dataset in the Norch package- It generates about 400000 entries (I think) from 21000 batch files of about 150-200 entries.

Yes, in the past I have also managed to queue up way to many operations, leading to huge server hangs. Now however, search-index reads in batches in serial (one after the other). This seems to be more stable, but leads to a big memory leak in db.batch.

Owner

dominictarr commented Aug 14, 2013

I've found that large batches work pretty well - how many items do you have in your batches?

Owner

rvagg commented Aug 14, 2013

We need a simple test case for this so we can replicate the behavior and track it down and also we need to know whether this is new behavior or has been present for a while. There was a LevelDOWN release this week that contains lots of new C++ which could possibly be leaky but it'd be nice to know f this is recent behavior.

Also note that LevelDB does have some behavior that can look like a leak when you're doing heavy writes. Initially it'll just climb right up and not look like its going down but it eventually comes right back down and settles there. I don't think I have a graph of this but that'd probably be helpful to weed this kind of thing out.

@dominictarr its just over 21000 batch files of maybe 1-200 entries each

@rvagg OK- I will throw together a gist

@rvagg See https://gist.github.com/fergiemcdowall/6239924

NOTE: this gist demonstrates the memory jump (about 3gb on my machine) when inserting many batches to levelUP, although I cant replicate the out-of-memory error I am experiencing when inserting to levelUP from Express.js- there the cleanup doesnt happen unless there is a start-stop of the application.

Owner

rvagg commented Aug 15, 2013

added an adjusted version to the gist that gives some breathing space to V8 to let the callbacks come in.

increasing the total number of batches triggers the problem you're talking about:

FATAL ERROR: JS Allocation failed - process out of memory

so there's something there but I'm not sure where and whether it's a real memory leak on our end or not (could plausibly be up the stack or down the stack!).

OK- thanks for the tips- that breathing space was a good idea

@rvagg could you push the adjusted version to the gist? :)

Owner

rvagg commented Aug 15, 2013

see the comment in the gist

Aha- got it, cheers!

Owner

rvagg commented Aug 15, 2013

if I run the same stuff against leveldown directly I get exactly the same behaviour so it's not in levelup.

I'm still not entirely convinced this isn't a leveldb thing. if you run leak-tester.js in the tests directory of leveldown and watch it over time you'll see memory usage balloon and then settle back down again over time, it's odd behaviour but I imagine if you push hard enough you could make that ballooning push you over Node's limit. Perhaps that's what's happening here?

I've also opened a leakage branch of leveldown with some minor things I've found so far.

Owner

No9 commented Aug 20, 2013

The GC plot makes for scary reading https://gist.github.com/No9/fa818a9d63d22551a837 (See plot at bottom of page)
We could have hit a known issue but I need to definitively map the node.js group thread to our situation.

Owner

No9 commented Aug 20, 2013

Here is a flame graph of the execution over 20 mins.

The leveldb core library is out on the left which would suggest to me that our issue is in node or our JS as opposed to the leveldb code.

Alt text

Owner

rvagg commented Aug 21, 2013

that's an odd flamegraph.. for 20 mins worth of work there's a lot of string and module stuff going on in there and I can't see much related to levelup/leveldown. Can you skip the use of lorem-ipsum since it seems to be getting in the way and see what it does with plain Buffers with crypto.randomBytes(x) or Strings if you want crypto.randomBytes(x).toString('hex') (both cases are interesting since they have to do different things).

Owner

No9 commented Aug 21, 2013

Yup and I will also provide one that filters on doBatch so we can see if that sheds any light.

No9 referenced this issue in Level/leveldown Aug 22, 2013

Closed

Windows memory leak? #55

Owner

No9 commented Aug 22, 2013

@rvagg here is the flamegraph you were looking for.
This appears to be more readable This is a ten minute sample and the app Seg Faulted so I have a coredump I can run ::filtjsobjects on if I can get hold of the right V8.so.
CryptoOutput

Owner

No9 commented Aug 22, 2013

@dominictarr suggested compaction and now we have a clearer flamegraph that could be worth some investigation I'll run a trace on

leveldb::DBImpl::BackgroundCompaction()
leveldb::DBImpl::MaybeScheduleCompaction()
leveldb::DBImpl::FinishCompactionOutputFile()
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)

And try and get a correlation by timestamping the above and GC so we can see possible cause and effect:

(Edit)
See this link for more detail on the current GC metrics.
https://gist.github.com/No9/4f979544861588945788

Maybe compaction is causing demands on system resources that is forcing aggressive GC by V8?

Owner

0x00A commented Aug 23, 2013

As I said to @dominictarr it could be compaction kicking in, but this might not be a problem at all. Im not sure, but I think that leveldb::DBImpl::BackgroundCompaction() is designed to run when it can, as much as it can.

Owner

No9 commented Aug 23, 2013

OK So actual compactions over a 10min period look like the following (Milliseconds)

121.787329
993.087684
1223.09732
2774.197186
988.637749

The full range of data is available here in the funky github tsv format
https://github.com/No9/datadump/blob/master/gc-vs-comp.tsv
(You can filter for 'Do Compaction' at the top All the other compaction calls are in there too)

While nearly 3 seconds is not optimal and it isn't clear how this would be/is impacting.
(I was hoping for longer compaction throttled by GC)

I think I am going to go for looking at memory paging next but I will also keep chasing the V8.so as analysis of the core dump could be handy.

Owner

rvagg commented Aug 29, 2013

Owner

juliangruber commented Aug 29, 2013

did you try using the chained batch, eg db.batch().put()......write() ?

I'm not so sure this is just a batch issue anymore. I tried loading data via looping single puts, and I eventually hit the same out-of-memory error. Puts with {sync: true} still leak, it just takes a lot longer to get there.

var levelup = require('level');
var db = levelup('./leakydb', {valueEncoding: 'json'}); // modifying cache size here doesn't help

var putOne = function (i) {
    var i = i + 1;
    // log a little message every 10,000 puts
    if (i % 10000 == 0) {
        console.log(i + " ops   " + (new Date()).toISOString());
    }
    // if we're under 9,000,001 ops, do another put, otherwise stop
    if (i < 9000001) {
        var keyFriendlyI = i + 1000000; // add a million so its sort friendly
        var key = "aKeyLabel~" + keyFriendlyI;
        var value = {
            i: i,
            createdDate: new Date(),
            someText: "I thought that maybe single puts would make it further than the batch approach",
            moreText: "but that isn't necessarily true.",
            evenMoreText: "This hits a FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory."
        };
        // tried setting {sync: true} here and we still run out of memory 
        // it just takes a lot longer to get there
        db.put(key, value, function (err) {
            if (err) {
                console.log(err);
            } else {
                putOne(i);
            }
        });
    } else {
        console.log("Finished!");
    }
};
putOne(0);
Owner

rvagg commented Aug 29, 2013

I concur, my initial testing suggests it's deeper than batch

Owner

dominictarr commented Aug 29, 2013

hmm, so, what happens if you run leveldown directly, without levelup?

Owner

rvagg commented Aug 29, 2013

@tjfontaine I think we need some SmartOS magic to hunt down some kind of memory leak. It's likely in the C++ of LevelDOWN, perhaps even has something to do with the way Node interacts with LevelDB. How's that blog post coming along about finding leaks? Some pointers would be nice.

I haven't finished the blog post, so here is a list of steps, some are obvious they are not meant to be patronizing, just for those who also might wander into this thread.

  • mlogin
    • requires an active joyent account, or if you already have a smartos/illumos based instance somewhere else get there
  • git clone ...
  • npm install
  • UMEM_DEBUG=default node myLeakyScript.js &
    • you can also use screen or tmux instead of running in the background, the point is it should be running for the next steps
    • if you're using a library that doesn't link against libumem or links against something else for its memory allocation you can use LD_PRELOAD=libumem.so UMEM_DEBUG=default node myLeakyScript.js to force the dependency chain to use libumem allocators.
  • ps -ef | grep node
    • find the pid of your node process
  • gcore <pid>
    • creates a file named core.<pid> in $CWD
  • mdb core.<pid>
  • ::findleaks
    • This is only going to show you leaks it can verify on the C/C++ side of things, it's not going to be able to diagnose JS leaks
    • You're looking for output that looks roughly like

CACHE LEAKED BUFCTL CALLER 08072c90 1 0807dd08 buf_create+0x12 08072c90 1 0807dca0 func_leak+0x12 08072c90 1 0807dbd0 main+0x12

  • 0807dd08::bufctl_audit
    • The third field is a bucket you can diagnose, the result of this will show you the stack trace for where the leaked memory was alloc'd

Catch me in IRC if there are more details you need or want an extra pair of eyes

Owner

rvagg commented Aug 30, 2013

@tjfontaine I just don't get anything interesting, it mostly looks like this:

BYTES             LEAKED VMEM_SEG CALLER
8192                   1 932d9000 MMAP
------------------------------------------------------------------------
           Total       1 oversized leak, 8192 bytes

does that mean it can't find any c++ leaks and that the problem could be something to do with V8, like not letting go of persistent handles properly?

Right, this means that -- at least during the time of the core -- all memory allocated on the native side was accounted for. I should have mentioned that you did the gcore after you were able to observe the memory leak happening?

The next step, since you've eliminated at least the obvious native leaks, works like this, in the same core:

  • ::load v8.so
  • ::findjsobjects -v -a

This may take a while (depending on the amount of objects that are still observable). It will output into a pager a list of object addresses, the count of objects, the amount of properties on that specific object type, and then finally a brief description of the "type" of object. This may be a primitive, or the property based class of an object.

8355dce5      210        8 Array
a313320d     1140        2 Arguments: length, callee
835860f9      492        7 Error: message, path, type, code, arguments, ...
8358aec1      537       40 Array
> a313320d::jsprint
{
    length: undefined,
    callee: undefined,
}
> a313320d::findjsobjects | ::jsprint
{
    length: undefined,
    callee: undefined,
}
{
    length: 3,
    callee: function <anonymous> (as EventEmitter.emit),
}
{
    length: 3,
    callee: function <anonymous> (as EventEmitter.emit),
}
...

This should give the easiest way to see what objects are still being held around in the JS heap, and where your leak may be, or it may not even be a leak so much as a less than ideal design.

Other reading can be found at http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/ or catching me on IRC

Owner

rvagg commented Aug 30, 2013

oh the irony!

*** mdb: received signal SEGV at:
    [1] v8.so`jsobj_properties+0x262()
    [2] v8.so`findjsobjects_range+0x14b()
    [3] v8.so`findjsobjects_mapping+0x4a()
    [4] libproc.so.1`i_Pmapping_iter+0x60()
    [5] libproc.so.1`Pmapping_iter+0x16()
    [6] v8.so`dcmd_findjsobjects+0x187()
    [7] mdb`dcmd_invoke+0x40()
    [8] mdb`mdb_call_idcmd+0x128()
    [9] mdb`mdb_call+0x325()
    [10] mdb`yyparse+0x3f7()
    [11] mdb`mdb_run+0x26d()
    [12] mdb`main+0x153a()
    [13] mdb`_start+0x83()

@rvagg this can happen on older v8.so's, if you put that core in manta however and mlogin -s /rvagg/stor/core.1234 --memory=2048 you should be able to do mdb /assets/rvagg/stor/core.1234 and it will do the right thing

Owner

rvagg commented Aug 30, 2013

Thanks @tjfontaine, I've managed to figure out doing this on manta.

Everyone else: we seem to have a problem with persistent references not being freed. Running test/leak-tester.js after a while I get a lot of objects and these are the ones with significant enough counts to bother looking at, along with what I know & can guess about them:

817d5185    30523        0 Array
# 40 'empty' elements in each of these
85008a2d     3423        1 Object: key
# I'm pretty sure this is from get() calls
8172f419     3579        1 Array
8500890d     3372        2 Object: key, value
# I'm pretty sure this is from put() calls
850088e9     6799        1 Object: callback
# anon
8172f44d     3580        2 Object: callback, domain
# `run`, which is the main function in leak-tester.js
8172ed89    40025        2 SlowBuffer: length, used
# length=0, cleaned up perhaps?
8171484d    27367        3 Error: arguments, type, message
# mostly empty, the occasional one has a large hex string as the `message`

/cc @kkoopa

Owner

rvagg commented Aug 30, 2013

I should also say that the nature of leak-tester.js is this: it generates random keys in a space of 10000000 possible, performs a get() on those keys (the Object with 'key' in the above data) to see if they exist, mostly they don't (the Error in the above data), then does a put() (the Object with 'key' and 'value' in the data above) for that key and a random String from: crypto.randomBytes(1024).toString('hex') (it can be made to do just Buffers but in my case I'm running Strings.

kkoopa commented Aug 30, 2013

I see SlowBuffer in the list there, which indicates Node 0.11.2 or less. Is the problem present in older versions of leveldown? What about on newer Node? Could some persistent handles require being made weak?

Owner

rvagg commented Aug 30, 2013

sorry, I should have said, this is the latest 0.10. I'm just messing with 0.11 now but am finding some other odd problems just getting the leak tester going!

kkoopa commented Aug 30, 2013

I see, latest node master produces a whole bunch of failed tests due to uv-stuff:

node: ../deps/uv/src/unix/loop.c:150: uv__loop_delete: Assertion `!((((*(&(loop)->active_reqs))[0]) == (&(loop)->active_reqs)) == 0)' failed.
not ok test/approximate-size-test.js .................... 1/2
    Command: "node" "approximate-size-test.js"
    TAP version 13
    ok 1 cleanup returned an error
    not ok 2 test/approximate-size-test.js
      ---
        exit:    ~
        signal:  SIGABRT
        stderr:  node: ../deps/uv/src/unix/loop.c:150: uv__loop_delete: Assertion `!((((*(&(loop)->active_reqs))[0]) == (&(loop)->active_reqs)) == 0)' failed.
        command: "node" "approximate-size-test.js"
      ...

    1..2
    # tests 2
    # pass  1
    # fail  1

I'll try having a look with 0.10 in the meantime.

kkoopa commented Aug 30, 2013

I don't recall anyone mentioning testing with --expose-gc and forcing garbage collection. If all (non-persistent) objects are still in the active context, they might not get garbage collected despite having no references and so forth.

Owner

rvagg commented Aug 30, 2013

I messed a bit with --trace_gc and even --gc_interval but not much beyond that. All I could see was a bunch of V8 objects not being collected.

Owner

rvagg commented Aug 30, 2013

So I think I've solved at least part of this puzzle. leak-tester.js behaves a little more reasonably after this: nodejs/nan@e4097ab

I honestly don't understand the interaction of v8::HandleScope and Persistent references (or perhaps this isn't even about Persistent references?) but it seems that when you forget to use it you get subtle but nasty behaviour. Perhaps @tjfontaine or @kkoopa have more of a clue than I?

Owner

rvagg commented Aug 30, 2013

Anyone who thinks they see a leak: try reinstalling leveldown to pick up nan@0.3.2 from npm and see if it makes a difference.

Owner

rvagg commented Aug 30, 2013

getCount = 1561000 , putCount =  1445759 , rss = 1583% 144M ["1","44","208","1284","0","0","0"]

Much more like normal now! memory goes up when the throughput is high and settles down as leveldb starts to slow down a little with higher volume. I'm not even sure whether it's v8 or leveldb that causes this memory behaviour but I know it's normal!

The next problem is that I can't even get leak-tester.js to run in Node 0.11.4+, Node seems to completely lose the reference to the callback given to put() and it doesn't get called at all so it just stops. If I put a console.log() in there then it continues. This is the case even up to current Node master so I'd be interested to hear from anyone who's using Node 0.11 for anything serious; @juliangruber perhaps?

kkoopa commented Aug 30, 2013

Aha, that might very well be it. It's not about persistent references per se, but Locals (or Handles) on the stack. I am not sure, but seem to remember something along the lines of if a HandleScope is not specified in a non-inlined?function, the scope of the caller is used. For some reason I don't quite remember now, any time a new Local or Handle is created, you have to have a HandleScope in the surrounding scope (or is it context). NanPersistentToLocal always creates a new Local<T> in 0.11+, as well as for weak handles in 0.10-. This is 0.10- and the handles are not weak, so this does not apply in this particular case. However, #define NanSymbol(value) v8::String::NewSymbol(value) creates a new String and therefore a HandleScope is required.

In general, NAN assumes scopes are available in several functions, e.g. NanReturnValue(), which assumes a HandleScope called scope exists in 0.10-, but not in 0.11+, and so will work in the latter, but not in the former. The easiest solution is recommending always doing a NanScope() whenever using NAN methods.

Owner

juliangruber commented Aug 30, 2013

@rvagg using node 0.11, levelup 0.14 and leveldown 0.8 in a dev environment and so far it works fine

Owner

rvagg commented Aug 31, 2013

@fergiemcdowall could you report back when you've given the latest a go and close this issue if you think your immediate problem has gone away.

You should npm ls | grep nan and see nan@0.3.2 to know you have the fixed version.

Testing on a Windows 8 machine. Will try OSX and Linux later

The issue as reported is still there, although the behaviour is slightly different. Memory allocation still climbs inexorably upwards, and finally crashes with FATAL ERROR: JS Allocation failed - process out of memory, although now "batch" is a bit slower. Same end result, but it takes longer to get there (although its around the same in terms of total .batch calls issued)

My Level is as follows:

├── levelup@0.14.0 (concat-stream@0.1.1, simple-bufferstream@0.0.4, prr@0.0.0, e rrno@0.0.5, semver@1.1.4, bops@0.0.6, xtend@2.0.6) └── leveldown@0.8.0 (bindings@1.1.1, nan@0.3.2)

And my node is:

$ node --version v0.11.6

Owner

rvagg commented Sep 3, 2013

A leveldown 0.8.1 was published a couple of days ago with some additional (minor) leak fixes that were present in my local testing environment but I'd forgotten to commit and publish them when I found the NAN leak. 0.8.2 is also out now but that's just got FreeBSD support.

So, grab the latest version and try again and let's see if it's any different.

Hmmm, now I get a slightly different error at more or less the same point:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

My Level is now

search-index@0.2.17 node_modules\search-index ├── levelup@0.14.0 (concat-stream@0.1.1, prr@0.0.0, simple-bufferstream@0.0.4, e rrno@0.0.5, semver@1.1.4, bops@0.0.6, xtend@2.0.6) └── leveldown@0.8.1 (bindings@1.1.1, nan@0.3.2)

kkoopa commented Sep 3, 2013

Are you sure it's leaking and not just using a lot of memory? What happens with increased heap size? --max-old-space-size=4096

@kkoopa thanks for that- yes, --max-old-space-size=4096 does prevent crashing and even makes memory use DECLINE slightly once it hits 95% or so on my 6gb RAM test machine (Hurrah!).

As for "leaking" vs "using a lot of memory", I would say 'leaking'. Memory is allocated cumulatively and the node/level test app only seems to clear up after itself once it gets hit over the head by the OS. If I index a lot of documents, I have 5gb memory allocated to the app, until I stop in and start it, in which case it instantly disappears, indicating that the objects held in memory were not in use.

@rvagg interesting- have starred it!

kkoopa commented Sep 3, 2013

The max-old-space-size option sets the maximum heap size of V8. It defaults to 1 GB in 32-bit and 1.5 GB in 64-bit. For databases and such, this can easily be far too little.

kkoopa commented Sep 3, 2013

Regarding the failure on Node 0.11.4+, it now works with latest HEAD of Node nodejs/node-v0.x-archive@7a235f9 and leveldown Level/leveldown@6c48cac.

fergiemcdowall referenced this issue in fergiemcdowall/norch Sep 13, 2013

Closed

Indexing Memory Problem #16

Owner

rvagg commented Oct 1, 2013

where are we at with this? is it still a problem?

Still a problem.

Users of the levelUPpy Forage (there are AT LEAST 3 of us :) are currenlty working around this problem by stopping and starting their servers after big indexing jobs to instantly free up a few Gbs of RAM.

Observation: LevelUP apps with big RAM leaks can be fixed almost instantaneously by restarting the app via, for example forever. Maybe it is possible to implement a temporary high-level fix in levelUP that simply forks big batch insert jobs and then kills them afterwards?

Owner

rvagg commented Oct 2, 2013

I'm wondering if this could be related to the thing that @maxogden found with writestream performance and not exceeding the writeBufferSize (the size of the memtable/log), which is configurable (see leveldown docs).

I'm pretty confident this isn't an actual memory leak but either something to do with the way LevelDB is handling the large input or to do with V8 garbage collection.

given a well defined and contained test case I would be happy to look into this

@tjfontaine there are a couple of small test cases here (look also in the comments) https://gist.github.com/fergiemcdowall/6239924

Owner

rvagg commented Oct 3, 2013

@tjfontaine that'd be amazing if you could have a look at this. There is also a leak-tester.js in the test directory that can be adjusted to do batch() writes. My experience with testing this is that the memory used balloons initially but eventually settles right down over time, it's the scale and timeframe for the ballooning that's getting us in trouble here I suspect. Maybe you can help us understand what causes this behaviour or perhaps there's something else entirely going on here.

Owner

maxogden commented Oct 4, 2013

Ran into segfaulting today, here's some code to reproduce it: https://github.com/maxogden/level-csv-bench

Follow the readme and run node raw.js. It should segfault. Then rm -rf test.db and npm install level@0.11.0. It should run fine in roughly 1 minute.

The version of level that gets installed by default is latest stable, 0.17.0. I tested every release in between 0.11 and 0.17 with these results:

0.11, 0.12: completes after ~1 minute
0.13: segfaults after ~1 minute, process was around 300mb after steady increase
0.14: completes after ~2 minutes, memory usage at end was ~400mb
0.15, 0.16: completes after ~2 minutes, memory usage at end was ~500mb
0.17: segfaults after ~6 seconds

The big differences seem to be between 12 and 13, 13 and 14 and 16 and 17

Owner

No9 commented Oct 8, 2013

Yes downgraded to 0.12 and the modified test using crypto instead of lorum ipsum runs.
https://gist.github.com/No9/4f979544861588945788

Queued :  10 Inserted :  20 Total Puts :  2000 { rss: 432910336, heapTotal: 133092096, heapUsed: 79213712 }
Queued :  23 Inserted :  37 Total Puts :  3700 { rss: 864186368, heapTotal: 267243776, heapUsed: 259586360 }
Queued :  35 Inserted :  55 Total Puts :  5500 { rss: 1228398592, heapTotal: 342575104, heapUsed: 292723808 }
Queued :  47 Inserted :  73 Total Puts :  7300 { rss: 1705848832, heapTotal: 530387456, heapUsed: 410235976 }
Queued :  60 Inserted :  90 Total Puts :  9000 { rss: 2252857344, heapTotal: 781147904, heapUsed: 435587912 }
Queued :  73 Inserted :  108 Total Puts :  10800 { rss: 2660450304, heapTotal: 887437312, heapUsed: 870932104 }
Queued :  84 Inserted :  126 Total Puts :  12600 { rss: 2878947328, heapTotal: 832744704, heapUsed: 712754240 }
Queued :  96 Inserted :  144 Total Puts :  14400 { rss: 3171160064, heapTotal: 835840512, heapUsed: 816263800 }
Queued :  108 Inserted :  162 Total Puts :  16200 { rss: 3619356672, heapTotal: 994758656, heapUsed: 898126104 }
Queued :  121 Inserted :  179 Total Puts :  17900 { rss: 4022054912, heapTotal: 1108271616, heapUsed: 1022284520 }
Queued :  104 Inserted :  196 Total Puts :  19600 { rss: 4022452224, heapTotal: 1108271616, heapUsed: 1022312296 }
Queued :  88 Inserted :  212 Total Puts :  21200 { rss: 4029661184, heapTotal: 1108271616, heapUsed: 1022323600 }
Queued :  72 Inserted :  228 Total Puts :  22800 { rss: 4022190080, heapTotal: 1108271616, heapUsed: 1022334008 }
Queued :  55 Inserted :  245 Total Puts :  24500 { rss: 4029292544, heapTotal: 1108271616, heapUsed: 1022344552 }
Queued :  38 Inserted :  262 Total Puts :  26200 { rss: 4029419520, heapTotal: 1108271616, heapUsed: 1022355128 }
Queued :  21 Inserted :  279 Total Puts :  27900 { rss: 4030140416, heapTotal: 1108271616, heapUsed: 1022365672 }
Queued :  5 Inserted :  295 Total Puts :  29500 { rss: 4030316544, heapTotal: 1108271616, heapUsed: 1022376048 }
Queued :  0 Inserted :  300 Total Puts :  30000 { rss: 4005097472, heapTotal: 1108271616, heapUsed: 1022384544 }

Owner

rvagg commented Oct 8, 2013

@maxogden what platform (mac presumably?) and Node version are you getting crashes on? working for me in Linux with 0.10.18.

There was a leak that was fixed in NAN somewhere around 0.15/0.16 that would cause ballooning memory for these large jobs. (see above comment).

.. unfortunately, speaking of NAN, my guess would be that the issue is somewhere hidden in there since that's where most of the work has gone on for leveldown, either there or the integration of NAN into leveldown.

Owner

maxogden commented Oct 8, 2013

@rvagg node -v 0.10.18 on mac os 10.8.4.

I tested the csv-bench with level@latest (17) on my linode 2048 VPS w/ ubuntu and got a segfault after 2 minutes:

$ time node raw.js 
Segmentation fault

real    2m17.361s
user    2m41.937s
sys 0m6.156s

same thing on my mac segfaulted in 13 seconds:

$ time node raw.js 
Segmentation fault: 11

real    0m13.900s
user    0m19.085s
sys 0m0.633s
Owner

rvagg commented Oct 8, 2013

memory size thing then perhaps, I have 16G here so plenty of breathing room, I guess it's the same issue as referenced by the rest of this thread.

Owner

maxogden commented Oct 8, 2013

@rvagg maybe you could get a segfault if you insert a csv 10x or 20x the size

Owner

rvagg commented Oct 8, 2013

actually, I was working in tmpfs and switched to my normal disk (SSD) to try it there and got the segfault and whaddya know, it's a LevelDB error, here's the trace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5bc8700 (LWP 16439)]
0x00007ffff7786473 in std::string::size() const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
(gdb) backtrace 
#0  0x00007ffff7786473 in std::string::size() const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff49688f7 in leveldb::WriteBatchInternal::ByteSize (batch=0x111bfbb0)
    at ../deps/leveldb/leveldb-1.14.0/db/write_batch_internal.h:36
#2  0x00007ffff496618c in leveldb::DBImpl::BuildBatchGroup (this=0x7fffe80008f0, 
    last_writer=0x7ffff5bc7d30) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1228
#3  0x00007ffff4965dce in leveldb::DBImpl::Write (this=0x7fffe80008f0, options=..., 
    my_batch=0x111bfbb0) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1177
#4  0x00007ffff494f3b0 in leveldown::Database::WriteBatchToDatabase (this=0xe9e5e0, 
    options=0x111bfb10, batch=0x111bfbb0) at ../src/database.cc:74
#5  0x00007ffff4948c06 in leveldown::Batch::Write (this=0x111dffc0) at ../src/batch.cc:30
#6  0x00007ffff494d81c in leveldown::BatchWriteWorker::Execute (this=0x1126f6c0)
    at ../src/batch_async.cc:25
#7  0x00007ffff494b748 in NanAsyncExecute (req=0x1126f6c8) at ../node_modules/nan/nan.h:627
#8  0x00000000006ddebd in worker (arg=arg@entry=0x0) at ../deps/uv/src/unix/threadpool.c:74
#9  0x00000000006d3a0f in uv__thread_start (ctx_v=<optimised out>) at ../deps/uv/src/uv-common.c:322
#10 0x00007ffff6f9af8e in start_thread (arg=0x7ffff5bc8700) at pthread_create.c:311
#11 0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

@rescrv any ideas here?

Owner

maxogden commented Oct 8, 2013

my batching code is here: https://github.com/maxogden/level-csv-bench/blob/master/index.js

the batches should be <= 16mb, which is also the writeBufferSize (https://github.com/maxogden/level-csv-bench/blob/master/setup.js)

the size can be verified by iterating over the batch here https://github.com/maxogden/level-csv-bench/blob/master/index.js#L25 and summing byteLengths etc.

is this the correct way to do it? the sum of the byte length of all keys and values? as long as that number is below the writeBufferSize then everything should go swimmingly?

Owner

rvagg commented Oct 8, 2013

@rescrv re irc:

(gdb) f 0
#0  0x00007ffff7786473 in std::string::size() const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) print *this
No symbol "this" in current context.
(gdb) f 1
#1  0x00007ffff49688f7 in leveldb::WriteBatchInternal::ByteSize (batch=0x10457670)
    at ../deps/leveldb/leveldb-1.14.0/db/write_batch_internal.h:36
36      return batch->rep_.size();
(gdb) print *this
No symbol "this" in current context.
(gdb) f 2
#2  0x00007ffff496618c in leveldb::DBImpl::BuildBatchGroup (this=0x7fffe80008f0, 
    last_writer=0x7ffff63c8d30) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1228
1228      size_t size = WriteBatchInternal::ByteSize(first->batch);
(gdb) print *this
$7 = {<leveldb::DB> = {_vptr.DB = 0x7ffff4bc1330 <vtable for leveldb::DBImpl+16>}, env_ = 0xee6160, 
  internal_comparator_ = {<leveldb::Comparator> = {
      _vptr.Comparator = 0x7ffff4bc1410 <vtable for leveldb::InternalKeyComparator+16>}, 
    user_comparator_ = 0xe9aba0}, internal_filter_policy_ = {<leveldb::FilterPolicy> = {
      _vptr.FilterPolicy = 0x7ffff4bc13d0 <vtable for leveldb::InternalFilterPolicy+16>}, 
    user_policy_ = 0x0}, options_ = {comparator = 0x7fffe8000900, create_if_missing = true, 
    error_if_exists = false, paranoid_checks = false, env = 0xee6160, info_log = 0x7fffe8000ee0, 
    write_buffer_size = 16777216, max_open_files = 1000, block_cache = 0xee6520, block_size = 4096, 
    block_restart_interval = 16, compression = leveldb::kSnappyCompression, filter_policy = 0x0}, 
  owns_info_log_ = true, owns_cache_ = false, dbname_ = {static npos = <optimised out>, 
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fffe80008d8 "test.db"}}, table_cache_ = 0x7fffe80021a0, 
  db_lock_ = 0x7fffe8000c80, mutex_ = {mu_ = {__data = {__lock = 2, __count = 0, __owner = 13704, 
        __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000\210\065\000\000\001", '\000' <repeats 26 times>, 
      __align = 2}}, shutting_down_ = {rep_ = 0x0}, bg_cv_ = {cv_ = {__data = {__lock = 0, 
        __futex = 2, __total_seq = 1, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x7fffe8000990, 
        __nwaiters = 0, __broadcast_seq = 1}, 
      __size = "\000\000\000\000\002\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\220\t\000\350\377\177\000\000\000\000\000\000\001\000\000", 
      __align = 8589934592}, mu_ = 0x7fffe8000990}, mem_ = 0x7fffd80cd670, imm_ = 0x7fffd8563080, 
  has_imm_ = {rep_ = 0x7fffd8563080}, logfile_ = 0x7fffd80cd610, logfile_number_ = 281, 
  log_ = 0x7fffd8286190, seed_ = 0, 
  writers_ = {<std::_Deque_base<leveldb::DBImpl::Writer*, std::allocator<leveldb::DBImpl::Writer*> >> = {
      _M_impl = {<std::allocator<leveldb::DBImpl::Writer*>> = {<__gnu_cxx::new_allocator<leveldb::DBImpl::Writer*>> = {<No data fields>}, <No data fields>}, _M_map = 0x7fffe8000be0, _M_map_size = 8, 
        _M_start = {_M_cur = 0x7fffd80dd968, _M_first = 0x7fffd80dd820, _M_last = 0x7fffd80dda20, 
          _M_node = 0x7fffe8000c10}, _M_finish = {_M_cur = 0x7fffd80dd970, _M_first = 0x7fffd80dd820, 
          _M_last = 0x7fffd80dda20, _M_node = 0x7fffe8000c10}}}, <No data fields>}, 
  tmp_batch_ = 0x7fffe8000c30, snapshots_ = {list_ = {<leveldb::Snapshot> = {
        _vptr.Snapshot = 0x7ffff4bc13b0 <vtable for leveldb::SnapshotImpl+16>}, number_ = 0, 
      prev_ = 0x7fffe8000a88, next_ = 0x7fffe8000a88, list_ = 0x0}}, pending_outputs_ = {_M_t = {
      _M_impl = {<std::allocator<std::_Rb_tree_node<unsigned long> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<unsigned long> >> = {<No data fields>}, <No data fields>}, 
        _M_key_compare = {<std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, _M_parent = 0x7fffe416e620, 
          _M_left = 0x7fffe416e620, _M_right = 0x7fffe40f1640}, _M_node_count = 2}}}, 
  bg_compaction_scheduled_ = true, manual_compaction_ = 0x0, versions_ = 0x7fffe8002e20, bg_error_ = {
    state_ = 0x0}, consecutive_compaction_errors_ = 0, stats_ = {{micros = 19773873, bytes_read = 0, 
      bytes_written = 227771407}, {micros = 23647903, bytes_read = 185973394, 
      bytes_written = 196165495}, {micros = 24064573, bytes_read = 204562615, 
      bytes_written = 213360961}, {micros = 0, bytes_read = 0, bytes_written = 0}, {micros = 0, 
      bytes_read = 0, bytes_written = 0}, {micros = 0, bytes_read = 0, bytes_written = 0}, {
      micros = 0, bytes_read = 0, bytes_written = 0}}}
(gdb) f 3
#3  0x00007ffff4965dce in leveldb::DBImpl::Write (this=0x7fffe80008f0, options=..., 
    my_batch=0x10457670) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1177
1177        WriteBatch* updates = BuildBatchGroup(&last_writer);
(gdb) print *this
$8 = {<leveldb::DB> = {_vptr.DB = 0x7ffff4bc1330 <vtable for leveldb::DBImpl+16>}, env_ = 0xee6160, 
  internal_comparator_ = {<leveldb::Comparator> = {
      _vptr.Comparator = 0x7ffff4bc1410 <vtable for leveldb::InternalKeyComparator+16>}, 
    user_comparator_ = 0xe9aba0}, internal_filter_policy_ = {<leveldb::FilterPolicy> = {
      _vptr.FilterPolicy = 0x7ffff4bc13d0 <vtable for leveldb::InternalFilterPolicy+16>}, 
    user_policy_ = 0x0}, options_ = {comparator = 0x7fffe8000900, create_if_missing = true, 
    error_if_exists = false, paranoid_checks = false, env = 0xee6160, info_log = 0x7fffe8000ee0, 
    write_buffer_size = 16777216, max_open_files = 1000, block_cache = 0xee6520, block_size = 4096, 
    block_restart_interval = 16, compression = leveldb::kSnappyCompression, filter_policy = 0x0}, 
  owns_info_log_ = true, owns_cache_ = false, dbname_ = {static npos = <optimised out>, 
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fffe80008d8 "test.db"}}, table_cache_ = 0x7fffe80021a0, 
  db_lock_ = 0x7fffe8000c80, mutex_ = {mu_ = {__data = {__lock = 2, __count = 0, __owner = 13704, 
        __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000\210\065\000\000\001", '\000' <repeats 26 times>, 
      __align = 2}}, shutting_down_ = {rep_ = 0x0}, bg_cv_ = {cv_ = {__data = {__lock = 0, 
        __futex = 2, __total_seq = 1, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x7fffe8000990, 
        __nwaiters = 0, __broadcast_seq = 1}, 
      __size = "\000\000\000\000\002\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\220\t\000\350\377\177\000\000\000\000\000\000\001\000\000", 
      __align = 8589934592}, mu_ = 0x7fffe8000990}, mem_ = 0x7fffd80cd670, imm_ = 0x7fffd8563080, 
  has_imm_ = {rep_ = 0x7fffd8563080}, logfile_ = 0x7fffd80cd610, logfile_number_ = 281, 
  log_ = 0x7fffd8286190, seed_ = 0, 
  writers_ = {<std::_Deque_base<leveldb::DBImpl::Writer*, std::allocator<leveldb::DBImpl::Writer*> >> = {
      _M_impl = {<std::allocator<leveldb::DBImpl::Writer*>> = {<__gnu_cxx::new_allocator<leveldb::DBImpl::Writer*>> = {<No data fields>}, <No data fields>}, _M_map = 0x7fffe8000be0, _M_map_size = 8, 
        _M_start = {_M_cur = 0x7fffd80dd968, _M_first = 0x7fffd80dd820, _M_last = 0x7fffd80dda20, 
          _M_node = 0x7fffe8000c10}, _M_finish = {_M_cur = 0x7fffd80dd970, _M_first = 0x7fffd80dd820, 
          _M_last = 0x7fffd80dda20, _M_node = 0x7fffe8000c10}}}, <No data fields>}, 
  tmp_batch_ = 0x7fffe8000c30, snapshots_ = {list_ = {<leveldb::Snapshot> = {
        _vptr.Snapshot = 0x7ffff4bc13b0 <vtable for leveldb::SnapshotImpl+16>}, number_ = 0, 
      prev_ = 0x7fffe8000a88, next_ = 0x7fffe8000a88, list_ = 0x0}}, pending_outputs_ = {_M_t = {
      _M_impl = {<std::allocator<std::_Rb_tree_node<unsigned long> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<unsigned long> >> = {<No data fields>}, <No data fields>}, 
        _M_key_compare = {<std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, _M_parent = 0x7fffe416e620, 
          _M_left = 0x7fffe416e620, _M_right = 0x7fffe40f1640}, _M_node_count = 2}}}, 
  bg_compaction_scheduled_ = true, manual_compaction_ = 0x0, versions_ = 0x7fffe8002e20, bg_error_ = {
    state_ = 0x0}, consecutive_compaction_errors_ = 0, stats_ = {{micros = 19773873, bytes_read = 0, 
      bytes_written = 227771407}, {micros = 23647903, bytes_read = 185973394, 
      bytes_written = 196165495}, {micros = 24064573, bytes_read = 204562615, 
      bytes_written = 213360961}, {micros = 0, bytes_read = 0, bytes_written = 0}, {micros = 0, 
      bytes_read = 0, bytes_written = 0}, {micros = 0, bytes_read = 0, bytes_written = 0}, {
      micros = 0, bytes_read = 0, bytes_written = 0}}}
(gdb) f 4
#4  0x00007ffff494f3b0 in leveldown::Database::WriteBatchToDatabase (this=0xe9e5e0, 
    options=0x104575d0, batch=0x10457670) at ../src/database.cc:74
74    return db->Write(*options, batch);
(gdb) print *this
$9 = {<node::ObjectWrap> = {_vptr.ObjectWrap = 0x7ffff4bc0f10 <vtable for leveldown::Database+16>, 
    handle_ = {<v8::Handle<v8::Object>> = {val_ = 0xea7550}, <No data fields>}, refs_ = 0}, 
  db = 0x7fffe80008f0, location = 0xe99090 "test.db", currentIteratorId = 0, 
  pendingCloseWorker = 0x0, iterators = {_M_t = {
      _M_impl = {<std::allocator<std::_Rb_tree_node<std::pair<unsigned int const, leveldown::Iterator*> > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<unsigned int const, leveldown::Iterator*> > >> = {<No data fields>}, <No data fields>}, 
        _M_key_compare = {<std::binary_function<unsigned int, unsigned int, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, _M_parent = 0x0, _M_left = 0xe9e620, 
          _M_right = 0xe9e620}, _M_node_count = 0}}}}
Owner

rvagg commented Oct 8, 2013

@rescrv re irc

(gdb) thread apply all bt

Thread 7 (Thread 0x7ffff48f4700 (LWP 13745)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff6f9d17c in _L_lock_982 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff6f9cfcb in __GI___pthread_mutex_lock (mutex=0x7fffe8000990) at pthread_mutex_lock.c:64
#3  0x00007ffff498e39a in leveldb::port::Mutex::Lock (this=0x7fffe8000990)
    at ../deps/leveldb/leveldb-1.14.0/port/port_posix.cc:26
#4  0x00007ffff4963ff9 in leveldb::DBImpl::OpenCompactionOutputFile (this=0x7fffe80008f0, 
    compact=0x7fffe42375a0) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:760
#5  0x00007ffff4964d47 in leveldb::DBImpl::DoCompactionWork (this=0x7fffe80008f0, 
    compact=0x7fffe42375a0) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:952
#6  0x00007ffff4963be6 in leveldb::DBImpl::BackgroundCompaction (this=0x7fffe80008f0)
    at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:703
#7  0x00007ffff49634dd in leveldb::DBImpl::BackgroundCall (this=0x7fffe80008f0)
    at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:623
#8  0x00007ffff4963454 in leveldb::DBImpl::BGWork (db=0x7fffe80008f0)
    at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:616
#9  0x00007ffff4990a2c in leveldb::(anonymous namespace)::PosixEnv::BGThread (this=0xee6160)
    at ../deps/leveldb/leveldb-1.14.0/util/env_posix.cc:692
#10 0x00007ffff499070f in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0xee6160)
    at ../deps/leveldb/leveldb-1.14.0/util/env_posix.cc:629
#11 0x00007ffff6f9af8e in start_thread (arg=0x7ffff48f4700) at pthread_create.c:311
#12 0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 6 (Thread 0x7ffff53c7700 (LWP 13706)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00000000006ddb79 in uv_cond_wait (cond=cond@entry=0xe54000 <cond>, 
    mutex=mutex@entry=0xe53fc0 <mutex>) at ../deps/uv/src/unix/thread.c:322
#2  0x00000000006dde6f in worker (arg=arg@entry=0x0) at ../deps/uv/src/unix/threadpool.c:56
#3  0x00000000006d3a0f in uv__thread_start (ctx_v=<optimised out>) at ../deps/uv/src/uv-common.c:322
#4  0x00007ffff6f9af8e in start_thread (arg=0x7ffff53c7700) at pthread_create.c:311
#5  0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7ffff5bc8700 (LWP 13705)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00000000006ddb79 in uv_cond_wait (cond=cond@entry=0xe54000 <cond>, 
    mutex=mutex@entry=0xe53fc0 <mutex>) at ../deps/uv/src/unix/thread.c:322
#2  0x00000000006dde6f in worker (arg=arg@entry=0x0) at ../deps/uv/src/unix/threadpool.c:56
#3  0x00000000006d3a0f in uv__thread_start (ctx_v=<optimised out>) at ../deps/uv/src/uv-common.c:322
#4  0x00007ffff6f9af8e in start_thread (arg=0x7ffff5bc8700) at pthread_create.c:311
#5  0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7ffff63c9700 (LWP 13704)):
#0  0x00007ffff7786473 in std::string::size() const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff49688f7 in leveldb::WriteBatchInternal::ByteSize (batch=0x10457670)
    at ../deps/leveldb/leveldb-1.14.0/db/write_batch_internal.h:36
#2  0x00007ffff496618c in leveldb::DBImpl::BuildBatchGroup (this=0x7fffe80008f0, 
    last_writer=0x7ffff63c8d30) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1228
#3  0x00007ffff4965dce in leveldb::DBImpl::Write (this=0x7fffe80008f0, options=..., 
    my_batch=0x10457670) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1177
#4  0x00007ffff494f3b0 in leveldown::Database::WriteBatchToDatabase (this=0xe9e5e0, 
    options=0x104575d0, batch=0x10457670) at ../src/database.cc:74
#5  0x00007ffff4948c06 in leveldown::Batch::Write (this=0x10467a20) at ../src/batch.cc:30
#6  0x00007ffff494d81c in leveldown::BatchWriteWorker::Execute (this=0x104eaed0)
    at ../src/batch_async.cc:25
#7  0x00007ffff494b748 in NanAsyncExecute (req=0x104eaed8) at ../node_modules/nan/nan.h:627
#8  0x00000000006ddebd in worker (arg=arg@entry=0x0) at ../deps/uv/src/unix/threadpool.c:74
#9  0x00000000006d3a0f in uv__thread_start (ctx_v=<optimised out>) at ../deps/uv/src/uv-common.c:322
#10 0x00007ffff6f9af8e in start_thread (arg=0x7ffff63c9700) at pthread_create.c:311
#11 0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7ffff6bca700 (LWP 13703)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00000000006ddb79 in uv_cond_wait (cond=cond@entry=0xe54000 <cond>, 
    mutex=mutex@entry=0xe53fc0 <mutex>) at ../deps/uv/src/unix/thread.c:322
#2  0x00000000006dde6f in worker (arg=arg@entry=0x0) at ../deps/uv/src/unix/threadpool.c:56
#3  0x00000000006d3a0f in uv__thread_start (ctx_v=<optimised out>) at ../deps/uv/src/uv-common.c:322
#4  0x00007ffff6f9af8e in start_thread (arg=0x7ffff6bca700) at pthread_create.c:311
#5  0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7ffff7ff7700 (LWP 13702)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x000000000092955d in v8::internal::LinuxSemaphore::Wait() ()
#2  0x00000000008496b2 in v8::internal::RuntimeProfiler::WaitForSomeIsolateToEnterJS() ()
#3  0x000000000092a80a in v8::internal::SignalSender::Run() ()
#4  0x000000000092976e in v8::internal::ThreadEntry(void*) ()
#5  0x00007ffff6f9af8e in start_thread (arg=0x7ffff7ff7700) at pthread_create.c:311
#6  0x00007ffff6cc4e1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7ffff7fd0740 (LWP 13698)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00000000006e23ea in uv__epoll_wait (epfd=<optimised out>, events=events@entry=0x7fffffffad10, 
    nevents=nevents@entry=1024, timeout=timeout@entry=-1) at ../deps/uv/src/unix/linux-syscalls.c:282
#2  0x00000000006e0cfe in uv__io_poll (loop=loop@entry=0xe53c60 <default_loop_struct>, timeout=-1)
    at ../deps/uv/src/unix/linux-core.c:160
#3  0x00000000006d4dd8 in uv_run (loop=0xe53c60 <default_loop_struct>, mode=<optimised out>)
    at ../deps/uv/src/unix/core.c:317
#4  0x0000000000596898 in node::Start(int, char**) ()
#5  0x00007ffff6becea5 in __libc_start_main (main=0x58c430 <main>, argc=2, ubp_av=0x7fffffffdef8, 
    init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7fffffffdee8)
    at libc-start.c:260
#6  0x000000000058c7c5 in _start ()
Owner

rvagg commented Oct 8, 2013

Current theory: the segfault is a LevelDB problem, @rescrv will hopefully figure out what the issue is although HyperLevelDB doesn't suffer from the same problem so a work-around is to switch to that and HyperLevelDB is probably going to be significantly better for bulk-loading anyway @maxogden so you might want to consider it a default for dat? I haven't released a leveldown-hyper to match the leveldown release last week yet because I've had some compile problems and no time to sort it out.

But that's not the whole story, there's possibly also another problem here causing memory issues and/or slowness, the fact that there's been problems post level@0.12 suggests that it's to do with the upgrade to NAN, which makes sense since that's been the the most major work that's happened in leveldown, lots of adjustments to fit in NAN & Node@0.11+ compatibility. Probably need some of @tjfontaine's MDB magic to figure out if we have a legitimate leak going on.

Owner

maxogden commented Oct 8, 2013

@rvagg sounds good. If you get some time to do a leveldown-hyper release I'd appreciate it!

Owner

rvagg commented Oct 9, 2013

@maxogden I have leveldown-hyper fixed up and released, you'll have to let me know if there are any compile errors (or even warnings actually, it'd be nice to get rid of those too) on OSX, it's compiling on Linux but it's a stab in the dark for me for OSX.

use level-hyper in place of level, it should be a drop-in replacement so you don't have to mess around with plugging leveldown-hyper in to levelup.

mcollina referenced this issue in mcollina/levelgraph Oct 14, 2013

Closed

Memleak #40

fwiw, I hit FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory trying to transform a 3-million-row database with ReadStream and WriteStream. The leak also appears to be there with batch(); I'm working around it for now by exec()ing another node once process.memoryUsage() exceeds a certain amount. But this sucks :(

Owner

rvagg commented Oct 29, 2013

@nornagon chained batch or array batch?

Array.

On Tuesday, 29 October 2013, Rod Vagg wrote:

@nornagon https://github.com/nornagon chained batch or array batch?


Reply to this email directly or view it on GitHubhttps://github.com/rvagg/node-levelup/issues/171#issuecomment-27284281
.

j

Owner

brycebaril commented Oct 29, 2013

You definitely want chained batch for that -- switching to it here https://github.com/brycebaril/level-bufferstreams/blob/master/write.js#L41 made a vast improvement on memory.

Owner

mcollina commented Oct 31, 2013

We are getting the same FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory in LevelGraph too. Sigh. The discussion is here: mcollina/levelgraph#40.

It gets two different behavior on levelup 0.12 and 0.17.

On 0.12 it gets to a halt after writing around 1.200.000 records, the CPU skyrockets but the memory usage stay steady, then write something more and then stops again. It continues doing so, on and off, for every of the following 300.000 records.

On 0.17 it keeps writing, but the memory usage explodes. In the heap I see my keys, values and callbacks passed to batch (I'm using level-writestream). It is writing to the database, as I see it growing. As far as I see it, LevelDown is not freeing some stuff, or it is calling the batch callback too early.

Anybody has any clue on this?

Owner

mcollina commented Nov 3, 2013

I think I made some serious progress. I actually was able to insert 12 millions non-ordered pairs into levelup using only 300MB of RAM. I used by branch of level-ws (Level/level-ws#1) that uses the chainable batch instead of the array batch. So, I used regular streams.

The same code segfault on node v0.10.21, but it works perfectly on node v0.11.8.
It segfault reliably at some point, but I cannot reproduce it on pure leveldown/levelup: as my data is coming from a file, I'm guessing something weird is going on.
Anybody with some serious C++-fu to take it from here? Or to guide me to the culprit.

Owner

mcollina commented Nov 3, 2013

This is the backtrace I am getting from gdb.

#1  leveldb::WriteBatchInternal::ByteSize () at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1/bits/basic_string.h:36
#2  0x0000000103911d54 in leveldb::DBImpl::BuildBatchGroup (this=0x100c00b10, last_writer=0x103b7ad88) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1228
#3  0x00000001039117d2 in leveldb::DBImpl::Write (this=0x100c00b10, options=@0x110e5c980, my_batch=<value temporarily unavailable, due to optimizations>) at ../deps/leveldb/leveldb-1.14.0/db/db_impl.cc:1177
#4  0x0000000103904a0b in leveldown::Database::WriteBatchToDatabase (this=<value temporarily unavailable, due to optimizations>, options=0x1018a8c00, batch=0xc) at ../src/database.cc:74
#5  0x0000000103902e7f in leveldown::Batch::Write (this=<value temporarily unavailable, due to optimizations>) at ../src/batch.cc:30
#6  0x00000001039043c5 in leveldown::BatchWriteWorker::Execute (this=0x1138e8300) at ../src/batch_async.cc:25
#7  0x000000010012c8d8 in worker (arg=<value temporarily unavailable, due to optimizations>) at ../deps/uv/src/unix/threadpool.c:74
#8  0x00000001001232e3 in uv__thread_start (ctx_v=<value temporarily unavailable, due to optimizations>) at ../deps/uv/src/uv-common.c:322
#9  0x00007fff88706899 in _pthread_body ()
#10 0x00007fff8870672a in _pthread_start ()
#11 0x00007fff8870afc9 in thread_start ()
Owner

ralphtheninja commented Nov 3, 2013

static size_t ByteSize(const WriteBatch* batch) {
  return batch->rep_.size();
}

There's basically two things that can go wrong here. Either the batch parameter is NULL, or it's pointing to dead memory, i.e. someone has deleted the pointer.

Owner

mcollina commented Nov 3, 2013

I think it's NULL:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
[Switching to process 37040 thread 0x2503]
std::string::size () at basic_string.h:605
605           size() const
Owner

mcollina commented Nov 3, 2013

OK, it's not NULL, as it is asserted, so it's dead memory.

I'm hitting this precondition:

// REQUIRES: Writer list must be non-empty
// REQUIRES: First writer must have a non-NULL batch
WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {
  assert(!writers_.empty());
  Writer* first = writers_.front();
  WriteBatch* result = first->batch;
  assert(result != NULL);

  size_t size = WriteBatchInternal::ByteSize(first->batch);

It seems the rep_ gets GCed after the write, but leveldb internals still need it.

Owner

ralphtheninja commented Nov 3, 2013

From node-leveldown/deps/leveldb/leveldb-1.14.0/db/db_impl.cc

// REQUIRES: Writer list must be non-empty
// REQUIRES: First writer must have a non-NULL batch
WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {
  assert(!writers_.empty());
  Writer* first = writers_.front();
  WriteBatch* result = first->batch;
  assert(result != NULL);

  size_t size = WriteBatchInternal::ByteSize(first->batch);
..
}

We need to make sure that's the case. Before ByteSize() is called above add a logging line, e.g.:

  printf("batch pointer: %x\n", first->batch);
Owner

ralphtheninja commented Nov 3, 2013

Ok, nevermind my last comment. I wrote it before I saw your comment :)

Owner

ralphtheninja commented Nov 3, 2013

Do you know for sure that the asserts fire if something is wrong? Some only enable asserts in e.g. debug mode.

Owner

mcollina commented Nov 3, 2013

Confirming it. the first->batch pointer is there.

mcollina referenced this issue in Level/leveldown Nov 3, 2013

Merged

Persistent handles not needed for batch #70

Owner

mcollina commented Nov 3, 2013

While debugging my segfault issue, I think I got the problem for the memory footprint of batches. Given my scarce C++ skills I might be wrong, but check out rvagg/node-leveldown#70.

Owner

rvagg commented Nov 11, 2013

@tjfontaine managed to track down a missing HandleScope in core, I'm not sure yet if this impacts us (I honestly don't understand what HandleWrap::Close & HandleWrap::OnClose are used for).

Until this makes it into a release, someone on this thread who can reproduce this problem could try patching Node source and running against that. See the patch here https://gist.github.com/tjfontaine/7394912 - basically you just need HandleScope scope above the MakeCallback in src/handle_wrap.cc.

Had anyone come up with a very simple script that will reproduce this problem reliably so the rest of us can dig?

Owner

rvagg commented Nov 11, 2013

oh, and that dtrace script in the gist might be helpful, but currently we're (naughtily) not using MakeCallback yet, but we need to do this, particularly to properly support domains I believe, see rvagg/node-leveldown#62

I am working to make the script more generic, but the way v8 inlines a bunch of things it's difficult to get the probe point to work on handle creation, but I'm hoping to have a good solution for it soon.

Owner

rvagg commented Nov 11, 2013

and ftr, the last mem leak we squashed in leveldown was a missing HandleScope although I did a bit of an audit then to look for any more and nothing stood out

Owner

rvagg commented Nov 18, 2013

I believe this is fixed now in LevelDOWN@0.10.0 which comes with LevelUP/Level@0.18.0. Can everyone who's had the problem try it out and report back so I can close this?

I tried this out and I think its resolved! Thanks everyone!

I loaded a 4 million document/2.6 GB leveldb with level 0.18.0 and it finished successfully. My node process started at 160 MB and finished at 240 MB. With previous versions my process would hit 1.6 GB after loading the initial 10% of my documents, slow to a crawl and then die shortly after that.

Wow! So fast! So little memory!

Works for me, so closing issue...

Thanks for the great fix!

F

@fergiemcdowall fergiemcdowall added a commit to fergiemcdowall/search-index that referenced this issue Nov 18, 2013

@fergiemcdowall fergiemcdowall fixed memory problems (Level/levelup#171) e613d11
Owner

mcollina commented Nov 18, 2013

Just to point out that I got an overall 25% increase in writing speed. I'm writing around 120.000 k-v pair per second in LevelGraph.
It was really worth the work! 👯

Owner

rvagg commented Nov 18, 2013

I wrote details here if you want the gory innards of what's in leveldown@0.10 http://r.va.gg/2013/11/leveldown-v0.10-managing-gc-in-native-v8-programming.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment