Unknown memory leak #2978

o1pranay · 2019-07-24T23:38:43Z

~$ coda daemon -peer hello-coda.o1test.net:8303
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
2019-07-24 21:47:21 UTC [Info] Starting Bootstrap Controller phase
2019-07-24 21:47:21 UTC [Info] Pausing block production while bootstrapping
2019-07-24 21:47:21 UTC [Info] Daemon ready. Clients can now connect
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 54.185.199.39, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 52.37.41.83, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Faulty_peer] Banning peer "34.90.45.209" until "2019-07-25 21:48:06.168045Z" because it Trust_system.Actions.Violated_protocol (RPC call failed, reason: $exn)
	exn: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:28 UTC [Info] Bootstrap state: complete.
2019-07-24 21:48:28 UTC [Info] Starting Transition Frontier Controller phase

The text was updated successfully, but these errors were encountered:

o1pranay · 2019-07-30T18:18:29Z

User RomanS reports the same issue when starting daemon as snark worker:

RomanSToday at 11:12 AM
What's a good way to tell if my snark worker is running correctly?
I used the command from the documentation
I did see a terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc in my node logs when I launched with the snark worker command

enolan · 2019-07-30T19:42:45Z

I was assuming that Linux always overcommitted, but apparently it's free to do so or not. According to RomanS on Discord who I think is quoting the sysctl documentation: "0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available."

Something in libff or libsnark may be requesting more memory than the kernel thinks is available, or there may be some tricky memory corruption bug. We'd need a stack trace to debug further. Can we turn on coredumps and debugging symbols for our C++ dependencies?

emberian · 2019-08-16T01:32:59Z

From #3194, we have another crash report: https://github.com/CodaProtocol/coda/files/3502612/coda_crash_report_2019-08-14_13-43-41.739273.tar.zip

emberian · 2019-08-16T01:35:21Z

And another from #3214, which shares with #3194 the fact that the daemon was running fine for a while. Crash report from #3205.

Are we leaking memory, leading to an eventual failed alloc?

emberian · 2019-08-16T01:44:15Z

In #3196, @AlexanderYudin reports seeing this repeatedly with two different crash reports. How much RAM does that proposer machine have, @AlexanderYudin?

AlexanderYudin · 2019-08-16T01:49:19Z

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

В # 3196 , @AlexanderYudin сообщает , видя это неоднократно с двумя различными отчетами о сбоях . Сколько оперативной памяти у этого компьютера-разработчика, @AlexanderYudin ?

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

enolan · 2019-08-16T02:03:19Z

Maybe? I do see a slight upward slope here: https://search-testnet-djtfjlry3tituhytrovphxtleu.us-west-2.es.amazonaws.com/_plugin/kibana/goto/27b3550924c2090e8b606743b622d4e5 but we don't have data going back far enough to be confident. Memory usage increases from 5.34GB to 5.55GB over the course of ~3 hours.

@jkrauska can we have better stats?

imeckler · 2019-08-16T22:03:43Z

@jkrauska to add per process memory and then we will try to narrow in on the leak. Running theory is it's in parallel scan state

AlexanderYudin · 2019-08-17T08:02:15Z

@cmr After increasing the RAM, the node works stably

jkrauska · 2019-08-17T23:01:53Z

Seeing memory leak in parent ocaml process.

Coda process was stable at 11% until around 5:30 when it jumped to 20% and 35%. it now up to 45%.

imeckler · 2019-08-30T21:39:33Z

@nholland94 tried some stuff, ran a bunch of nodes locally. There needs to be multiple nodes for the leak to reproduce.

nholland94 · 2019-08-30T21:46:39Z

I have not been able to narrow down the exact cause of this, though I was able to narrow down things that it is not. It is not any of the following:

Scan state (at least internally, still possible scan state objects are leaking)
Transition frontier (breadcrumbs and nodes do not leak from internal structure)
Transition frontier extensions (including snark pool refcount)
Transaction pool
Coda_subscriptions and related components

This bug does not reproduce on single proposers not connected to a network. This bug also does not reproduce on @yourbuddyconner's container deployed instances that are scraped by prometheus. Strong next things to look at are items in the networking stack, including get client status.

psteckler · 2019-09-10T20:59:20Z

Did we see monotonic increases in the OCaml heap size, as shown by the every-10-minutes info log entry? If so, that means the leak is in OCaml, not in C++.

psteckler · 2019-09-11T01:16:30Z

Could we build a node with spacetime, run it on AWS?

@mrmr1993 says spacetime may fill up a dev machine disk, maybe on AWS it's OK.

imeckler · 2019-09-19T18:40:13Z

@psteckler and @jkrauska will collect info with spacetime on the next tesnet

bkase · 2019-10-03T18:13:56Z

@enolan has more information

bkase · 2019-10-03T18:14:28Z

It doesn't seem to be in the OCaml heap. It is possible it is curve points (allocated in C++)

imeckler · 2019-10-17T19:06:25Z

@enolan fixed the jemalloc PR -- @nholland94 merged it into release. we'll see how that goes

imeckler · 2019-10-24T18:39:53Z

Things look relatively flat according to @jkrauska

@ghost-not-in-the-shell suspects this may because the scan state is smaller than last week's

enolan · 2019-12-12T22:33:26Z

Given the memory improvements and that nobody has reported this in the last couple months I'm closing this.

o1pranay added this to the Testnet Beta milestone Jul 30, 2019

o1pranay added daemon bug labels Jul 30, 2019

o1pranay changed the title ~~[daemon][bug] RPC call error~~ RPC call error Jul 30, 2019

emberian changed the title ~~RPC call error~~ Prover allocation failure (std::bad_alloc) Aug 16, 2019

emberian mentioned this issue Aug 16, 2019

Coda Daemon crashed #3194

Closed

emberian mentioned this issue Aug 16, 2019

Crash - Prover Failed #3214

Closed

emberian added this to Discuss in Protocol Prioritization via automation Aug 16, 2019

emberian mentioned this issue Aug 16, 2019

Coda Daemon Crashed while Staking #3205

Closed

emberian mentioned this issue Aug 16, 2019

Coda Daemon crashed #3196

Closed

enolan mentioned this issue Aug 16, 2019

Coda Daemon crash and couldnt allocate memory for report #3222

Closed

imeckler moved this from Discuss to Next release in Protocol Prioritization Aug 16, 2019

bkase moved this from Next release to This release in Protocol Prioritization Aug 23, 2019

bkase moved this from This release to Next release in Protocol Prioritization Aug 23, 2019

imeckler changed the title ~~Prover allocation failure (std::bad_alloc)~~ Unknown memory leak Aug 30, 2019

imeckler moved this from Next release to This release in Protocol Prioritization Sep 19, 2019

bkase assigned enolan Oct 10, 2019

bkase moved this from This release to Done in Protocol Prioritization Oct 31, 2019

enolan closed this as completed Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown memory leak #2978

Unknown memory leak #2978

o1pranay commented Jul 24, 2019

o1pranay commented Jul 30, 2019

enolan commented Jul 30, 2019

emberian commented Aug 16, 2019

emberian commented Aug 16, 2019 •

edited

emberian commented Aug 16, 2019 •

edited

AlexanderYudin commented Aug 16, 2019

enolan commented Aug 16, 2019

imeckler commented Aug 16, 2019

AlexanderYudin commented Aug 17, 2019

jkrauska commented Aug 17, 2019

imeckler commented Aug 30, 2019

nholland94 commented Aug 30, 2019

psteckler commented Sep 10, 2019

psteckler commented Sep 11, 2019

imeckler commented Sep 19, 2019

bkase commented Oct 3, 2019

bkase commented Oct 3, 2019

imeckler commented Oct 17, 2019 •

edited

imeckler commented Oct 24, 2019

enolan commented Dec 12, 2019

Unknown memory leak #2978

Unknown memory leak #2978

Comments

o1pranay commented Jul 24, 2019

o1pranay commented Jul 30, 2019

enolan commented Jul 30, 2019

emberian commented Aug 16, 2019

emberian commented Aug 16, 2019 • edited

emberian commented Aug 16, 2019 • edited

AlexanderYudin commented Aug 16, 2019

enolan commented Aug 16, 2019

imeckler commented Aug 16, 2019

AlexanderYudin commented Aug 17, 2019

jkrauska commented Aug 17, 2019

imeckler commented Aug 30, 2019

nholland94 commented Aug 30, 2019

psteckler commented Sep 10, 2019

psteckler commented Sep 11, 2019

imeckler commented Sep 19, 2019

bkase commented Oct 3, 2019

bkase commented Oct 3, 2019

imeckler commented Oct 17, 2019 • edited

imeckler commented Oct 24, 2019

enolan commented Dec 12, 2019

emberian commented Aug 16, 2019 •

edited

emberian commented Aug 16, 2019 •

edited

imeckler commented Oct 17, 2019 •

edited