Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown memory leak #2978

Closed
o1pranay opened this issue Jul 24, 2019 · 20 comments
Closed

Unknown memory leak #2978

o1pranay opened this issue Jul 24, 2019 · 20 comments

Comments

@o1pranay
Copy link
Contributor

~$ coda daemon -peer hello-coda.o1test.net:8303
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
2019-07-24 21:47:21 UTC [Info] Starting Bootstrap Controller phase
2019-07-24 21:47:21 UTC [Info] Pausing block production while bootstrapping
2019-07-24 21:47:21 UTC [Info] Daemon ready. Clients can now connect
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 54.185.199.39, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 52.37.41.83, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Faulty_peer] Banning peer "34.90.45.209" until "2019-07-25 21:48:06.168045Z" because it Trust_system.Actions.Violated_protocol (RPC call failed, reason: $exn)
	exn: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:28 UTC [Info] Bootstrap state: complete.
2019-07-24 21:48:28 UTC [Info] Starting Transition Frontier Controller phase
@o1pranay o1pranay added this to the Testnet Beta milestone Jul 30, 2019
@o1pranay o1pranay changed the title [daemon][bug] RPC call error RPC call error Jul 30, 2019
@o1pranay
Copy link
Contributor Author

User RomanS reports the same issue when starting daemon as snark worker:

RomanSToday at 11:12 AM
What's a good way to tell if my snark worker is running correctly?
I used the command from the documentation
I did see a terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc in my node logs when I launched with the snark worker command

@enolan
Copy link
Contributor

enolan commented Jul 30, 2019

I was assuming that Linux always overcommitted, but apparently it's free to do so or not. According to RomanS on Discord who I think is quoting the sysctl documentation: "0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available."

Something in libff or libsnark may be requesting more memory than the kernel thinks is available, or there may be some tricky memory corruption bug. We'd need a stack trace to debug further. Can we turn on coredumps and debugging symbols for our C++ dependencies?

@emberian emberian changed the title RPC call error Prover allocation failure (std::bad_alloc) Aug 16, 2019
@emberian
Copy link
Member

@emberian
Copy link
Member

emberian commented Aug 16, 2019

And another from #3214, which shares with #3194 the fact that the daemon was running fine for a while. Crash report from #3205.

Are we leaking memory, leading to an eventual failed alloc?

@emberian
Copy link
Member

emberian commented Aug 16, 2019

In #3196, @AlexanderYudin reports seeing this repeatedly with two different crash reports. How much RAM does that proposer machine have, @AlexanderYudin?

@AlexanderYudin
Copy link

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

В # 3196 , @AlexanderYudin сообщает , видя это неоднократно с двумя различными отчетами о сбоях . Сколько оперативной памяти у этого компьютера-разработчика, @AlexanderYudin ?

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

@enolan
Copy link
Contributor

enolan commented Aug 16, 2019

Maybe? I do see a slight upward slope here: https://search-testnet-djtfjlry3tituhytrovphxtleu.us-west-2.es.amazonaws.com/_plugin/kibana/goto/27b3550924c2090e8b606743b622d4e5 but we don't have data going back far enough to be confident. Memory usage increases from 5.34GB to 5.55GB over the course of ~3 hours.

Screenshot from 2019-08-15 18-59-38

@jkrauska can we have better stats?

@imeckler
Copy link
Member

@jkrauska to add per process memory and then we will try to narrow in on the leak. Running theory is it's in parallel scan state

@imeckler imeckler moved this from Discuss to Next release in Protocol Prioritization Aug 16, 2019
@AlexanderYudin
Copy link

@cmr After increasing the RAM, the node works stably

@jkrauska
Copy link
Contributor

Seeing memory leak in parent ocaml process.

Coda process was stable at 11% until around 5:30 when it jumped to 20% and 35%. it now up to 45%.

@bkase bkase moved this from Next release to This release in Protocol Prioritization Aug 23, 2019
@bkase bkase moved this from This release to Next release in Protocol Prioritization Aug 23, 2019
@imeckler imeckler changed the title Prover allocation failure (std::bad_alloc) Unknown memory leak Aug 30, 2019
@imeckler
Copy link
Member

@nholland94 tried some stuff, ran a bunch of nodes locally. There needs to be multiple nodes for the leak to reproduce.

@nholland94
Copy link
Member

I have not been able to narrow down the exact cause of this, though I was able to narrow down things that it is not. It is not any of the following:

  • Scan state (at least internally, still possible scan state objects are leaking)
  • Transition frontier (breadcrumbs and nodes do not leak from internal structure)
  • Transition frontier extensions (including snark pool refcount)
  • Transaction pool
  • Coda_subscriptions and related components

This bug does not reproduce on single proposers not connected to a network. This bug also does not reproduce on @yourbuddyconner's container deployed instances that are scraped by prometheus. Strong next things to look at are items in the networking stack, including get client status.

@psteckler
Copy link
Member

Did we see monotonic increases in the OCaml heap size, as shown by the every-10-minutes info log entry? If so, that means the leak is in OCaml, not in C++.

@psteckler
Copy link
Member

Could we build a node with spacetime, run it on AWS?

@mrmr1993 says spacetime may fill up a dev machine disk, maybe on AWS it's OK.

@imeckler imeckler moved this from Next release to This release in Protocol Prioritization Sep 19, 2019
@imeckler
Copy link
Member

@psteckler and @jkrauska will collect info with spacetime on the next tesnet

@bkase
Copy link
Member

bkase commented Oct 3, 2019

@enolan has more information

@bkase
Copy link
Member

bkase commented Oct 3, 2019

It doesn't seem to be in the OCaml heap. It is possible it is curve points (allocated in C++)

@imeckler
Copy link
Member

imeckler commented Oct 17, 2019

@enolan fixed the jemalloc PR -- @nholland94 merged it into release. we'll see how that goes

@imeckler
Copy link
Member

Things look relatively flat according to @jkrauska

@ghost-not-in-the-shell suspects this may because the scan state is smaller than last week's

@bkase bkase moved this from This release to Done in Protocol Prioritization Oct 31, 2019
@enolan
Copy link
Contributor

enolan commented Dec 12, 2019

Given the memory improvements and that nobody has reported this in the last couple months I'm closing this.

@enolan enolan closed this as completed Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

9 participants