Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown memory leak #2978

Open
o1pranay opened this issue Jul 24, 2019 · 14 comments

Comments

@o1pranay
Copy link
Contributor

commented Jul 24, 2019

~$ coda daemon -peer hello-coda.o1test.net:8303
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
2019-07-24 21:47:21 UTC [Info] Starting Bootstrap Controller phase
2019-07-24 21:47:21 UTC [Info] Pausing block production while bootstrapping
2019-07-24 21:47:21 UTC [Info] Daemon ready. Clients can now connect
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 54.185.199.39, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 52.37.41.83, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:47:27 UTC [Info] Connected to some peers [[host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]]
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Faulty_peer] Banning peer "34.90.45.209" until "2019-07-25 21:48:06.168045Z" because it Trust_system.Actions.Violated_protocol (RPC call failed, reason: $exn)
	exn: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:06 UTC [Error] RPC call error: $error, same error in machine format: $machine_error
	error: "((rpc_error\n  (Bin_io_exn\n   ((location \"server-side rpc query un-bin-io'ing\")\n    (exn\n     (src/common.ml.Read_error\n      \"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\" 40)))))\n (connection_description <created-directly>)\n (rpc_tag answer_sync_ledger_query) (rpc_version 1))"
	machine_error: "((rpc_error(Bin_io_exn((location\"server-side rpc query un-bin-io'ing\")(exn(src/common.ml.Read_error\"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query\"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))"
2019-07-24 21:48:06 UTC [Info] Removing peer from peer set: [host : 34.90.45.209, discovery_port : 8303, communication_port : 8302]
2019-07-24 21:48:06 UTC [Warn] Network error: ((rpc_error(Bin_io_exn((location"server-side rpc query un-bin-io'ing")(exn(src/common.ml.Read_error"Sum_tag / lib/syncable_ledger/syncable_ledger.ml.Make.query"40)))))(connection_description <created-directly>)(rpc_tag answer_sync_ledger_query)(rpc_version 1))
2019-07-24 21:48:28 UTC [Info] Bootstrap state: complete.
2019-07-24 21:48:28 UTC [Info] Starting Transition Frontier Controller phase

@o1pranay o1pranay added this to the Testnet Beta milestone Jul 30, 2019

@o1pranay o1pranay changed the title [daemon][bug] RPC call error RPC call error Jul 30, 2019

@o1pranay

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

User RomanS reports the same issue when starting daemon as snark worker:

RomanSToday at 11:12 AM
What's a good way to tell if my snark worker is running correctly?
I used the command from the documentation
I did see a terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc in my node logs when I launched with the snark worker command
@enolan

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

I was assuming that Linux always overcommitted, but apparently it's free to do so or not. According to RomanS on Discord who I think is quoting the sysctl documentation: "0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available."

Something in libff or libsnark may be requesting more memory than the kernel thinks is available, or there may be some tricky memory corruption bug. We'd need a stack trace to debug further. Can we turn on coredumps and debugging symbols for our C++ dependencies?

@cmr cmr changed the title RPC call error Prover allocation failure (std::bad_alloc) Aug 16, 2019

@cmr

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

@cmr

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

And another from #3214, which shares with #3194 the fact that the daemon was running fine for a while. Crash report from #3205.

Are we leaking memory, leading to an eventual failed alloc?

@cmr cmr added this to Discuss in Work Prioritization - Protocol via automation Aug 16, 2019

@cmr

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

In #3196, @AlexanderYudin reports seeing this repeatedly with two different crash reports. How much RAM does that proposer machine have, @AlexanderYudin?

@AlexanderYudin

This comment has been minimized.

Copy link

commented Aug 16, 2019

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

В # 3196 , @AlexanderYudin сообщает , видя это неоднократно с двумя различными отчетами о сбоях . Сколько оперативной памяти у этого компьютера-разработчика, @AlexanderYudin ?

free -h
total used free shared buff/cache available
Mem: 7.8G 4.4G 2.3G 912K 1.0G 3.1G
Swap: 0B 0B 0B

@enolan

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

Maybe? I do see a slight upward slope here: https://search-testnet-djtfjlry3tituhytrovphxtleu.us-west-2.es.amazonaws.com/_plugin/kibana/goto/27b3550924c2090e8b606743b622d4e5 but we don't have data going back far enough to be confident. Memory usage increases from 5.34GB to 5.55GB over the course of ~3 hours.

Screenshot from 2019-08-15 18-59-38

@jkrauska can we have better stats?

@imeckler

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

@jkrauska to add per process memory and then we will try to narrow in on the leak. Running theory is it's in parallel scan state

@imeckler imeckler moved this from Discuss to Next release in Work Prioritization - Protocol Aug 16, 2019

@AlexanderYudin

This comment has been minimized.

Copy link

commented Aug 17, 2019

@cmr After increasing the RAM, the node works stably

@jkrauska

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2019

Seeing memory leak in parent ocaml process.

Coda process was stable at 11% until around 5:30 when it jumped to 20% and 35%. it now up to 45%.

@bkase bkase moved this from Next release to This release in Work Prioritization - Protocol Aug 23, 2019

@bkase bkase moved this from This release to Next release in Work Prioritization - Protocol Aug 23, 2019

@imeckler imeckler changed the title Prover allocation failure (std::bad_alloc) Unknown memory leak Aug 30, 2019

@imeckler

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2019

@nholland94 tried some stuff, ran a bunch of nodes locally. There needs to be multiple nodes for the leak to reproduce.

@nholland94

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2019

I have not been able to narrow down the exact cause of this, though I was able to narrow down things that it is not. It is not any of the following:

  • Scan state (at least internally, still possible scan state objects are leaking)
  • Transition frontier (breadcrumbs and nodes do not leak from internal structure)
  • Transition frontier extensions (including snark pool refcount)
  • Transaction pool
  • Coda_subscriptions and related components

This bug does not reproduce on single proposers not connected to a network. This bug also does not reproduce on @yourbuddyconner's container deployed instances that are scraped by prometheus. Strong next things to look at are items in the networking stack, including get client status.

@psteckler

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

Did we see monotonic increases in the OCaml heap size, as shown by the every-10-minutes info log entry? If so, that means the leak is in OCaml, not in C++.

@psteckler

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

Could we build a node with spacetime, run it on AWS?

@mrmr1993 says spacetime may fill up a dev machine disk, maybe on AWS it's OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
8 participants
You can’t perform that action at this time.