New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unknown memory leak #2978
Comments
User RomanS reports the same issue when starting daemon as snark worker:
|
I was assuming that Linux always overcommitted, but apparently it's free to do so or not. According to RomanS on Discord who I think is quoting the Something in |
From #3194, we have another crash report: https://github.com/CodaProtocol/coda/files/3502612/coda_crash_report_2019-08-14_13-43-41.739273.tar.zip |
And another from #3214, which shares with #3194 the fact that the daemon was running fine for a while. Crash report from #3205. Are we leaking memory, leading to an eventual failed alloc? |
In #3196, @AlexanderYudin reports seeing this repeatedly with two different crash reports. How much RAM does that proposer machine have, @AlexanderYudin? |
free -h
free -h |
Maybe? I do see a slight upward slope here: https://search-testnet-djtfjlry3tituhytrovphxtleu.us-west-2.es.amazonaws.com/_plugin/kibana/goto/27b3550924c2090e8b606743b622d4e5 but we don't have data going back far enough to be confident. Memory usage increases from 5.34GB to 5.55GB over the course of ~3 hours. @jkrauska can we have better stats? |
@jkrauska to add per process memory and then we will try to narrow in on the leak. Running theory is it's in parallel scan state |
@cmr After increasing the RAM, the node works stably |
Seeing memory leak in parent ocaml process. Coda process was stable at 11% until around 5:30 when it jumped to 20% and 35%. it now up to 45%. |
@nholland94 tried some stuff, ran a bunch of nodes locally. There needs to be multiple nodes for the leak to reproduce. |
I have not been able to narrow down the exact cause of this, though I was able to narrow down things that it is not. It is not any of the following:
This bug does not reproduce on single proposers not connected to a network. This bug also does not reproduce on @yourbuddyconner's container deployed instances that are scraped by prometheus. Strong next things to look at are items in the networking stack, including get client status. |
Did we see monotonic increases in the OCaml heap size, as shown by the every-10-minutes info log entry? If so, that means the leak is in OCaml, not in C++. |
Could we build a node with spacetime, run it on AWS? @mrmr1993 says spacetime may fill up a dev machine disk, maybe on AWS it's OK. |
@psteckler and @jkrauska will collect info with spacetime on the next tesnet |
@enolan has more information |
It doesn't seem to be in the OCaml heap. It is possible it is curve points (allocated in C++) |
@enolan fixed the jemalloc PR -- @nholland94 merged it into release. we'll see how that goes |
Things look relatively flat according to @jkrauska @ghost-not-in-the-shell suspects this may because the scan state is smaller than last week's |
Given the memory improvements and that nobody has reported this in the last couple months I'm closing this. |
The text was updated successfully, but these errors were encountered: