New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve elements memory usage #1190
Improve elements memory usage #1190
Conversation
eb1f32d
to
4613a1f
Compare
Fixed a bug I introduced when refactoring the "Don't let header syncing..." commit last night. Don't code after bedtime, kids. |
Oops, tests failing due to a failing lint ( |
4613a1f
to
1f9ba3f
Compare
It turns out that it wasn't actually documentation, it was "I mistyped the name of the flag one place". I swear I had fixed that; I may have lost it in a rebase. Fixed now. (Also, apparently nothing except CI runs |
Hmmm, CI suggests I have broken signet. Will take a look. |
Nice! I'm busy doing a rough test measuring memory usage during IBD 👍 |
Ooh, the undefined behavior sanitizer caught me out.
This line is I could fix the warning with a cast, but in practice I should probably have a less-stupid fallback value for non-dynafed chains (the default value for |
Ran into an issue, IBD on
|
Second restart went fine, maybe a race with a new block arriving during startup? |
It seems I spoke too soon :/
|
restarted again with debug, crashed again - attached debug log file |
I added an arg to
Which is called here in https://github.com/gwillen/elements/blob/fix-elements-memory-usage/src/net_processing.cpp#L3170 |
Thanks for your work on this, it's incredibly helpful. It seems that people are sending us In any case, this is a good reminder of something I had meant to do but forgot, which is to have some kind of reasonable handling of |
…something reasonable.
1afc9b0
to
29345e9
Compare
Rebased to bring in new changes from master, to fix merge conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 29345e9
I'm busy testing IBD and restarting again.
Left some nits and a comment about the CI fuzzer issue
Co-authored-by: Byron Hambly <byron@hambly.dev>
408ab19
to
b9b7f65
Compare
I had started a new node for IBD, and had to interrupt it before letting it run overnight and it appears to be stuck
Judging from the log file before the first shutdown, it was already stuck during the once reaching this point
(this is compressed with xz, had to add a zip ext to make github happy) I will try another sync from scratch with no interruptions |
Another IBD stuck, this time at 2117769
So in this case the headers are 41543 ahead of the blocks "chain": "liquidv1",
"blocks": 2117769,
"headers": 2159312,
"bestblockhash": "7a81d57330179218a0106b0c9421aadf18a6a61f0392c191859ba8dc5230c2c5",
"mediantime": 1669830008,
"verificationprogress": 0.976,
"initialblockdownload": true, From what I can tell, moving the "enough headers" check up in fba37b7 means that we are discarding the headers but then not continuing to sync the outstanding blocks |
@@ -307,12 +307,47 @@ bool CBlockTreeDB::WritePAKList(const std::vector<std::vector<unsigned char> >& | |||
return Write(std::make_pair(DB_PAK, uint256S("1")), offline_list) && Write(std::make_pair(DB_PAK, uint256S("2")), online_list) && Write(std::make_pair(DB_PAK, uint256S("3")), reject); | |||
} | |||
|
|||
bool CBlockTreeDB::LoadBlockIndexGuts(const Consensus::Params& consensusParams, std::function<CBlockIndex*(const uint256&)> insertBlockIndex) | |||
bool CBlockTreeDB::WalkBlockIndexGutsForMaxHeight(int* nHeight) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not guaranteed to be the max height, correct? This is our best guess of the max height, from a random sample of 10k blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could that be placed into the docstring for this method, since it is public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, forgive my ignorance about the bitcoin / elements codebase, but is there no way to get the max height here? Certainly that is stored somewhere, since certain RPCs produce it.
Is it just that accessing it here would require some major refactors which are overkill for this PR? Or... do we actually have to load the block index first in order to get the max height?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to load the block index first to get the actual max height, yeah, it's a catch-22 here. This is one of the very first stages of startup. And the headers in the index are stored on disk in hash order, not height order, so we have no idea what the max height is until after we've loaded all of them. (But since hash order is random with respect to height, loading a few gives us a really good guess.)
Also note that this really only tells us the max header height, we don't learn what blocks we actually have until even later in the startup process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. That makes sense, thanks,
What do you think about writing that "this is not guaranteed to be the max height, it's just from a random sample" in the method's doctoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added (hopefully I did the docstring right, there aren't a lot of them around.)
while (min_index && !min_index->trimmed()) { | ||
min_index->trim(); | ||
min_index = min_index->pprev; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the logic which trims blocks that were new when we got them, but are now old enough that they should be trimmed?
It looks like it only gets triggered when we flush to disk. How often does this flushing take place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Adjusting the flush frequency is part of this PR elsewhere -- I think I lowered it from 1 hour to 5 minutes. It's not ideal for it to be time-based instead of memory-pressure-based, but that would be a more invasive change.
Hmmm, this is odd. Block syncing is done elsewhere and ought to be totally independent of this. I will try to figure out what's going on. |
// For simplicity, if any of the headers they're asking for are trimmed, | ||
// just drop the request. | ||
LogPrint(BCLog::NET, "%s: ignoring getheaders from peer=%i which would return at least one trimmed header\n", __func__, pfrom.GetId()); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the GETHEADERS
message.
Does the GETBLOCK
message still work fine if the headers are not in memory? I have not tested it. It doesn't appear that this call needs in-memory headers to find blocks from the index, but I just want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this specific case other than by running against the live network. But I systematically checked for references to the trimmed fields, so I should have caught this if it were an issue. Looking at it now, I can see that it never calls GetBlockHeader
(which is the usual suspect) or any of the other offending methods -- it calls ReadBlockFromDisk
, which is safe.
utACK 8cad8cd. |
When receiving unwanted headers from a peer, first check whether we have recent "best known block" state for that peer. If not, accept one batch of headers to refresh it. Otherwise, we can become stuck in a state where we need blocks, but we don't know which peers have those blocks available. This especially happens during startup, when we don't have this information for any peers until we start processing header messages.
It turns out that @delta1 was 100% correct about this. Apparently header syncing is the only mechanism through which we learn what blocks a peer has available to sync. So if you refuse all incoming headers except from one peer you're doing IBD from, and then that peer dies or something, you can get in a state where you have a bunch of live peers, but you have no idea what blocks they have so you'll never request blocks from them. (And restarting doesn't help, because on startup we have no idea what peers hold what blocks.) Fixed this by relaxing the rejection of unwanted headers just slightly: if a peer sends us headers, and we have never received headers from that peer before, OR the latest headers we have from that peer are very stale (predating our most recent full block), then we accept the headers. |
Added a couple of minor documentation fixes. I believe all outstanding comments are now resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK e622156. |
Just merged this locally on top of 22.1 and all the tests passed without issue |
Adds a
-trim_headers
flag, which punts some of the largest header fields out ofCBlockIndex
, the struct used for keeping the in-memory copies of the headers of the whole chain. We keep several dynafed periods worth of headers fully in memory, so that we can be assured of being able to validate new blocks; additionally, during IBD we keep a window of headers in-memory past where we have blocks validated (but restrict the size of the window, so that memory usage does not balloon too much.)If the chain is not configured for dynafed, the default value for the dynafed period is very large, so my guess is this code will do basically nothing in that case (it will have a window of headers to keep which is larger than the entire chain), but I haven't tested that.
Turning the flag on while in the middle of IBD may trip an assert during startup, as we try to load the index (due to having way more headers than blocks on disk, such that we try to trim away headers we still need for validation.) This should have no ill effects once the node is restarted without the flag, but this should probably be better documented.
RPCs and REST endpoints which need a trimmed header will load it from the block file on disk, in a fairly wasteful way (loading the entire block each time.) RPCs which print a trimmed header in JSON will omit the missing fields instead (replacing them with a placeholder.) This could also be better documented.
A node operating in trimmed mode will act as a "light" or SPV node to the P2P network, since it can't provide full headers to other nodes. So this mode is not appropriate for "infrastructure" nodes which need to provide support for IBD or block/transaction propagation. (I am fairly convinced that, with the current on-disk structures being set up the way they are, the performance impact of trying to go to disk for this would be prohibitive. I haven't tried it, though.)
The memory usage reduction I see with this, downloading the entire Liquid v1 chain, is something like 2x. There are definitely further improvements that could be made.