Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve elements memory usage #1190

Merged
merged 16 commits into from Jan 26, 2023

Conversation

gwillen
Copy link
Contributor

@gwillen gwillen commented Nov 15, 2022

Adds a -trim_headers flag, which punts some of the largest header fields out of CBlockIndex, the struct used for keeping the in-memory copies of the headers of the whole chain. We keep several dynafed periods worth of headers fully in memory, so that we can be assured of being able to validate new blocks; additionally, during IBD we keep a window of headers in-memory past where we have blocks validated (but restrict the size of the window, so that memory usage does not balloon too much.)

If the chain is not configured for dynafed, the default value for the dynafed period is very large, so my guess is this code will do basically nothing in that case (it will have a window of headers to keep which is larger than the entire chain), but I haven't tested that.

Turning the flag on while in the middle of IBD may trip an assert during startup, as we try to load the index (due to having way more headers than blocks on disk, such that we try to trim away headers we still need for validation.) This should have no ill effects once the node is restarted without the flag, but this should probably be better documented.

RPCs and REST endpoints which need a trimmed header will load it from the block file on disk, in a fairly wasteful way (loading the entire block each time.) RPCs which print a trimmed header in JSON will omit the missing fields instead (replacing them with a placeholder.) This could also be better documented.

A node operating in trimmed mode will act as a "light" or SPV node to the P2P network, since it can't provide full headers to other nodes. So this mode is not appropriate for "infrastructure" nodes which need to provide support for IBD or block/transaction propagation. (I am fairly convinced that, with the current on-disk structures being set up the way they are, the performance impact of trying to go to disk for this would be prohibitive. I haven't tried it, though.)

The memory usage reduction I see with this, downloading the entire Liquid v1 chain, is something like 2x. There are definitely further improvements that could be made.

@gwillen gwillen force-pushed the fix-elements-memory-usage branch 2 times, most recently from eb1f32d to 4613a1f Compare November 15, 2022 19:19
@gwillen
Copy link
Contributor Author

gwillen commented Nov 15, 2022

Fixed a bug I introduced when refactoring the "Don't let header syncing..." commit last night. Don't code after bedtime, kids.

@gwillen
Copy link
Contributor Author

gwillen commented Nov 15, 2022

Oops, tests failing due to a failing lint (check-doc.py noting that there's some place I'm supposed to document the new flag.) For some reason, on my system lint-all.sh does not run check-doc.py so I didn't catch it.

@gwillen
Copy link
Contributor Author

gwillen commented Nov 15, 2022

Oops, tests failing due to a failing lint (check-doc.py noting that there's some place I'm supposed to document the new flag.) For some reason, on my system lint-all.sh does not run check-doc.py so I didn't catch it.

It turns out that it wasn't actually documentation, it was "I mistyped the name of the flag one place". I swear I had fixed that; I may have lost it in a rebase. Fixed now. (Also, apparently nothing except CI runs check-doc.py, you just have to know to run it. Add it to the list...)

@gwillen
Copy link
Contributor Author

gwillen commented Nov 16, 2022

Hmmm, CI suggests I have broken signet. Will take a look.

@delta1
Copy link
Member

delta1 commented Nov 16, 2022

Nice! I'm busy doing a rough test measuring memory usage during IBD 👍

@gwillen
Copy link
Contributor Author

gwillen commented Nov 17, 2022

Ooh, the undefined behavior sanitizer caught me out.

nHeaderDownloadBuffer = chainparams.GetConsensus().dynamic_epoch_length * 2;

This line is uint64 = uint32*2, which seems superficially legitimate (and is not technically undefined behavior, I think), but because of integer promotion, it gives unexpected (although here I think harmless) results for very large values of dynamic_epoch_length.

I could fix the warning with a cast, but in practice I should probably have a less-stupid fallback value for non-dynafed chains (the default value for dynamic_epoch_length is the maximum uint32 value.)

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

Ran into an issue, IBD on liquidv1 went fine, but then after restarting elementsd I hit this:

2022-11-17T08:30:35Z UpdateTip: new best=bb49a7efcbc1e8289258e64fd524329a17b0eb2d3032136d8cfa9d57bf22ffe5 height=2098714 version=0x20000000 tx=3411874 date='2022-11-17T08:30:08Z' progress=1.000000 cache=0.0MiB(14txo)
elementsd: ./chain.h:223: void CBlockIndex::assert_untrimmed() const: Assertion `!m_trimmed' failed.
zsh: IOT instruction (core dumped)  ./src/elementsd -datadir=/tmp/elements -chain=liquidv1 -trim_headers

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

Second restart went fine, maybe a race with a new block arriving during startup?

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

It seems I spoke too soon :/

2022-11-17T08:34:57Z Loaded best chain: hashBestChain=fcbbfa42ee6184f242244501a86929f2d72c2245135447203ab22bad0e02f8e1 height=2098709 date=2022-11-17T08:25:08Z progress=1.000000
2022-11-17T08:34:58Z init message: Verifying blocks…
2022-11-17T08:34:58Z Verifying last 6 blocks at level 3
2022-11-17T08:34:58Z [0%]...[16%]...[33%]...[50%]...[66%]...[83%]...[99%]...[DONE].
2022-11-17T08:34:58Z No coin database inconsistencies in last 6 blocks (7 transactions)
2022-11-17T08:34:58Z  block index           24772ms
2022-11-17T08:34:58Z Unsetting NODE_NETWORK on prune mode
2022-11-17T08:34:58Z init message: Pruning blockstore…
2022-11-17T08:34:58Z Unsetting NODE_NETWORK_LIMITED on header trim mode
2022-11-17T08:34:58Z block tree size = 2098710
2022-11-17T08:34:58Z nBestHeight = 2098709
2022-11-17T08:34:58Z loadblk thread start
2022-11-17T08:34:58Z Imported mempool transactions from disk: 0 succeeded, 0 failed, 0 expired, 0 already there, 0 waiting for initial broadcast
2022-11-17T08:34:58Z loadblk thread exit
2022-11-17T08:34:58Z Bound to 127.0.0.1:37041
2022-11-17T08:34:58Z Bound to [::]:7042
2022-11-17T08:34:58Z Bound to 0.0.0.0:7042
2022-11-17T08:34:58Z init message: Loading P2P addresses…
2022-11-17T08:34:58Z torcontrol thread start
2022-11-17T08:34:58Z Loaded 975 addresses from peers.dat  3ms
2022-11-17T08:34:58Z Missing or invalid file /tmp/elements/liquidv1/anchors.dat
2022-11-17T08:34:58Z 0 block-relay-only anchors will be tried for connections.
2022-11-17T08:34:58Z init message: Starting network threads…
2022-11-17T08:34:58Z dnsseed thread start
2022-11-17T08:34:58Z net thread start
2022-11-17T08:34:58Z addcon thread start
2022-11-17T08:34:58Z Waiting 11 seconds before querying DNS seeds.
2022-11-17T08:34:58Z msghand thread start
2022-11-17T08:34:58Z opencon thread start
2022-11-17T08:34:58Z init message: Done loading
2022-11-17T08:34:58Z tor: Got service ID ul2tiw2q5waxqnmd47emvdz67yvuapmt7s7akouzf4sfdtolvpfsq4yd, advertising service ul2tiw2q5waxqnmd47emvdz67yvuapmt7s7akouzf4sfdtolvpfsq4yd.onion:7042
2022-11-17T08:34:58Z AddLocal(ul2tiw2q5waxqnmd47emvdz67yvuapmt7s7akouzf4sfdtolvpfsq4yd.onion:7042,4)
2022-11-17T08:34:59Z Leaving InitialBlockDownload (latching to false)
2022-11-17T08:34:59Z New outbound peer connected: version: 70016, blocks=2098718, peer=0 (outbound-full-relay)
2022-11-17T08:35:00Z New outbound peer connected: version: 70016, blocks=2098718, peer=1 (outbound-full-relay)
2022-11-17T08:35:09Z Loading addresses from DNS seed seed.liquidnetwork.io
2022-11-17T08:35:09Z 9 addresses found from DNS seeds
2022-11-17T08:35:09Z dnsseed thread exit
2022-11-17T08:35:18Z UpdateTip: new best=b1d7729a01fdc5c711a7de77fb3479529813ad3845754a53d1b15641063a33d0 height=2098710 version=0x20000000 tx=3411867 date='2022-11-17T08:26:08Z' progress=1.000000 cache=0.0MiB(0txo)
2022-11-17T08:35:18Z New outbound peer connected: version: 70016, blocks=2098718, peer=2 (outbound-full-relay)
2022-11-17T08:35:18Z UpdateTip: new best=082748a02e94e3d54c9b3f5d693e156e3cf4c4c2aebf837a5f6502134c264258 height=2098711 version=0x20000000 tx=3411870 date='2022-11-17T08:27:08Z' progress=1.000000 cache=0.0MiB(10txo)
2022-11-17T08:35:18Z UpdateTip: new best=bc6b983fe934d3dffc8023447fdfc58ba5ddf12932b6922cb2b2466d969ef030 height=2098712 version=0x20000000 tx=3411871 date='2022-11-17T08:28:07Z' progress=1.000000 cache=0.0MiB(10txo)
2022-11-17T08:35:19Z UpdateTip: new best=da574173a5458367b7d4d6aece8978f6ed02e4b52d07b75be5bcd22af088bc4d height=2098713 version=0x20000000 tx=3411872 date='2022-11-17T08:29:05Z' progress=1.000000 cache=0.0MiB(10txo)
2022-11-17T08:35:19Z UpdateTip: new best=bb49a7efcbc1e8289258e64fd524329a17b0eb2d3032136d8cfa9d57bf22ffe5 height=2098714 version=0x20000000 tx=3411874 date='2022-11-17T08:30:08Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:19Z UpdateTip: new best=5c223bb79036809c5be42bea50ae721b88dc9ae02179f22855775347a3716368 height=2098715 version=0x20000000 tx=3411875 date='2022-11-17T08:31:08Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:19Z UpdateTip: new best=58ef9a0c941ef6eb647152eb21ab8220369e15fefb2cfddc9b03f08a868f4144 height=2098716 version=0x20000000 tx=3411876 date='2022-11-17T08:31:51Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:19Z UpdateTip: new best=acaf97cd0f6e4b302bd51ab582a7a042d0b27defb78ae0eb4fa69c925866b5fa height=2098717 version=0x20000000 tx=3411877 date='2022-11-17T08:33:08Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:19Z UpdateTip: new best=9e394fa1536d6ee1d20c9fa0b8ca80480295a573cedab0d55ac25ad495741c2c height=2098718 version=0x20000000 tx=3411878 date='2022-11-17T08:34:08Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:35Z UpdateTip: new best=4526a1c604c3b7e0621e73c61a15050c06ecc34c7e4ae1bed4799f5c27c2aae6 height=2098719 version=0x20000000 tx=3411879 date='2022-11-17T08:34:58Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:35:40Z New outbound peer connected: version: 70016, blocks=2098719, peer=3 (outbound-full-relay)
2022-11-17T08:35:40Z New outbound peer connected: version: 70016, blocks=2098719, peer=4 (outbound-full-relay)
2022-11-17T08:36:15Z New outbound peer connected: version: 70016, blocks=2098719, peer=6 (outbound-full-relay)
2022-11-17T08:36:15Z New outbound peer connected: version: 70016, blocks=2098719, peer=5 (outbound-full-relay)
2022-11-17T08:36:16Z New outbound peer connected: version: 70016, blocks=2098719, peer=7 (outbound-full-relay)
2022-11-17T08:36:35Z UpdateTip: new best=febc66baea17978a23985ba337d5378bcf561480bb214eafc8072e384f8561a5 height=2098720 version=0x20000000 tx=3411880 date='2022-11-17T08:36:08Z' progress=1.000000 cache=0.0MiB(14txo)
2022-11-17T08:36:37Z New outbound peer connected: version: 70016, blocks=2098720, peer=8 (outbound-full-relay)
2022-11-17T08:36:38Z New outbound peer connected: version: 70015, blocks=2098720, peer=9 (outbound-full-relay)
2022-11-17T08:36:59Z New outbound peer connected: version: 70015, blocks=1517039, peer=10 (outbound-full-relay)
2022-11-17T08:37:00Z New outbound peer connected: version: 70016, blocks=2098720, peer=11 (outbound-full-relay)
2022-11-17T08:37:01Z New outbound peer connected: version: 70015, blocks=1507686, peer=12 (outbound-full-relay)
2022-11-17T08:37:01Z New outbound peer connected: version: 70016, blocks=2098720, peer=13 (outbound-full-relay)
2022-11-17T08:37:03Z New outbound peer connected: version: 70015, blocks=1517039, peer=14 (outbound-full-relay)
2022-11-17T08:37:05Z New outbound peer connected: version: 70016, blocks=2098720, peer=15 (block-relay-only)
2022-11-17T08:37:05Z New outbound peer connected: version: 70015, blocks=1517039, peer=16 (outbound-full-relay)
2022-11-17T08:37:06Z New outbound peer connected: version: 70016, blocks=2098720, peer=17 (block-relay-only)
2022-11-17T08:37:15Z New outbound peer connected: version: 70016, blocks=2098720, peer=18 (block-relay-only)
2022-11-17T08:37:35Z UpdateTip: new best=365cce856fa32dc934f5e03277377ede7be94a89e52f53d8645dfc0866491890 height=2098721 version=0x20000000 tx=3411882 date='2022-11-17T08:37:08Z' progress=1.000000 cache=0.0MiB(15txo)
elementsd: ./chain.h:223: void CBlockIndex::assert_untrimmed() const: Assertion `!m_trimmed' failed.
zsh: IOT instruction (core dumped)  ./src/elementsd -datadir=/tmp/elements -chain=liquidv1 -trim_headers

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

restarted again with debug, crashed again - attached debug log file

debug.log

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

Very unscientific measurement of RAM usage during IBD

x-axis: number of 5 second increments
y-axis: resident set size as reported by pmap -x <pid>

RSS climbed to a max of approx 5.2GB when IBD finished and stayed there

image

@delta1
Copy link
Member

delta1 commented Nov 17, 2022

I added an arg to assert_untrimmed to see where it was happening, appears to be in GetBlockHeader():

2022-11-17T09:52:06Z Requesting block 4fee51568938601608ae2600624a284425b49b0847e020e9f4445906ff8baa6d (2098726) peer=1
2022-11-17T09:52:06Z Requesting block 0fca94737b5a5a65963c189afaeedc0aca41c026d87ad2b733f083302e944f7b (2098727) peer=1
2022-11-17T09:52:06Z Requesting block 59e9017d6c98de570c678bb6c31577f0fe44da2dec8f7779d8ce04208e1e7990 (2098728) peer=1
2022-11-17T09:52:06Z Requesting block 8f70225f4e2f21a79b400aa53b658d89f10d49bf1ff2e7003f445e1e8ae6e830 (2098729) peer=1
2022-11-17T09:52:06Z Requesting block 5fe4db51af8b5eda78a01ee0e2bc125b9559b317e14d0d9986bb0a45340e89e5 (2098730) peer=1
2022-11-17T09:52:06Z Requesting block ef96b77431d1f5272f04709b56e2144007a9ebacaaf03ad1e4768e6b4607de03 (2098731) peer=1
2022-11-17T09:52:06Z Requesting block 6284c554cc32d518ff72698183caa8572895e855f1fe1d11d21389b5c8dfebcd (2098732) peer=1
2022-11-17T09:52:06Z Requesting block a546deee33bdceca63221688622b77ef8c6ac336fa4a5eb9c443844ddf8e9536 (2098733) peer=1
2022-11-17T09:52:06Z Requesting block b436910715f53d10361203b1ac9a2e1c8700fb361fee99b0a62e0ec629a8ba6f (2098734) peer=1
2022-11-17T09:52:06Z Requesting block 4724ae6e760f7101cb89b04a854d7dce7655450be30571d0c183c8ebe73c7171 (2098735) peer=1
2022-11-17T09:52:06Z Requesting block 9e58f2d6a1c522b7ffac9c7cc87415fd3d7c2bcfe1deb0b8c6b1938c7a83cabd (2098736) peer=1
2022-11-17T09:52:06Z Requesting block 0dd38c66e282b6861ed44b949c8a088224ecbfa2109c9b5e9c7928c8aa1be5cd (2098737) peer=1
2022-11-17T09:52:06Z Requesting block 885ca46a3f6074fceaacd5c4ccbf7f46915465f10cca3f856500f9229b1cde91 (2098738) peer=1
2022-11-17T09:52:06Z Requesting block 266e528bb14c564277954860f5200cffbf39343aa990b00777a4b6a099fec0c7 (2098739) peer=1
2022-11-17T09:52:06Z Requesting block 7263ec438c09765772d7be0dfed73d5656f0a51612ab024c1b05b6873c788f17 (2098740) peer=1
2022-11-17T09:52:06Z Requesting block a8161fd4673be4c223d50eb3e50cfa61f80a5708d7996a3cf64097fd6d96b4bc (2098741) peer=1
2022-11-17T09:52:06Z sending getdata (577 bytes) peer=1
2022-11-17T09:52:06Z received: addrv2 (9000 bytes) peer=1
2022-11-17T09:52:06Z Received addr: 543 addresses (543 processed, 0 rate-limited) from peer=1
2022-11-17T09:52:06Z IP 185.212.69.79 mapped to AS0 belongs to new bucket 565
2022-11-17T09:52:06Z IP 170.75.170.78 mapped to AS0 belongs to new bucket 337
2022-11-17T09:52:06Z IP zj5yy2wjf5lwswvlafuqzoxapl7mqwaxlte7coy7ri6rcaoha4xiudqd.onion mapped to AS0 belongs to new bucket 148
2022-11-17T09:52:06Z Cleared nodestate for peer=2
2022-11-17T09:52:07Z received: getheaders (1061 bytes) peer=0
2022-11-17T09:52:07Z getheaders 1509329 to b1d7729a01fdc5c711a7de77fb3479529813ad3845754a53d1b15641063a33d0 from peer=0
2022-11-17T09:52:07Z assert_untrimmed: GetBlockHeader()
elementsd: ./chain.h:225: void CBlockIndex::assert_untrimmed(std::string) const: Assertion `!m_trimmed' failed.
zsh: IOT instruction (core dumped)  ./src/elementsd -datadir=/tmp/elements -chain=liquidv1 -trim_headers  -debug

Which is called here in net_processing.cpp

https://github.com/gwillen/elements/blob/fix-elements-memory-usage/src/net_processing.cpp#L3170

@gwillen
Copy link
Contributor Author

gwillen commented Nov 17, 2022

Thanks for your work on this, it's incredibly helpful.

It seems that people are sending us getheaders even though we should be clearing the network bits saying it's okay to ask us for them. I will have to investigate whether (1) we're not actually doing that, (2) the peers are misbehaving somehow? (3) I have misunderstood what those bits do, or (4) ???

In any case, this is a good reminder of something I had meant to do but forgot, which is to have some kind of reasonable handling of getheaders messages even if we don't expect to receive them (i.e. banning the peer, if it's truly misbehaving, or just ignoring them entirely.)

@gwillen
Copy link
Contributor Author

gwillen commented Dec 12, 2022

Rebased to bring in new changes from master, to fix merge conflict.

Copy link
Member

@delta1 delta1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 29345e9

I'm busy testing IBD and restarting again.

Left some nits and a comment about the CI fuzzer issue

src/validation.cpp Outdated Show resolved Hide resolved
src/init.cpp Outdated Show resolved Hide resolved
src/init.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/net_processing.cpp Outdated Show resolved Hide resolved
src/rest.cpp Outdated Show resolved Hide resolved
src/rest.cpp Outdated Show resolved Hide resolved
src/txdb.cpp Outdated Show resolved Hide resolved
src/txdb.cpp Outdated Show resolved Hide resolved
src/txdb.h Outdated Show resolved Hide resolved
Co-authored-by: Byron Hambly <byron@hambly.dev>
@delta1
Copy link
Member

delta1 commented Jan 5, 2023

I had started a new node for IBD, and had to interrupt it before letting it run overnight and it appears to be stuck

Discarding received headers and pausing header sync from peer=2, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)

Judging from the log file before the first shutdown, it was already stuck during the once reaching this point

2023-01-04T14:38:38Z Discarding received headers and pausing header sync from peer=43, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)
2023-01-04T14:38:39Z received: headers (2767039 bytes) peer=41
2023-01-04T14:38:39Z Discarding received headers and pausing header sync from peer=41, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)
2023-01-04T14:38:39Z received: headers (2767039 bytes) peer=42
2023-01-04T14:38:39Z Discarding received headers and pausing header sync from peer=42, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)
2023-01-04T14:38:40Z received: headers (2767039 bytes) peer=44
2023-01-04T14:38:40Z Discarding received headers and pausing header sync from peer=44, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)
2023-01-04T14:38:40Z received: headers (2767039 bytes) peer=40
2023-01-04T14:38:40Z Discarding received headers and pausing header sync from peer=40, because we are too far ahead of block sync (189974 > 148383, 2000 new headers received)

debug.log.xz.zip

(this is compressed with xz, had to add a zip ext to make github happy)

I will try another sync from scratch with no interruptions

@delta1
Copy link
Member

delta1 commented Jan 6, 2023

Another IBD stuck, this time at 2117769

2023-01-06T07:38:40Z Discarding received headers and pausing header sync from peer=211, because we are too far ahead of block sync (2159312 > 2117769, 2000 new headers received)

So in this case the headers are 41543 ahead of the blocks

  "chain": "liquidv1",
  "blocks": 2117769,
  "headers": 2159312,
  "bestblockhash": "7a81d57330179218a0106b0c9421aadf18a6a61f0392c191859ba8dc5230c2c5",
  "mediantime": 1669830008,
  "verificationprogress": 0.976,
  "initialblockdownload": true,

From what I can tell, moving the "enough headers" check up in fba37b7 means that we are discarding the headers but then not continuing to sync the outstanding blocks

@@ -307,12 +307,47 @@ bool CBlockTreeDB::WritePAKList(const std::vector<std::vector<unsigned char> >&
return Write(std::make_pair(DB_PAK, uint256S("1")), offline_list) && Write(std::make_pair(DB_PAK, uint256S("2")), online_list) && Write(std::make_pair(DB_PAK, uint256S("3")), reject);
}

bool CBlockTreeDB::LoadBlockIndexGuts(const Consensus::Params& consensusParams, std::function<CBlockIndex*(const uint256&)> insertBlockIndex)
bool CBlockTreeDB::WalkBlockIndexGutsForMaxHeight(int* nHeight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not guaranteed to be the max height, correct? This is our best guess of the max height, from a random sample of 10k blocks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that be placed into the docstring for this method, since it is public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, forgive my ignorance about the bitcoin / elements codebase, but is there no way to get the max height here? Certainly that is stored somewhere, since certain RPCs produce it.

Is it just that accessing it here would require some major refactors which are overkill for this PR? Or... do we actually have to load the block index first in order to get the max height?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to load the block index first to get the actual max height, yeah, it's a catch-22 here. This is one of the very first stages of startup. And the headers in the index are stored on disk in hash order, not height order, so we have no idea what the max height is until after we've loaded all of them. (But since hash order is random with respect to height, loading a few gives us a really good guess.)

Also note that this really only tells us the max header height, we don't learn what blocks we actually have until even later in the startup process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. That makes sense, thanks,

What do you think about writing that "this is not guaranteed to be the max height, it's just from a random sample" in the method's doctoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added (hopefully I did the docstring right, there aren't a lot of them around.)

while (min_index && !min_index->trimmed()) {
min_index->trim();
min_index = min_index->pprev;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the logic which trims blocks that were new when we got them, but are now old enough that they should be trimmed?

It looks like it only gets triggered when we flush to disk. How often does this flushing take place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Adjusting the flush frequency is part of this PR elsewhere -- I think I lowered it from 1 hour to 5 minutes. It's not ideal for it to be time-based instead of memory-pressure-based, but that would be a more invasive change.

@gwillen
Copy link
Contributor Author

gwillen commented Jan 18, 2023

From what I can tell, moving the "enough headers" check up in fba37b7 means that we are discarding the headers but then not continuing to sync the outstanding blocks

Hmmm, this is odd. Block syncing is done elsewhere and ought to be totally independent of this. I will try to figure out what's going on.

// For simplicity, if any of the headers they're asking for are trimmed,
// just drop the request.
LogPrint(BCLog::NET, "%s: ignoring getheaders from peer=%i which would return at least one trimmed header\n", __func__, pfrom.GetId());
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the GETHEADERS message.

Does the GETBLOCK message still work fine if the headers are not in memory? I have not tested it. It doesn't appear that this call needs in-memory headers to find blocks from the index, but I just want to confirm.

Copy link
Contributor Author

@gwillen gwillen Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this specific case other than by running against the live network. But I systematically checked for references to the trimmed fields, so I should have caught this if it were an issue. Looking at it now, I can see that it never calls GetBlockHeader (which is the usual suspect) or any of the other offending methods -- it calls ReadBlockFromDisk, which is safe.

@jamesdorfman
Copy link
Contributor

utACK 8cad8cd.

When receiving unwanted headers from a peer, first check whether we have
recent "best known block" state for that peer. If not, accept one batch
of headers to refresh it. Otherwise, we can become stuck in a state where
we need blocks, but we don't know which peers have those blocks available.
This especially happens during startup, when we don't have this information
for any peers until we start processing header messages.
@gwillen
Copy link
Contributor Author

gwillen commented Jan 25, 2023

From what I can tell, moving the "enough headers" check up in fba37b7 means that we are discarding the headers but then not continuing to sync the outstanding blocks

Hmmm, this is odd. Block syncing is done elsewhere and ought to be totally independent of this. I will try to figure out what's going on.

It turns out that @delta1 was 100% correct about this. Apparently header syncing is the only mechanism through which we learn what blocks a peer has available to sync. So if you refuse all incoming headers except from one peer you're doing IBD from, and then that peer dies or something, you can get in a state where you have a bunch of live peers, but you have no idea what blocks they have so you'll never request blocks from them. (And restarting doesn't help, because on startup we have no idea what peers hold what blocks.)

Fixed this by relaxing the rejection of unwanted headers just slightly: if a peer sends us headers, and we have never received headers from that peer before, OR the latest headers we have from that peer are very stale (predating our most recent full block), then we accept the headers.

@gwillen
Copy link
Contributor Author

gwillen commented Jan 25, 2023

Added a couple of minor documentation fixes. I believe all outstanding comments are now resolved.

Copy link
Member

@delta1 delta1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested ACK e622156

IBD issue is resolved so far in my sync test.

I also ran test/functional/test_runner.py successfully after cherry-picking the python fixes in 069bec1 and 0f69670 from the elements-22.x branch

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved
@jamesdorfman
Copy link
Contributor

utACK e622156.

@gwillen
Copy link
Contributor Author

gwillen commented Jan 26, 2023

@psgreco @jhfrontz I would like to merge this -- I forget who has the power on this repo. What say you?

@psgreco
Copy link
Contributor

psgreco commented Jan 26, 2023

Just merged this locally on top of 22.1 and all the tests passed without issue

@psgreco psgreco merged commit ffeedde into ElementsProject:master Jan 26, 2023
jamesdorfman added a commit to jamesdorfman/elements that referenced this pull request Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants