blocksync: Some performance improvements on top of #3957 #3985

cdecker · 2020-08-26T16:53:42Z

Inspired by #3957 I took some time to benchmark and improve the block sync
time as well. It's not as impressive as the 30x that #3957 achieved but it has
some incremental gains. The benchmarks use the following code, with a local
bitcoind and the default bcli backend:

from pyln.client import LightningRpc
import time
import subprocess

startheight = 600000
numblocks = 1000
targetheight = startheight + numblocks
proc = subprocess.Popen([
    'lightningd/lightningd',
    '--network=bitcoin',
    '--log-level=debug',
    '--rescan=-{}'.format(startheight),
    '--lightning-dir=/tmp/ll'
])
time.sleep(5)
rpc = LightningRpc('/tmp/ll/bitcoin/lightning-rpc')

def height():
    return rpc.getinfo()['blockheight']

def rate(first, last):
    if first == last:
        return 0
    return (last[1] - first[1]) / (last[0] - first[0])

start = (time.time(), height())
heights = []
while True:
    heights.append((time.time(), height()))
    if len(heights) > 30:
        heights = heights[-30:]
    first = heights[0]
    last = heights[-1]

    print("Speed: {:2f} blocks / s, overall {:2f} blocks / s".format(
        rate(first, last),
        rate(start, last)
    ))
    time.sleep(1)

    if last[1] >= targetheight:
        break

rpc.stop()
proc.wait()

The following are the performance gains achieved by individual commits (and compared to #3957):

Scenario	TTS 1000 blocks	Blocks / second	Speedup
pre-3957 (`bf5e994`)	8:01	2.073498	-
post-3957 (`128adf0`)	7:28	2.224472	7%
commit 1 (`68f3f30`)	5:42	2.923359	41%
commit 2 (`dc0859e`)	5:00	3.321625	60%
commit 3 (`0e4c6b6`)	3:07	5.357775	158%

And here are the flamegraphs that I used to guide the changes:

lightningd/plugin.c

cdecker · 2020-08-27T10:25:01Z

Fixup switches from searching \n\n to searching }\n which makes no additional assumptions on the behavior of the plugins, instead using the existing "JSON object terminated by a newline" semantics.

cdecker · 2020-08-27T11:56:40Z

Performance remains the same for block sync, if not better:

Scenario	TTS 1000 blocks	Blocks / second	Speedup
pre-3957 (`bf5e994`)	8:01	2.073498	-
post-3957 (`128adf0`)	7:28	2.224472	7%
commit 1 (`68f3f30`)	5:42	2.923359	41%
commit 2 (`dc0859e`)	5:00	3.321625	60%
commit 3 (`0e4c6b6`)	3:07	5.357775	158%
new separator (`1634f45`)	3:01	5.609808	170%

lightningd/plugin.c

rustyrussell · 2020-08-30T02:26:46Z

Hmm, my response to this got lost.

Anyway, I can't find any requirement on json.org or jsonrpc.org that a JSON RPC end with '\n'. Even if it said that, my ignorance suggests it's dangerous to rely on it.

We can, however, insist that we have just read a '}'. In the normal case, this is harmless, but in the case here "huge token", it's an effective hack. Needs a huge comment though:

    /* Our JSON parser is pretty good at incremental parsing, but `getrawblock` gives a giant 2MB token, which forces it to re-parse 
    every time until we have all of it.  However, we can't complete a JSON object without a '}', so we do a cheaper check here. */

cdecker · 2020-09-01T10:44:09Z

Hmm, my response to this got lost.

Anyway, I can't find any requirement on json.org or jsonrpc.org that a JSON RPC end with '\n'. Even if it said that, my ignorance suggests it's dangerous to rely on it.

That is because the JSON-RPC spec is (mostly) transport agnostic, and the usual suspects (HTTP/HTTPS/HTTP2.0) provide built-in framing support, i.e., either the header contains the number of bytes to read, or the connection is closed after submitting the payload. In our case we don't have that which required us to come up with a different end-of-message marker, and so we did with a very inefficient end-of-json = end-of-message which requires either re-parsing the message on every increment, or having a resumable parser like jsmn. Notice that resumable parsers are very rare, and we got lucky. For consumers such as the pyln-client library we do provide stronger guarantees like the \n\n message separator, but I think we should slowly move to consolidate the different interfaces, and start requiring \n\n over time (gated behind a COMPAT flag).

I'm pretty sure that requiring a newline after the message is safe, even if not explicitly documented right now, but happy to start enforcing that only in conjunction with a COMPAT flag as well.

We can, however, insist that we have just read a '}'. In the normal case, this is harmless, but in the case here "huge token", it's an effective hack. Needs a huge comment though:
    /* Our JSON parser is pretty good at incremental parsing, but `getrawblock` gives a giant 2MB token, which forces it to re-parse 
    every time until we have all of it.  However, we can't complete a JSON object without a '}', so we do a cheaper check here. */

I went ahead and implemented the single } check:

Scenario	TTS 1000 blocks	Blocks / second	Speedup
pre-3957 (`bf5e994`)	8:01	2.073498	-
post-3957 (`128adf0`)	7:28	2.224472	7%
commit 1 (`68f3f30`)	5:42	2.923359	41%
commit 2 (`dc0859e`)	5:00	3.321625	60%
commit 3 (`0e4c6b6`)	3:07	5.357775	158%
new separator (`1634f45`)	3:01	5.609808	170%
single `}` sep (`eea01b0`)	2:46	5.966785	188%

So it seems like checking for } is just as good as the longer separator. At least we don't need to be more restrictive in what we accept yet 😉

I added the details on why we filter to the comment, and if this looks good I can squash it all into place.

rustyrussell · 2020-09-01T21:05:37Z

lightningd/plugin.c

+	 * `getrawblock` gives a giant 2MB token, which forces it to re-parse
+	 * every time until we have all of it. However, we can't complete a
+	 * JSON object without a '}', so we do a cheaper check here.
+	*/
 	have_full = memmem(plugin->buffer + plugin->used - offset,
 			   plugin->len_read + offset, "}", 1);


This can be memchr :)

And we no longer need offset. I'll add a fixup (and seems like a file needs regen, too...)

rustyrussell · 2020-09-02T00:02:51Z

Changed to a straight memchr, squashed fixups and rebased on (Travis-fixed) master...

rustyrussell · 2020-09-02T00:03:09Z

Ack e45b22d

We were using `tal_fmt` to truncate off the last byte of the response (newline), which required an allocation, a call to `vsnprintf` and a copy of the block contents. This being >2MB in size on mainnet was rather expensive. We now just signal the end of the string by overwriting the trailing newline, and stealing directly onto the result.

We're not going to mutate transactions in a block, so computing the txids every time we need them is a waste, let's compute them upfront and use them afterwards.

PR ElementsProject#3957 improved performance considerably, however we still look over the entire message for the message separator. If instead we just look in the incrementally read data, we remove the quadratic behavior for large messages. This is safe since we then loop over the messages which would drain any message separator from the buffer before we attempt the next read. Changelog-Fixed: bcli: Significant speedups for block synchronization

cdecker · 2020-09-02T11:21:12Z

ACK 64f2dcb

Addressed

cdecker added this to the v0.9.1 milestone Aug 26, 2020

cdecker requested a review from rustyrussell August 26, 2020 16:53

rustyrussell requested changes Aug 27, 2020

View reviewed changes

lightningd/plugin.c Show resolved Hide resolved

cdecker force-pushed the perf branch 2 times, most recently from 1d60831 to 92ab81a Compare August 28, 2020 11:03

cdecker requested a review from rustyrussell August 28, 2020 11:37

rustyrussell previously requested changes Aug 30, 2020

View reviewed changes

lightningd/plugin.c Show resolved Hide resolved

cdecker force-pushed the perf branch from 92ab81a to eea01b0 Compare September 1, 2020 10:37

cdecker requested a review from rustyrussell September 1, 2020 10:45

rustyrussell reviewed Sep 1, 2020

View reviewed changes

rustyrussell force-pushed the perf branch from c58a4e9 to e45b22d Compare September 2, 2020 00:02

cdecker added 3 commits September 2, 2020 11:11

block: Compute the txids only once

912a118

We're not going to mutate transactions in a block, so computing the txids every time we need them is a waste, let's compute them upfront and use them afterwards.

rustyrussell force-pushed the perf branch from e45b22d to 64f2dcb Compare September 2, 2020 01:43

cdecker merged commit d7ff74e into ElementsProject:master Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blocksync: Some performance improvements on top of #3957 #3985

blocksync: Some performance improvements on top of #3957 #3985

cdecker commented Aug 26, 2020

cdecker commented Aug 27, 2020

cdecker commented Aug 27, 2020

rustyrussell commented Aug 30, 2020 •

edited

Loading

cdecker commented Sep 1, 2020 •

edited

Loading

rustyrussell Sep 1, 2020

rustyrussell commented Sep 2, 2020

rustyrussell commented Sep 2, 2020

cdecker commented Sep 2, 2020

blocksync: Some performance improvements on top of #3957 #3985

blocksync: Some performance improvements on top of #3957 #3985

Conversation

cdecker commented Aug 26, 2020

cdecker commented Aug 27, 2020

cdecker commented Aug 27, 2020

rustyrussell commented Aug 30, 2020 • edited Loading

cdecker commented Sep 1, 2020 • edited Loading

rustyrussell Sep 1, 2020

Choose a reason for hiding this comment

rustyrussell commented Sep 2, 2020

rustyrussell commented Sep 2, 2020

cdecker commented Sep 2, 2020

rustyrussell commented Aug 30, 2020 •

edited

Loading

cdecker commented Sep 1, 2020 •

edited

Loading