Final gossip store optimization #2644

rustyrussell · 2019-05-17T07:38:37Z

This is based on #2685 which fixes some gossip bugs. Review and merge that first :)

This moves all traversal out to the per-peer daemons. It's pretty nice, actually.

Before this series: (--disable-developer, -Og)

store_load_msec:20669-20902(20822.2+/-82)
vsz_kb:439704-439712(439706+/-3.2)
listnodes_sec:0.890000-1.000000(0.92+/-0.04)
listchannels_sec:11.960000-13.380000(12.576+/-0.49)
routing_sec:3.070000-5.970000(4.814+/-1.2) 
peer_write_all_sec:28.490000-30.580000(29.532+/-0.78)

After: (--disable-developer, -Og)

store_load_msec:19722-20124(19921.6+/-1.4e+02)
vsz_kb:288320
listnodes_sec:0.860000-0.980000(0.912+/-0.056)
listchannels_sec:10.790000-12.260000(11.65+/-0.5)
routing_sec:2.540000-4.950000(4.262+/-0.88)
peer_write_all_sec:17.570000-19.500000(18.048+/-0.73)

m-schmoock

Great change(set).
This is a bit large for a detailed analysis, we will have to see if it behaves as expected.
But here are my two Satoshis...

There is a conflict in openingd/openingd.c (which is why jenkins wont build, right?) .
I thought putting declarations at function start was project style, but Im fine with both.
On my testnet node I get the following upon startup, not sure if its related. Is there a testcase for upgrading from 4 to 5?

2019-05-17T08:50:14.521Z lightning_gossipd(19181): Upgraded gossip_store from version 4 to 5
2019-05-17T08:50:14.522Z lightning_gossipd(19181): gossip_store: Bad channel_update (0102231f797252fdb20277b5bee70f942e08cd8a0a7361ea2ee989b3dcee8a63d7470a845ebda5381bac371a1b037e33f883e9cc7531fb3e8c53ebd7315f673b5c5243497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea33090000000016b6bc00001d00005cb3e7890100002800000000000003e8000003e800000001000000000229e4e0) truncating to 605593

after startup, my lightning-channeld remain at 100% for an extensive periord of time, it has not stopped while writing...
- Edit: now 10 minutes + 100% CPU on a decent machine...
- Retest on master branch: no such behaviour

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                           
19207 will      20   0    9264   3284   2700 R 100.0   0.0   3:22.13 lightning_chann                                                   
19211 will      20   0    9264   3196   2604 R 100.0   0.0   3:26.07 lightning_chann                                                   
19213 will      20   0    9264   3068   2472 R 100.0   0.0   3:17.24 lightning_chann

m-schmoock · 2019-05-17T07:59:17Z

channeld/channeld.c

+			peer_failed_connection_lost();
+
+		type = fromwire_peektype(reply);
+		if (type == replytype) {


I would drop the int type; (line 825) and inline this to:

if (fromwire_peektype(reply) == replytype) {

... or put the int type; at the function start.

We reuse type below this in the status msg though...

m-schmoock · 2019-05-17T07:59:31Z

cli/lightning-cli.c

-	resp = tal_arr(ctx, char, 1000);
-	toks = tal_arr(ctx, jsmntok_t, 100);
+	resp_len = 1000;
+	resp = malloc(resp_len);


This is seems not being released/freed , right?

True, though we exit. Neater to free though, will fix in separate patch.

rustyrussell · 2019-05-18T01:53:17Z

* On my testnet node I get the following upon startup, not sure if its related. Is there a testcase for upgrading from 4 to 5?

2019-05-17T08:50:14.521Z lightning_gossipd(19181): Upgraded gossip_store from version 4 to 5
2019-05-17T08:50:14.522Z lightning_gossipd(19181): gossip_store: Bad channel_update (0102231f797252fdb20277b5bee70f942e08cd8a0a7361ea2ee989b3dcee8a63d7470a845ebda5381bac371a1b037e33f883e9cc7531fb3e8c53ebd7315f673b5c5243497fd7f826957108f4a30fd9cec3aeba79972084e90ead01ea33090000000016b6bc00001d00005cb3e7890100002800000000000003e8000003e800000001000000000229e4e0) truncating to 605593

I reproduced this here. The upgrade logic was too simplistic, but it's not worth fixing. We simply truncated the gossip_store on upgrade again :(

Before 0.7.1 I'll make sure we're not hammered by every peer feeding us gossip, to mitigate.

* after startup, my `lightning-channeld` remain at 100% for an extensive periord of time, it has not stopped while writing...

Hmm, that's weird. If you can gdb attach to it, or even send channeld a SIGABRT and look for a backtrace in the logs, that might offer a clue?

Thanks!
Rusty.

rustyrussell · 2019-05-18T02:19:05Z

Rebased on master to resolve conflict, and added a final commit to drop upgrade code.

SimonVrouwe

It also weans us off broadcast
ordering, which we can now eliminated.

What is the meaning of that sentence?

m-schmoock · 2019-05-18T09:50:36Z

Hmm, that's weird. If you can gdb attach to it, or even send channeld a SIGABRT and look for a backtrace in the logs, that might offer a clue?

@rustyrussell Here you go, I tried again, luckily same happened again. I SIGABRTed (kill -6) any 100% CPU lightning-channeld I did find. They were instantly restarted by the main daemon and turned immediately into 100% CPU cycles again. I did it multiple times, so we have a chance to catch the right execution stack. I cropped some timestamps and peer_ids from the logs so they become readable without horizontal scrolling. At a first glance it seems to be related to brute_force_first Here the logs from the SIGABRTed processes:

Click to expand (huge) Stacktrace logs!

 chan #9: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x563972f07103
 chan #9: backtrace: common/daemon.c:53 (crashdump) 0x563972f07153
 chan #9: backtrace: (null):0 ((null)) 0x7fdec08b08af
 chan #9: backtrace: (null):0 ((null)) 0x7fdec096b8eb
 chan #9: backtrace: channeld/channeld.c:3079 (main) 0x563972efafca
 chan #9: backtrace: (null):0 ((null)) 0x7fdec089cce2
 chan #9: backtrace: (null):0 ((null)) 0x563972ef27fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x563972f07103
 chan #9: backtrace: common/status.c:192 (status_failed) 0x563972f12600
 chan #9: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x563972f12848
 chan #9: backtrace: common/daemon.c:56 (crashdump) 0x563972f0715c
 chan #9: backtrace: (null):0 ((null)) 0x7fdec08b08af
 chan #9: backtrace: (null):0 ((null)) 0x7fdec096b8eb
 chan #9: backtrace: channeld/channeld.c:3079 (main) 0x563972efafca
 chan #9: backtrace: (null):0 ((null)) 0x7fdec089cce2
 chan #9: backtrace: (null):0 ((null)) 0x563972ef27fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #5: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #5: backtrace: common/daemon.c:45 (send_backtrace) 0x56138f34c103
 chan #5: backtrace: common/daemon.c:53 (crashdump) 0x56138f34c153
 chan #5: backtrace: (null):0 ((null)) 0x7fe8905558af
 chan #5: backtrace: ccan/ccan/list/list.h:812 (list_node_to_off_) 0x56138f385115
 chan #5: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x56138f38585f
 chan #5: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x56138f385a0b
 chan #5: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x56138f385b48
 chan #5: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x56138f385b62
 chan #5: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x56138f385bb3
 chan #5: backtrace: channeld/channeld.c:3072 (main) 0x56138f33ff1f
 chan #5: backtrace: (null):0 ((null)) 0x7fe890541ce2
 chan #5: backtrace: (null):0 ((null)) 0x56138f3377fd
 chan #5: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #5: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #5: backtrace: common/daemon.c:45 (send_backtrace) 0x56138f34c103
 chan #5: backtrace: common/status.c:192 (status_failed) 0x56138f357600
 chan #5: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x56138f357848
 chan #5: backtrace: common/daemon.c:56 (crashdump) 0x56138f34c15c
 chan #5: backtrace: (null):0 ((null)) 0x7fe8905558af
 chan #5: backtrace: ccan/ccan/list/list.h:812 (list_node_to_off_) 0x56138f385115
 chan #5: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x56138f38585f
 chan #5: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x56138f385a0b
 chan #5: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x56138f385b48
 chan #5: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x56138f385b62
 chan #5: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x56138f385bb3
 chan #5: backtrace: channeld/channeld.c:3072 (main) 0x56138f33ff1f
 chan #5: backtrace: (null):0 ((null)) 0x7fe890541ce2
 chan #5: backtrace: (null):0 ((null)) 0x56138f3377fd
 chan #5: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #5: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #6: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #6: backtrace: common/daemon.c:45 (send_backtrace) 0x55e99d44c103
 chan #6: backtrace: common/daemon.c:53 (crashdump) 0x55e99d44c153
 chan #6: backtrace: (null):0 ((null)) 0x7f6c6b7c28af
 chan #6: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x55e99d4858b9
 chan #6: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x55e99d485a0b
 chan #6: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x55e99d485b48
 chan #6: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x55e99d485b62
 chan #6: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x55e99d485bb3
 chan #6: backtrace: channeld/channeld.c:3072 (main) 0x55e99d43ff1f
 chan #6: backtrace: (null):0 ((null)) 0x7f6c6b7aece2
 chan #6: backtrace: (null):0 ((null)) 0x55e99d4377fd
 chan #6: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #6: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #6: backtrace: common/daemon.c:45 (send_backtrace) 0x55e99d44c103
 chan #6: backtrace: common/status.c:192 (status_failed) 0x55e99d457600
 chan #6: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x55e99d457848
 chan #6: backtrace: common/daemon.c:56 (crashdump) 0x55e99d44c15c
 chan #6: backtrace: (null):0 ((null)) 0x7f6c6b7c28af
 chan #6: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x55e99d4858b9
 chan #6: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x55e99d485a0b
 chan #6: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x55e99d485b48
 chan #6: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x55e99d485b62
 chan #6: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x55e99d485bb3
 chan #6: backtrace: channeld/channeld.c:3072 (main) 0x55e99d43ff1f
 chan #6: backtrace: (null):0 ((null)) 0x7f6c6b7aece2
 chan #6: backtrace: (null):0 ((null)) 0x55e99d4377fd
 chan #6: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #6: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #9: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x559cfe88d103
 chan #9: backtrace: common/daemon.c:53 (crashdump) 0x559cfe88d153
 chan #9: backtrace: (null):0 ((null)) 0x7f24852108af
 chan #9: backtrace: ccan/ccan/list/list.h:814 (list_node_from_off_) 0x559cfe8c611f
 chan #9: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x559cfe8c68c5
 chan #9: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x559cfe8c6a0b
 chan #9: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x559cfe8c6b48
 chan #9: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x559cfe8c6b62
 chan #9: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x559cfe8c6bb3
 chan #9: backtrace: channeld/channeld.c:3072 (main) 0x559cfe880f1f
 chan #9: backtrace: (null):0 ((null)) 0x7f24851fcce2
 chan #9: backtrace: (null):0 ((null)) 0x559cfe8787fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x559cfe88d103
 chan #9: backtrace: common/status.c:192 (status_failed) 0x559cfe898600
 chan #9: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x559cfe898848
 chan #9: backtrace: common/daemon.c:56 (crashdump) 0x559cfe88d15c
 chan #9: backtrace: (null):0 ((null)) 0x7f24852108af
 chan #9: backtrace: ccan/ccan/list/list.h:814 (list_node_from_off_) 0x559cfe8c611f
 chan #9: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x559cfe8c68c5
 chan #9: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x559cfe8c6a0b
 chan #9: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x559cfe8c6b48
 chan #9: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x559cfe8c6b62
 chan #9: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x559cfe8c6bb3
 chan #9: backtrace: channeld/channeld.c:3072 (main) 0x559cfe880f1f
 chan #9: backtrace: (null):0 ((null)) 0x7f24851fcce2
 chan #9: backtrace: (null):0 ((null)) 0x559cfe8787fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #9: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x55ab20c1c103
 chan #9: backtrace: common/daemon.c:53 (crashdump) 0x55ab20c1c153
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfb5c8af
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfc178eb
 chan #9: backtrace: channeld/channeld.c:3079 (main) 0x55ab20c0ffca
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfb48ce2
 chan #9: backtrace: (null):0 ((null)) 0x55ab20c077fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x55ab20c1c103
 chan #9: backtrace: common/status.c:192 (status_failed) 0x55ab20c27600
 chan #9: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x55ab20c27848
 chan #9: backtrace: common/daemon.c:56 (crashdump) 0x55ab20c1c15c
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfb5c8af
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfc178eb
 chan #9: backtrace: channeld/channeld.c:3079 (main) 0x55ab20c0ffca
 chan #9: backtrace: (null):0 ((null)) 0x7fb2bfb48ce2
 chan #9: backtrace: (null):0 ((null)) 0x55ab20c077fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #6: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #6: backtrace: common/daemon.c:45 (send_backtrace) 0x557140050103
 chan #6: backtrace: common/daemon.c:53 (crashdump) 0x557140050153
 chan #6: backtrace: (null):0 ((null)) 0x7f8989d678af
 chan #6: backtrace: (null):0 ((null)) 0x7f8989e228eb
 chan #6: backtrace: channeld/channeld.c:3079 (main) 0x557140043fca
 chan #6: backtrace: (null):0 ((null)) 0x7f8989d53ce2
 chan #6: backtrace: (null):0 ((null)) 0x55714003b7fd
 chan #6: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #6: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #6: backtrace: common/daemon.c:45 (send_backtrace) 0x557140050103
 chan #6: backtrace: common/status.c:192 (status_failed) 0x55714005b600
 chan #6: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x55714005b848
 chan #6: backtrace: common/daemon.c:56 (crashdump) 0x55714005015c
 chan #6: backtrace: (null):0 ((null)) 0x7f8989d678af
 chan #6: backtrace: (null):0 ((null)) 0x7f8989e228eb
 chan #6: backtrace: channeld/channeld.c:3079 (main) 0x557140043fca
 chan #6: backtrace: (null):0 ((null)) 0x7f8989d53ce2
 chan #6: backtrace: (null):0 ((null)) 0x55714003b7fd
 chan #6: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #6: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #5: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #5: backtrace: common/daemon.c:45 (send_backtrace) 0x55b1b9645103
 chan #5: backtrace: common/daemon.c:53 (crashdump) 0x55b1b9645153
 chan #5: backtrace: (null):0 ((null)) 0x7fee770ee8af
 chan #5: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x55b1b967e8b9
 chan #5: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x55b1b967ea0b
 chan #5: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x55b1b967eb48
 chan #5: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x55b1b967eb62
 chan #5: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x55b1b967ebb3
 chan #5: backtrace: channeld/channeld.c:3072 (main) 0x55b1b9638f1f
 chan #5: backtrace: (null):0 ((null)) 0x7fee770dace2
 chan #5: backtrace: (null):0 ((null)) 0x55b1b96307fd
 chan #5: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #5: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #5: backtrace: common/daemon.c:45 (send_backtrace) 0x55b1b9645103
 chan #5: backtrace: common/status.c:192 (status_failed) 0x55b1b9650600
 chan #5: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x55b1b9650848
 chan #5: backtrace: common/daemon.c:56 (crashdump) 0x55b1b964515c
 chan #5: backtrace: (null):0 ((null)) 0x7fee770ee8af
 chan #5: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x55b1b967e8b9
 chan #5: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x55b1b967ea0b
 chan #5: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x55b1b967eb48
 chan #5: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x55b1b967eb62
 chan #5: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x55b1b967ebb3
 chan #5: backtrace: channeld/channeld.c:3072 (main) 0x55b1b9638f1f
 chan #5: backtrace: (null):0 ((null)) 0x7fee770dace2
 chan #5: backtrace: (null):0 ((null)) 0x55b1b96307fd
 chan #5: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #5: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)


 chan #9: FATAL SIGNAL 6 (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x5613ee116103
 chan #9: backtrace: common/daemon.c:53 (crashdump) 0x5613ee116153
 chan #9: backtrace: (null):0 ((null)) 0x7f9487ce08af
 chan #9: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x5613ee14f853
 chan #9: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x5613ee14fa0b
 chan #9: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x5613ee14fb48
 chan #9: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x5613ee14fb62
 chan #9: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x5613ee14fbb3
 chan #9: backtrace: channeld/channeld.c:3072 (main) 0x5613ee109f1f
 chan #9: backtrace: (null):0 ((null)) 0x7f9487cccce2
 chan #9: backtrace: (null):0 ((null)) 0x5613ee1017fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: FATAL SIGNAL (version v0.7.0-359-g94d7220)
 chan #9: backtrace: common/daemon.c:45 (send_backtrace) 0x5613ee116103
 chan #9: backtrace: common/status.c:192 (status_failed) 0x5613ee121600
 chan #9: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x5613ee121848
 chan #9: backtrace: common/daemon.c:56 (crashdump) 0x5613ee11615c
 chan #9: backtrace: (null):0 ((null)) 0x7f9487ce08af
 chan #9: backtrace: ccan/ccan/timer/timer.c:190 (find_first) 0x5613ee14f853
 chan #9: backtrace: ccan/ccan/timer/timer.c:236 (brute_force_first) 0x5613ee14fa0b
 chan #9: backtrace: ccan/ccan/timer/timer.c:270 (get_first) 0x5613ee14fb48
 chan #9: backtrace: ccan/ccan/timer/timer.c:275 (update_first) 0x5613ee14fb62
 chan #9: backtrace: ccan/ccan/timer/timer.c:288 (timer_earliest) 0x5613ee14fbb3
 chan #9: backtrace: channeld/channeld.c:3072 (main) 0x5613ee109f1f
 chan #9: backtrace: (null):0 ((null)) 0x7f9487cccce2
 chan #9: backtrace: (null):0 ((null)) 0x5613ee1017fd
 chan #9: backtrace: (null):0 ((null)) 0xffffffffffffffff
 chan #9: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
sient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (61952)

rustyrussell · 2019-05-20T04:55:17Z

Hmm, that's weird. If you can gdb attach to it, or even send channeld a SIGABRT and look for a backtrace in the logs, that might offer a clue?

@rustyrussell Here you go, I tried again, luckily same happened again. I SIGABRTed (kill -6) any 100% CPU lightning-channeld I did find. They were instantly restarted by the main daemon and turned immediately into 100% CPU cycles again. I did it multiple times, so we have a chance to catch the right execution stack. I cropped some timestamps and peer_ids from the logs so they become readable without horizontal scrolling. At a first glance it seems to be related to brute_force_first Here the logs from the SIGABRTed processes:
Click to expand (huge) Stacktrace logs!

Thanks! Indeed, I reproduced it here too. It spins on a 0 timeout until we get a timestamp_filter message, which doesn't happen with modern lnd.

I've fixed this up and rebased, am testing now...

m-schmoock · 2019-05-20T07:26:07Z

better, but test_gossip_timestamp_filter test is now broken:
https://travis-ci.org/ElementsProject/lightning/jobs/534647840

SimonVrouwe · 2019-05-20T17:02:37Z

test_gossip_store_compact currently doesn't test actual compaction.
Modifying that test by inserting:

l2.rpc.setchannelfee(l3.info['id'], 20, 1000)
l2.daemon.wait_for_log(r'Received channel_update for channel')

before l2.rpc.call('dev-compact-gossip-store'), so there is one channel_update to compact, makes it fail.

rustyrussell · 2019-05-22T11:32:15Z

It also weans us off broadcast
ordering, which we can now eliminated.

What is the meaning of that sentence?

It means this was the last user of the broadcast ordering uintmap; we can now remove it.

rustyrussell · 2019-05-22T11:45:07Z

test_gossip_store_compact currently doesn't test actual compaction.
Modifying that test by inserting:
l2.rpc.setchannelfee(l3.info['id'], 20, 1000)
l2.daemon.wait_for_log(r'Received channel_update for channel')
before l2.rpc.call('dev-compact-gossip-store'), so there is one channel_update to compact, makes it fail.

Indeed! Nice addition, I've reproduced here and am bisecting. Sleep soon, so it'll be done in about 13 hours I'd guess.

SimonVrouwe · 2019-05-23T15:59:54Z

@rustyrussell Testing this PR, with a mainnet gossip_store (gossip_store_fromHPLaptop2_May23_1521.zip) the command dev-compact-gossip-store causes a crash.

Click to expand crashlog

2019-05-23T15:45:02.230Z lightningd(24278): --------------------------------------------------
2019-05-23T15:45:02.230Z lightningd(24278): Server started with public key 03a50d7e21858e04efc8f5c268043fe5c4f315e70ee1c07ba8dd4073751eb8f63a, alias ANGRYGENESIS-v0.7.0-389-g9ff3554 (color #03a50d) and lightningd v0.7.0-389-g9ff3554
2019-05-23T15:45:02.231Z plugin-autoclean autocleaning not active
2019-05-23T15:45:03.688Z lightning_gossipd(24291): local_add_channel 576792x1294x0
2019-05-23T15:45:03.688Z lightning_gossipd(24291): local_add_channel 567242x1265x0
2019-05-23T15:45:03.688Z lightning_gossipd(24291): local_add_channel 567016x918x0
2019-05-23T15:45:03.688Z lightning_gossipd(24291): total store load time: 1496 msec
2019-05-23T15:45:03.688Z lightning_gossipd(24291): gossip_store: Read 27821/55501/3454/0 cannounce/cupdate/nannounce/cdelete from store (5695 deleted) in 22689177 bytes
lightning_gossipd: gossip_store: Entry at 1->1 not updated? (version v0.7.0-389-g9ff3554)
0x563e76aa4df1 send_backtrace
        common/daemon.c:40
0x563e76aa9b9a status_failed
        common/status.c:192
0x563e76a93c27 gossip_store_compact
        gossipd/gossip_store.c:309
0x563e76aa17b3 dev_compact_store
        gossipd/gossipd.c:2450
0x563e76aa20b5 recv_req
        gossipd/gossipd.c:2715
0x563e76aa52fb handle_read
        common/daemon_conn.c:31
0x563e76ac7f2e next_plan
        ccan/ccan/io/io.c:59
0x563e76ac8aab do_plan
        ccan/ccan/io/io.c:407
0x563e76ac8ae9 io_ready
        ccan/ccan/io/io.c:417
0x563e76acab3e io_loop
        ccan/ccan/io/poll.c:445
0x563e76aa2225 main
        gossipd/gossipd.c:2791
0x7f09de1ac09a ???
        ???:0
0x563e76a8e809 ???
        ???:0
0xffffffffffffffff ???
        ???:0
2019-05-23T15:45:19.609Z lightning_gossipd(24291): Compacting gossip_store with 114610 entries, 5695 of which are stale
2019-05-23T15:45:19.609Z lightning_gossipd(24291): gossip_store: Entry at 1->1 not updated? (version v0.7.0-389-g9ff3554)
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: common/daemon.c:45 (send_backtrace) 0x563e76aa4e47
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: common/status.c:192 (status_failed) 0x563e76aa9b9a
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: gossipd/gossip_store.c:309 (gossip_store_compact) 0x563e76a93c27
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: gossipd/gossipd.c:2450 (dev_compact_store) 0x563e76aa17b3
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: gossipd/gossipd.c:2715 (recv_req) 0x563e76aa20b5
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: common/daemon_conn.c:31 (handle_read) 0x563e76aa52fb
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x563e76ac7f2e
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x563e76ac8aab
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x563e76ac8ae9
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x563e76acab3e
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: gossipd/gossipd.c:2791 (main) 0x563e76aa2225
2019-05-23T15:45:19.609Z lightning_gossipd(24291): backtrace: (null):0 ((null)) 0x7f09de1ac09a
2019-05-23T15:45:19.610Z lightning_gossipd(24291): backtrace: (null):0 ((null)) 0x563e76a8e809
2019-05-23T15:45:19.610Z lightning_gossipd(24291): backtrace: (null):0 ((null)) 0xffffffffffffffff
2019-05-23T15:45:19.610Z lightning_gossipd(24291): STATUS_FAIL_INTERNAL_ERROR: gossip_store: Entry at 1->1 not updated?

Another thing is that at restart, gossip_store almost always truncates on a Bad node_announcement, is that normal? Aren't node_announcements only added to gossip_store after channel_announcement and channel_update have been added?

rustyrussell · 2019-05-25T02:36:57Z

Thanks, will investigate further.... Truncation is almost always a sign of previous corruption (though compaction will usually find those issues too).

At least this code is now getting more testing!

SimonVrouwe · 2019-05-25T16:37:15Z

Maybe this is of interest, since mid April messages like:
gossip_store: Bad node_announcement, gossip_store: can't read hdr offset 2333, gossip_store: bad checksum offset 55820 and recently gossip_store: Entry at 1->1 not updated?
start showing up in my mainnet logs (see below). All goes well until (incl.) f3c2345 and the first Bad node_annoucement on (2019-04-17) at 4985693. Some of them could be my own mess ups, but otherwise maybe introduced by #2545?

Click to expand for grep result

simon@HPLaptop:~/.lightning$ grep -r -o '.*gossip_store: .*$\|lightningd v.*$' logfile*                              
logfile01:2018-12-02T10:25:46.501Z lightning_gossipd(8728): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile01:lightningd v0.6.2-127-g3ba7517
logfile02:2019-01-12T15:31:16.110Z lightning_gossipd(13750): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile02:lightningd v0.6.2-127-g3ba7517
logfile02:2019-01-12T15:32:55.731Z lightning_gossipd(14250): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile02:lightningd v0.6.2-127-g3ba7517
logfile03:2019-01-18T18:23:22.245Z lightning_gossipd(9848): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile03:lightningd v0.6.1-642-ga989be4
logfile04:2019-01-19T10:43:04.081Z lightning_gossipd(16643): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile04:lightningd v0.6.3rc1-89-g8bc75a5
logfile05:lightningd v0.6.3rc1-89-g8bc75a5
logfile05:2019-01-20T12:09:23.269Z lightning_gossipd(29987): gossip_store: Read 20982/39994/2771/203 cannounce/cupdate/nannounce/cdelete from store in 15393947 bytes
logfile06:lightningd v0.6.3rc1-90-g2a8c8da
logfile06:2019-01-20T12:18:35.052Z lightning_gossipd(2004): gossip_store: Read 20983/39996/2772/204 cannounce/cupdate/nannounce/cdelete from store in 15394917 bytes
logfile06:lightningd v0.6.3rc1-92-g97014eb
logfile06:2019-01-21T08:29:43.093Z lightning_gossipd(4575): gossip_store: Read 21173/47482/2866/28 cannounce/cupdate/nannounce/cdelete from store in 16544161 bytes
logfile07:lightningd v0.6.3rc1-93-g858fd4c
logfile07:2019-01-21T09:01:30.739Z lightning_gossipd(9549): gossip_store: Read 21173/47482/2866/28 cannounce/cupdate/nannounce/cdelete from store in 16544220 bytes
logfile08:lightningd v0.6.3rc1-94-gd273b16
logfile08:2019-01-21T09:55:34.655Z lightning_gossipd(16648): gossip_store: Read 21173/47482/2866/29 cannounce/cupdate/nannounce/cdelete from store in 16544293 bytes
logfile09:lightningd v0.6.3rc1-96-gee71099
logfile09:2019-01-21T14:50:06.572Z lightning_gossipd(2234): gossip_store: Read 21282/46606/2924/58 cannounce/cupdate/nannounce/cdelete from store in 16474965 bytes
logfile10:lightningd v0.6.3rc1-90-gcdabf0a
logfile10:2019-01-22T15:54:02.776Z lightning_gossipd(11532): gossip_store: Read 21690/47497/2846/2 cannounce/cupdate/nannounce/cdelete from store in 16778768 bytes
logfile11:lightningd v0.6.3rc1-106-ga5d960b
logfile11:2019-01-26T14:32:28.745Z lightning_gossipd(26120): gossip_store: Read 22738/48589/2959/8 cannounce/cupdate/nannounce/cdelete from store in 17426739 bytes
logfile12:lightningd v0.6.3rc1-106-ga5d960b
logfile12:2019-01-26T14:34:44.204Z lightning_gossipd(26450): gossip_store: Read 22771/50246/3000/15 cannounce/cupdate/nannounce/cdelete from store in 17678919 bytes
logfile13:lightningd v0.6.3rc1-106-ga5d960b
logfile13:2019-01-26T14:38:57.775Z lightning_gossipd(30592): gossip_store: Read 22771/50297/3001/19 cannounce/cupdate/nannounce/cdelete from store in 17686373 bytes
logfile14:lightningd v0.6.3rc1-105-gc029f2f
logfile14:2019-01-26T14:41:36.603Z lightning_gossipd(31299): gossip_store: Read 22775/50469/3004/19 cannounce/cupdate/nannounce/cdelete from store in 17712599 bytes
logfile15:lightningd v0.6.3rc1-106-ga5d960b
logfile15:2019-01-26T14:45:53.625Z lightning_gossipd(3404): gossip_store: Read 22775/50469/3004/23 cannounce/cupdate/nannounce/cdelete from store in 17712714 bytes
logfile16:lightningd v0.6.3rc1-108-gb6b9246
logfile16:2019-01-29T09:23:16.161Z lightning_gossipd(1244): gossip_store: Read 23240/46608/2945/12 cannounce/cupdate/nannounce/cdelete from store in 17387084 bytes
logfile16:lightningd v0.6.3rc1-120-g35545f7
logfile16:2019-02-04T17:39:27.967Z lightning_gossipd(12108): gossip_store: Read 24146/46758/3038/0 cannounce/cupdate/nannounce/cdelete from store in 17835537 bytes
logfile17:lightningd v0.6.3rc1-120-g35545f7
logfile17:2019-02-04T17:40:14.238Z lightning_gossipd(12189): gossip_store: Read 24146/46758/3038/0 cannounce/cupdate/nannounce/cdelete from store in 17835596 bytes
logfile18:lightningd v0.6.3rc1-152-g8816a95
logfile18:2019-02-08T17:34:06.904Z lightning_gossipd(23095): gossip_store: Read 24863/53161/3153/3 cannounce/cupdate/nannounce/cdelete from store in 19145785 bytes
logfile19:lightningd v0.6.3rc1-152-g8816a95
logfile19:2019-02-08T18:19:09.896Z lightning_gossipd(27878): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile20:lightningd v0.6.3rc1-152-g8816a95
logfile20:2019-02-08T18:21:28.428Z lightning_gossipd(28169): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile21:lightningd v0.6.3rc1-152-g8816a95
logfile21:2019-02-08T18:25:04.862Z lightning_gossipd(31020): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile22:lightningd v0.6.3rc1-158-ge3ea65a
logfile22:2019-02-17T07:52:44.522Z lightning_gossipd(18306): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile23:lightningd v0.6.3rc1-157-gf66a2d5
logfile23:2019-02-17T08:15:04.880Z lightning_gossipd(26367): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile23:lightningd v0.6.3rc1-157-gf66a2d5
logfile23:2019-02-17T08:16:43.104Z lightning_gossipd(26737): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile24:lightningd v0.6.3rc1-157-gf66a2d5
logfile24:2019-02-17T08:25:52.132Z lightning_gossipd(27319): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile24:lightningd v0.6.3rc1-157-gf66a2d5
logfile24:2019-02-17T08:26:12.836Z lightning_gossipd(27553): gossip_store: Read 24889/53208/3176/12 cannounce/cupdate/nannounce/cdelete from store in 19167840 bytes
logfile25:lightningd v0.6.3rc1-169-g3db5345
logfile25:2019-02-18T09:49:02.416Z lightning_gossipd(31345): gossip_store: Read 24889/53208/3250/1105 cannounce/cupdate/nannounce/cdelete from store in 19194732 bytes
logfile26:lightningd v0.6.3rc1-169-g3db5345
logfile26:2019-02-21T09:12:56.475Z lightning_gossipd(17273): gossip_store: Read 31844/75915/3788/8 cannounce/cupdate/nannounce/cdelete from store in 25679499 bytes
logfile27:lightningd v0.6.3rc1-169-g3db5345
logfile27:2019-02-26T11:48:47.391Z lightning_gossipd(15674): gossip_store: Read 32659/81545/4082/46 cannounce/cupdate/nannounce/cdelete from store in 26903410 bytes
logfile28:lightningd v0.6.3rc1-278-g75be693
logfile28:2019-02-27T09:24:27.474Z lightning_gossipd(15598): gossip_store: Read 32780/70244/3797/113 cannounce/cupdate/nannounce/cdelete from store in 25327458 bytes
logfile29:lightningd v0.6.3rc1-276-g1dcc482
logfile29:2019-02-27T11:01:41.667Z lightning_gossipd(28916): gossip_store: Read 32829/76008/3902/136 cannounce/cupdate/nannounce/cdelete from store in 26194740 bytes
logfile30:lightningd v0.6.3rc1-276-g1dcc482
logfile30:2019-02-28T08:34:22.992Z lightning_gossipd(22982): gossip_store: Read 32916/70109/3824/136 cannounce/cupdate/nannounce/cdelete from store in 25386304 bytes
logfile31:lightningd v0.6.3rc1-276-g1dcc482
logfile31:2019-03-13T06:19:41.130Z lightning_gossipd(22940): gossip_store: Read 37228/77610/4045/5 cannounce/cupdate/nannounce/cdelete from store in 28502923 bytes
logfile32:lightningd v0.6.3rc1-276-g1dcc482
logfile32:2019-03-13T06:53:15.801Z lightning_gossipd(25749): gossip_store: Read 37228/78973/4064/5 cannounce/cupdate/nannounce/cdelete from store in 28703420 bytes
logfile33:lightningd v0.7.0-21-ge843e13
logfile33:2019-03-14T14:46:56.968Z lightning_gossipd(17784): gossip_store: Read 37737/76570/4029/2 cannounce/cupdate/nannounce/cdelete from store in 28579207 bytes
logfile34:lightningd v0.7.0-21-ge843e13
logfile34:2019-03-14T15:02:17.863Z lightning_gossipd(19270): gossip_store: Read 37747/78132/4046/11 cannounce/cupdate/nannounce/cdelete from store in 28812644 bytes
logfile34:lightningd v0.7.0-21-ge843e13
logfile34:2019-03-14T15:08:03.605Z lightning_gossipd(19863): gossip_store: Read 37757/78794/4065/14 cannounce/cupdate/nannounce/cdelete from store in 28916224 bytes
logfile34:lightningd v0.7.0-21-ge843e13
logfile34:2019-03-14T15:13:00.955Z lightning_gossipd(20398): gossip_store: Read 37761/79366/4068/14 cannounce/cupdate/nannounce/cdelete from store in 29001062 bytes
logfile35:2019-03-14T16:11:03.857Z lightning_gossipd(26011): gossip_store: Read 70/149/76/0 cannounce/cupdate/nannounce/cdelete from store in 64866 bytes
logfile35:lightningd v0.7.0-21-ge843e13
logfile36:lightningd v0.7.0-21-ge843e13
logfile36:2019-03-15T10:52:27.305Z lightning_gossipd(2071): gossip_store: Read 22515/53309/3027/56 cannounce/cupdate/nannounce/cdelete from store in 18284741 bytes
logfile37:lightningd v0.7.0-21-ge843e13
logfile37:2019-03-29T21:07:26.350Z lightning_gossipd(1806): gossip_store: Read 26283/57872/3108/0 cannounce/cupdate/nannounce/cdelete from store in 20660400 bytes
logfile38:lightningd v0.7.0-68-g1fae5ed
logfile38:2019-04-01T14:04:30.897Z lightning_gossipd(10161): gossip_store: Read 26653/65245/3317/87 cannounce/cupdate/nannounce/cdelete from store in 21928978 bytes
logfile38:lightningd v0.7.0-68-g1fae5ed
logfile38:2019-04-02T16:10:40.898Z lightning_gossipd(11430): gossip_store: Read 26710/64594/3371/9 cannounce/cupdate/nannounce/cdelete from store in 21869951 bytes
logfile39:lightningd v0.7.0-68-g1fae5ed
logfile39:2019-04-02T16:12:35.216Z lightning_gossipd(11613): gossip_store: Read 26710/64598/3371/9 cannounce/cupdate/nannounce/cdelete from store in 21870653 bytes
logfile40:lightningd v0.7.0-56-gf3c2345
logfile40:2019-04-02T20:42:52.485Z lightning_gossipd(22098): gossip_store: Read 26743/63729/3357/31 cannounce/cupdate/nannounce/cdelete from store in 21758999 bytes
logfile41:lightningd v0.7.0-56-gf3c2345
logfile41:2019-04-02T20:50:02.503Z lightning_gossipd(32226): gossip_store: Read 26751/64779/3372/31 cannounce/cupdate/nannounce/cdelete from store in 21917290 bytes
logfile42:2019-04-17T05:48:53.825Z lightning_gossipd(31553): gossip_store: Bad node_announcement (10020095010187e2954cbf1936ba57adf419b3cbbb3ce7b314c88d883bf495df11e7ce25ebab6ce07ee25776d91adb21849a1de9a67d96c616d4c2ed6b03e2bb65bc98532cc200005c71c8c802d11c4c4f986ff547288d873f6a33758aab744927355b01272ee560a3ba44d01e49daaa6c696768746e696e6761756374696f6e2e6e657400000000000000000000000000070103108e1b2607) truncating to 1
logfile42:2019-04-17T05:48:53.825Z lightning_gossipd(31553): gossip_store: Read 4/2/1/0 cannounce/cupdate/nannounce/cdelete from store in 2270 bytes
logfile42:lightningd v0.7.0-248-g4985693
logfile42:2019-04-17T05:48:56.433Z lightning_gossipd(31553): gossip_store: can't read hdr offset 2333, store len 2483: Success (version v0.7.0-248-g4985693)
logfile42:2019-04-17T05:48:56.913Z lightning_gossipd(31553): STATUS_FAIL_INTERNAL_ERROR: gossip_store: can't read hdr offset 2333, store len 2483: Success
logfile42:2019-04-17T05:53:00.002Z lightning_gossipd(32006): gossip_store: Read 0/1/0/0 cannounce/cupdate/nannounce/cdelete from store in 214 bytes
logfile42:lightningd v0.7.0-248-g4985693
logfile43:2019-04-30T12:27:57.389Z lightning_gossipd(24346): gossip_store: Bad node_announcement (1002009501017dc15b739366cc0c5b9154417a7ee75ced42eec3b76d5c9cb42475da4a57c67341189c3450c8be35ae582f2837c65a32a657c0ccbcf698549ff1fad5d4df071d00005cb6bf3e02682020a37a3b5876b16bffbf0ac56d7f379695a5d273cf16ed0f35f18ab2dc980268204c6f6f6b427573793400000000000000000000000000000000000000000000000007014e3abed52607) truncating to 1
logfile43:2019-04-30T12:27:57.389Z lightning_gossipd(24346): gossip_store: Read 1/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 453 bytes
logfile43:lightningd v0.7.0-248-g4985693
logfile43:2019-04-30T12:28:26.151Z lightning_gossipd(24346): gossip_store: bad checksum offset 55820, store len 4517764 (version v0.7.0-248-g4985693)
logfile43:2019-04-30T12:28:26.152Z lightning_gossipd(24346): STATUS_FAIL_INTERNAL_ERROR: gossip_store: bad checksum offset 55820, store len 4517764
logfile44:2019-04-30T12:33:13.121Z lightning_gossipd(24859): gossip_store: Bad node_announcement (1002009501017dc15b739366cc0c5b9154417a7ee75ced42eec3b76d5c9cb42475da4a57c67341189c3450c8be35ae582f2837c65a32a657c0ccbcf698549ff1fad5d4df071d00005cb6bf3e02682020a37a3b5876b16bffbf0ac56d7f379695a5d273cf16ed0f35f18ab2dc980268204c6f6f6b427573793400000000000000000000000000000000000000000000000007014e3abed52607) truncating to 1
logfile44:2019-04-30T12:33:13.124Z lightning_gossipd(24859): gossip_store: Read 1/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 453 bytes
logfile44:lightningd v0.7.0-279-gf99c461
logfile44:2019-04-30T12:34:06.574Z lightning_gossipd(24859): gossip_store: can't read hdr offset 15922775, store len 15923086: Success (version v0.7.0-279-gf99c461)
logfile44:2019-04-30T12:34:06.575Z lightning_gossipd(24859): STATUS_FAIL_INTERNAL_ERROR: gossip_store: can't read hdr offset 15922775, store len 15923086: Success
logfile46:lightningd v0.7.0-279-gf99c461
logfile46:2019-04-30T13:18:28.776Z lightning_gossipd(15464): gossip_store: Read 20526/41004/3185/0 cannounce/cupdate/nannounce/cdelete from store in 15922634 bytes
logfile47:2019-04-30T13:19:29.343Z lightning_gossipd(15593): gossip_store: Bad node_announcement (1002009501017dc15b739366cc0c5b9154417a7ee75ced42eec3b76d5c9cb42475da4a57c67341189c3450c8be35ae582f2837c65a32a657c0ccbcf698549ff1fad5d4df071d00005cb6bf3e02682020a37a3b5876b16bffbf0ac56d7f379695a5d273cf16ed0f35f18ab2dc980268204c6f6f6b427573793400000000000000000000000000000000000000000000000007014e3abed52607) truncating to 1
logfile47:2019-04-30T13:19:29.344Z lightning_gossipd(15593): gossip_store: Read 1/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 453 bytes
logfile47:lightningd v0.7.0-279-gf99c461
logfile47:2019-04-30T13:22:51.477Z lightning_gossipd(15593): gossip_store: bad checksum offset 6991123, store len 13517850 (version v0.7.0-279-gf99c461)
logfile47:2019-04-30T13:22:51.479Z lightning_gossipd(15593): STATUS_FAIL_INTERNAL_ERROR: gossip_store: bad checksum offset 6991123, store len 13517850
logfile48:lightningd v0.7.0-279-gf99c461
logfile48:2019-04-30T13:24:15.464Z lightning_gossipd(16002): gossip_store: Read 17346/34662/3056/0 cannounce/cupdate/nannounce/cdelete from store in 13517398 bytes
logfile49:2019-05-01T04:32:56.994Z lightning_gossipd(20574): gossip_store: Bad node_announcement (1002009501017dc15b739366cc0c5b9154417a7ee75ced42eec3b76d5c9cb42475da4a57c67341189c3450c8be35ae582f2837c65a32a657c0ccbcf698549ff1fad5d4df071d00005cb6bf3e02682020a37a3b5876b16bffbf0ac56d7f379695a5d273cf16ed0f35f18ab2dc980268204c6f6f6b427573793400000000000000000000000000000000000000000000000007014e3abed52607) truncating to 453
logfile49:2019-05-01T04:32:56.994Z lightning_gossipd(20574): gossip_store: Read 1/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 453 bytes
logfile49:lightningd v0.7.0-288-gee8ae27
logfile50:2019-05-03T10:43:44.668Z lightning_gossipd(13645): gossip_store: Bad node_announcement (100200950101a56711ce137a0f1d72b392f35cf533f67278329063030e6e505ce688ea6f7c8e5053dccc954c492718dc008ffdaffb87e78cdbbbc40e4b255691793bbc4c2c8600005bb2589303f47ba61341a704b0910b6d1fc8a938bb97152e8df94dfc142816d2721b5d5c73ff00004669727374426c6f636b4361706974616c000000000000000000000000000000000701a763b45c2607) truncating to 7102267
logfile50:2019-05-03T10:43:44.669Z lightning_gossipd(13645): gossip_store: Read 13364/7132/2/0 cannounce/cupdate/nannounce/cdelete from store in 7102267 bytes
logfile50:lightningd v0.7.0-309-gb5903a1
logfile50:2019-05-03T10:46:34.352Z lightning_gossipd(13931): gossip_store: Bad channel_update (1001008a010247fea805d43066a82b0a8bb98c6ee10652af7fb8625d430ed7c058a317a53c9d0b4259390c62e1cc51deab1533cb24e6ba9c54ae90a42714be90dc7509238b526fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d619000000000008b60a00034700015ccb76c1010200100000000000000000000003e800002710000000003b9aca00) truncating to 4754580
logfile50:2019-05-03T10:46:34.353Z lightning_gossipd(13931): gossip_store: Read 6741/11454/3/0 cannounce/cupdate/nannounce/cdelete from store in 4754580 bytes
logfile50:lightningd v0.7.0-309-gb5903a1
logfile51:lightningd v0.7.0-309-gb5903a1
logfile51:2019-05-05T12:28:47.194Z lightning_gossipd(8839): gossip_store: Bad node_announcement (10020095010117ef2a006dfe05aaedb080e87ff2e536ecdfc263888edd6a6491a229b88d0d3f4b45d410c87ed32c14fafa12d52a3a6fedad601e96eed1a8225834b2af4b911e00005bddaceb03dbd26bedb44490bfbf19c876f41c4561e3c650517a6c6ba77cb5e581b0b886b803dbd24c61696b692042616e6b204379707275730000000000000000000000000000000007012668e11e2607) truncating to 14284739
logfile51:2019-05-05T12:28:47.195Z lightning_gossipd(8839): gossip_store: Read 28721/8732/11/0 cannounce/cupdate/nannounce/cdelete from store in 14284739 bytes
logfile52:lightningd v0.7.0-310-gbc5dbb6
logfile52:2019-05-05T12:33:17.865Z lightning_gossipd(14125): gossip_store: Bad node_announcement (10020095010117ef2a006dfe05aaedb080e87ff2e536ecdfc263888edd6a6491a229b88d0d3f4b45d410c87ed32c14fafa12d52a3a6fedad601e96eed1a8225834b2af4b911e00005bddaceb03dbd26bedb44490bfbf19c876f41c4561e3c650517a6c6ba77cb5e581b0b886b803dbd24c61696b692042616e6b204379707275730000000000000000000000000000000007012668e11e2607) truncating to 14284739
logfile52:2019-05-05T12:33:17.867Z lightning_gossipd(14125): gossip_store: Read 28721/8732/11/0 cannounce/cupdate/nannounce/cdelete from store in 14284739 bytes
logfile53:2019-05-05T12:44:33.485Z lightning_gossipd(14380): gossip_store: Bad channel_update (1001008a0102e78506717cefa9c366bc3504c807c98a8ba4083a0dbb3de06f5af489b9bf32fd11a19d6dc995a7f40323006e911f8f65484ab46be02bd0b838ced9c11af2e8c86fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000089b4300085c00015ccb25d90100009000000000000003e8000003e800000001000000000ec08ce0) truncating to 6046505
logfile53:2019-05-05T12:44:33.485Z lightning_gossipd(14380): gossip_store: Read 8209/15643/11/6 cannounce/cupdate/nannounce/cdelete from store in 6046505 bytes
logfile53:lightningd v0.7.0-310-gbc5dbb6
logfile54:2019-05-09T08:42:02.524Z lightning_gossipd(18948): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile54:lightningd v0.7.0-310-gbc5dbb6
logfile55:2019-05-10T05:38:12.216Z lightning_gossipd(17097): gossip_store: Bad node_announcement (1002008e0101c143c3c85b1e715535a065a3e7391d3826d0024e97230745dc65770c29bcf38a655cb8e7cca84a579d5fb8ab61a9f9070309775b02a8969d94193ae7a069d12300005bdf50f4025f06d450f7085b6ff090b2a482a1046d28c2dd28bd7902e76011819df42ff7d449daaa4f706465506661000000000000000000000000000000000000000000000000000000) truncating to 15102584
logfile55:lightningd v0.7.0-310-gbc5dbb6
logfile55:2019-05-10T05:38:12.218Z lightning_gossipd(17097): gossip_store: Read 29002/13348/12/0 cannounce/cupdate/nannounce/cdelete from store in 15102584 bytes
logfile56:2019-05-10T05:47:00.118Z lightning_gossipd(17632): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile56:lightningd v0.7.0-310-gbc5dbb6
logfile57:lightningd v0.7.0-310-gbc5dbb6
logfile57:2019-05-10T16:01:47.674Z lightning_gossipd(21887): gossip_store: Read 29375/80689/3610/49 cannounce/cupdate/nannounce/cdelete from store in 25933561 bytes
logfile58:lightningd v0.7.0-310-gbc5dbb6
logfile58:2019-05-10T16:11:42.475Z lightning_gossipd(22680): gossip_store: Read 29375/80689/3610/49 cannounce/cupdate/nannounce/cdelete from store in 25933561 bytes
logfile59:2019-05-14T08:28:57.691Z lightning_gossipd(10370): gossip_store: Bad node_announcement (0101524233154db6179ea4190334014a49ff254d3adb85b7752bb869c621bb7c2dfe4409e7cb8c5302f335a7036b2b014ed86df3a595d68f7db58916b1a43c61636f00005bb04226034a8b7d2605084569190894e258578f3f9e8b961d27fa6df4982811bc56af736e0000004d465f632d6c696768746e696e670000000000000000000000000000000000000007010d4bc2782607) truncating to 13642143
logfile59:2019-05-14T08:28:57.691Z lightning_gossipd(10370): gossip_store: Read 27751/6431/2/0 cannounce/cupdate/nannounce/cdelete from store in 13642143 bytes
logfile59:lightningd v0.7.0-332-gd054fef
logfile60:2019-05-14T08:30:38.419Z lightning_gossipd(10522): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile60:lightningd v0.7.0-332-gd054fef
logfile60:lightningd v0.7.0-332-gd054fef
logfile60:2019-05-14T08:40:37.036Z lightning_gossipd(11404): gossip_store: Read 28332/56745/3388/0 cannounce/cupdate/nannounce/cdelete from store in 21776182 bytes
logfile60:lightningd v0.7.0-332-gd054fef
logfile60:2019-05-14T08:43:21.750Z lightning_gossipd(11612): gossip_store: Read 28332/56745/3388/0 cannounce/cupdate/nannounce/cdelete from store in 21776182 bytes
logfile61:lightningd v0.7.0-332-gd054fef
logfile61:2019-05-14T10:26:27.415Z lightning_gossipd(21320): gossip_store: Read 28346/59832/3426/3 cannounce/cupdate/nannounce/cdelete from store in 22237545 bytes
logfile62:2019-05-15T12:50:12.994Z lightning_gossipd(3029): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile62:lightningd v0.7.0-335-g4779bb2
logfile63:lightningd v0.7.0-336-gec7a1c3
logfile63:2019-05-15T17:44:33.281Z lightning_gossipd(2450): gossip_store: Read 28372/75856/3450/16 cannounce/cupdate/nannounce/cdelete from store in 24587039 bytes
logfile64:lightningd v0.7.0-338-g58b53a0
logfile64:2019-05-16T06:32:44.288Z lightning_gossipd(14007): gossip_store: Read 28372/75856/3450/16 cannounce/cupdate/nannounce/cdelete from store in 24587039 bytes
logfile65:lightningd v0.7.0-338-g58b53a0
logfile65:2019-05-17T12:53:30.344Z lightning_gossipd(24149): gossip_store: Read 28211/8121/11/0 cannounce/cupdate/nannounce/cdelete from store in 14100067 bytes
logfile66:2019-05-17T12:59:33.227Z lightning_gossipd(29763): gossip_store: Bad channel_update (01029b7212e37250d48f937939fe5345168f4e3eba213b3372afdbca552822d99a7e6bf822d4d2061e493e32f11817daeab422f2ee5dde0d2e76feb23dbf6ae469516fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d619000000000008a8390005e400015cdeae760100009000000000000003e8000003e800000001000000002f34f600) truncating to 4674115
logfile66:2019-05-17T12:59:33.229Z lightning_gossipd(29763): gossip_store: Read 7630/8122/11/0 cannounce/cupdate/nannounce/cdelete from store in 4674115 bytes
logfile66:lightningd v0.7.0-337-g6ee2cd8
logfile67:2019-05-18T14:50:08.845Z lightning_gossipd(31786): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store (0 deleted) in 1 bytes
logfile67:lightningd v0.7.0-359-g94d7220
logfile68:2019-05-18T14:57:32.781Z lightning_gossipd(4117): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile68:lightningd v0.7.0-338-g0f42985
logfile69:2019-05-20T08:37:52.041Z lightning_gossipd(12180): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store (0 deleted) in 1 bytes
logfile69:lightningd v0.7.0-360-g182e617
logfile70:2019-05-20T09:04:03.277Z lightning_gossipd(14880): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store (0 deleted) in 1 bytes
logfile70:lightningd v0.7.0-360-g182e617
logfile71:2019-05-20T15:01:01.653Z lightning_gossipd(20569): gossip_store: Bad node_announcement (010184fbbe41b3f75e60ab5532b9c024707ef76847e07d576c8c99928694fcc8e13262dbefb6be882e94deca1ba5264e1444a34774f0783b7681dc7fbbc0d15ab37200005c4c4cf003150d7aada9296fbfb494cb00ce056b3a38e3e843b559a1d861edba11c35eac7c3399ff30333135306437616164613932393666626662340000000000000000000000000000) truncating to 23926432
logfile71:2019-05-20T15:01:01.653Z lightning_gossipd(20569): gossip_store: Read 28542/43997/52/0 cannounce/cupdate/nannounce/cdelete from store (26943 deleted) in 23926432 bytes
logfile71:lightningd v0.7.0-360-g182e617
logfile72:lightningd v0.7.0-360-g182e617
logfile72:2019-05-20T15:05:58.776Z lightning_gossipd(20968): gossip_store: Bad node_announcement (010184fbbe41b3f75e60ab5532b9c024707ef76847e07d576c8c99928694fcc8e13262dbefb6be882e94deca1ba5264e1444a34774f0783b7681dc7fbbc0d15ab37200005c4c4cf003150d7aada9296fbfb494cb00ce056b3a38e3e843b559a1d861edba11c35eac7c3399ff30333135306437616164613932393666626662340000000000000000000000000000) truncating to 23926432
logfile72:2019-05-20T15:05:58.776Z lightning_gossipd(20968): gossip_store: Read 28542/43997/52/0 cannounce/cupdate/nannounce/cdelete from store (26943 deleted) in 23926432 bytes
logfile73:2019-05-23T13:09:44.398Z lightning_gossipd(20497): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile73:lightningd v0.7.0-338-g0f42985
logfile74:2019-05-23T13:32:07.634Z lightning_gossipd(27187): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store (0 deleted) in 1 bytes
logfile74:lightningd v0.7.0-389-g9ff3554
logfile75:lightningd v0.7.0-389-g9ff3554
logfile75:2019-05-23T15:19:57.993Z lightning_gossipd(6471): gossip_store: Bad node_announcement (01019ef9ac42840c5855a203396a8ffa1f2dec26e8958540a9d9979cecde92aa3acb2ca933e7a2e46f2bfeba01cf869ca7548b55142d6e6ecdd3650bd5a49413e49f00005cb0c92603b9161dcd66175f3f141274b4d06d97ce990b6ce1a7fa84869210bb5bff66b94368f442626974636f696e6e6f6465000000000000000000000000000000000000000000000701d593a42e2607) truncating to 21382410
logfile75:2019-05-23T15:19:57.993Z lightning_gossipd(6471): gossip_store: Read 27787/51027/812/0 cannounce/cupdate/nannounce/cdelete from store (4467 deleted) in 21382410 bytes
logfile75:2019-05-23T15:42:16.573Z lightning_gossipd(6471): gossip_store: Entry at 1->1 not updated? (version v0.7.0-389-g9ff3554)
logfile75:2019-05-23T15:42:16.574Z lightning_gossipd(6471): STATUS_FAIL_INTERNAL_ERROR: gossip_store: Entry at 1->1 not updated?
logfile76:lightningd v0.7.0-389-g9ff3554
logfile76:2019-05-23T15:50:59.195Z lightning_gossipd(9582): gossip_store: Bad node_announcement (0101bb2dffb61597da75174565604d5759e402082ad8b834f5d3c908a52f86272afc69f5c9515b63e16e2f4ceb737ad1771d0f1cb375977ca43b9a917444609ae7d600005ce671ec02c2e6b13c4c0493cff7f63d8c0447c17897991fd490b4ea0a1beb4e8c47dc996568f442626562612d6265727279000000000000000000000000000000000000000000000007011f9ca8c94d17) truncating to 22651900
logfile76:2019-05-23T15:50:59.196Z lightning_gossipd(9582): gossip_store: Read 27795/54711/3342/0 cannounce/cupdate/nannounce/cdelete from store (6431 deleted) in 22651900 bytes
logfile76:2019-05-24T00:55:51.599Z lightning_gossipd(9582): gossip_store: Entry at 1->1 not updated? (version v0.7.0-389-g9ff3554)
logfile76:2019-05-24T00:55:51.600Z lightning_gossipd(9582): STATUS_FAIL_INTERNAL_ERROR: gossip_store: Entry at 1->1 not updated?
logfile77:2019-05-24T05:56:41.000Z lightning_gossipd(26412): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile77:lightningd v0.7.0-376-g8ecc6c7
logfile77:2019-05-24T14:44:42.706Z lightning_gossipd(26412): gossip_store: can't read len 2356863187 offset 605 (version v0.7.0-376-g8ecc6c7)
logfile77:2019-05-24T14:44:42.707Z lightning_gossipd(26412): STATUS_FAIL_INTERNAL_ERROR: gossip_store: can't read len 2356863187 offset 605
logfile78:2019-05-24T19:36:19.905Z lightning_gossipd(14781): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
logfile78:lightningd v0.7.0-381-g51938e6

rustyrussell · 2019-05-27T04:48:40Z

OK, I reproduced this problem, and enhanced our test to expose it: we weren't deleting private channels from the store when they become public.

It also found something else as well: our counter of entries was not counting everything.

Here's the total difference after rebase:

diff --git a/gossipd/gossip_store.c b/gossipd/gossip_store.c
index 779525580..58957c3da 100644
--- a/gossipd/gossip_store.c
+++ b/gossipd/gossip_store.c
@@ -385,6 +385,8 @@ u64 gossip_store_add(struct gossip_store *gs, const u8 *gossip_msg,
 	}
 
 	gs->count++;
+	if (addendum)
+		gs->count++;
 	return off;
 }
 
diff --git a/gossipd/routing.c b/gossipd/routing.c
index 1dc7015e5..ecef419a3 100644
--- a/gossipd/routing.c
+++ b/gossipd/routing.c
@@ -1413,8 +1413,10 @@ bool routing_add_channel_announcement(struct routing_state *rstate,
 	}
 
 	/* Pretend it didn't exist, for the moment. */
-	if (chan)
+	if (chan) {
+		remove_channel_from_store(rstate, chan);
 		free_chan(rstate, chan);
+	}
 
 	uc = tal(rstate, struct unupdated_channel);
 	uc->channel_announce = tal_dup_arr(uc, u8, msg, tal_count(msg), 0);
diff --git a/tests/test_gossip.py b/tests/test_gossip.py
index 8e36498a1..5efb3dd65 100644
--- a/tests/test_gossip.py
+++ b/tests/test_gossip.py
@@ -1094,10 +1094,22 @@ def test_gossip_store_compact(node_factory, bitcoind):
     l1, l2, l3 = node_factory.line_graph(3, fundchannel=False)
 
     # Create channel.
-    l2.fund_channel(l3, 10**6)
+    scid23 = l2.fund_channel(l3, 10**6)
 
+    # Replace the update
     l2.rpc.setchannelfee(l3.info['id'], 20, 1000)
-    l2.daemon.wait_for_log(r'Received channel_update for channel')
+    l2.daemon.wait_for_log(r'Received channel_update for channel {}'.format(scid23))
+
+    # Have that channel announced.
+    bitcoind.generate_block(5)
+    wait_for(lambda: [c['active'] for c in l2.rpc.listchannels(scid23)['channels']] == [True, True])
+
+    # Create another channel, which will stay private.
+    scid12 = l1.fund_channel(l2, 10**6)
+
+    # Replace the update
+    l1.rpc.setchannelfee(l2.info['id'], 20, 1000)
+    l2.daemon.wait_for_log(r'Received channel_update for channel {}'.format(scid12))
 
     # Now compact store.
     l2.rpc.call('dev-compact-gossip-store')

rustyrussell · 2019-06-03T01:13:49Z

Trivial rebase to fix conflict.

cdecker

ACK 724b0cf

common/gossip_store.c

They're really gossipd-internal, and we don't want per-peer daemons to confuse them with normal updates. I don't bump the gossip_store version; that's coming with another update anyway. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

More useful if something is wrong. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We currently don't care, but the next patch means we have to find them again. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We're about to bump version again, and the code to upgrade it was quite hairy (and buggy!). It's not worthwhile for such a poorly-tested path: I will just add code to limit how much incoming gossip we get to avoid flooding when we upgrade, however. I also use a modern gossip_store version in our test_gossip_store_load test, instead of relying on the upgrade path. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We use the high bit of the length field: this way we can still check that the checksums are valid on deleted fields. Once this is done, serially reading the gossip_store file will result in a complete, ordered, minimal gossip broadcast. Also, the horrible corner case where we might try to delete things from the store during load time is completely gone: we only load non-deleted things. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…pdate. We ask gossipd for the channel_update for the outgoing channel; any other messages it sends us get queued for later processing. But this is overzealous: we can shunt those msgs to the peer while we're waiting. This fixes a nasty case where we have to handle WIRE_GOSSIPD_NEW_STORE_FD messages by queuing the fd for later. This then means that WIRE_GOSSIPD_NEW_STORE_FD can be handled internally inside handle_gossip_msg(), since it's always dealt with the same, simplifying all callers. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Encapsulating the peer state was a win for lightningd; not surprisingly, it's even more of a win for the other daemons, especially as we want to add a little gossip information. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

They already send *us* gossip messages, so they have to be distinct anyway. Why make us both do extra work? Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

There's a corner case where otherwise a reader could see the header and not the body of a message. It could handle that in various ways, but simplest (and most efficient) is to avoid it happening. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We used to generate this in the caller, then save it in case we needed to retry. We're about to change the message we send to lightningd, so we'll need to regenerate it every time; just hand all the extra args into peer_connected() and we can generate the `connect_peer_connected` msg there. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 1diff --git a/connectd/connectd.c b/connectd/connectd.c index 94fe50b..459c9ac 100644

…elves. Keeping the uintmap ordering all the broadcastable messages is expensive: 130MB for the million-channels project. But now we delete obsolete entries from the store, we can have the per-peer daemons simply read that sequentially and stream the gossip itself. This is the most primitive version, where all gossip is streamed; successive patches will bring back proper handling of timestamp filtering and initial_routing_sync. We add a gossip_state field to track what's happening with our gossip streaming: it's initialized in gossipd, and currently always set, but once we handle timestamps the per-peer daemon may do it when the first filter is sent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The single caller can easily use transfer_store_msg instead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

(We don't increment the gossip_store version, since there are only a few commits since the last time we did this). This lets the reader simply filter messages; this is especially nice since the channel_announcement timestamp is *derived*, not in the actual message. This also creates a 'struct gossip_hdr' which makes the code a bit clearer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We need the timestamp for channel_announcement, but this is simplified because MCP always follows the channel_announcement by a channel_update. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This means we intercept the peer's gossip_timestamp_filter request in the per-peer subdaemon itself. The rest of the semantics are fairly simple however. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…unt. We didn't count some records before, so we could compare the two counters. This is much simpler, and avoids reliance on bs. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We have a problem: if we get halfway through writing the compacted store and run out of disk space, we've already changed half the indexes. This changes it so we do nothing until writing is finished: then we iterate through and update indexes. It also weans us off broadcast ordering, which we can now eliminated. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This clarifies things a fair bit: we simply add and remove from the gossip_store directly. Before this series: (--disable-developer, -Og) store_load_msec:20669-20902(20822.2+/-82) vsz_kb:439704-439712(439706+/-3.2) listnodes_sec:0.890000-1.000000(0.92+/-0.04) listchannels_sec:11.960000-13.380000(12.576+/-0.49) routing_sec:3.070000-5.970000(4.814+/-1.2) peer_write_all_sec:28.490000-30.580000(29.532+/-0.78) After: (--disable-developer, -Og) store_load_msec:19722-20124(19921.6+/-1.4e+02) vsz_kb:288320 listnodes_sec:0.860000-0.980000(0.912+/-0.056) listchannels_sec:10.790000-12.260000(11.65+/-0.5) routing_sec:2.540000-4.950000(4.262+/-0.88) peer_write_all_sec:17.570000-19.500000(18.048+/-0.73) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

For some reason I was reluctant to use the hc local variable; I even re-declared it! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Triggered by a previous variant of this PR, but a goo1d idea to simply discard the store in general when we get a duplicate entry. We crash trying to delete old ones, which means writing to the store. But they should have already been deleted. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@darosior

(Or, if we crashed before we got to write out the channel_update). It's a corner case, but one reported by @darosior and reproduced on my test node (both with bad gossip_store due to previous iterations of this patchset!). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@m-schmoock

It's not required, but it means valgrind won't complain about leaks. Suggested-by: @m-schmoock Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell · 2019-06-04T00:05:32Z

Trivial rebase to solve conflicts caused by #2685.

Ack dcb8646

rustyrussell requested a review from cdecker as a code owner May 17, 2019 07:38

rustyrussell added this to the 0.7.1 milestone May 17, 2019

rustyrussell added gossip MCP labels May 17, 2019

m-schmoock reviewed May 17, 2019

View reviewed changes

rustyrussell changed the title ~~Final gossip store optimization~~ WIP: Final gossip store optimization May 17, 2019

rustyrussell force-pushed the final-gossip-store-optimization branch from 8aa7ee4 to 94d7220 Compare May 18, 2019 02:18

SimonVrouwe reviewed May 18, 2019

View reviewed changes

cdecker added the Work in Progress label May 18, 2019

rustyrussell force-pushed the final-gossip-store-optimization branch 3 times, most recently from 5d2d012 to 182e617 Compare May 20, 2019 04:53

rustyrussell force-pushed the final-gossip-store-optimization branch from 182e617 to 3547623 Compare May 23, 2019 02:16

rustyrussell changed the title ~~WIP: Final gossip store optimization~~ Final gossip store optimization May 23, 2019

rustyrussell force-pushed the final-gossip-store-optimization branch from 3547623 to 9ff3554 Compare May 23, 2019 04:53

rustyrussell changed the title ~~Final gossip store optimization~~ BUGGY: Final gossip store optimization May 25, 2019

rustyrussell removed the Work in Progress label May 27, 2019

rustyrussell force-pushed the final-gossip-store-optimization branch 2 times, most recently from ae55921 to 724b0cf Compare June 3, 2019 01:13

rustyrussell mentioned this pull request Jun 3, 2019

No more gossip flood #2700

Merged

cdecker approved these changes Jun 3, 2019

View reviewed changes

common/gossip_store.c Show resolved Hide resolved

rustyrussell added 22 commits June 4, 2019 03:35

devtools/dump-gossipstore: print offsets.

48545c7

More useful if something is wrong. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: track gossip_store locations of local channels.

d0a6d59

We currently don't care, but the next patch means we have to find them again. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: don't wrap messages when we send them to the peer.

97681d0

They already send *us* gossip messages, so they have to be distinct anyway. Why make us both do extra work? Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossip_store: remove redundant copy_message.

fa8af3f

The single caller can easily use transfer_store_msg instead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

devtools/create-gossipstore: write timestamps to the gossip_store.

7582bc7

We need the timestamp for channel_announcement, but this is simplified because MCP always follows the channel_announcement by a channel_update. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

common/gossip_store: handle timestamp filtering.

989b206

This means we intercept the peer's gossip_timestamp_filter request in the per-peer subdaemon itself. The rest of the semantics are fairly simple however. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd/gossip_store: keep count of deleted entries, don't use bs->co…

aa6ecd0

…unt. We didn't count some records before, so we could compare the two counters. This is much simpler, and avoids reliance on bs. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: trivial clean up of routing_add_channel_update.

fa6b1df

For some reason I was reluctant to use the hc local variable; I even re-declared it! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

cli/lightning-cli: free eveything on exit.

dcb8646

It's not required, but it means valgrind won't complain about leaks. Suggested-by: @m-schmoock Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the final-gossip-store-optimization branch from 724b0cf to dcb8646 Compare June 4, 2019 00:03

rustyrussell merged commit e902d9a into ElementsProject:master Jun 4, 2019

Final gossip store optimization #2644

Final gossip store optimization #2644

Uh oh!

Conversation

rustyrussell commented May 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-schmoock left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-schmoock May 17, 2019

Choose a reason for hiding this comment

Uh oh!

rustyrussell May 29, 2019

Choose a reason for hiding this comment

Uh oh!

m-schmoock May 17, 2019

Choose a reason for hiding this comment

Uh oh!

rustyrussell May 22, 2019

Choose a reason for hiding this comment

Uh oh!

rustyrussell commented May 18, 2019

Uh oh!

rustyrussell commented May 18, 2019

Uh oh!

SimonVrouwe left a comment

Choose a reason for hiding this comment

Uh oh!

m-schmoock commented May 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustyrussell commented May 20, 2019

Uh oh!

m-schmoock commented May 20, 2019

Uh oh!

SimonVrouwe commented May 20, 2019

Uh oh!

rustyrussell commented May 22, 2019

Uh oh!

rustyrussell commented May 22, 2019

Uh oh!

SimonVrouwe commented May 23, 2019

Uh oh!

rustyrussell commented May 25, 2019

Uh oh!

SimonVrouwe commented May 25, 2019

Uh oh!

rustyrussell commented May 27, 2019

Uh oh!

rustyrussell commented Jun 3, 2019

Uh oh!

cdecker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rustyrussell commented Jun 4, 2019

Uh oh!

Uh oh!

rustyrussell commented May 17, 2019 •

edited

Loading

m-schmoock left a comment •

edited

Loading

m-schmoock commented May 18, 2019 •

edited

Loading