Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for HTLC db issues #2000

Merged
merged 21 commits into from Oct 9, 2018

Conversation

rustyrussell
Copy link
Contributor

Puzzled about two PR's, I wrote a stress test, which revealed a number of bugs, including those which caused the horrific issues #1960 and #1993.

It also triggers a yet-to-be-diagnosed occasional issue where we dislike the peer's option_dataloss_protect; I have temporarily disabled that check in order to resolve these more important issues, but am now chasing that one.

Fixes: #1960
Fixes: #1993

… is True

This is what the callers want; generalize it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
lightningd: Outstanding taken pointers: lightningd/pay.c:243:channel_update

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means we need to check when we've altered the state, so the checks
are moved to the callers of htlc_in_update_state and htlc_out_update_state.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…utgoing.

Usually, we only close an incoming HTLC once the outgoing HTLC is completely
resolved.  However, we short-cut this in the FULFILL case: we have the
preimage, so might as well use it immediately (in fact, we wait for it to
be committed, but we don't need to in theory).

As a side-effect of this, our assumption that every outgoing HTLC has
a corresponding incoming HTLC is incorrect, and this test (xfail) tickles
that corner case.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
… outgoing.

We now need an explicit 'local' flag, rather than relying on the existence
of the 'in' pointer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…reed.

Now we know this can happen (see previous patch), we need to handle it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Which we don't handle, due to a separate bug, so it's xfail.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…hout a preimage.

We need to handle this case (old db) before the next commit, which actually
fixes it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can now wrap the 'missing preimage' hack in COMPAT_V061.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Under stress, it fails (test_restart_many_payments, the next test).

I suspect a deep misunderstanding in the comparison code, will chase
separately.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't save them to the database, so fix things up as we load them.

Next patch will actually save them into the db, and this will become
COMPAT code.

Also: call htlc_in_check() with NULL on db load, as otherwise it aborts
internally.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't write it in there yet, so this change currently has no effect.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we can finally move the fixup code under COMPAT_V061, so it's only
for old nodes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
failoutchannel tells us which channel to send an update for (specifically
for temporary_channel_failure); but we don't save it into the db.  It's
not even clear we should, since it's a corner case and the channel might
not even exist when we come back.

So on db restore, change such errors to WIRE_TEMPORARY_NODE_FAILURE
which doesn't need an update.

We also don't memset it to 0 in the normal case (we only access if it
failcode has the UPDATE bit set) so valgrind will trigger if we're
wrong.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't expect payment or payment->route_channels to be NULL without an
old db, but putting an assert there reveals that we try to fail an HTLC
which has already succeeded in 'test_onchain_unwatch'.

Obviously we only want to fail an HTLC which goes onchain if we don't
already have the preimage!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks ok from my side, just some minor nits about naming and comments. The only thing that might take some time, but I think is a nice cleanup, is to remove the preimage from hout altogether and making the state an explicit enum field, to make it easier to reason about the states that an HTLC may be in.

ACK 362fe44

# Make sure everyone sees all channels: we can cheat and
# simply check the ends (since it's a line).
wait_for(lambda: both_dirs_ready(nodes[0], scids[-1]))
wait_for(lambda: both_dirs_ready(nodes[-1], scids[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification


if not announce:
return nodes

bitcoin.generate_block(5)

def both_dirs_ready(n, scid):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will take a really long time without DEVELOPER=1, since we can't set the broadcast interval, so each hop adds 30 wait time this way. But it seems we mostly wait anyway (or have a 2-line-graph which does not suffer this problem).


# Now, l2 should restore from DB fine, even though outgoing HTLC no longer
# has an incoming.
l2.restart()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically restarting does not trigger a reconnect, and it's not needed to check the HTLC restore from the database. It might be interesting nevertheless to have them reconnect and check that they clear correctly, not just load correctly from the DB. A bunch of wait_fors should suffice to see the HTLC settle correctly.

Covered in 4f42de2

@@ -73,6 +73,9 @@ struct htlc_out {
/* If we fulfilled, here's the preimage. */
struct preimage *preimage;

/* Is this local? Implies ->in is NULL. */
bool local;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already using local in the context of a single channel quite extensively, could we name this origin or sender instead? I'd like to separate the nomenclatures of channel context from the multi-channel context a bit more, to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

/* FIXME: COMPAT_V061 only */
log_broken(ld->log,
"Missing preimage for orphaned HTLC; replacing with zeros");
hout->preimage = talz(hout, struct preimage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we are using the preimage solely to test for success and to pass it to the payment. So it is never used on the downstream side, only on the upstream side (towards the sender), so why are we keeping this at all?

I might have been a bit overzealous in saving it on both sides, when it could just have been part of the state. This might be part of a bigger PR that drops the preimage from hout, but it'd allow us to avoid a compat flag here (and subsequent forced global reload).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a FIXME.

# Manually create routes, get invoices
route = []
payment_hash = []
for i in range(len(innodes)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a tuple of (route, hash) seems to be preferable here. it'd make the unpacking for loop below easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since we are iterating over innodes we could extend this to be (innode, route, hash) making the loop really trivial below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the spirit of pythonicity, this is a bit nicer:

import itertools
for inode, outnode, outchan, inchan in itertools.izip(innodes,outnodes,outchans,inchans):

via https://stackoverflow.com/a/6498180


# sendpay is async.
for i in range(len(payment_hash)):
innodes[i // 2].rpc.sendpay(route[i], payment_hash[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [i // 2] seems strange here, shouldn't we iterate over all innodes not just the first half?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two payments per innode...

wallet/wallet.c Outdated
@@ -1346,6 +1346,34 @@ static bool wallet_stmt2htlc_out(struct channel *channel,
return ok;
}

/* We didn't used to save failcore, failuremsg... */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "didn't used to"

@@ -1378,19 +1385,20 @@ static void fixup_hin(struct wallet *wallet, struct htlc_in *hin)
if (hin->failcode || hin->failuremsg)
return;

hin->failcode = WIRE_TEMPORARY_CHANNEL_FAILURE;
hin->failcode = WIRE_TEMPORARY_NODE_FAILURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This and the next change could be added to the commit that added this fix in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. I only realized it when I found that issue though...

#endif
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll kill that, too.

Noted by @cdecker, the term 'local' is grossly overused, and the hout
preimage is basically only used as a sanity check (though I've just put
a FIXME there for now).

Also eliminated spurious blank line which crept into wallet.c.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@cdecker
Copy link
Member

cdecker commented Oct 9, 2018

Awesome, thanks for the fixes @rustyrussell

ACK 14e2798

The merge request should disappear when doing a Rebase and merge, hopefully 🤞

# We agree on fee change first, then add HTLC, then remove; stop after remove.
disconnects = ['+WIRE_COMMITMENT_SIGNED*3']
# We manually reconnect l2 & l3, after 100 blocks; hence allowing manual
# reconnect, but disabling auto coneect, and massive cltv so 2/3 doesn't
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: coneect -> connect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rebase anyway due to CHANGELOG conflict, so I snuck this in too... thanks!

struct preimage *preimage;

/* Is this local? Implies ->in is NULL. */
bool local;
/* Is this a locally-generated payment? Implies ->in is NULL. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much helpful 👍

# Manually create routes, get invoices
route = []
payment_hash = []
for i in range(len(innodes)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the spirit of pythonicity, this is a bit nicer:

import itertools
for inode, outnode, outchan, inchan in itertools.izip(innodes,outnodes,outchans,inchans):

via https://stackoverflow.com/a/6498180

IIUC, namedtuple is like tuple for grown-ups: Pythonify!

Suggested-by: @cdecker
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell merged commit a3e87af into ElementsProject:master Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants