Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cloneBlockchainTime #1506

Merged
merged 7 commits into from Mar 2, 2020
Merged

Conversation

nfrisby
Copy link
Contributor

@nfrisby nfrisby commented Jan 27, 2020

Fixes #1489. Fixes #1524.

Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the
range [2 .. 10] since the dual-ledger tests do that now revealed several
Issues.

Issues in the library, not just the test infrastructure:

Issues only in the test infrastructure:

  • Issue Fix cloning of TestBlockChaintime #1489: let nodes lead when they join. This bug slipped in recently,
    when I added cloning of BlockchainTimes as part of the
    restarting/rekeying loop in the test infrastructure.

  • PBFT reference simulator. (Was masked by 1489.) Model competing 1-block
    chains in Ref.PBFT and use its results where applicable instead of the
    .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate
    results agree on Ref.Nominal slots.

    The Ref.PBFT module had been making assumptions that were accurate given
    the guards in RealPBFT generators, given the k >= n regime. But outside
    of that regime, when the security parameter exceeds the node count, it
    wasn't enough. Also, it couldn't be compared against the PBFT threadnet
    because of those assumptions.

  • PBFT reference simulator. (Cascade of above.) Add
    definitelyEnoughBlocks to confirm the "at least k blocks in 2k slots"
    invariant in genRealPBFTNodeJoinPlan. The existing guards in the RealPBFT
    generators are intentionally insufficient by themselves; this way we can
    optimize them to avoid O(n^2) complexity without risking divergence from
    the suchThats.

  • Origin corner-case. (Was masked by 1489.) Discard DualPBFT tests that
    forge in slot 0. The current cardano-ledger-specs doesn't allow for that.
    My hypothesis is that cvsLastSlot would need to be able to represent
    origin.

  • Dlg cert tx. Adjust genNodeRekeys to respect "If node X rekeys in slot
    S and Y leads slot S+1, then either the topology must connect X and Y
    directly, or Y must join before slot S."

  • Dlg cert tx. (Was (statistically?) masked by relatively large k.) Add
    rekeyOracle and use it to determine which epoch number to record in the
    dlg cert. The correct value depends on which block the dlg cert tx will end
    up in, not when we first add it to our mempool.

  • Dlg cert tx. (Was (statistically?) masked by relatively large k.) Add the
    dlg cert tx to the mempool each time the ledger changes.

@@ -0,0 +1,620 @@
{-# LANGUAGE LambdaCase #-}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the commits for a delta without the renaming (can't pass the git diff -M option to GitHub's PR renderer?).

@mrBliss mrBliss added the consensus issues related to ouroboros-consensus label Jan 27, 2020
@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 31, 2020

I've pushed new commits and updated the PR description, but the commit history is still just a trail of breadcrumbs along my bug hunt. When I'm back online in about 6 hours, my immediate goal is to clean up those commits. The tip of this branch has passed 50000 RealPBFT k=2 tests and 50000 RealPBFT k=3 tests and tens of thousands of more varying tests.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 3, 2020

That rebase cleaned up the git history, removing a lot of the code that arose during debugging.

test-consensus: let k vary in PBFT and RealPBFT tests b8d8ab5 is still a pretty big commit, but I'm not sure that splitting it up (into up to 3 commits, I'd guess) is worth the effort.

-- not reach it soon enough)
, let nextLeader = Ref.mkLeaderOf params $ succ jslot
, jslot /= coreNodeIdJoinSlot nodeJoinPlan nextLeader ||
cid `elem` coreNodeIdNeighbors nodeTopology nextLeader
-> pure $ NodeRestarts $
Copy link
Contributor Author

@nfrisby nfrisby Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO We also need to prevent the rekeying node from restarting too soon, since it might need to propagate its dlg cert tx more than once.

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch from 3b86ee3 to 980a641 Compare February 6, 2020 15:20
@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 6, 2020

That force push is passing tests. Remaining work:

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch 3 times, most recently from e7c83b5 to 98d9f85 Compare February 6, 2020 17:10
@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 6, 2020

I rebased onto cee384a, because that's the last commit on master that still has the Byron ledger wrapper in this repo. My patch for cardano-ledger Issue 715 cannot exist in this repo beyond that commit.

Unfortunately, the fixes for 1544 (prevPointAndBlockNo) and 1565 (removeTxs) come later.

I'm going to open a PR in cardano-ledger if only to get the ball rolling.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 6, 2020

I opened IntersectMBO/cardano-ledger#716, which I think is the necessary fix (it is the same as the patch I have locally on this branch). Once that's merged, then my PR can update the deps in the repo accordingly and therefore rebase all the way to our origin/master.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 7, 2020

After rebasing onto master (now that a cardano-ledger fix commit is available for gitfetch), I'm seeing errors that I believe are due to Issue #1147. I see the characteristic FetchDeclineConcurrencyLimit BF decision between a CompletedBlockFetch and CompletedFetchBatch, and if I cherry-pick my patch for that from PR 1131 (i.e. commit 0d2fa18), then the failing repro passes.

I'm guessing the recent changes to ChainSync client (in light of the recent refinement of tips to possibly include "origin" instead of an off-by-one blockNumber = 0") has somehow enabled this family of mini protocol message/event interleavings, which PR 1131 instead has been exploring by adding network latencies.

cc: @dcoutts @edsko @mrBliss

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch from 98d9f85 to 5f71595 Compare February 7, 2020 05:58
iohk-bors bot added a commit that referenced this pull request Feb 25, 2020
1691: consensus: handle EBBs in rewindHeaderState r=nfrisby a=nfrisby

Fixes #1690.

I believe this bug was masked by Issue #1489; I discovered it and confirmed that this patch fixes it on my related WIP branch. This PR does not include a repro for this bug, but that branch does, and PR #1506 will as soon as I'm able to update it with that WIP branch -- that's my current focus.

Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
@nfrisby nfrisby requested a review from edsko February 27, 2020 22:42
@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch 2 times, most recently from 48babe5 to 23a0a4b Compare February 27, 2020 22:50
@nfrisby nfrisby marked this pull request as ready for review February 27, 2020 22:50
@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 27, 2020

I eventually realized that the Just-In-Time EBBs (enabled by a recent PR on master) avoid the error case that otherwise required a patch for Issue 1631 (Chain Density check in NodeKernel) and a rewrite of the reference simulator. This meant I could drop the NodeKernel change, which simplified the PR quite a bit. It no longer addresses 1631, leaving that for future work.

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch 2 times, most recently from 6178e19 to 3acec92 Compare March 2, 2020 15:07
Copy link
Contributor

@edsko edsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

--
-- INVARIANT the 'NonEmpty's are all ascending
--
newtype Pam v k = UnsafePam {getPam :: Map v (NonEmpty k)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to InvertedMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 3de1fc8

@@ -74,8 +73,10 @@ data TestClock =
deriving (Eq, Generic, NoUnexpectedThunks)

data TestBlockchainTime m = TestBlockchainTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really ought to move this to test-infra.

ouroboros-consensus-byron/test/Test/ThreadNet/RealPBFT.hs Outdated Show resolved Hide resolved
-- EBB predecessor's hash
-> blk
, ebbSlotBefore :: SlotNo -> SlotNo
-- ^ the slot of the most recent expected that precedes the given slot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this name or this comment :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated in commit 9b2aa86

-- If we add the transaction and then the mempools discards it for some
-- reason, this thread will add it again.
--
forkTxs0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with a one or two line explanation of what the purpose of it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated in commit 183da45


-- The test infrastructure allows nodes to forge in slot 0; however, the
-- cardano-ledger-specs code causes @PBFTFailure (SlotNotAfterLastBlock
-- (Slot 0) (Slot 0))@ in that case. So we discard such tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a comment that the spec here is therefore more strict than the real implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated in commit d0134ca

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch 3 times, most recently from 8be8d2f to d0134ca Compare March 2, 2020 17:41
@nfrisby
Copy link
Contributor Author

nfrisby commented Mar 2, 2020

OK. I addressed each comment with a commit, and now I'm going to squash those down and then bors it.

@nfrisby nfrisby force-pushed the nfrisby/issue-1489-btime-clone-slot0 branch from d0134ca to 5e8a8d2 Compare March 2, 2020 18:31
@nfrisby
Copy link
Contributor Author

nfrisby commented Mar 2, 2020

bors r+

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Mar 2, 2020

@iohk-bors iohk-bors bot merged commit abc472c into master Mar 2, 2020
@iohk-bors iohk-bors bot deleted the nfrisby/issue-1489-btime-clone-slot0 branch March 2, 2020 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test with k = 2 in test-consensus Fix cloning of TestBlockChaintime
3 participants