Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug reapplyTxSameState: unexpected error: MockInvalidInputs #1505

Closed
nfrisby opened this issue Jan 24, 2020 · 17 comments
Closed

Debug reapplyTxSameState: unexpected error: MockInvalidInputs #1505

nfrisby opened this issue Jan 24, 2020 · 17 comments
Assignees
Labels
bug Something isn't working byron Required for a Byron mainnet: replace the old core nodes with cardano-node consensus issues related to ouroboros-consensus mempool
Milestone

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Jan 24, 2020

This Issue is a bit of a placeholder; I'm not yet sure what the ultimate cause is.

While fixing Issue #1489, I started to see additional errors. One example is:

ouroboros-consensus
  PBFT
    simple convergence: FAIL (0.05s)
      *** Failed! (after 1 test):
      Exception:
        reapplyTxSameState: unexpected error: MockInvalidInputs (fromList [(9b87a9e5,1)])
        CallStack (from HasCallStack):
          error, called at src/Ouroboros/Consensus/Ledger/Mock/Block.hs:288:32 in ouroboros-consensus-0.1.0.0-inplace:Ouroboros.Consensus.Ledger.Mock.Block
          reapplyTxSameState, called at src/Ouroboros/Consensus/Mempool/Impl.hs:575:54 in ouroboros-consensus-0.1.0.0-inplace:Ouroboros.Consensus.Mempool.Impl

So far, I only have a repro. It looks like this:

$ git lg -4
* 8befbfb9 - (HEAD -> nfrisby/issue-1505-repro) WIP narrow test to repro (18 minutes ago) <Nicolas Frisby>
* d0e3c0d2 - consensus: include Tx validation error in error message (18 minutes ago) <Nicolas Frisby>
* 3594250d - consensus: fix TestBlockchainTime cloning (18 minutes ago) <Nicolas Frisby>
*   1dac815e - (origin/master, origin/bors/staging, origin/HEAD) Merge input-output-hk/ouroboros-network#1180 (4 hours ago) <iohk-bors[bot]>
|\  
$

Commit 3594250 is the aforementioned anticipated fix for Issue 1489. Commit d0e3c0d adds the actual Tx invalidity to the error message. Commit 8befbfb narrows test-consensus to just run this one test:

tests :: TestTree
tests = testGroup "PBFT" [
      testProperty "simple convergence" $
        prop_simple_pbft_convergence k $
           -- This is a nondeterministic plan: the pbftLimit is 1 block and c3 would
           -- choose among the three 1-block chains when it leads in slot 28,
           -- but c2 raises a MockInvalidInputs error (during
           -- 'reapplyTxSameState') in slot 26.
           let ncn5 = NumCoreNodes 5 in
           TestConfig
             { numCoreNodes = ncn5
             , numSlots     = NumSlots 100
             , nodeJoinPlan = NodeJoinPlan $ Map.fromList
               [ (CoreNodeId 0, SlotNo 0)   -- 0 only leads this slot
               , (CoreNodeId 1, SlotNo 6)   -- 1 only leads this slot
               , (CoreNodeId 2, SlotNo 22)  -- 2 only leads this slot
               , (CoreNodeId 3, SlotNo 24)
               , (CoreNodeId 4, SlotNo 99)  -- irrelevant, beyond affecting pbftThreshold via numCoreNodes
               ]
             , nodeRestarts = noRestarts
             , nodeTopology = meshNodeTopology ncn5
             , slotLengths  = singletonSlotLengths $ slotLengthFromSec 1
             , initSeed     = Seed (9550173506264790139,4734409083700350196,9697926137031612922,16476814117921936461,9569412668768792610)
             }
    ]
  where
    k = SecurityParam 5

The only hint I have so far is that if I revert PR #1468, then the test no longer fails. The contents of that PR do not obviously cause such an error. I anticipate that it is incriminated only because it changes the blocking behavior of addTxs and there's a bug somewhere in addTxs regarding the logic about whether or not the ledger snapshot being used for validation is stale. (@edsko and I have found some similar issues in that module during separate work on Issue #1297.)

Edit: instead of reverting PR 1468, I can just set this field in nodeArgs in Test.ThreadNet.Network.hs.

  mempoolCap = MempoolCapacityBytesOverride $ MempoolCapacityBytes $ multiplier*256

The capacity was 3000 bytes before PR 1468, and the repro's test now passes if 0 < multiplier < 47, and fail if it's >= 47 -- does that support the addTxs blocking/fingerprinting theory?

@nfrisby nfrisby added bug Something isn't working mempool labels Jan 24, 2020
nfrisby added a commit that referenced this issue Jan 27, 2020
@mrBliss mrBliss added the consensus issues related to ouroboros-consensus label Jan 27, 2020
@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 31, 2020

[Edit: never mind the mention of the invariant in the following. The repro is for PBFT, not RealPBFT.]

Hmmmmm. I just noticed that this repro necessarily violates the "at least k blocks in 2k slots" Byron invariant. That's usually enough to explain weird behavior. But I haven't connected the dots to how that could cause a reapplyTxSameState failure in this case.

(I suppose this test case arose while I was rewriting the generators on my WIP PR but before I had finished doing so. As far as I know, they currently do not create tests cases that violate the "enough blocks" Byron invariant.)

If I understand correctly, nodes 1-2 are stuck in a stand-off (hence the invariant violation) and the only thing (confirm?) they're actively doing is adding txs to their mempool at the onset of each slot.

    forkTxProducer btime cfg produceDRG getExtLedger mempool =
      void $ onSlotChange btime $ \curSlotNo -> do
        varDRG <- uncheckedNewTVarM =<< produceDRG
        txs <- atomically $ do
          ledger <- ledgerState <$> getExtLedger
          simChaChaT varDRG id $ testGenTxs numCoreNodes curSlotNo cfg ledger
        void $ addTxs mempool txs

@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 31, 2020

This just occurred on my WIP PR for Issue 1489, which includes the one-line reversion that resets the max mempool size to 3000. I'll edit this comment soon to manually shrink it and confirm it happens on master too. I'm excited that it only involves two nodes.

      SecurityParam {maxRollbacks = 2}
      TestConfig {numCoreNodes = NumCoreNodes 2, numSlots = NumSlots 56, nodeJoinPlan = NodeJoinPlan (fromList [(CoreNodeId 0,SlotNo {unSlotNo = 0}),(CoreNodeId 1,SlotNo {unSlotNo = 23})]), nodeRestarts = NodeRestarts (fromList []), nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0])]), slotLengths = SlotLengths {currentSlotLength = SlotLength {getSlotLength = 1s}, nextSlotLengths = Nothing}, initSeed = Seed {getSeed = (12583441396583418650,10344129402959190140,9790632537009006143,7136608049403144127,5991750057253903322)}}

Edit: And another.

      SecurityParam {maxRollbacks = 2}
      TestConfig {numCoreNodes = NumCoreNodes 2, numSlots = NumSlots 51, nodeJoinPlan = NodeJoinPlan (fromList [(CoreNodeId 0,SlotNo {unSlotNo = 0}),(CoreNodeId 1,SlotNo {unSlotNo = 13})]), nodeRestarts = NodeRestarts (fromList []), nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0])]), slotLengths = SlotLengths {currentSlotLength = SlotLength {getSlotLength = 1s}, nextSlotLengths = Nothing}, initSeed = Seed {getSeed = (7599403326098459112,3329525978019681109,5919560847621141701,14182739294269672657,15340091703535017950)}}

Observation: both of these have two nodes that both lead in the slot they join.

Edit And another.

      SecurityParam {maxRollbacks = 4}
      TestConfig {numCoreNodes = NumCoreNodes 3, numSlots = NumSlots 52, nodeJoinPlan = NodeJoinPlan (fromList [(CoreNodeId 0,SlotNo {unSlotNo = 0}),(CoreNodeId 1,SlotNo {unSlotNo = 25}),(CoreNodeId 2,SlotNo {unSlotNo = 29})]), nodeRestarts = NodeRestarts (fromList []), nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0]),(CoreNodeId 2,fromList [CoreNodeId 0,CoreNodeId 1])]), slotLengths = SlotLengths {currentSlotLength = SlotLength {getSlotLength = 1s}, nextSlotLengths = Nothing}, initSeed = Seed {getSeed = (18132660796886713536,10773371602508597336,2703129924415452064,8277492550891641619,16125829469494116032)}}

Edit: All four of those fail as expected on master after cherry-picking only the basic testBlockchainTimeClone fix for Issue 1489. Two don't shrink and the other two do shrink but with a different failure that the rest of PR 1504 for Issue 1489 fixes, but I'm hoping to debug without involving that new code. All four are relatively minimal as is, especially those with 2 nodes.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 3, 2020

I started debugging this one.

      SecurityParam {maxRollbacks = 2}
      TestConfig {numCoreNodes = NumCoreNodes 2, numSlots = NumSlots 56, nodeJoinPlan = NodeJoinPlan (fromList [(CoreNodeId 0,SlotNo {unSlotNo = 0}),(CoreNodeId 1,SlotNo {unSlotNo = 23})]), nodeRestarts = NodeRestarts (fromList []), nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0])]), slotLengths = SlotLengths {currentSlotLength = SlotLength {getSlotLength = 1s}, nextSlotLengths = Nothing}, initSeed = Seed {getSeed = (12583441396583418650,10344129402959190140,9790632537009006143,7136608049403144127,5991750057253903322)}}

I'm not sure yet, but my leading theory is that there's a race condition involving this line:

https://github.com/input-output-hk/ouroboros-network/blob/a43961c1133b9e1b0826f3d4d1bc5b93819037d6/ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs#L470

It's looking like it removes a transaction that a subsequent one depends on, but then that leaves a dangling UTxO reference.

@edsko
Copy link
Contributor

edsko commented Feb 3, 2020

But wouldn't that just happen on the next validation? Or does something go wrong at that point (perhaps a "did we already validate this?" check incorrectly saying "yeah, this is fine?").

@mrBliss
Copy link
Contributor

mrBliss commented Feb 3, 2020

I had a quick look at this and my first suspicion is that applyChainTick in Ledger.Mock.Block is not actually updating the last slot in the mock ledger state. This means that the mempool incorrectly thinks the ledger state hasn't changed and uses reapplyTxSameState. However, @edsko just told me that he talked about this with you and that you're already aware of this. However, looking at your WIP branch: #1506, I don't see any changes to Ledger.Mock.Block fixing this. Maybe I'm looking at the wrong branch? Or did you forget to push your latest changes?

@mrBliss
Copy link
Contributor

mrBliss commented Feb 3, 2020

Also, I think that the idea of ledgerStatePoint is not correct. Say we have an empty ledger state with no applied blocks, so the last hash is GenesisHash. When we now "tick" it using a SlotNo s, which Point would we return? We can't return GenesisPoint, but we can't return BlockPoint either, because we don't have a hash.

A simple fix would be to return (WithOrigin SlotNo, ChainHash blk) instead of Point blk, but that would allow for (Origin, BlockHash h), which is an invalid combination. A new type would be better, e.g.:

data LedgerTip blk
  = NoAppliedBlock !(WithOrigin SlotNo)
    -- ^ No block has been applied to the ledger. When the ledger hasn't been
    -- \"ticked\", the argument will be 'Origin', otherwise, it will be @'At'
    -- slotNo@ where @slotNo@ is the ticked slot number.
  | AppliedBlock !(HeaderHash blk) !SlotNo
    -- ^ A block with the given hash has been applied. The 'SlotNo' will be
    -- that of the block, unless the ledger has been \"ticked\" with a more
    -- recent slot number.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 3, 2020

I had a quick look at this and my first suspicion is that applyChainTick in Ledger.Mock.Block is not actually updating the last slot in the mock ledger state.

Yes, @edsko and I noticed that as part of Issue #1297, which I paused work on when I found this error case (and others). It seems that my work here is coming full circle.

@mrBliss Thanks for sharing your helpful thoughts; that's where I was headed with my #1297 approach. Edsko and I had determined to attempt a middle ground, in part because the existing Byron ledger state type cannot correctly represent all of the cases near Origin: it has cvsLastSlot :: SlotNo instead of cvsLastSlot :: WithOrigin SlotNo, e.g.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 4, 2020

Edit 1: Hmm. I don't know why the removeTxs call seemingly doesn't remove the dependent txs.

Edit 2: Aha! That removeTxs call also uses the fast path (same reason as below, I think) when removing the removed txs' dependents. But vrAfter isn't forced by the rest of the removeTxs logic, so those reapplyTxSameStates failures' thunks are never forced. If I force vrAfter as part of implRemoveTxs, then I see the same error, but earlier: slot 14 and before the TraceMempoolManuallyRemovedTxs event is emitted.

Edit 3: Should the validation after removeTxs ever use the fast path? If I force it not to, these repros no longer fail. (Well, they fail differently, in an expected way since I only cherry-picked one commit from PR 1506.) It's a pretty surgical fix, but I'm not sure it's not myopic.

OK, here's an explanation of the simplest repro's failure. @edsko @mrBliss @intricate

  • c0 joins in slot 0 and forges and adopts block 39e157ff. c0 can't forge anymore blocks until someone else does because of the PBFT threshold (nn = 2, k = 2).

https://github.com/input-output-hk/ouroboros-network/blob/c1b9ba34ee2eac870a10e5194a9e2c85ddf595ec/ouroboros-consensus/test-consensus/Test/ThreadNet/PBFT.hs#L50

  • c1 joins in slot 13 and forges its own block before syncing with c0. c0 doesn't select c1's new block because it has the same block number as c0's selected block.

  • Also in slot 13, c0 receives the a69096cc and d4155c3d txs from c1.

TraceLabelPeer (CoreId (CoreNodeId 1)) Recv MsgReplyTxs [SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(076d805e,0)]) [(\"b\",482),(\"a\",518)], simpleGenTxId = a69096cc},SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(076d805e,1)]) [(\"a\",287),(\"b\",713)], simpleGenTxId = d4155c3d}]
  • In slot 14, c0 forges a block with the a69096cc and d4155c3d txs, but it's
    invalid because of the PBFT threshold. c0 therefore doubts those txs and
    removes them from its mempool.

    However, before removing those txs, c0 also receives from c1 the 52125d90
    and d89f9e6f txs, which depend on the a69096cc and d4155c3d txs.

https://github.com/input-output-hk/ouroboros-network/blob/c1b9ba34ee2eac870a10e5194a9e2c85ddf595ec/ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs#L462-L470

CoreNodeId 0 TraceLabelPeer (CoreId (CoreNodeId 1)) Recv MsgReplyTxs
  [ SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(a69096cc,1),(d4155c3d,0)]) [("b",202),("a",603)], simpleGenTxId = 52125d90}
  , SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(a69096cc,0),(d4155c3d,1)]) [("a",263),("b",932)], simpleGenTxId = d89f9e6f}
  ]
TraceForgedBlock (SlotNo {unSlotNo = 14}) (SimpleBlock {simpleHeader = SimpleHeader {simpleHeaderHash = b267c2ab, simpleHeaderStd = SimpleStdHeader {simplePrev = BlockHash 39e157ff, simpleSlotNo = SlotNo {unSlotNo = 14}, simpleBlockNo = BlockNo {unBlockNo = 2}, simpleBodyHash = 4f4521e3, simpleBlockSize = 56}, simpleHeaderExt = SimplePBftExt {simplePBftExt = PBftFields {pbftIssuer = VerKeyMockDSIGN 0, pbftGenKey = VerKeyMockDSIGN 0, pbftSignature = SignedDSIGN (SigMockDSIGN "9r\141\149" 0)}}}, simpleBody = SimpleBody {simpleTxs = [UnsafeTx (fromList [(076d805e,0)]) [("b",482),("a",518)],UnsafeTx (fromList [(076d805e,1)]) [("a",287),("b",713)]]}}) (MempoolSize {msNumTxs = 2, msNumBytes = 70})
  [ SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(076d805e,0)]) [("b",482),("a",518)], simpleGenTxId = a69096cc}
  , SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(076d805e,1)]) [("a",287),("b",713)], simpleGenTxId = d4155c3d}
  ]
TraceMempoolAddTxs
  [ SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(a69096cc,0),(d4155c3d,1)]) [("a",263),("b",932)], simpleGenTxId = d89f9e6f}
  , SimpleGenTx {simpleGenTx = UnsafeTx (fromList [(a69096cc,1),(d4155c3d,0)]) [("b",202),("a",603)], simpleGenTxId = 52125d90}
  ] (MempoolSize {msNumTxs = 4, msNumBytes = 157})
...
TraceMempoolManuallyRemovedTxs [a69096cc,d4155c3d] [] (MempoolSize {msNumTxs = 2, msNumBytes = 87})
  • In slot 15, c0 adds some more txs, but in doing so, it must first reapply
    52125d90 and d89f9e6f.

    However, the code is using the fast path in validateStateFor, which
    assumes all existing txs are valid while 52125d90 and d89f9e6f are now
    invalid; hence, the test fails.

CoreNodeId 0 ADDING (SlotNo {unSlotNo = 15},SimpleLedgerState
  { simpleLedgerState = MockState {mockUtxo = fromList [((076d805e,0),("a",1000)),((076d805e,1),("b",1000))]
  , mockConfirmed = fromList [076d805e]
  , mockTip = At (Block {blockPointSlot = SlotNo {unSlotNo = 0}, blockPointHash = 39e157ff})}
  },...)
("FAST"
  , blockSlot = TxsForUnknownBlock
  , st        = ... as in the ADDING message above ...
  , isTip     = BlockHash 39e157ff
  , isSlotNo  = At (SlotNo {unSlotNo = 1})
  )
  • It is ultimately using the fast path because of the following definition;
    note that the ledgerTipSlot is 0 is in this case.

https://github.com/input-output-hk/ouroboros-network/blob/c1b9ba34ee2eac870a10e5194a9e2c85ddf595ec/ouroboros-consensus/src/Ouroboros/Consensus/Mempool/Impl.hs#L667-L677

  • In other words "the next available slot" is being defined as "the slot
    after the last block on our chain". But it seems that's Is that the wrong basis if
    time has since passed without the chain being extended . ?

@mrBliss
Copy link
Contributor

mrBliss commented Feb 4, 2020

@nfrisby Excellent debugging! I agree that removeTxs should indeed not use the fast path. My apologies, I wrote that code 😳. I'll create a ticket for it.

@edsko
Copy link
Contributor

edsko commented Feb 4, 2020

Ok, so, @mrBliss and I discussed this.

You indeed identified the bug. Let's summarize what's going on. Suppose the mempool contains two transactions A and B, where B depends on A. Now we delete A from the mempool. The bug you identified means that B is not removed. For a while I thought that this didn't matter; after all, when we apply the next block to the mempool, or indeed get the contents of the mempool to produce a block, we will revalidate anyway. However, if before any of that happens we add a new transaction C to the mempool, we need a ledger state in order to validate C. To do that, we use reapplyTxSameState for the existing transactions---that is, B---and since B is no longer actually valid, that goes wrong. Hence, we need an invariant that says that isTxs must be valid with respect to isTip and isSlotNo.

In other words "the next available slot" is being defined as "the slot after the last block on our chain". But it seems that's Is that the wrong basis if time has since passed without the chain being extended . ?

I think this is stlll okay, isn't it? Sure, it might be possible that when you add a transaction into the mempool, it might already have expired with respect to the wall clock, but it will nonetheless be valid with respect to the ticked ledger state slot no. I don't think this could cause trouble, but if you think otherwise, I'm all ears :) We have a ticket for changing this (IntersectMBO/ouroboros-consensus#744) but so far I'm regarding this as strictly an enhancement only.

@mrBliss mrBliss self-assigned this Feb 4, 2020
@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 4, 2020

@edsko I agree that Issue 1298 is an enhancement: it might save some churn but that's it. In particular, the TxsForBlockInSlot argument in this piece of code ensures that only valid txs actually make it into a block, right?

https://github.com/input-output-hk/ouroboros-network/blob/8ef7cb074629117e17d124d02eb21fa3ccfabb6b/ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs#L413-L423

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 4, 2020

@edsko @mrBliss If there are no loose ends in the comments on this Issue, I think we can close it in favor of the new Issue 1565.

The only obstacle I see is that the repros I'm aware of depend on a fix to Issue 1489; I'm working on PR #1506 for that Issue.

@edsko
Copy link
Contributor

edsko commented Feb 5, 2020

Let's close it after we validate that @mrBliss fix-to-come to #1565 also fixes this?

@mrBliss mrBliss added this to the S6 2020-02-13 milestone Feb 10, 2020
@mrBliss
Copy link
Contributor

mrBliss commented Feb 11, 2020

@nfrisby can you check whether #1599 (the fix for #1565) fixed it?

@edsko edsko added byron Required for a Byron mainnet: replace the old core nodes with cardano-node priority low and removed priority high labels Feb 11, 2020
@edsko
Copy link
Contributor

edsko commented Feb 11, 2020

Reprioritizing as low priority, since we are pretty sure this is fixed now after #1599 , though it would still be good to verify. Once @nfrisby does that, we can close this ticket.

@edsko edsko modified the milestones: S6 2020-02-13, S7 2020-02-27 Feb 13, 2020
@edsko edsko modified the milestones: S7 2020-02-27, S8 2020-03-12 Feb 27, 2020
@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 28, 2020

Yes; I confirm it works. Closing.

In particular, I've now added a repro for it (commit 6178e19) to the tip of my pending PR #1506 -- I had accidentally dropped that repro during the various rebases.

@nfrisby nfrisby closed this as completed Feb 28, 2020
@edsko
Copy link
Contributor

edsko commented Mar 2, 2020

Hurray! :D

iohk-bors bot added a commit that referenced this issue Mar 2, 2020
1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby

Fixes #1489. Fixes #1524.

Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the
range [2 .. 10] since the dual-ledger tests do that now revealed several
Issues.

Issues in the library, not just the test infrastructure:

  * Issue #1505 -- `removeTxs` cannot use the fast path when validating after
    removing or else we might have dangling tx inputs references. Was fixed
    by #1565.

  * Issue #1511, bullet 1 (closed) -- The `Empty` cases in
    `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and
    #1589.

  * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an
    interruptible operation. Thomas' PR re-designed the vulnerability away. I
    think this is unrelated to the other changes; it was lurking and happened
    to pop up here just because I've been running hundreds of thousands of
    tests.

Issues only in the test infrastructure:

  * Issue #1489: let nodes lead when they join. This bug slipped in recently,
    when I added cloning of `BlockchainTime`s as part of the
    restarting/rekeying loop in the test infrastructure.

  * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block
    chains in Ref.PBFT and use its results where applicable instead of the
    .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate
    results agree on `Ref.Nominal` slots.

    The Ref.PBFT module had been making assumptions that were accurate given
    the guards in RealPBFT generators, given the `k >= n` regime. But outside
    of that regime, when the security parameter exceeds the node count, it
    wasn't enough. Also, it couldn't be compared against the PBFT threadnet
    because of those assumptions.

  * _PBFT reference simulator_. (Cascade of above.) Add
    `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots"
    invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT
    generators are intentionally insufficient by themselves; this way we can
    optimize them to avoid O(n^2) complexity without risking divergence from
    the `suchThat`s.

  * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that
    forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that.
    My hypothesis is that `cvsLastSlot` would need to be able to represent
    origin.

  * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot
    S and Y leads slot S+1, then either the topology must connect X and Y
    directly, or Y must join before slot S."

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add
    `rekeyOracle` and use it to determine which epoch number to record in the
    dlg cert. The correct value depends on which block the dlg cert tx will end
    up in, not when we first add it to our mempool.

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the
    dlg cert tx to the mempool each time the ledger changes.

Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working byron Required for a Byron mainnet: replace the old core nodes with cardano-node consensus issues related to ouroboros-consensus mempool
Projects
None yet
Development

No branches or pull requests

3 participants