New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug reapplyTxSameState: unexpected error: MockInvalidInputs #1505
Comments
[Edit: never mind the mention of the invariant in the following. The repro is for PBFT, not RealPBFT.] Hmmmmm. I just noticed that this repro necessarily violates the "at least k blocks in 2k slots" Byron invariant. That's usually enough to explain weird behavior. But I haven't connected the dots to how that could cause a (I suppose this test case arose while I was rewriting the generators on my WIP PR but before I had finished doing so. As far as I know, they currently do not create tests cases that violate the "enough blocks" Byron invariant.) If I understand correctly, nodes 1-2 are stuck in a stand-off (hence the invariant violation) and the only thing (confirm?) they're actively doing is adding txs to their mempool at the onset of each slot.
|
This just occurred on my WIP PR for Issue 1489, which includes the one-line reversion that resets the max mempool size to
Edit: And another.
Observation: both of these have two nodes that both lead in the slot they join. Edit And another.
Edit: All four of those fail as expected on master after cherry-picking only the basic |
I started debugging this one.
I'm not sure yet, but my leading theory is that there's a race condition involving this line: It's looking like it removes a transaction that a subsequent one depends on, but then that leaves a dangling UTxO reference. |
But wouldn't that just happen on the next validation? Or does something go wrong at that point (perhaps a "did we already validate this?" check incorrectly saying "yeah, this is fine?"). |
I had a quick look at this and my first suspicion is that |
Also, I think that the idea of A simple fix would be to return data LedgerTip blk
= NoAppliedBlock !(WithOrigin SlotNo)
-- ^ No block has been applied to the ledger. When the ledger hasn't been
-- \"ticked\", the argument will be 'Origin', otherwise, it will be @'At'
-- slotNo@ where @slotNo@ is the ticked slot number.
| AppliedBlock !(HeaderHash blk) !SlotNo
-- ^ A block with the given hash has been applied. The 'SlotNo' will be
-- that of the block, unless the ledger has been \"ticked\" with a more
-- recent slot number.
|
Yes, @edsko and I noticed that as part of Issue #1297, which I paused work on when I found this error case (and others). It seems that my work here is coming full circle. @mrBliss Thanks for sharing your helpful thoughts; that's where I was headed with my #1297 approach. Edsko and I had determined to attempt a middle ground, in part because the existing Byron ledger state type cannot correctly represent all of the cases near |
Edit 1: Hmm. I don't know why the Edit 2: Aha! That Edit 3: Should the validation after OK, here's an explanation of the simplest repro's failure. @edsko @mrBliss @intricate
|
@nfrisby Excellent debugging! I agree that |
Ok, so, @mrBliss and I discussed this. You indeed identified the bug. Let's summarize what's going on. Suppose the mempool contains two transactions
I think this is stlll okay, isn't it? Sure, it might be possible that when you add a transaction into the mempool, it might already have expired with respect to the wall clock, but it will nonetheless be valid with respect to the ticked ledger state slot no. I don't think this could cause trouble, but if you think otherwise, I'm all ears :) We have a ticket for changing this (IntersectMBO/ouroboros-consensus#744) but so far I'm regarding this as strictly an enhancement only. |
@edsko I agree that Issue 1298 is an enhancement: it might save some churn but that's it. In particular, the |
Hurray! :D |
1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby Fixes #1489. Fixes #1524. Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the range [2 .. 10] since the dual-ledger tests do that now revealed several Issues. Issues in the library, not just the test infrastructure: * Issue #1505 -- `removeTxs` cannot use the fast path when validating after removing or else we might have dangling tx inputs references. Was fixed by #1565. * Issue #1511, bullet 1 (closed) -- The `Empty` cases in `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and #1589. * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an interruptible operation. Thomas' PR re-designed the vulnerability away. I think this is unrelated to the other changes; it was lurking and happened to pop up here just because I've been running hundreds of thousands of tests. Issues only in the test infrastructure: * Issue #1489: let nodes lead when they join. This bug slipped in recently, when I added cloning of `BlockchainTime`s as part of the restarting/rekeying loop in the test infrastructure. * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block chains in Ref.PBFT and use its results where applicable instead of the .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate results agree on `Ref.Nominal` slots. The Ref.PBFT module had been making assumptions that were accurate given the guards in RealPBFT generators, given the `k >= n` regime. But outside of that regime, when the security parameter exceeds the node count, it wasn't enough. Also, it couldn't be compared against the PBFT threadnet because of those assumptions. * _PBFT reference simulator_. (Cascade of above.) Add `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots" invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT generators are intentionally insufficient by themselves; this way we can optimize them to avoid O(n^2) complexity without risking divergence from the `suchThat`s. * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that. My hypothesis is that `cvsLastSlot` would need to be able to represent origin. * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot S and Y leads slot S+1, then either the topology must connect X and Y directly, or Y must join before slot S." * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add `rekeyOracle` and use it to determine which epoch number to record in the dlg cert. The correct value depends on which block the dlg cert tx will end up in, not when we first add it to our mempool. * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the dlg cert tx to the mempool each time the ledger changes. Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
This Issue is a bit of a placeholder; I'm not yet sure what the ultimate cause is.
While fixing Issue #1489, I started to see additional errors. One example is:
So far, I only have a repro. It looks like this:
Commit 3594250 is the aforementioned anticipated fix for Issue 1489. Commit d0e3c0d adds the actual Tx invalidity to the error message. Commit 8befbfb narrows
test-consensus
to just run this one test:The only hint I have so far is that if I revert PR #1468, then the test no longer fails. The contents of that PR do not obviously cause such an error. I anticipate that it is incriminated only because it changes the blocking behavior of
addTxs
and there's a bug somewhere inaddTxs
regarding the logic about whether or not the ledger snapshot being used for validation is stale. (@edsko and I have found some similar issues in that module during separate work on Issue #1297.)Edit: instead of reverting PR 1468, I can just set this field in
nodeArgs
inTest.ThreadNet.Network.hs
.The capacity was 3000 bytes before PR 1468, and the repro's test now passes if
0 < multiplier < 47
, and fail if it's>= 47
-- does that support theaddTxs
blocking/fingerprinting theory?The text was updated successfully, but these errors were encountered: