Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangover logic includes previous block chain sends after restart #8652

Open
mhofman opened this issue Dec 12, 2023 · 1 comment
Open

Hangover logic includes previous block chain sends after restart #8652

mhofman opened this issue Dec 12, 2023 · 1 comment
Assignees
Labels
bug Something isn't working chain-incident

Comments

@mhofman
Copy link
Member

mhofman commented Dec 12, 2023

Describe the bug

Our hangover logic aims to handle the a one block discrepancy in committed state between cosmos and swingset. It does so by saving the "chain sends" performed by JS during a block, and replaying those instead of executing if cosmic-swingset realizes it's asked to process the last recorded block.

There has been issues in the past with logic in cosmic-swingset, incorrectly mutating some state during detected hangovers, or clearing chain sends during flush, resulting in one time only hangover recovery.

We experienced a remaining issue where the saved chain sends of the block committed before a halt of the node are not cleared and included in the saved chain sends of the first block. If the node experiences a hangover at that point, it will incorrectly replay chain sends of both blocks.

To Reproduce

Steps to reproduce the behavior:

  1. Start a local-chain
  2. Make sure it's idle, submit some swingset message
  3. Interrupt the node before it commits the block in which the message is included
  4. restart the node
  5. interrupt the node after it commits the swing-store, but before it commits the cosmos DB (e.g. throw at the end of COMMIT_BLOCK after saveOutsideState)
  6. restart the node
  7. observe an error along the lines of
    portHandler threw (Error#1)
    Error#1: fatal: replaying chain send [1,"{\"args\":[\"actionQueue.tail\"],\"method\":\"get\"}"] resulted in "\"1\""; expected null
    
      at replayChainSends (packages/cosmic-swingset/src/chain-main.js:235:15)
    

Expected behavior

Hangover logic handles double halts like this.

More specifically, the "chain sends" are cleared once cosmic-swingset is sure that cosmos has correctly committed it's DB. While we could rely on the new AFTER_COMMIT_BLOCK action, because of #6736 we likely should call await clearChainSends() in BEGIN_BLOCK if blockNeedsExecution(). Once we introduce execution between blocks, we'll have to be careful to clear chain sends at the appropriate time for that.

Testing

Historically it has been difficult to test these integration issues between cosmos and cosmic-swingset, however we realized we could simulate the cosmos driver by importing main from chain-main.js and provide a mock agcc.

Additional context

Experienced by validator-2 at block 2843573-2843574. See #8650

Screenshots

TBD: capture of faulty saved chain sends

@mhofman mhofman added the bug Something isn't working label Dec 12, 2023
@mhofman mhofman self-assigned this Dec 12, 2023
@mhofman mhofman changed the title Hangover logic saves previous block after restart Hangover logic includes previous block chain sends after restart Dec 12, 2023
@mhofman
Copy link
Member Author

mhofman commented Apr 9, 2024

This issue is likely what caused a replay error when Nodes.Guru attempted to rollback after an AppHash:

Started Agoric Node.
4:45AM INF agd delegating to JS executable args=["ag-chain-cosmos","--home","/home/ubuntu/.agoric","start","--address","tcp://0.0.0.0:29658","--grpc-web.address","0.0.0.0:9391","--grpc.address","0.0.0.0:9390","--p2p.laddr","tcp://0.0.0.0:29656","--rpc.laddr","tcp://127.0.0.1:29657","--home","/home/ubuntu/.agoric"] binary=/home/ubuntu/agoric-sdk/packages/cosmic-swingset/src/entrypoint.js
4:46AM INF starting node with ABCI Tendermint in-process
4:46AM INF service start impl=multiAppConn module=proxy msg={}
4:46AM INF service start connection=query impl=committingClient module=abci-client msg={}
4:46AM INF service start connection=snapshot impl=committingClient module=abci-client msg={}
4:46AM INF service start connection=mempool impl=committingClient module=abci-client msg={}
4:46AM INF service start connection=consensus impl=committingClient module=abci-client msg={}
4:46AM INF service start impl=EventBus module=events msg={}
4:46AM INF service start impl=PubSub module=pubsub msg={}
4:46AM INF service start impl=IndexerService module=txindex msg={}
4:46AM INF ABCI Handshake App Info hash="�*����\x04��I\a\x0eJ�\x02G��\x02(g1V\x19�V\x14#S\x0e{�" height=14530318 module=consensus protocol-version=0 software-version=0.35.0-u14.1
4:46AM INF ABCI Replay Blocks appHeight=14530318 module=consensus stateHeight=14530318 storeHeight=14530319
4:46AM INF Replay last block using real app module=consensus
4:46AM INF minted coins from module account amount=8506555ubld from=mint module=x/bank
Loading slog sender modules: @agoric/telemetry/src/flight-recorder.js
2024-04-09T04:46:44.743Z launch-chain: Launching SwingSet kernel
2024-04-09T04:47:04.789Z launch-chain: Launched SwingSet kernel
2024-04-09T04:47:04.791Z block-manager: block 14530319 begin
portHandler threw (Error#1)
Error#1: fatal: replaying chain send [1,"{\"args\":[\"actionQueue.tail\"],\"method\":\"get\"}"] resulted in "null"; expected "1"
  at replayChainSends (packages/cosmic-swingset/src/chain-main.js:235:15)
  at doBlockingSend (packages/cosmic-swingset/src/launch-chain.js:1013:13)
panic: Error: fatal: replaying chain send [1,"{\"args\":[\"actionQueue.tail\"],\"method\":\"get\"}"] resulted in "null"; expected "1"
goroutine 99 [running]:
github.com/Agoric/agoric-sdk/golang/cosmos/x/swingset.EndBlock({{0x7f0c16d501f0, 0x7f0c17fe93c0}, {0x7f0c16d607a0, 0xc026ec6c80}, {{0xb, 0x0}, {0xc026d76f30, 0x8}, 0xddb70f, {0x1ab27c17, ...}, ...}, ...}, ...)
        /home/ubuntu/agoric-sdk/golang/cosmos/x/swingset/abci.go:66 +0x22d
github.com/Agoric/agoric-sdk/golang/cosmos/x/swingset.AppModule.EndBlock({{}, {{0x7f0c16d36ac0, 0xc00e9cef20}, {0x7f0c16d65cb0, 0xc001182010}, {{0x7f0c16d60010, 0xc001182010}, 0xc000134038, {0x7f0c16d36ac0, 0xc00e9cee90}, ...}, ...}, ...}, ...)
        /home/ubuntu/agoric-sdk/golang/cosmos/x/swingset/module.go:145 +0x65
github.com/cosmos/cosmos-sdk/types/module.(*Manager).EndBlock(_, {{0x7f0c16d501f0, 0x7f0c17fe93c0}, {0x7f0c16d607a0, 0xc026ec6c80}, {{0xb, 0x0}, {0xc026d76f30, 0x8}, 0xddb70f, ...}, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/types/module/module.go:509 +0x2a9
github.com/Agoric/agoric-sdk/golang/cosmos/app.(*GaiaApp).EndBlocker(...)
        /home/ubuntu/agoric-sdk/golang/cosmos/app/app.go:1009
github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).EndBlock(0xc000394a80, {0x40?})
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/baseapp/abci.go:208 +0x1fd
github.com/tendermint/tendermint/abci/client.(*committingClient).EndBlockSync(0xc03cfd8fc0, {0xc03cfd8fc0?})
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/abci/client/committing_client.go:341 +0xd2
github.com/tendermint/tendermint/proxy.(*appConnConsensus).EndBlockSync(0xc026e8ba00?, {0x20?})
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/proxy/app_conn.go:89 +0x1e
github.com/tendermint/tendermint/state.execBlockOnProxyApp({0x7f0c16d50688?, 0xc033c4cf00}, {0x7f0c16d59750, 0xc01bb6a0d0}, 0xc0004fb2c0, {0x7f0c16d60f70, 0xc01aa7a120}, 0xddb70e?)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/state/execution.go:327 +0x714
github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc01b9077e0, 0x7}}, {0xc01b9077e8, 0x8}, 0x204855, 0xddb70e, {{0xc01b11a140, ...}, ...}, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/state/execution.go:140 +0x16b
github.com/tendermint/tendermint/consensus.(*Handshaker).replayBlock(_, {{{0xb, 0x0}, {0xc01b9077e0, 0x7}}, {0xc01b9077e8, 0x8}, 0x204855, 0xddb70e, {{0xc01b11a140, ...}, ...}, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/consensus/replay.go:527 +0x23c
github.com/tendermint/tendermint/consensus.(*Handshaker).ReplayBlocksWithContext(_, {_, _}, {{{0xb, 0x0}, {0xc01b9077e0, 0x7}}, {0xc01b9077e8, 0x8}, 0x204855, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/consensus/replay.go:433 +0x75a
github.com/tendermint/tendermint/consensus.(*Handshaker).HandshakeWithContext(0xc00f29bc48, {0x7f0c16d50228, 0x7f0c17fe93c0}, {0x7f0c16d627d8?, 0xc0005c5c70?})
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/consensus/replay.go:274 +0x3d8
github.com/tendermint/tendermint/node.doHandshake({_, _}, {_, _}, {{{0xb, 0x0}, {0xc01b9077e0, 0x7}}, {0xc01b9077e8, 0x8}, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/node/node.go:330 +0x1a9
github.com/tendermint/tendermint/node.NewNodeWithContext({0x7f0c16d50228, 0x7f0c17fe93c0}, 0xc0014de280, {0x7f0c16d443f0, 0xc000407c20}, 0xc00e9ff330, {0x7f0c16d2ef48, 0xc014c2b5c0}, 0x1?, 0x7f0c16d23038, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/node/node.go:797 +0x577
github.com/tendermint/tendermint/node.NewNode(0x0?, {0x7f0c16d443f0?, 0xc000407c20?}, 0x0?, {0x7f0c16d2ef48?, 0xc014c2b5c0?}, 0x1?, 0x7f0c17fe93c0?, 0x0?, {0x7f0c16d50688, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cometbft@v0.34.30-alpha.agoric.1/node/node.go:719 +0xa9
github.com/cosmos/cosmos-sdk/server.startInProcess(_, {{0x0, 0x0, 0x0}, {0x7f0c16d6d460, 0xc0014a98c0}, 0x0, {0xc0014379a8, 0x8}, {0x7f0c16d65cb0, ...}, ...}, ...)
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/server/start.go:301 +0x6f5
github.com/cosmos/cosmos-sdk/server.StartCmd.func2.2()
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/server/start.go:147 +0x45
github.com/cosmos/cosmos-sdk/server.wrapCPUProfile.func2()
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/server/start.go:535 +0x22
created by github.com/cosmos/cosmos-sdk/server.wrapCPUProfile in goroutine 72
        /home/ubuntu/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.46.16-alpha.agoric.2.1/server/start.go:534 +0x22f
agd.service: Main process exited, code=killed, status=6/ABRT
agd.service: Failed with result 'signal'.
agd.service: Scheduled restart job, restart counter is at 3.
Stopped Agoric Node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chain-incident
Projects
None yet
Development

No branches or pull requests

2 participants