Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed cluster in CI #255

Closed
deepfire opened this issue Oct 17, 2019 · 13 comments
Closed

Mixed cluster in CI #255

deepfire opened this issue Oct 17, 2019 · 13 comments
Assignees
Projects
Milestone

Comments

@deepfire
Copy link
Contributor

deepfire commented Oct 17, 2019

Goal

This is a subgoal of #211

We want an integration test for the Legacy/OBFT to Shelley/OBFT transition phase.

This means that we want to run a cluster configuration with two segments -- nodes running cardano-sl and nodes running cardano-node -- with having a number of cardano-byron-proxy connecting those.

Implementation

The entire cluster is supposed to run in a NixOS test on a single machine, to lighten load on CI & make the test faster.

The cluster has to share the same genesis, for obvious reasons, so cardano-sl nodes must use external genesis. Preferably we should use a genesis with configuration as close to mainnet as possible (to make the testing maximally relevant).

Deliverable

The final PR that enables this functionality is #269

@deepfire deepfire added this to Backlog in ActiveBoard via automation Oct 17, 2019
@deepfire deepfire moved this from Backlog to In progress in ActiveBoard Oct 17, 2019
@deepfire deepfire self-assigned this Oct 17, 2019
@deepfire
Copy link
Contributor Author

deepfire commented Oct 17, 2019

Legacy cluster bringup trouble

We are trying to bring up a familiar cardano-sl cluster from scratch, with external genesis and configuration that resembles mainnet as close as possible.

There are two problems, in two different contexts:

  1. Nodes not creating blocks, in a single-machine-multi-node context,
  2. Nodes failing to start, rejecting their (freshly-made) databases as malformed -- in a supposedly well-trodden (but not touched for a while) regular AWS deployment context.

Nodes not creating blocks, in a single-machine-multi-node context

Logs: local-cluster.txt

This setup has a fair number of divergence from the well-established road, mainly in the deployment part:

  1. the iohk-ops part (service definition) was heavily changed -- due to the necessity to pack everything in a single machine,
  2. network-transport-tcp was modified to not do the claimed/DNS/numeric address check,

Configuration key: input-output-hk/cardano-sl@a8c04d5
Genesis: input-output-hk/cardano-sl@cbd3174#diff-ce43dde293d86f721d43e1b8e68edc67

Note the fixed startTime of 1000000000 -- the time is intended to be artificially reset to that point in time, due to CI reasons.

Branches in related repos:

  1. https://github.com/input-output-hk/cardano-sl/tree/serge/ci-genesis
  2. https://github.com/input-output-hk/iohk-ops/tree/serge/cardano-cluster
  3. https://github.com/input-output-hk/cardano-node/tree/serge/mainnet-ci
  4. https://github.com/deepfire/network-transport-tcp/tree/serge/drop-extra-check

Nodes failing to start, rejecting their (freshly-made) databases as malformed.

Logs: c-a-1.log

This setup is fairly traditional as it relies on regular AWS deployment, except it also includes the network-transport-tcp changes -- although the DB-related failure is arguably unrelated.

Of note is that the startTime was properly in future: Waiting 308 seconds for system start can be seen in the logs.

Genesis & configuration: input-output-hk/cardano-sl@108464c

Branches in related repos:

  1. https://github.com/input-output-hk/cardano-sl/tree/serge/ci-genesis
  2. https://github.com/input-output-hk/iohk-ops/tree/serge/external-genesis
  3. https://github.com/deepfire/network-transport-tcp/tree/serge/drop-extra-check

@deepfire
Copy link
Contributor Author

deepfire commented Oct 17, 2019

There is some evidence that the error is caused by DB state loss, so is really a node misbehavior:

  1. The error comes from Pos.DB.Block.GState.BlockExtra.getFirstGenesisBlockHash https://github.com/input-output-hk/cardano-sl/blob/51ad7c0503b1c52a75a6eb36096c407934136468/db/src/Pos/DB/Block/GState/BlockExtra.hs#L125
  2. Adding tracing shows evidence that this function returns different values for the same input:
  • for getFirstGenesisBlockHash, it is always called with the same argument:
    Untitled.txt
  • for its callee, resolveForwardLink:
resolveForwardLink = Just (AbstractHash 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c), linkKey "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[`&\176"
[cardano-sl.MonadPseudoRandom:Error:ThreadId 149] [2019-10-17 15:33:47.21 UTC] rollbackSsc: most genesis block is passed to rollback
getFirstGenesisBlockHash: genesisHash: AbstractHash 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
resolveForwardLink = Nothing, linkKey "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[`&\176"
[cardano-sl.consolidate:Error:ThreadId 119] [2019-10-17 15:34:02.20 UTC] DBMalformed "Can't retrieve genesis block, maybe db is not initialized?"

I.e. we query the GStateDB twice, with the same linkKey, and at first it's a Just, then a Nothing.

So the next step is to find the mutator that does the damage.

Btw, an additional piece of context, is that this action happens in the consolidation code.

@deepfire
Copy link
Contributor Author

After adding a tracepoint that @intricate suggested (at input-output-hk/cardano-sl@016b38c#diff-a9e07ec6470d6fa7c4708aada702011eR185):

getFirstGenesisBlockHash: genesisHash: AbstractHash 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
rezzzolveForwardLink = Just (AbstractHash 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c), linkKey "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[`&\176"
    previous block: 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
    epoch: #0
    difficulty: 0

[cardano-sl.node:Info:ThreadId 141] [2019-10-16 16:32:43.19 UTC] Difference between current slot and tip slot is: 0
[cardano-sl.node:Debug:ThreadId 137] [2019-10-16 16:32:43.19 UTC] There are no new confirmed update proposals for our application
[cardano-sl.node:Info:ThreadId 150] [2019-10-16 16:32:43.19 UTC] blundLocation: about to call getConsolidateCheckPoint
getFirstGenesisBlockHash: genesisHash: AbstractHash 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
rezzzolveForwardLink = Just (AbstractHash 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c), linkKey "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[`&\176"
Rocks.Del: "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[`&\176"
[cardano-sl.node:Info:ThreadId 150] [2019-10-16 16:32:43.19 UTC] blundLocation: about to call getConsolidateCheckPoint
[cardano-sl.MonadPseudoRandom:Error:ThreadId 150] [2019-10-16 16:32:43.19 UTC] rollbackSsc: most genesis block is passed to rollback
^C

[cardano-sl.diffusion:Error:ThreadId 122] [2019-10-16 16:32:48.81 UTC] stopping with exception AsyncCancelled

..i.e. the Genesis point is erased from the DB!

Rocks.Del: "e/fl/X A\160s\156\184\207\152\161v\169\144\248\169\v,\166\SYN\229A>#w\214\200HA\196k[&\176"`

@deepfire
Copy link
Contributor Author

https://github.com/input-output-hk/cardano-sl/blob/master/lib/src/Pos/Worker/Block.hs#L220 is the caller that drops the genesis point:

  1. Pos.Worker.Block.dropObftEbb
  2. Pos.DB.Block.Logic.rollbackBlocks
  3. Pos.DB.Block.Logic.Internal.rollbackBlocksUnsafe
  4. Pos.DB.Block.Slog.Logic.slogRollbackBlocks

@deepfire
Copy link
Contributor Author

So, at least the naive, first approach at interpretation looks like this --

1. OBFT mode assumes that the the EBB is always present,
2. during cluster startup this does not hold

@deepfire
Copy link
Contributor Author

Fix for DBMalformed "Can't retrieve genesis block, maybe db is not initialized?" error is in input-output-hk/cardano-sl#4247

iohk-bors bot added a commit to input-output-hk/iohk-nix that referenced this issue Oct 25, 2019
206: cardano-lib:  mainnet CI genesis & configuration r=deepfire a=deepfire

Configuration and genesis for IntersectMBO/cardano-node#255

Co-authored-by: Kosyrev Serge <serge.kosyrev@iohk.io>
iohk-bors bot added a commit to input-output-hk/cardano-sl that referenced this issue Oct 28, 2019
4247: Single-machine multi-node mixed cluster CI prerequisites r=deepfire a=deepfire

This supplies the necessary changes for a mixed-cluster integration test, as per IntersectMBO/cardano-node#255 :

1. `mainnet_ci_full` genesis & configuration, starting in OBFT node
2. fix for an OBFT EBB rollback issue, which was trying to erase EBB even if the chain was started in OBFT mode, leading to IntersectMBO/cardano-node#255 (comment)
3. change in `network-transport-tcp` to be more lenient regarding remote address claims: deepfire/network-transport-tcp@44f84a8.  This is necessary to avoid problems when starting multiple nodes on the same machine.
4. small improvements in genesis generation

Additionally, this resets the protocol version for the `shelley_staging_short_full` configuration to 0 -- a prerequisite for its respin.

*NOTE*: perhaps this PR should be split.  But then, this repository sees very little activity, so perhaps the separation wouldn't have much benefit.  I don't have a strong opinion myself.

Co-authored-by: Kosyrev Serge <serge.kosyrev@iohk.io>
@deepfire deepfire mentioned this issue Oct 28, 2019
3 tasks
iohk-bors bot added a commit to input-output-hk/cardano-sl that referenced this issue Oct 29, 2019
4247: Single-machine multi-node mixed cluster CI prerequisites r=deepfire a=deepfire

This supplies the necessary changes for a mixed-cluster integration test, as per IntersectMBO/cardano-node#255 :

1. `mainnet_ci_full` genesis & configuration, starting in OBFT node
2. fix for an OBFT EBB rollback issue, which was trying to erase EBB even if the chain was started in OBFT mode, leading to IntersectMBO/cardano-node#255 (comment)
3. change in `network-transport-tcp` to be more lenient regarding remote address claims: deepfire/network-transport-tcp@44f84a8.  This is necessary to avoid problems when starting multiple nodes on the same machine.
4. small improvements in genesis generation

Additionally, this resets the protocol version for the `shelley_staging_short_full` configuration to 0 -- a prerequisite for its respin.

*NOTE*: perhaps this PR should be split.  But then, this repository sees very little activity, so perhaps the separation wouldn't have much benefit.  I don't have a strong opinion myself.

Co-authored-by: Kosyrev Serge <serge.kosyrev@iohk.io>
@deepfire
Copy link
Contributor Author

The final deliverable is in: #269

@deepfire
Copy link
Contributor Author

deepfire commented Nov 9, 2019

Documentation for running the CI cluster locally is at:

https://github.com/input-output-hk/cardano-node/blob/serge/mainnet-ci/scripts/README.org#ci-cluster

@deepfire
Copy link
Contributor Author

deepfire commented Nov 19, 2019

Seeing #302 in chairman.

@deepfire
Copy link
Contributor Author

input-output-hk/cardano-sl#4251 is a conditional blocker for merging of the cardano-sl dependency PR.

@deepfire
Copy link
Contributor Author

input-output-hk/iohk-nix#237 -- a nasty caching bug that prevents local runs of the CI cluster (post-rebase).

iohk-bors bot added a commit that referenced this issue Nov 27, 2019
269: Mixed cluster CI r=deepfire a=deepfire

_One PR to bring them all_..
..or, the final deliverable of #255

Dependencies:
- [x] input-output-hk/cardano-sl#4252
- [x] input-output-hk/cardano-byron-proxy#70
- [x] #302

Co-authored-by: Kosyrev Serge <serge.kosyrev@iohk.io>
Co-authored-by: Marcin Szamotulski <profunctor@pm.me>
@deepfire
Copy link
Contributor Author

#269 was merged.

ActiveBoard automation moved this from In progress to Done Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
ActiveBoard
  
Done
Development

No branches or pull requests

1 participant