Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-based tests #211

Closed
deepfire opened this issue Sep 26, 2019 · 11 comments
Closed

Cluster-based tests #211

deepfire opened this issue Sep 26, 2019 · 11 comments
Assignees
Projects

Comments

@deepfire
Copy link
Contributor

deepfire commented Sep 26, 2019

Context

We want cluster-based integration tests for the node.

Current scope (to be extended)

  1. Cluster consensus validation: Run a small demo cluster on CI together with chairman application #106
    • status: implementation mostly done
  2. Extend to mixed cluster of old and new nodes, in PBFT era/mode, both producing blocks, connected via proxies: Mixed cluster in CI #255
    • status: stuck debugging cardano-sl cluster startup
  3. cardano-byron-proxy test:
    1. Basic functionality
    2. Heap profiling
    3. Strictness check using @edsko 's WHNF checker.
  4. Cluster Tx submission

Implementation

NixOS tests that can run a cluster in a VM are a good foundation for many of those.

This basis functionality was merged in #177

@deepfire deepfire self-assigned this Sep 26, 2019
@deepfire
Copy link
Contributor Author

After myself trying to make cardano-byron-proxy serve a static chain (with no new block announcements), @avieth suggested the following:

Byron proxy just plays the cardano-sl game: it can't download a chain unless it has the header hash of the tip. If you want it to download from a Byron peer that has a static chain, it can be done without much difficulty. Either:

  1. you know the hash of the tip that you want, and you can patch byron-proxy to request it in particular or
  2. patch cardano-sl to announce its tip header periodically even if it does not change

@deepfire
Copy link
Contributor Author

deepfire commented Oct 14, 2019

So, the latest status is -- as per latest developments in https://github.com/input-output-hk/iohk-ops/tree/serge/cardano-cluster :

  1. The legacy node service definition was modified to provide an instanced systemd service, allowing several legacy nodes to run on a single system -- similar as was done for the new node.
  2. A new legacy cluster configuration was created in https://github.com/input-output-hk/cardano-sl/tree/serge/ci-genesis -- its keys will be supplied in https://github.com/input-output-hk/cardano-node/tree/serge/mainnet-ci. This will be shared by all components of the mixed cluster: legacy segment, proxy and Byron Rewrite segment.
  3. NTP-to-local-clocks pinning was implemented, in line with @cleverca22's suggestion.
  4. The test legacy cluster still doesn't make blocks, despite full connectivity & not being in recovery mode. There are several suspicions on why that is so.
  5. In particular, @dcoutts suggested that we try starting the cluster in OBFT mode (it starts in Byron Classic mode, by default).

So the next piece of work is trying to figure out how to make the legacy cluster start in OBFT, without.

@deepfire
Copy link
Contributor Author

deepfire commented Oct 14, 2019

There was a discussion on how to simplify #2 -- the mixed cluster, to try avoiding the issue with cardano-sl cluster not starting with multiple nodes sharing a single localhost address.

The idea was to use an existing mainnet cluster as the source of blocks (which are necessary for the proxy to function, as per above).

Sadly, this breaks on two points (and a half):

  1. It just won't allow us simultaneous block creation on both sides of the proxy (since it'll be tied to mainnet) -- so this will have to be thrown away and re-done properly (in mainnet-independent fashion) anyway.
  2. It'll create problems with the relay taking a lot of time to sync (since it'll always be behind when it starts).
  3. De-isolation -- allowing NixOS test environment talk to real mainnet -- isn't exactly trivial -- while it's definitely doable, it'll still take some work. This isn't of course preventive, but stll this takes away from the attractivity of the option -- it's now comparable amount of work to others.

@deepfire
Copy link
Contributor Author

There is a simpler option to try with cardano-sl potentially being stuck due to all nodes sharing localhost -- we can employ VDE[1] to give distinct nodes distinct, routable IP addresses.

--

  1. https://github.com/virtualsquare/vde-2, available on NixOS.

@deepfire
Copy link
Contributor Author

deepfire commented Oct 14, 2019

The VDE route almost worked.. except the routing itself became interesting -- the kernel was choosing the same route for all packets, since all tapX interfaces are local! ..

..and this follows to the same dreaded error as with the previous attempt with using different loopback addresses -- the network-transport-tcp sees a mismatch between stated and actual address, and fails: https://github.com/input-output-hk/network-transport-tcp/blob/2634e5e32178bb0456d800d133f8664321daa2ef/src/Network/Transport/TCP.hs#L1621

Duh! Should have expected that..

So I'm currently playing with source routing policies, which would make the kernel assign choose different interfaces, that would actually depend on the source address: https://www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.rpdb.simple.html

UPDATE: I'm getting different source addresses now, however the problem now is, the mapping between TAP interfaces and the source addresses seems random 😂

@deepfire
Copy link
Contributor Author

  1. Cutting out the network-transport-tcp address check did the trick -- the nodes agreed to connect/talk to each other.

However, that didn't resolve the problem with the cardano-sl nodes not making blocks.

So I started looking into switching the legacy nodes into OBFT node right from start (they currently start in Ouroboros Classic mode).

  1. Found the OBFT era being determined by the unlockStakeEpoch field of BlockVersionData: https://github.com/input-output-hk/cardano-sl/blob/master/chain/src/Pos/Chain/Update/BlockVersionData.hs#L148

  2. Regenerated genesis with unlockStakeEpoch being equal to the magic OBFT value -- and no MPC messages appear in cardano-sl's logs anymore, which suggests the change was effective.

No blocks, though..

@CodiePP CodiePP added this to In progress in ActiveBoard Oct 16, 2019
@deepfire
Copy link
Contributor Author

deepfire commented Oct 16, 2019

Ok, I've gone with the supposedly well-oiled AWS setup of cadano-sl, however, it somehow manages to fare even worse than a cluster confined to a multi-node-on-single-machine (although, yes, there are other differences -- because the single-machine cluster required systemd service instancing and a lot of fiddling in general).

The error cardano-sl gives at cluster startup is (with some initial context):

Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Application: cardano-sl:1, last known block version 0.2.0, systemTag: linux64
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Genesis stakeholders (7 addresses, dust threshold 7 coin(s)): GenesisWStakeholders: {33111eddbb08270d: 1, 540fb9f1c0415491: 1, 6132662df7ccd698: 1, 773d6255ced70494: 1, 8dba875898ab11ac: 1, f7dedd2205451763: 1, f825bd9e9df8670d: 1}
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] GenesisDelegation (stakeholder ids): [773d6255ced70494 -> d5f8ce7d1937176c, 33111eddbb08270d -> aa84a9d0f69f2493, 8dba875898ab11ac -> ac68bdca1fae8f14, f7dedd2205451763 -> fcb3a4f1b35e5868, 540fb9f1c0415491 -> 98ca509664413dbf, 6132662df7ccd698 -> f050f7380f318dd4, f825bd9e9df8670d -> f3b7b1477a80fda3]
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] First genesis block hash: 1a28c5b6d7b98239, genesis seed is 76617361206f7061736120736b6f766f726f64612047677572646120626f726f64612070726f766f6461
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Current tip header: GenesisBlockHeader:
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     hash: 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     previous block: 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     epoch: #0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     difficulty: 0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Waiting 303 seconds for system start
...
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node.slotting:Notice:ThreadId 149] [2019-10-16 16:32:26.00 UTC] New slot has just started: 0th slot of 0th epoch
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node.slotting:Debug:ThreadId 149] [2019-10-16 16:32:26.00 UTC] Waiting for 19993571mcs before new slot
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Debug:ThreadId 142] [2019-10-16 16:32:26.00 UTC] Our tip header: GenesisBlockHeader:
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     hash: 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     previous block: 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     epoch: #0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     difficulty: 0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 142] [2019-10-16 16:32:26.00 UTC] Difference between current slot and tip slot is: 0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Debug:ThreadId 138] [2019-10-16 16:32:26.00 UTC] There are no new confirmed update proposals for our application
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.MonadPseudoRandom:Error:ThreadId 148] [2019-10-16 16:32:26.00 UTC] rollbackSsc: most genesis block is passed to rollback
Oct 16 16:32:51 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.consolidate:Error:ThreadId 119] [2019-10-16 16:32:51.27 UTC] DBMalformed "Can't retrieve genesis block, maybe db is not initialized?"

There is a lead, of course..

@deepfire
Copy link
Contributor Author

For the sake of completeness -- the way genesis is generated is via https://github.com/input-output-hk/cardano-sl/blob/master/scripts/prepare-genesis/default.nix

@deepfire deepfire moved this from In progress to Backlog in ActiveBoard Oct 17, 2019
@Jimbo4350
Copy link
Contributor

@deepfire can we close this?

@deepfire
Copy link
Contributor Author

@Jimbo4350, I don't think so -- not all of the bullet items are done.

@CodiePP
Copy link
Contributor

CodiePP commented Mar 5, 2020

will be moved to cardano-benchmarking

@CodiePP CodiePP closed this as completed Mar 5, 2020
ActiveBoard automation moved this from Backlog to Done Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
ActiveBoard
  
Done
Development

No branches or pull requests

3 participants