Retry time fixes #4867

karknu · 2024-04-30T12:55:07Z

Description

Fix problems related to big ledger and public root peers retry time.

Checklist

Quality

Commit sequence makes sense and have useful messages, see ref.
New tests are added and existing tests are updated.
Self-reviewed the PR.

Maintenance

Linked an issue or added the PR to the current sprint of ouroboros-network project.
Added labels.
Updated changelog files.
The documentation has been properly updated, see ref.

Update the bigledger peers state in case of an exception. Previously the publicRoot state was updated instead. This ment that in case of an error the node would never attempt to increase the number of big ledger peers again.

The bootstrap peers and public root peers should be different. Therefor the failure of one kind shouldn't automatically carry over to the other kind.

Lower the maximum retry timer from above 1h to 4 minutes. Normally public roots have a ttl of 5s when taken from the ledger. But for bootstrap peers the ttl from DNS will be returned. Since this can be hours or days we cap it to 60s and depend on backoff counter to increase it further if needed.

When calculating publicRoot lookup results having only configured peers with IP addreses is not a DNS failure. Instead of a 3h TTL use 60s for IP addresses only (will be increased by backoff logic).

bolt12 · 2024-04-30T13:58:46Z

These all look sensible to me :)

bolt12 · 2024-04-30T15:10:56Z

ouroboros-network/src/Ouroboros/Network/PeerSelection/Governor/BigLedgerPeers.hs

+                              inProgressBigLedgerPeersReq = False,
+                              bigLedgerPeerBackoffs       = bigLedgerPeerBackoffs',
+                              bigLedgerPeerRetryTime      = bigLedgerPeerRetryTime'


Nice catch!

coot

Very nice.

If we are above the credits instead of failing instantly take a look if any of the events provide us with more credits. This makes the test suite more robust incase one kind of events happens close to another kind of events. For example a burst of peersharing activity just before the node starts to look for new public roots.

ouroboros-network/sim-tests-lib/Test/Ouroboros/Network/PeerSelection.hs

coot · 2024-05-02T19:58:37Z

One of the tests failed on Hydra

        node never uses non-trustable peers in clean state:                     FAIL (2.74s)
          *** Failed! Falsified (after 51 tests and 77 shrinks):
          GovernorMockEnvironment {peerGraph = PeerGraph [(PeerAddr 10,[],GovernorScripts {peerShareScript = Script (Nothing :| []), peerSharingScript = Script (PeerSharingDisabled :| []), connectionScript = Script ((Noop,NoDelay) :| [])})], localRootPeers = fromGroups [(HotValency {getHotValency = 1},WarmValency {getWarmValency = 1},fromList [(PeerAddr 10,(DoAdvertisePeer,IsTrustable))])], publicRootPeers = PublicRootPeers {getPublicConfigPeers = fromList [], getBootstrapPeers = fromList [], getLedgerPeers = fromList [], getBigLedgerPeers = fromList []}, targets = Script ((PeerSelectionTargets {targetNumberOfRootPeers = 0, targetNumberOfKnownPeers = 1, targetNumberOfEstablishedPeers = 0, targetNumberOfActivePeers = 0, targetNumberOfKnownBigLedgerPeers = 0, targetNumberOfEstablishedBigLedgerPeers = 0, targetNumberOfActiveBigLedgerPeers = 0},NoDelay) :| [(PeerSelectionTargets {targetNumberOfRootPeers = 0, targetNumberOfKnownPeers = 0, targetNumberOfEstablishedPeers = 0, targetNumberOfActivePeers = 0, targetNumberOfKnownBigLedgerPeers = 0, targetNumberOfEstablishedBigLedgerPeers = 0, targetNumberOfActiveBigLedgerPeers = 0},NoDelay)]), pickKnownPeersForPeerShare = Script (PickFirst :| []), pickColdPeersToPromote = Script (PickFirst :| []), pickWarmPeersToPromote = Script (PickFirst :| []), pickHotPeersToDemote = Script (PickFirst :| []), pickWarmPeersToDemote = Script (PickFirst :| []), pickColdPeersToForget = Script (PickFirst :| []), peerSharingFlag = PeerSharingDisabled, useBootstrapPeers = Script ((UseBootstrapPeers [RelayAccessDomain "test5" 48,RelayAccessDomain "test5" 65511,RelayAccessAddress "0.246.32.126" 44,RelayAccessDomain "test1" 65494,RelayAccessAddress "12d:21f:56:5bfe:1e1:bef7:c3:da85" 2,RelayAccessAddress "1e4:3395:1a8:1c05:145:be84:124:520c" 65523,RelayAccessAddress "1.82.19.128" 65531,RelayAccessDomain "test5" 48,RelayAccessDomain "test3" 65500,RelayAccessDomain "test5" 65525,RelayAccessAddress "e6:c8c5:119:9f6c:12d:3aa2:88:beaf" 22,RelayAccessDomain "test3" 65525,RelayAccessAddress "0.100.68.30" 65519,RelayAccessDomain "test2" 50,RelayAccessAddress "0.121.48.177" 65497,RelayAccessAddress "0.40.44.218" 16,RelayAccessDomain "test3" 39,RelayAccessDomain "test5" 65492,RelayAccessDomain "test3" 65512,RelayAccessAddress "0.79.102.117" 65533,RelayAccessDomain "test4" 35,RelayAccessAddress "1bc:1a5a:15d:7af1:19d:a8fc:fd:a338" 40,RelayAccessDomain "test5" 65510,RelayAccessAddress "1.118.216.216" 41,RelayAccessDomain "test4" 6,RelayAccessDomain "test4" 16,RelayAccessAddress "ad:49ba:1ed:fcb5:72:5dc6:147:be7f" 9,RelayAccessDomain "test2" 65491,RelayAccessDomain "test3" 65504,RelayAccessAddress "1ce:7887:1f5:8717:155:e24d:1e6:d5cb" 45,RelayAccessDomain "test1" 65497,RelayAccessAddress "0.7.117.105" 42,RelayAccessAddress "5f:ac20:11a:ff14:130:3e5e:69:3dcd" 65506,RelayAccessDomain "test4" 65500,RelayAccessDomain "test1" 65505,RelayAccessAddress "21:c6b1:29:ac56:36:60d2:ca:2eda" 11,RelayAccessDomain "test3" 65487,RelayAccessDomain "test5" 11],NoDelay) :| []), useLedgerPeers = Script ((DontUseLedgerPeers,NoDelay) :| []), ledgerStateJudgement = Script ((TooOld,NoDelay) :| [])}
          Last 20 signal values:
          Time 0s       : (False,(fromList [],fromList []))
          Time 0s       : (True,(fromList [],fromList []))
          Time 0s       : (True,(fromList [PeerAddr 10],fromList [PeerAddr 10]))
          
          Property violated at: Time 0s
          Invalid signal value:
          (True,(fromList [PeerAddr 10],fromList []))
          
          Use --quickcheck-replay=833314 to reproduce.
          Use -p '/node never uses non-trustable peers in clean state/' to rerun this test only.

karknu · 2024-05-03T07:35:31Z

-p '/node never uses non-trustable peers in clean state/'

This breakage seems to have been introduced in a38e5a1 , that is it exist in master.

bolt12 · 2024-05-03T10:47:32Z

I have looked into it and it seems it happens on an edge case when there's a trusted local root peer being in progress promote to cold and then the targets change to be all 0 (due to the quickcheck arbitrary script). This will clamp the local roots to the targetKnownPeer limit which will effectively make them empty. However the peer will stay in the knownPeer set. In other scenario the peer would have been shortly forgotten but since it is in the inProgressPromoteCold set it will linger and make the test fail

karknu added 3 commits April 30, 2024 14:50

Update the bigledger state not publicRoot state

e0fbe4b

Update the bigledger peers state in case of an exception. Previously the publicRoot state was updated instead. This ment that in case of an error the node would never attempt to increase the number of big ledger peers again.

Reset retry state when transition between LedgerStateJudgements

f8529ce

The bootstrap peers and public root peers should be different. Therefor the failure of one kind shouldn't automatically carry over to the other kind.

karknu added the bootstrap Issues / PRs related to bootstrap peers label Apr 30, 2024

karknu added 2 commits April 30, 2024 15:20

Using IP addresses is not a DNS failure

b26e5c0

When calculating publicRoot lookup results having only configured peers with IP addreses is not a DNS failure. Instead of a 3h TTL use 60s for IP addresses only (will be increased by backoff logic).

Update changelog

0ef9e22

karknu force-pushed the karknu/retry_fixes branch from d32f6a1 to 0ef9e22 Compare April 30, 2024 13:20

karknu marked this pull request as ready for review April 30, 2024 15:08

karknu requested a review from a team as a code owner April 30, 2024 15:08

karknu requested a review from bolt12 April 30, 2024 15:08

bolt12 approved these changes Apr 30, 2024

View reviewed changes

coot assigned karknu May 1, 2024

coot approved these changes May 1, 2024

View reviewed changes

coot added the outbound-governor Issues / PRs related to outbound-governor label May 1, 2024

karknu requested a review from bolt12 May 2, 2024 09:37

bolt12 approved these changes May 2, 2024

View reviewed changes

karknu enabled auto-merge May 2, 2024 11:56

coot reviewed May 2, 2024

View reviewed changes

ouroboros-network/sim-tests-lib/Test/Ouroboros/Network/PeerSelection.hs Show resolved Hide resolved

coot mentioned this pull request May 2, 2024

ouroboros-network changes for cardano-node-8.9.3 #4869

Merged

7 tasks

Fixing 'node never uses non-trustable peers in clean state' test failure

6e5fbcd

karknu added this pull request to the merge queue May 4, 2024

Merged via the queue into master with commit 3f545be May 4, 2024
13 checks passed

karknu deleted the karknu/retry_fixes branch May 4, 2024 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry time fixes #4867

Retry time fixes #4867

karknu commented Apr 30, 2024 •

edited by coot

Loading

bolt12 commented Apr 30, 2024

bolt12 Apr 30, 2024

coot left a comment

coot commented May 2, 2024

karknu commented May 3, 2024 •

edited

Loading

bolt12 commented May 3, 2024

Retry time fixes #4867

Retry time fixes #4867

Conversation

karknu commented Apr 30, 2024 • edited by coot Loading

Description

Checklist

Quality

Maintenance

bolt12 commented Apr 30, 2024

bolt12 Apr 30, 2024

Choose a reason for hiding this comment

coot left a comment

Choose a reason for hiding this comment

coot commented May 2, 2024

karknu commented May 3, 2024 • edited Loading

bolt12 commented May 3, 2024

karknu commented Apr 30, 2024 •

edited by coot

Loading

karknu commented May 3, 2024 •

edited

Loading