Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry time fixes #4867

Merged
merged 7 commits into from
May 4, 2024
Merged

Retry time fixes #4867

merged 7 commits into from
May 4, 2024

Conversation

karknu
Copy link
Contributor

@karknu karknu commented Apr 30, 2024

Description

Fix problems related to big ledger and public root peers retry time.

Checklist

Quality

  • Commit sequence makes sense and have useful messages, see ref.
  • New tests are added and existing tests are updated.
  • Self-reviewed the PR.

Maintenance

  • Linked an issue or added the PR to the current sprint of ouroboros-network project.
  • Added labels.
  • Updated changelog files.
  • The documentation has been properly updated, see ref.

karknu added 3 commits April 30, 2024 14:50
Update the bigledger peers state in case of an exception. Previously the
publicRoot state was updated instead. This ment that in case of an error
the node would never attempt to increase the number of big ledger peers
again.
The bootstrap peers and public root peers should be different. Therefor
the failure of one kind shouldn't automatically carry over to the other
kind.
Lower the maximum retry timer from above 1h to 4 minutes.
Normally public roots have a ttl of 5s when taken from the ledger. But
for bootstrap peers the ttl from DNS will be returned. Since this can be
hours or days we cap it to 60s and depend on backoff counter to increase
it further if needed.
@karknu karknu added the bootstrap Issues / PRs related to bootstrap peers label Apr 30, 2024
karknu added 2 commits April 30, 2024 15:20
When calculating publicRoot lookup results having only configured peers
with IP addreses is not a DNS failure. Instead of a 3h TTL use 60s for
IP addresses only (will be increased by backoff logic).
@karknu karknu force-pushed the karknu/retry_fixes branch from d32f6a1 to 0ef9e22 Compare April 30, 2024 13:20
@bolt12
Copy link
Contributor

bolt12 commented Apr 30, 2024

These all look sensible to me :)

@karknu karknu marked this pull request as ready for review April 30, 2024 15:08
@karknu karknu requested a review from a team as a code owner April 30, 2024 15:08
@karknu karknu requested a review from bolt12 April 30, 2024 15:08
Comment on lines +109 to +111
inProgressBigLedgerPeersReq = False,
bigLedgerPeerBackoffs = bigLedgerPeerBackoffs',
bigLedgerPeerRetryTime = bigLedgerPeerRetryTime'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Contributor

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

@coot coot added the outbound-governor Issues / PRs related to outbound-governor label May 1, 2024
If we are above the credits instead of failing instantly take a look if
any of the events provide us with more credits. This makes the test
suite more robust incase one kind of events happens close to another
kind of events. For example a burst of peersharing activity just before
the node starts to look for new public roots.
@karknu karknu requested a review from bolt12 May 2, 2024 09:37
@karknu karknu enabled auto-merge May 2, 2024 11:56
@coot
Copy link
Contributor

coot commented May 2, 2024

One of the tests failed on Hydra

        node never uses non-trustable peers in clean state:                     FAIL (2.74s)
          *** Failed! Falsified (after 51 tests and 77 shrinks):
          GovernorMockEnvironment {peerGraph = PeerGraph [(PeerAddr 10,[],GovernorScripts {peerShareScript = Script (Nothing :| []), peerSharingScript = Script (PeerSharingDisabled :| []), connectionScript = Script ((Noop,NoDelay) :| [])})], localRootPeers = fromGroups [(HotValency {getHotValency = 1},WarmValency {getWarmValency = 1},fromList [(PeerAddr 10,(DoAdvertisePeer,IsTrustable))])], publicRootPeers = PublicRootPeers {getPublicConfigPeers = fromList [], getBootstrapPeers = fromList [], getLedgerPeers = fromList [], getBigLedgerPeers = fromList []}, targets = Script ((PeerSelectionTargets {targetNumberOfRootPeers = 0, targetNumberOfKnownPeers = 1, targetNumberOfEstablishedPeers = 0, targetNumberOfActivePeers = 0, targetNumberOfKnownBigLedgerPeers = 0, targetNumberOfEstablishedBigLedgerPeers = 0, targetNumberOfActiveBigLedgerPeers = 0},NoDelay) :| [(PeerSelectionTargets {targetNumberOfRootPeers = 0, targetNumberOfKnownPeers = 0, targetNumberOfEstablishedPeers = 0, targetNumberOfActivePeers = 0, targetNumberOfKnownBigLedgerPeers = 0, targetNumberOfEstablishedBigLedgerPeers = 0, targetNumberOfActiveBigLedgerPeers = 0},NoDelay)]), pickKnownPeersForPeerShare = Script (PickFirst :| []), pickColdPeersToPromote = Script (PickFirst :| []), pickWarmPeersToPromote = Script (PickFirst :| []), pickHotPeersToDemote = Script (PickFirst :| []), pickWarmPeersToDemote = Script (PickFirst :| []), pickColdPeersToForget = Script (PickFirst :| []), peerSharingFlag = PeerSharingDisabled, useBootstrapPeers = Script ((UseBootstrapPeers [RelayAccessDomain "test5" 48,RelayAccessDomain "test5" 65511,RelayAccessAddress "0.246.32.126" 44,RelayAccessDomain "test1" 65494,RelayAccessAddress "12d:21f:56:5bfe:1e1:bef7:c3:da85" 2,RelayAccessAddress "1e4:3395:1a8:1c05:145:be84:124:520c" 65523,RelayAccessAddress "1.82.19.128" 65531,RelayAccessDomain "test5" 48,RelayAccessDomain "test3" 65500,RelayAccessDomain "test5" 65525,RelayAccessAddress "e6:c8c5:119:9f6c:12d:3aa2:88:beaf" 22,RelayAccessDomain "test3" 65525,RelayAccessAddress "0.100.68.30" 65519,RelayAccessDomain "test2" 50,RelayAccessAddress "0.121.48.177" 65497,RelayAccessAddress "0.40.44.218" 16,RelayAccessDomain "test3" 39,RelayAccessDomain "test5" 65492,RelayAccessDomain "test3" 65512,RelayAccessAddress "0.79.102.117" 65533,RelayAccessDomain "test4" 35,RelayAccessAddress "1bc:1a5a:15d:7af1:19d:a8fc:fd:a338" 40,RelayAccessDomain "test5" 65510,RelayAccessAddress "1.118.216.216" 41,RelayAccessDomain "test4" 6,RelayAccessDomain "test4" 16,RelayAccessAddress "ad:49ba:1ed:fcb5:72:5dc6:147:be7f" 9,RelayAccessDomain "test2" 65491,RelayAccessDomain "test3" 65504,RelayAccessAddress "1ce:7887:1f5:8717:155:e24d:1e6:d5cb" 45,RelayAccessDomain "test1" 65497,RelayAccessAddress "0.7.117.105" 42,RelayAccessAddress "5f:ac20:11a:ff14:130:3e5e:69:3dcd" 65506,RelayAccessDomain "test4" 65500,RelayAccessDomain "test1" 65505,RelayAccessAddress "21:c6b1:29:ac56:36:60d2:ca:2eda" 11,RelayAccessDomain "test3" 65487,RelayAccessDomain "test5" 11],NoDelay) :| []), useLedgerPeers = Script ((DontUseLedgerPeers,NoDelay) :| []), ledgerStateJudgement = Script ((TooOld,NoDelay) :| [])}
          Last 20 signal values:
          Time 0s       : (False,(fromList [],fromList []))
          Time 0s       : (True,(fromList [],fromList []))
          Time 0s       : (True,(fromList [PeerAddr 10],fromList [PeerAddr 10]))
          
          Property violated at: Time 0s
          Invalid signal value:
          (True,(fromList [PeerAddr 10],fromList []))
          
          Use --quickcheck-replay=833314 to reproduce.
          Use -p '/node never uses non-trustable peers in clean state/' to rerun this test only.

@karknu
Copy link
Contributor Author

karknu commented May 3, 2024

-p '/node never uses non-trustable peers in clean state/'

This breakage seems to have been introduced in a38e5a1 , that is it exist in master.

@bolt12
Copy link
Contributor

bolt12 commented May 3, 2024

I have looked into it and it seems it happens on an edge case when there's a trusted local root peer being in progress promote to cold and then the targets change to be all 0 (due to the quickcheck arbitrary script). This will clamp the local roots to the targetKnownPeer limit which will effectively make them empty. However the peer will stay in the knownPeer set. In other scenario the peer would have been shortly forgotten but since it is in the inProgressPromoteCold set it will linger and make the test fail

@karknu karknu added this pull request to the merge queue May 4, 2024
Merged via the queue into master with commit 3f545be May 4, 2024
13 checks passed
@karknu karknu deleted the karknu/retry_fixes branch May 4, 2024 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bootstrap Issues / PRs related to bootstrap peers outbound-governor Issues / PRs related to outbound-governor
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants