Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix snapshot from follower thread leak
### What changes are proposed in this pull request? When using Ratis, the primary master does not take a snapshot itself, instead it requests a snapshot from a follower. This pull request fixes two issues with the current implementation of this. 1. Currently when a snapshot needs to be taken as given by `alluxio.master.journal.checkpoint.period.entries`, Ratis will call `takeSnapshot()` in the JournalStateMachine each time a journal entry is committed until a new snapshot is installed. On the primary master `takeSnapshot()` runs asynchronously by first requesting snapshot information from each follower, then downloading a snapshot form one of them if a valid snapshot is available. If no valid snapshot is available (which is likely since all nodes take snapshots at the same log index and it takes time to generate a snapshot) the request happens repeatedly until one is available, but each request allocates a new GRPC connection, eventually this may cause the master to crash or fail over from allocating too many threads. This is fixed by having the follower block until a valid snapshot is available before sending the reply (with a configurable timeout). 2. Currently when a primary master sends a request to a follower to start sending the snapshot it always expects the follower to start a new RPC to do this, but if the follower does not do this (for example due to any sort of failure or network issue) then the primary master will always be waiting for this RPC and never install a new snapshot. This is fixed by adding a timeout for the follower to start the RPC, and if the timeout runs out, a new follower is tried, or the snapshot request protocol starts again if none are available presently. ### Does this PR introduce any user facing changes? No pr-link: #15873 change-id: cid-2a9a8ef7f1a416299b9a70df2592ffc9a076f760
- Loading branch information
Showing
8 changed files
with
332 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.