-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActionScheduler will now use ActionListener instead of tokio::watch #1091
ActionScheduler will now use ActionListener instead of tokio::watch #1091
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: vale (waiting on @zbirenbaum)
34da4f3
to
8e8dfdc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 4 discussions need to be resolved
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 224 at r1 (raw file):
}; let Some(pending_txs) = maybe_pending_txs else { return; // Noone is waiting for this action anymore.
nit: Nobody
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 251 at r1 (raw file):
let maybe_pending_txs = { let mut cache_check_actions = cache_check_actions.lock();
Comment and rename cache_check_actions
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 255 at r1 (raw file):
}; let Some(pending_txs) = maybe_pending_txs else { return; // Noone is waiting for this action anymore.
nit: Nobody
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 275 at r1 (raw file):
) })? .err_tip(|| "While passing through CacheLookupScheduler::add_action")
"In CacheLookupScheduler::add_action"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @zbirenbaum)
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 224 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
nit: Nobody
Done.
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 251 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
Comment and rename
cache_check_actions
Done.
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 255 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
nit: Nobody
Done.
nativelink-scheduler/src/cache_lookup_scheduler.rs
line 275 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
"In CacheLookupScheduler::add_action"
Done.
This will enable the underlying scheduler to intercept the Drop call allowing easier cleanups of actively listened actions.
8e8dfdc
to
5bcd830
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained
This is an extremely significant overhaul to Nativelink's scheduler component. This new scheduler design is to enable a distributed scheduling system. The new components & definitions: * AwaitedActionDb - An interface that is easier to work with when dealing with key-value storage systems. * MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used to satisfy the requirements of AwaitedActionDb interface. * ClientStateManager - A minimal interface required to satisfy the requirements of a client-facing scheduler. * WorkerStateManager - A minimal interface required to satisfy the requirements of a worker-facing scheduler. * MatchingEngineStateManager - A minimal interface required to satisfy a the engine that matches queued jobs to workers. * SimpleSchedulerStateManager - An implementation that satisfies ClientStateManager, WorkerStateManager & MatchingEngineStateManager with all the logic of the previous "SimpleScheduler" logic moved behind each interface. * ApiWorkerScheduler - A component that handles all knowledge about workers state and implmenets the WorkerScheduler interface and translates them into the WorkerStateManager interface. * SimpleScheduler - Translation calls of the ClientScheduler interface into ClientStateManager & MatchingEngineStateManager. This component is currently always fowards calls to SimpleSchedulerStateManager then to MemoryAwaitedActionDb. Future changes will make these inner components dynamic via config. In addition we have hardened the interactions of different kind of IDs in NativeLink. Most relivent is the separation & introduction of: * OperationId - Represents an individal operation being requested to be executed that is unique across all of time. * ClientOperationId - An ID issued to the client when the client requests to execute a job. This ID will point to an OperationId internally, but the client is never exposed to the OperationId. * AwaitedActionHashKey - A key used to uniquely identify an action that is not unique across time. This means that this key might have multiple OperationId's that have executed it across different points in time. This key is used as a "fingerprint" of an operation that the client wants to execute and the scheduler may decide to join the stream onto an existing operation if this key has a hit. Overall these changes pave the way for more robust scheduler implementations, most notably, distributed scheduler implementations will be easier to impelemnt and will be introduced in followup PRs. This commit was developed on a side branch and consisted of the following commits with corresponding code reviews: 54ed73c Add scheduler metrics back (TraceMachina#1171) 50fdbd7 fix formatting (TraceMachina#1170) 8926236 Merge in main and format (TraceMachina#1168) 9c2c7b9 key as u64 (TraceMachina#1166) 0192051 Cleanup unused code and comments (TraceMachina#1165) 080df5d Add versioning to AwaitedAction (TraceMachina#1163) 73c19c4 Fix sequence bug in new memory store manager (TraceMachina#1162) 6e50d2c New AwaitedActionDb implementation (TraceMachina#1157) 18db991 Fix test on running_actions_manager_test (TraceMachina#1141) e50ef3c Rename workers to `worker_scheduler` 1fdd505 SimpleScheduler now uses config for action pruning (TraceMachina#1137) eaaa872 Change encoding for items that are cachable (TraceMachina#1136) d647056 Errors are now properly handles in subscription (TraceMachina#1135) 7c3e730 Restructure files to be more appropriate (TraceMachina#1131) 5e98ec9 ClientAwaitedAction now uses a channel to notify drops happened (TraceMachina#1130) 52beaf9 Cleanup unused structs (TraceMachina#1128) e86fe08 Remove all uses of salt and put under ActionUniqueQualifier (TraceMachina#1126) 3b86036 Remove all need for workers to know about ActionId (TraceMachina#1125) 5482d7f Fix bazel build and test on dev (TraceMachina#1123) ba52c7f Implement get_action_info to all ActionStateResult impls (TraceMachina#1118) 2fa4fee Remove MatchingEngineStateManager::remove_operation (TraceMachina#1119) 34dea06 Remove unused proto field (TraceMachina#1117) 3070a40 Remove metrics from new scheduler (TraceMachina#1116) e95adfc StateManager will now cleanup actions on client disconnect (TraceMachina#1107) 6f8c001 Fix worker execution issues (TraceMachina#1114) d353c30 rename set_priority to upgrade_priority (TraceMachina#1112) 0d93671 StateManager can now be notified of noone listeneing (TraceMachina#1093) cfc0cf6 ActionScheduler will now use ActionListener instead of tokio::watch (TraceMachina#1091) d70d31d QA fixes for scheduler-v2 (TraceMachina#1092) f2cea0c [Refactor] Complete rewrite of SimpleScheduler 34d93b7 [Refactor] Move worker notification in SimpleScheduler under Workers b9d9702 [Refactor] Moves worker logic back to SimpleScheduler 7a16e2e [Refactor] Move scheduler state behind mute
This is an extremely significant overhaul to Nativelink's scheduler component. This new scheduler design is to enable a distributed scheduling system. The new components & definitions: * AwaitedActionDb - An interface that is easier to work with when dealing with key-value storage systems. * MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used to satisfy the requirements of AwaitedActionDb interface. * ClientStateManager - A minimal interface required to satisfy the requirements of a client-facing scheduler. * WorkerStateManager - A minimal interface required to satisfy the requirements of a worker-facing scheduler. * MatchingEngineStateManager - A minimal interface required to satisfy a engine that matches queued jobs to workers. * SimpleSchedulerStateManager - An implements that satisfies ClientStateManager, WorkerStateManager & MatchingEngineStateManager with all the logic of the previous "SimpleScheduler" logic moved behind each interface. * ApiWorkerScheduler - A component that handles all knowledge about workers state and implmenets the WorkerScheduler interface and translates them into the WorkerStateManager interface. * SimpleScheduler - Translation calls of the ClientScheduler interface into ClientStateManager & MatchingEngineStateManager. This component is currently always forwards calls to SimpleSchedulerStateManager then to MemoryAwaitedActionDb. Future changes will make these inner components dynamic via config. In addition we have hardened the interactions of different kind of IDs in NativeLink. Most relevant is the separation & introduction of: * OperationId - Represents an individual operation being requested to be executed that is unique across all of time. * ClientOperationId - An ID issued to the client when the client requests to execute a job. This ID will point to an OperationId internally, but the client is never exposed to the OperationId. * AwaitedActionHashKey - A key used to uniquely identify an action that is not unique across time. This means that this key might have multiple OperationId's that have executed it across different points in time. This key is used as a "fingerprint" of an operation that the client wants to execute and the scheduler may decide to join the stream onto an existing operation if this key has a hit. Overall these changes pave the way for more robust scheduler implementations, most notably, distributed scheduler implementations will be easier to implement and will be introduced in followup PRs. This commit was developed on a side branch and consisted of the following commits with corresponding code reviews: 54ed73c Add scheduler metrics back (TraceMachina#1171) 50fdbd7 fix formatting (TraceMachina#1170) 8926236 Merge in main and format (TraceMachina#1168) 9c2c7b9 key as u64 (TraceMachina#1166) 0192051 Cleanup unused code and comments (TraceMachina#1165) 080df5d Add versioning to AwaitedAction (TraceMachina#1163) 73c19c4 Fix sequence bug in new memory store manager (TraceMachina#1162) 6e50d2c New AwaitedActionDb implementation (TraceMachina#1157) 18db991 Fix test on running_actions_manager_test (TraceMachina#1141) e50ef3c Rename workers to `worker_scheduler` 1fdd505 SimpleScheduler now uses config for action pruning (TraceMachina#1137) eaaa872 Change encoding for items that are cachable (TraceMachina#1136) d647056 Errors are now properly handles in subscription (TraceMachina#1135) 7c3e730 Restructure files to be more appropriate (TraceMachina#1131) 5e98ec9 ClientAwaitedAction now uses a channel to notify drops happened (TraceMachina#1130) 52beaf9 Cleanup unused structs (TraceMachina#1128) e86fe08 Remove all uses of salt and put under ActionUniqueQualifier (TraceMachina#1126) 3b86036 Remove all need for workers to know about ActionId (TraceMachina#1125) 5482d7f Fix bazel build and test on dev (TraceMachina#1123) ba52c7f Implement get_action_info to all ActionStateResult impls (TraceMachina#1118) 2fa4fee Remove MatchingEngineStateManager::remove_operation (TraceMachina#1119) 34dea06 Remove unused proto field (TraceMachina#1117) 3070a40 Remove metrics from new scheduler (TraceMachina#1116) e95adfc StateManager will now cleanup actions on client disconnect (TraceMachina#1107) 6f8c001 Fix worker execution issues (TraceMachina#1114) d353c30 rename set_priority to upgrade_priority (TraceMachina#1112) 0d93671 StateManager can now be notified of noone listeneing (TraceMachina#1093) cfc0cf6 ActionScheduler will now use ActionListener instead of tokio::watch (TraceMachina#1091) d70d31d QA fixes for scheduler-v2 (TraceMachina#1092) f2cea0c [Refactor] Complete rewrite of SimpleScheduler 34d93b7 [Refactor] Move worker notification in SimpleScheduler under Workers b9d9702 [Refactor] Moves worker logic back to SimpleScheduler 7a16e2e [Refactor] Move scheduler state behind mute
This is a significant overhaul to Nativelink's scheduler component. This new scheduler design is to enable a distributed scheduling system. The new components & definitions: * AwaitedActionDb - An interface that is easier to work with when dealing with key-value storage systems. * MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used to satisfy the requirements of AwaitedActionDb interface. * ClientStateManager - A minimal interface required to satisfy the requirements of a client-facing scheduler. * WorkerStateManager - A minimal interface required to satisfy the requirements of a worker-facing scheduler. * MatchingEngineStateManager - A minimal interface required to satisfy a engine that matches queued jobs to workers. * SimpleSchedulerStateManager - An implements that satisfies ClientStateManager, WorkerStateManager & MatchingEngineStateManager with all the logic of the previous "SimpleScheduler" logic moved behind each interface. * ApiWorkerScheduler - A component that handles all knowledge about workers state and implmenets the WorkerScheduler interface and translates them into the WorkerStateManager interface. * SimpleScheduler - Translation calls of the ClientScheduler interface into ClientStateManager & MatchingEngineStateManager. This component is currently always forwards calls to SimpleSchedulerStateManager then to MemoryAwaitedActionDb. Future changes will make these inner components dynamic via config. In addition we have hardened the interactions of different kind of IDs in NativeLink. Most relevant is the separation & introduction of: * OperationId - Represents an individual operation being requested to be executed that is unique across all of time. * ClientOperationId - An ID issued to the client when the client requests to execute a job. This ID will point to an OperationId internally, but the client is never exposed to the OperationId. * AwaitedActionHashKey - A key used to uniquely identify an action that is not unique across time. This means that this key might have multiple OperationId's that have executed it across different points in time. This key is used as a "fingerprint" of an operation that the client wants to execute and the scheduler may decide to join the stream onto an existing operation if this key has a hit. Overall these changes pave the way for more robust scheduler implementations, most notably, distributed scheduler implementations will be easier to implement and will be introduced in followup PRs. This commit was developed on a side branch and consisted of the following commits with corresponding code reviews: 54ed73c Add scheduler metrics back (#1171) 50fdbd7 fix formatting (#1170) 8926236 Merge in main and format (#1168) 9c2c7b9 key as u64 (#1166) 0192051 Cleanup unused code and comments (#1165) 080df5d Add versioning to AwaitedAction (#1163) 73c19c4 Fix sequence bug in new memory store manager (#1162) 6e50d2c New AwaitedActionDb implementation (#1157) 18db991 Fix test on running_actions_manager_test (#1141) e50ef3c Rename workers to `worker_scheduler` 1fdd505 SimpleScheduler now uses config for action pruning (#1137) eaaa872 Change encoding for items that are cachable (#1136) d647056 Errors are now properly handles in subscription (#1135) 7c3e730 Restructure files to be more appropriate (#1131) 5e98ec9 ClientAwaitedAction now uses a channel to notify drops happened (#1130) 52beaf9 Cleanup unused structs (#1128) e86fe08 Remove all uses of salt and put under ActionUniqueQualifier (#1126) 3b86036 Remove all need for workers to know about ActionId (#1125) 5482d7f Fix bazel build and test on dev (#1123) ba52c7f Implement get_action_info to all ActionStateResult impls (#1118) 2fa4fee Remove MatchingEngineStateManager::remove_operation (#1119) 34dea06 Remove unused proto field (#1117) 3070a40 Remove metrics from new scheduler (#1116) e95adfc StateManager will now cleanup actions on client disconnect (#1107) 6f8c001 Fix worker execution issues (#1114) d353c30 rename set_priority to upgrade_priority (#1112) 0d93671 StateManager can now be notified of noone listeneing (#1093) cfc0cf6 ActionScheduler will now use ActionListener instead of tokio::watch (#1091) d70d31d QA fixes for scheduler-v2 (#1092) f2cea0c [Refactor] Complete rewrite of SimpleScheduler 34d93b7 [Refactor] Move worker notification in SimpleScheduler under Workers b9d9702 [Refactor] Moves worker logic back to SimpleScheduler 7a16e2e [Refactor] Move scheduler state behind mute
This is a significant overhaul to Nativelink's scheduler component. This new scheduler design is to enable a distributed scheduling system. The new components & definitions: * AwaitedActionDb - An interface that is easier to work with when dealing with key-value storage systems. * MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used to satisfy the requirements of AwaitedActionDb interface. * ClientStateManager - A minimal interface required to satisfy the requirements of a client-facing scheduler. * WorkerStateManager - A minimal interface required to satisfy the requirements of a worker-facing scheduler. * MatchingEngineStateManager - A minimal interface required to satisfy a engine that matches queued jobs to workers. * SimpleSchedulerStateManager - An implements that satisfies ClientStateManager, WorkerStateManager & MatchingEngineStateManager with all the logic of the previous "SimpleScheduler" logic moved behind each interface. * ApiWorkerScheduler - A component that handles all knowledge about workers state and implmenets the WorkerScheduler interface and translates them into the WorkerStateManager interface. * SimpleScheduler - Translation calls of the ClientScheduler interface into ClientStateManager & MatchingEngineStateManager. This component is currently always forwards calls to SimpleSchedulerStateManager then to MemoryAwaitedActionDb. Future changes will make these inner components dynamic via config. In addition we have hardened the interactions of different kind of IDs in NativeLink. Most relevant is the separation & introduction of: * OperationId - Represents an individual operation being requested to be executed that is unique across all of time. * ClientOperationId - An ID issued to the client when the client requests to execute a job. This ID will point to an OperationId internally, but the client is never exposed to the OperationId. * AwaitedActionHashKey - A key used to uniquely identify an action that is not unique across time. This means that this key might have multiple OperationId's that have executed it across different points in time. This key is used as a "fingerprint" of an operation that the client wants to execute and the scheduler may decide to join the stream onto an existing operation if this key has a hit. Overall these changes pave the way for more robust scheduler implementations, most notably, distributed scheduler implementations will be easier to implement and will be introduced in followup PRs. This commit was developed on a side branch and consisted of the following commits with corresponding code reviews: 54ed73c Add scheduler metrics back (#1171) 50fdbd7 fix formatting (#1170) 8926236 Merge in main and format (#1168) 9c2c7b9 key as u64 (#1166) 0192051 Cleanup unused code and comments (#1165) 080df5d Add versioning to AwaitedAction (#1163) 73c19c4 Fix sequence bug in new memory store manager (#1162) 6e50d2c New AwaitedActionDb implementation (#1157) 18db991 Fix test on running_actions_manager_test (#1141) e50ef3c Rename workers to `worker_scheduler` 1fdd505 SimpleScheduler now uses config for action pruning (#1137) eaaa872 Change encoding for items that are cachable (#1136) d647056 Errors are now properly handles in subscription (#1135) 7c3e730 Restructure files to be more appropriate (#1131) 5e98ec9 ClientAwaitedAction now uses a channel to notify drops happened (#1130) 52beaf9 Cleanup unused structs (#1128) e86fe08 Remove all uses of salt and put under ActionUniqueQualifier (#1126) 3b86036 Remove all need for workers to know about ActionId (#1125) 5482d7f Fix bazel build and test on dev (#1123) ba52c7f Implement get_action_info to all ActionStateResult impls (#1118) 2fa4fee Remove MatchingEngineStateManager::remove_operation (#1119) 34dea06 Remove unused proto field (#1117) 3070a40 Remove metrics from new scheduler (#1116) e95adfc StateManager will now cleanup actions on client disconnect (#1107) 6f8c001 Fix worker execution issues (#1114) d353c30 rename set_priority to upgrade_priority (#1112) 0d93671 StateManager can now be notified of noone listeneing (#1093) cfc0cf6 ActionScheduler will now use ActionListener instead of tokio::watch (#1091) d70d31d QA fixes for scheduler-v2 (#1092) f2cea0c [Refactor] Complete rewrite of SimpleScheduler 34d93b7 [Refactor] Move worker notification in SimpleScheduler under Workers b9d9702 [Refactor] Moves worker logic back to SimpleScheduler 7a16e2e [Refactor] Move scheduler state behind mute
This will enable the underlying scheduler to intercept the Drop call allowing easier cleanups of actively listened actions.
This change is