New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singularity performance improvements #1702

Merged
merged 25 commits into from Feb 13, 2018

Conversation

Projects
None yet
2 participants
@ssalinas
Member

ssalinas commented Feb 1, 2018

This is a first pass at updates to the scheduler internals to improve the speed of offer processing, status updates, and other processes that contend for the scheduler lock. The main theme here is parallelizing things that were previously sequential.

The biggest change in this PR that still needs additional testing is the removal of the global scheduler lock in favor of many smaller locks. There are now three classes of locks. One lock for the scheduler state, one for accessing/processing offers, and one ConcurrentHashMap of request locks such that only one process can be updating the state or tasks for a request at a time. This change lets offers, status updates, deploy checks, pending request processes, and more run in parallel.

Other changes include:

  • More async handling of status updates. Instead of auto-acking a status update when we receive it, then processing it before grabbing the next, we will kick off an async process for each status update, with concurrency being limited by the AsyncSemaphore class (copied here in the short term until we open source the library)
  • Parallelize many processes in offer processing and remove the extra while loop. Base on the first pass over the logic it seemed we were doing many more loops than needed through certain sections of offer processing. This still needs more testing to confirm there is no regression
  • parallelize pending request and deploy check processes now that we are locking at a request level
  • remove the task credit system. This was originally for debugging. But, given the disabled actions capabilities the extra checks are unneeded overhead

The request level locking along with our proxy to leader code opens us up to some better UI experiences in the future as well. Instead of waiting for pollers, we can more easily execute many actions like bounces or task kills at request time and immediately return results to the UI.

Further TODOs include finishing off fixing tests, running additional benchmarks of the new code, and cleaning up some rough edges (like hard coded values that should instead be configurable)

@darcatron @baconmania @pschoenfelder

@ssalinas ssalinas changed the title from WIP - Singularity performance improvements to Singularity performance improvements Feb 7, 2018

@ssalinas ssalinas added the hs_staging label Feb 7, 2018

@ssalinas

This comment has been minimized.

Member

ssalinas commented Feb 7, 2018

Additional improvements:

  • Calculate more slave usage scores up front instead of recalculating on each pending task loop so that the number of calculations there is liner with number of offers not with numOffers*numPendingTasks
  • Also consider cached offers when resourceOffers is called, not only in the scheduler poller

Test updates:

  • Most tests rely on there being no offer cache in place. I now bind the SingularityNoOfferCache by default except for a specific test class which leverages the offer cache to test it
@@ -421,13 +421,14 @@ private void deleteScheduledTasks(final Collection<SingularityPendingTask> sched
}
private List<SingularityTaskId> getMatchingTaskIds(SingularityRequest request, SingularityDeployKey deployKey) {
List<SingularityTaskId> activeTaskIdsFroRequest = leaderCache.getActiveTaskIdsForRequest(deployKey.getRequestId());

This comment has been minimized.

@baconmania

baconmania Feb 7, 2018

Contributor

Minor typo

}
final long start = schedulerLock.lock(maybeTaskId.get().getRequestId(), "statusUpdate");
try {

This comment has been minimized.

@baconmania

baconmania Feb 7, 2018

Contributor

Does it make sense to use callWithRequestLock() here as well, and then make SingularitySchedulerLock#(lock|unlock) private?

This comment has been minimized.

@ssalinas

ssalinas Feb 7, 2018

Member

yes, I think I had a different exception in here at one point and didn't refactor back to callWithRequestLock, thanks for finding that

}
public long lock(String name) {
public long lock(String requestId, String name) {

This comment has been minimized.

@baconmania

baconmania Feb 7, 2018

Contributor

Could it be a good idea to accept the calling class here instead of a name, and then log out the class name in the trace logs below? Just to ensure that the names we pass when locking aren't arbitrary and are useful for tracing the flow of the scheduler.

This comment has been minimized.

@ssalinas

ssalinas Feb 7, 2018

Member

maybe even going as far as class#methodName could be useful for that logging

return System.currentTimeMillis();
}
public void unlock(String name, long start) {
LOG.info("{} - Unlocking ({})", name, JavaUtils.duration(start));
public void unlock(String requestId, String name, long start) {

This comment has been minimized.

@baconmania

baconmania Feb 7, 2018

Contributor

Should be safe to make lock() and unlock() private now, and only allow the rest of the scheduler to interact via runWithRequestLock().

@ssalinas ssalinas added the hs_qa label Feb 8, 2018

@baconmania

This comment has been minimized.

Contributor

baconmania commented Feb 13, 2018

🚢

@ssalinas ssalinas added the hs_stable label Feb 13, 2018

@ssalinas ssalinas merged commit 57ce9ba into master Feb 13, 2018

1 of 2 checks passed

continuous-integration/travis-ci/push The Travis CI build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssalinas ssalinas deleted the request_level_locks branch Feb 13, 2018

@ssalinas ssalinas added this to the 0.19.0 milestone Feb 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment