Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better task balancing #1482

Merged
merged 73 commits into from Jun 8, 2017
Merged

Better task balancing #1482

merged 73 commits into from Jun 8, 2017

Conversation

@darcatron
Copy link
Contributor

darcatron commented Mar 30, 2017

馃毀 This is a WIP for task balancing.

The general idea is that a pendingTask will calculate the best offer by scoring each offer and choosing the best score. Right now, the top score is 1.00 assuming the slave has nothing on it.

Score is weighted based on 2 real criteria: current resource usage for the same request type and current resource availability. I choose weights based on what I thought might be important, but it can be changed. I thought mem would be the most important so I weighted it a bit higher:

requestTypeCpuWeight = 0.20;
requestTypeMemWeight = 0.30;
freeCpuWeight = 0.20;
freeMemWeight = 0.30;

This only scores based on what's running on the slave. It does not look at the acceptedPendingTasks for an offer.

@ssalinas

double score = score(offerHolder, stateCache, tasksPerOfferPerRequest, taskRequestHolder, getUsagesPerRequestTypePerSlave());
if (score > 0) {
// todo: can short circuit here if score is high enough
scorePerOffer.put(offerHolder, score);

This comment has been minimized.

@darcatron

darcatron Mar 30, 2017 Author Contributor

Thought we might want to have a value that's definitely good enough to just accept instead of continue evaluating

@ssalinas
Copy link
Member

ssalinas commented Mar 30, 2017

I like the idea of a scoring system overall. Some comments on specific logic things I'll make later since this is still WIP. Overall comments though:

  • The time past due for a task should probably also factor in to the scoring. (i.e. if it was supposed to run 5 minutes ago, the definition of 'good enough' is different than if it was supposed to start 2 seconds ago)
  • There should probably also be a measure of how many offers we have looked at while trying to schedule this task. We may only get 1,2 etc offers at a time and not have a wholistic view of resources. So if we have looked at a number of them already, the bar of 'good enough' should start to get lower.
  • We should definitely be aware of computation time. The offer processing loop is already one of our slower areas. Anything we can do to pre-process this data and require less calculation at offer evaluation time will be a big help

taskManager.createTaskAndDeletePendingTask(zkTask);
private double minScore(SingularityTaskRequest taskRequest) {
double minScore = 0.80;

This comment has been minimized.

@darcatron

darcatron Apr 5, 2017 Author Contributor

this can be adjusted as necessary. I thought an 80% match might be a good starting point, but we could def reduce it

@darcatron
Copy link
Contributor Author

darcatron commented Apr 5, 2017

This has been updated as follows:

Before

All offers would be scored and of all scores above 0, the best scored offer would be accepted by the task

After

All offers are still scored, but the minimum score acceptable depends on the task's overdue milliseconds and number of offers a task has not accepted.

Currently, the overdue time and offer count have a max of 10 min and 20 attempts, respectively. The min score is based on the ratio of curOverdueTime:maxOverdueTime and curAttempts:maxAttempts

The offer attempts count is any offer that was considered, not just offers that scored too low. So an offer that didn't have enough resources to satisfy the task will still be counted.

I picked these numbers based on generalizations. We will likely have to tune these

return SlaveMatchState.SLAVE_ATTRIBUTES_DO_NOT_MATCH;
}

final SlavePlacement slavePlacement = taskRequest.getRequest().getSlavePlacement().or(configuration.getDefaultSlavePlacement());

if (!taskRequest.getRequest().isRackSensitive() && slavePlacement == SlavePlacement.GREEDY) {
// todo: account for this or let this behavior continue?
return SlaveMatchState.NOT_RACK_OR_SLAVE_PARTICULAR;

This comment has been minimized.

@darcatron

darcatron Apr 5, 2017 Author Contributor

I didn't know if we would need to account for any rack sensitivity outside of the existing checks done in the scheduler

@darcatron
Copy link
Contributor Author

darcatron commented Apr 14, 2017

@ssalinas Got some specific logic tests in. Let me know if there's a piece I missed that should be added. I'm going to continue to try to get some full logic tests in as well

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;

public class SingularitySlaveUsage {

public static final String CPU_USED = "cpusUsed";
public static final String MEMORY_BYTES_USED = "memoryRssBytes";
public static final long BYTES_PER_MEGABYTE = 1024L * 1024L;

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

was about to comment that there must be some type of easy class/enum for this like there is with TimeUnit, but apparently there isn't... weird...

This comment has been minimized.

@darcatron

darcatron Apr 20, 2017 Author Contributor

Yeah, I was sad to see there wasn't a lib method for this too 馃槩

}

public double getCpusUsedForRequestType(RequestType type) {
return usagePerRequestType.get(type).get(CPU_USED).doubleValue();

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

Maybe another enum is more appropriate for CPU_USED/MEMORY_BYTES_USED ?

This comment has been minimized.

@darcatron

darcatron Apr 20, 2017 Author Contributor

agreed, mapping would be clearer then too 馃憤

return usagePerRequestType;
}

public double getCpusUsedForRequestType(RequestType type) {

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

this and getMemBytesUsedForRequestType are unused methods

}

@Override
public void runActionOnPoll() {
final long now = System.currentTimeMillis();
Map<RequestType, Map<String, Number>> usagesPerRequestType = new HashMap<>();

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

wouldn't we want this to be per-slave, not overall?

This comment has been minimized.

@darcatron

darcatron Apr 20, 2017 Author Contributor

This should be per slave. This poller loops through each slave and creates a new SingularitySlaveUsage with the stats for that slave

Map<SingularityOfferHolder, Double> scorePerOffer = new HashMap<>();
double minScore = minScore(taskRequestHolder.getTaskRequest(), offerMatchAttemptsPerTask, System.currentTimeMillis());

LOG.info("Minimum score for task {} is {}", taskRequestHolder.getTaskRequest().getPendingTask().getPendingTaskId().getId(), minScore);

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

probably can be lower than info level here

continue;
}

double score = score(offerHolder, stateCache, tasksPerOfferPerRequest, taskRequestHolder, getSlaveUsage(currentSlaveUsages, offerHolder.getOffer().getSlaveId().getValue()));

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

for clarity, maybe something like 'hostScore' here? The score is for the particular slave, not necessarily about the offer

This comment has been minimized.

@darcatron

darcatron Apr 20, 2017 Author Contributor

I'm not sure about the naming here. We do look at the slave's utilization to score the offer, but we are still scoring the offer itself since offers aren't uniquely 1:1 for a slave (e.g. 2 offers for the same slave).

The slave utilization weight will be the same for all offers on the same slave, but the offer resources will be different per offer. So, it seems to me that we're scoring the offer in this class rather than the slave itself

@VisibleForTesting
double minScore(SingularityTaskRequest taskRequest, Map<String, Integer> offerMatchAttemptsPerTask, long now) {
double minScore = 0.80;
int maxOfferAttempts = 20;

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

another that would be nice to have ocnfigurable

final SingularityTask task = mesosTaskBuilder.buildTask(offerHolder.getOffer(), offerHolder.getCurrentResources(), taskRequest, taskRequestHolder.getTaskResources(), taskRequestHolder.getExecutorResources());
@VisibleForTesting
double score(Offer offer, SingularityTaskRequest taskRequest, Optional<SingularitySlaveUsageWithId> maybeSlaveUsage) {
double requestTypeCpuWeight = 0.20;

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

Let's make these configurable, maybe another object in the configuration yaml?

This comment has been minimized.

@darcatron

darcatron Apr 20, 2017 Author Contributor

Yup, I was in progress on this (now committed), but I kept the fields under SingularityConfiguration since I saw a lot of other stuff in there as well (e.g. caching). We could pull it into an OfferConfiguration file if you think that'd be better for organization

if (matchesResources && slaveMatchState.isMatchAllowed()) {
final SingularityTask task = mesosTaskBuilder.buildTask(offerHolder.getOffer(), offerHolder.getCurrentResources(), taskRequest, taskRequestHolder.getTaskResources(), taskRequestHolder.getExecutorResources());
@VisibleForTesting
double score(Offer offer, SingularityTaskRequest taskRequest, Optional<SingularitySlaveUsageWithId> maybeSlaveUsage) {

This comment has been minimized.

@ssalinas

ssalinas Apr 20, 2017 Member

Let's go over this one in-person, think we are getting close, just easier to chat than typing a novel in github ;)

@darcatron darcatron force-pushed the task-juggling branch from 2f8d3d0 to e145757 Apr 20, 2017
@darcatron darcatron force-pushed the task-juggling branch from e145757 to d7867e9 Apr 20, 2017
@darcatron darcatron added the hs_qa label May 18, 2017
@@ -31,22 +32,30 @@

private static final String SLAVE_PATH = ROOT_PATH + "/slaves";
private static final String TASK_PATH = ROOT_PATH + "/tasks";
private static final String USAGE_SUMMARY_PATH = ROOT_PATH + "/summary";

This comment has been minimized.

@darcatron

darcatron May 23, 2017 Author Contributor

@ssalinas can you take a look at this piece? Just wanna make sure I got this set up right

@@ -123,6 +140,10 @@ public SingularityCreateResult saveSpecificTaskUsage(String taskId, SingularityT
return save(getSpecificTaskUsagePath(taskId, usage.getTimestamp()), usage, taskUsageTranscoder);
}

public SingularityCreateResult saveSpecificClusterUtilization(SingularityClusterUtilization utilization) {
return save(getSpecificClusterUtilizationPath(utilization.getTimestamp()) , utilization, clusterUtilizationTranscoder);

This comment has been minimized.

@ssalinas

ssalinas May 23, 2017 Member

is there any point at which we would want a history of these? If we don't need a history of summaries, we might as well save the data to the summary path (without timestamp) and just overwrite when there is new data.

This comment has been minimized.

@darcatron

darcatron May 23, 2017 Author Contributor

I didn't see a reason to at this point. Saw the others were saving them so it seemed okay since we only save up to 5 points. I think it's safe to just have the one point. Keeps it simpler

@@ -95,6 +104,10 @@ private String getCurrentSlaveUsagePath(String slaveId) {
return ZKPaths.makePath(getSlaveUsagePath(slaveId), CURRENT_USAGE_NODE_KEY);
}

private String getSpecificClusterUtilizationPath(long timestamp) {

This comment has been minimized.

@ssalinas

ssalinas May 23, 2017 Member

why SpecificClusterUtilization instead of just getUsageSummaryPath or something like that?

This comment has been minimized.

@darcatron

darcatron May 23, 2017 Author Contributor

the "specific" keyword was the pattern the other items were using so I kept that since the timestamp specified which one to grab. I can drop the historical data and then rename it

} catch (InvalidSingularityTaskIdException e) {
LOG.error("Couldn't get SingularityTaskId for {}", taskUsage);
continue;
}

This comment has been minimized.

@darcatron

darcatron May 23, 2017 Author Contributor

added this try catch for the potentially incorrect taskId types

@darcatron
Copy link
Contributor Author

darcatron commented May 23, 2017

Before

Min score was configurable by the user and would be reduced as tasks were delayed and offer match attempts were rejected.

After

Min score is calculated based on the overall utilization of the cluster. memUsed / memTotal and cpuUsed / cpuTotal. This will help give us a min score closer to the actual offer scores. Task delays and match attempts will still reduce the min score

@darcatron darcatron force-pushed the task-juggling branch from ee48576 to c34834e May 26, 2017
@@ -130,12 +131,13 @@ public SingularityMesosOfferScheduler(MesosConfiguration mesosConfiguration,

while (!pendingTaskIdToTaskRequest.isEmpty() && addedTaskInLastLoop && canScheduleAdditionalTasks(taskCredits)) {
addedTaskInLastLoop = false;
double maxTaskMillisPastDue = maxTaskMillisPastDue(SingularityScheduledTasksInfo.getInfo(taskManager.getPendingTasks(), configuration.getDeltaAfterWhichTasksAreLateMillis()).getMaxTaskLagMillis());

This comment has been minimized.

@darcatron

darcatron May 31, 2017 Author Contributor

I think we'll need a better name for this variable. It's the max possible task lag before we decide we're going to take any offer we can get. I feel maxTaskMillisPastDue has some confusing overlap with maxTaskLag; which is the current highest lag time for pending tasks

@darcatron
Copy link
Contributor Author

darcatron commented May 31, 2017

Before

We had a configured value for the max possible task lag before we accepted any offer that matched a task. The default was 5 minutes.

After

The max possible task lag (maxTaskMillisPastDue) is determined by a simple point-slope formula that uses the maxTaskLag and the existing deltaAfterWhichTasksAreLateMillis.
maxTaskMillisPastDue = (-180 000 / deltaAfterWhichTasksAreLateMillis) * maxTaskLag + 180 000 where 180,000 is 3 min in milliseconds

A task with no lag will result in a 3 min maxTaskMillisPastDue. Any lag will linearly reduce the maxTaskMillisPastDue. The minimum possible maxTaskMillisPastDue is 1 millisecond.

Since we reduce the minScore by how long a task is past due: higher maxTaskLag => lower maxTaskMillisPastDue => lower minScore

It's also important to note that we will actually start accepting any matching offer (minScore == 0) before the maxTaskLag reaches deltaAfterWhichTasksAreLateMillis. We can use this formula to see what the values would be for different deltas.
180 000 = l + (180 000 / d) * l where l = the task lag at which we'll accept any offer and d = deltaAfterWhichTasksAreLateMillis

darcatron added 2 commits Jun 1, 2017
@ssalinas
Copy link
Member

ssalinas commented Jun 8, 2017

@darcatron I think this one is good to merge. Going to get this in master so we can continue to build off of the resource usage updates without endless merge conflicts 馃憤 . Thanks for all the work on this one

@ssalinas ssalinas merged commit af223b1 into master Jun 8, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ssalinas ssalinas deleted the task-juggling branch Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can鈥檛 perform that action at this time.