Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better task balancing #1482

Merged
merged 73 commits into from
Jun 8, 2017
Merged

Better task balancing #1482

merged 73 commits into from
Jun 8, 2017

Conversation

matush-v
Copy link
Contributor

@matush-v matush-v commented Mar 30, 2017

🚧 This is a WIP for task balancing.

The general idea is that a pendingTask will calculate the best offer by scoring each offer and choosing the best score. Right now, the top score is 1.00 assuming the slave has nothing on it.

Score is weighted based on 2 real criteria: current resource usage for the same request type and current resource availability. I choose weights based on what I thought might be important, but it can be changed. I thought mem would be the most important so I weighted it a bit higher:

requestTypeCpuWeight = 0.20;
requestTypeMemWeight = 0.30;
freeCpuWeight = 0.20;
freeMemWeight = 0.30;

This only scores based on what's running on the slave. It does not look at the acceptedPendingTasks for an offer.

@ssalinas

double score = score(offerHolder, stateCache, tasksPerOfferPerRequest, taskRequestHolder, getUsagesPerRequestTypePerSlave());
if (score > 0) {
// todo: can short circuit here if score is high enough
scorePerOffer.put(offerHolder, score);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought we might want to have a value that's definitely good enough to just accept instead of continue evaluating

@ssalinas
Copy link
Member

I like the idea of a scoring system overall. Some comments on specific logic things I'll make later since this is still WIP. Overall comments though:

  • The time past due for a task should probably also factor in to the scoring. (i.e. if it was supposed to run 5 minutes ago, the definition of 'good enough' is different than if it was supposed to start 2 seconds ago)
  • There should probably also be a measure of how many offers we have looked at while trying to schedule this task. We may only get 1,2 etc offers at a time and not have a wholistic view of resources. So if we have looked at a number of them already, the bar of 'good enough' should start to get lower.
  • We should definitely be aware of computation time. The offer processing loop is already one of our slower areas. Anything we can do to pre-process this data and require less calculation at offer evaluation time will be a big help


taskManager.createTaskAndDeletePendingTask(zkTask);
private double minScore(SingularityTaskRequest taskRequest) {
double minScore = 0.80;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be adjusted as necessary. I thought an 80% match might be a good starting point, but we could def reduce it

@matush-v
Copy link
Contributor Author

matush-v commented Apr 5, 2017

This has been updated as follows:

Before

All offers would be scored and of all scores above 0, the best scored offer would be accepted by the task

After

All offers are still scored, but the minimum score acceptable depends on the task's overdue milliseconds and number of offers a task has not accepted.

Currently, the overdue time and offer count have a max of 10 min and 20 attempts, respectively. The min score is based on the ratio of curOverdueTime:maxOverdueTime and curAttempts:maxAttempts

The offer attempts count is any offer that was considered, not just offers that scored too low. So an offer that didn't have enough resources to satisfy the task will still be counted.

I picked these numbers based on generalizations. We will likely have to tune these

return SlaveMatchState.SLAVE_ATTRIBUTES_DO_NOT_MATCH;
}

final SlavePlacement slavePlacement = taskRequest.getRequest().getSlavePlacement().or(configuration.getDefaultSlavePlacement());

if (!taskRequest.getRequest().isRackSensitive() && slavePlacement == SlavePlacement.GREEDY) {
// todo: account for this or let this behavior continue?
return SlaveMatchState.NOT_RACK_OR_SLAVE_PARTICULAR;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know if we would need to account for any rack sensitivity outside of the existing checks done in the scheduler

@matush-v
Copy link
Contributor Author

@ssalinas Got some specific logic tests in. Let me know if there's a piece I missed that should be added. I'm going to continue to try to get some full logic tests in as well

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;

public class SingularitySlaveUsage {

public static final String CPU_USED = "cpusUsed";
public static final String MEMORY_BYTES_USED = "memoryRssBytes";
public static final long BYTES_PER_MEGABYTE = 1024L * 1024L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was about to comment that there must be some type of easy class/enum for this like there is with TimeUnit, but apparently there isn't... weird...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was sad to see there wasn't a lib method for this too 😢

}

public double getCpusUsedForRequestType(RequestType type) {
return usagePerRequestType.get(type).get(CPU_USED).doubleValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another enum is more appropriate for CPU_USED/MEMORY_BYTES_USED ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, mapping would be clearer then too 👍

return usagePerRequestType;
}

public double getCpusUsedForRequestType(RequestType type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and getMemBytesUsedForRequestType are unused methods

}

@Override
public void runActionOnPoll() {
final long now = System.currentTimeMillis();
Map<RequestType, Map<String, Number>> usagesPerRequestType = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we want this to be per-slave, not overall?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be per slave. This poller loops through each slave and creates a new SingularitySlaveUsage with the stats for that slave

Map<SingularityOfferHolder, Double> scorePerOffer = new HashMap<>();
double minScore = minScore(taskRequestHolder.getTaskRequest(), offerMatchAttemptsPerTask, System.currentTimeMillis());

LOG.info("Minimum score for task {} is {}", taskRequestHolder.getTaskRequest().getPendingTask().getPendingTaskId().getId(), minScore);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably can be lower than info level here

continue;
}

double score = score(offerHolder, stateCache, tasksPerOfferPerRequest, taskRequestHolder, getSlaveUsage(currentSlaveUsages, offerHolder.getOffer().getSlaveId().getValue()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, maybe something like 'hostScore' here? The score is for the particular slave, not necessarily about the offer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the naming here. We do look at the slave's utilization to score the offer, but we are still scoring the offer itself since offers aren't uniquely 1:1 for a slave (e.g. 2 offers for the same slave).

The slave utilization weight will be the same for all offers on the same slave, but the offer resources will be different per offer. So, it seems to me that we're scoring the offer in this class rather than the slave itself

@VisibleForTesting
double minScore(SingularityTaskRequest taskRequest, Map<String, Integer> offerMatchAttemptsPerTask, long now) {
double minScore = 0.80;
int maxOfferAttempts = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another that would be nice to have ocnfigurable

final SingularityTask task = mesosTaskBuilder.buildTask(offerHolder.getOffer(), offerHolder.getCurrentResources(), taskRequest, taskRequestHolder.getTaskResources(), taskRequestHolder.getExecutorResources());
@VisibleForTesting
double score(Offer offer, SingularityTaskRequest taskRequest, Optional<SingularitySlaveUsageWithId> maybeSlaveUsage) {
double requestTypeCpuWeight = 0.20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these configurable, maybe another object in the configuration yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I was in progress on this (now committed), but I kept the fields under SingularityConfiguration since I saw a lot of other stuff in there as well (e.g. caching). We could pull it into an OfferConfiguration file if you think that'd be better for organization

if (matchesResources && slaveMatchState.isMatchAllowed()) {
final SingularityTask task = mesosTaskBuilder.buildTask(offerHolder.getOffer(), offerHolder.getCurrentResources(), taskRequest, taskRequestHolder.getTaskResources(), taskRequestHolder.getExecutorResources());
@VisibleForTesting
double score(Offer offer, SingularityTaskRequest taskRequest, Optional<SingularitySlaveUsageWithId> maybeSlaveUsage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go over this one in-person, think we are getting close, just easier to chat than typing a novel in github ;)

@matush-v matush-v added the hs_qa label May 18, 2017
@@ -31,22 +32,30 @@

private static final String SLAVE_PATH = ROOT_PATH + "/slaves";
private static final String TASK_PATH = ROOT_PATH + "/tasks";
private static final String USAGE_SUMMARY_PATH = ROOT_PATH + "/summary";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssalinas can you take a look at this piece? Just wanna make sure I got this set up right

@@ -123,6 +140,10 @@ public SingularityCreateResult saveSpecificTaskUsage(String taskId, SingularityT
return save(getSpecificTaskUsagePath(taskId, usage.getTimestamp()), usage, taskUsageTranscoder);
}

public SingularityCreateResult saveSpecificClusterUtilization(SingularityClusterUtilization utilization) {
return save(getSpecificClusterUtilizationPath(utilization.getTimestamp()) , utilization, clusterUtilizationTranscoder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any point at which we would want a history of these? If we don't need a history of summaries, we might as well save the data to the summary path (without timestamp) and just overwrite when there is new data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a reason to at this point. Saw the others were saving them so it seemed okay since we only save up to 5 points. I think it's safe to just have the one point. Keeps it simpler

@@ -95,6 +104,10 @@ private String getCurrentSlaveUsagePath(String slaveId) {
return ZKPaths.makePath(getSlaveUsagePath(slaveId), CURRENT_USAGE_NODE_KEY);
}

private String getSpecificClusterUtilizationPath(long timestamp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why SpecificClusterUtilization instead of just getUsageSummaryPath or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "specific" keyword was the pattern the other items were using so I kept that since the timestamp specified which one to grab. I can drop the historical data and then rename it

} catch (InvalidSingularityTaskIdException e) {
LOG.error("Couldn't get SingularityTaskId for {}", taskUsage);
continue;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this try catch for the potentially incorrect taskId types

@matush-v
Copy link
Contributor Author

Before

Min score was configurable by the user and would be reduced as tasks were delayed and offer match attempts were rejected.

After

Min score is calculated based on the overall utilization of the cluster. memUsed / memTotal and cpuUsed / cpuTotal. This will help give us a min score closer to the actual offer scores. Task delays and match attempts will still reduce the min score

@@ -130,12 +131,13 @@ public SingularityMesosOfferScheduler(MesosConfiguration mesosConfiguration,

while (!pendingTaskIdToTaskRequest.isEmpty() && addedTaskInLastLoop && canScheduleAdditionalTasks(taskCredits)) {
addedTaskInLastLoop = false;
double maxTaskMillisPastDue = maxTaskMillisPastDue(SingularityScheduledTasksInfo.getInfo(taskManager.getPendingTasks(), configuration.getDeltaAfterWhichTasksAreLateMillis()).getMaxTaskLagMillis());
Copy link
Contributor Author

@matush-v matush-v May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need a better name for this variable. It's the max possible task lag before we decide we're going to take any offer we can get. I feel maxTaskMillisPastDue has some confusing overlap with maxTaskLag; which is the current highest lag time for pending tasks

@matush-v
Copy link
Contributor Author

Before

We had a configured value for the max possible task lag before we accepted any offer that matched a task. The default was 5 minutes.

After

The max possible task lag (maxTaskMillisPastDue) is determined by a simple point-slope formula that uses the maxTaskLag and the existing deltaAfterWhichTasksAreLateMillis.
maxTaskMillisPastDue = (-180 000 / deltaAfterWhichTasksAreLateMillis) * maxTaskLag + 180 000 where 180,000 is 3 min in milliseconds

A task with no lag will result in a 3 min maxTaskMillisPastDue. Any lag will linearly reduce the maxTaskMillisPastDue. The minimum possible maxTaskMillisPastDue is 1 millisecond.

Since we reduce the minScore by how long a task is past due: higher maxTaskLag => lower maxTaskMillisPastDue => lower minScore

It's also important to note that we will actually start accepting any matching offer (minScore == 0) before the maxTaskLag reaches deltaAfterWhichTasksAreLateMillis. We can use this formula to see what the values would be for different deltas.
180 000 = l + (180 000 / d) * l where l = the task lag at which we'll accept any offer and d = deltaAfterWhichTasksAreLateMillis

@ssalinas
Copy link
Member

ssalinas commented Jun 8, 2017

@darcatron I think this one is good to merge. Going to get this in master so we can continue to build off of the resource usage updates without endless merge conflicts 👍 . Thanks for all the work on this one

@ssalinas ssalinas merged commit af223b1 into master Jun 8, 2017
@ssalinas ssalinas deleted the task-juggling branch June 8, 2017 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants