Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource usage endpoint #1570

Merged
merged 35 commits into from Jul 14, 2017
Merged

resource usage endpoint #1570

merged 35 commits into from Jul 14, 2017

Conversation

@darcatron
Copy link
Contributor

@darcatron darcatron commented Jun 20, 2017

  • add some unit tests

Added endpoint to get resource usage data for each tracked request and the cluster as a whole.
Increased the number of points collected form 5 -> 15 and increased the interval between each point (poll frequency x interval) so we have a wider span of data

@darcatron darcatron requested a review from ssalinas Jun 20, 2017
@@ -562,6 +566,26 @@ public SingularityState getState(Optional<Boolean> skipCache, Optional<Boolean>
return Optional.of(response.getAs(SingularityTaskReconciliationStatistics.class));
}

public SingularityClusterUtilization getClusterUtilization() {

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

I would follow the format of the other get calls here. You'll see there is a getSingle method where the http call building and deserialization is done for you

@VisibleForTesting
void clearOldUsage(List<SingularityTaskUsage> taskUsages, String taskId) {
if (taskUsages.size() + 1 > configuration.getNumUsageToKeep()) {
long minMillisApart = configuration.getUsageIntervalMultiplier() * configuration.getCheckUsageEveryMillis();

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

not sure of the purpose for this one, are you trying to make sure to keep a certain time period of data points rather than a count?

This comment has been minimized.

@darcatron

darcatron Jun 22, 2017
Author Contributor

Yeah, the goal here is to increase the interval between each point (poll frequency * interval) so we have a wider span of data. I thought that would be more telling of usage since a task could be in a bad state (fail, be modified, etc.) for a few consecutive runs. The interval defaults to 3 so that's essentially 45 minutes of data (15 points 3 min apart) rather than 15 minutes (15 points 1 min apart)

includeUtilization = false;
}

if (unusedMemBytes / memoryBytesReserved >= minUnderUtilizedPct) {

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

is there a downside to return all the data values instead of having mins that have to be hit? I feel like we'd want this to be a full report

This comment has been minimized.

@darcatron

darcatron Jun 22, 2017
Author Contributor

I added this min percentage to give us a goal of where we want our utilization to be. It's defaulted to 5% so that means any task using less than 95% of resources allocated would be considered in our report.

Without a min percent, it'd skew the counts and averages. Every task not using all the requested resources, no matter how small, will be counted as under-utilized. If some task is using 98 mb out of the 100 mb they requested, that seems very reasonable so it didn't make sense to count it as an under-utilized task.

With that said, I'm okay with changing it to consider all tasks if you think that would be more useful

double unusedCpu = cpuReserved - utilization.getAvgCpuUsed();
long unusedMemBytes = memoryBytesReserved - utilization.getMemBytesTotal();

if (unusedCpu / cpuReserved >= minUnderUtilizedPct) {

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

similar here, why not just report the over/under for everything rather than filtering it down?

maxUnderUtilizedMemBytes = Math.max(unusedMemBytes, maxUnderUtilizedMemBytes);
minUnderUtilizedMemBytes = Math.min(unusedMemBytes, minUnderUtilizedMemBytes);
} else if (!includeUtilization) {
it.remove();

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

not filtering things down also means you'd get to remove this. Generally better practice and more readable in code to not add or remove from the collection your iterating over.

This comment has been minimized.

@darcatron

darcatron Jun 22, 2017
Author Contributor

yeah, i wasn't a fan of the iterator here either, but that was the only way to safely remove the item from the list in the loop. If we decide to keep the min percentage, I could do a second cleanup loop rather than do it within the loop

return numTasks;
}

public double getAvgMemBytesUsed() {

This comment has been minimized.

@ssalinas

ssalinas Jun 22, 2017
Member

having these getters here means they would be included as json fields. Since they are easily calculated, and not used anywhere else in the code maybe exclude these? (think two extra fields x 3000+ requests all in one json blob)

This comment has been minimized.

@darcatron

darcatron Jun 22, 2017
Author Contributor

these are actually used to determine each unused resource:
long unusedMemBytes = (long) (memoryBytesReserved - utilization.getAvgMemBytesUsed());

I could drop the fields and do an inline calculation though?:
long unusedMemBytes = (long) (memoryBytesReserved - (utilization.getMemBytesTotal() / utilization.getNumTasks()));

This comment has been minimized.

@PtrTeixeira

PtrTeixeira Jun 22, 2017
Contributor

Could just do @JsonIgnoreand keep them on the object, but not the serialization

Copy link
Contributor

@PtrTeixeira PtrTeixeira left a comment

I don't know whether this has anything to do with your changes, and is probably just because those tests were fragile to start off with, but it looks like the tests SingularityMesosSchedulerTest (these aren't a big problem; these look like a problem with double comparisons) and Whitney's old tests in SingularityUsageTest have started breaking.

@darcatron
Copy link
Contributor Author

@darcatron darcatron commented Jun 22, 2017

@PtrTeixeira I've fixed some of those tests locally already and am working on getting mine written and the others fixed as well

@darcatron darcatron force-pushed the use-it-or-lose-it branch from bb6fb6e to 5df329a Jun 23, 2017
@ssalinas ssalinas modified the milestone: 0.17.0 Jun 23, 2017
darcatron added 4 commits Jun 24, 2017
sad
@darcatron darcatron force-pushed the use-it-or-lose-it branch from 27fb95f to 98d5d38 Jun 25, 2017
@darcatron darcatron force-pushed the use-it-or-lose-it branch from 63b8dd8 to 3234d43 Jun 27, 2017
@darcatron
Copy link
Contributor Author

@darcatron darcatron commented Jul 13, 2017

@ssalinas looking good in qa. okay to get it into prod?

@darcatron darcatron added the hs_stable label Jul 13, 2017
@ssalinas
Copy link
Member

@ssalinas ssalinas commented Jul 14, 2017

Going to merge this one. Any further updates can be done in future PRs

@ssalinas ssalinas merged commit 2b00479 into master Jul 14, 2017
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/travis-ci/push The Travis CI build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ssalinas ssalinas deleted the use-it-or-lose-it branch Jul 14, 2017
@baconmania baconmania modified the milestones: 0.18.0, 0.17.0 Sep 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.