-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750
Conversation
Something I thought of last night is that we need to dynamically set |
infrastructure/instances.tf
Outdated
resource "aws_spot_fleet_request" "cheap_ram" { | ||
iam_fleet_role = "${aws_iam_role.data_refinery_spot_fleet.arn}" | ||
allocation_strategy = "diversified" | ||
target_capacity = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, the units of this capacity are arbitrary based on what we decide right? Looking at the weights assigned to the various instance types, it seems like 10 capacity units is roughly equal to 1TB of RAM, making our target_capacity 10 TB of RAM.
If this is accurate I think it'd be good to record this in a comment so if we add additional instance types it's easy to remember what weight to assign them or if we want to increase our capacity by 5 TB we know we need to up this by 50 capacity units.
# Client Specific | ||
instance_type = "x1.16xlarge" | ||
weighted_capacity = 10 # via https://aws.amazon.com/ec2/instance-types/ | ||
spot_price = "${var.spot_price}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to be using the same spot_price for every instance type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spot_price
isn't the value that you actually pay, it's the value of the highest-that-you're-willing-to-go. For us, that's always our spot price for the biggest class - we're not really concerned about the small variances for smaller classes.
user_data = "${data.template_file.nomad_client_script_smusher.rendered}" | ||
resource "aws_spot_fleet_request" "cheap_ram" { | ||
iam_fleet_role = "${aws_iam_role.data_refinery_spot_fleet.arn}" | ||
allocation_strategy = "diversified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we want to use diversified
instead of lowestPrice
? It seems like lowestPrice would be overall cheaper, but could have more volatility if all our instances are of a single type and then that type's capacity gets snapped up.
However I think that what might be most appropriate for our use case is to follow the directions of the Configuring Spot Fleet for Cost Optimization and Diversification
section:
To create a fleet of Spot Instances that is both cheap and diversified, use the lowestPrice allocation strategy in combination with InstancePoolsToUseCount. Spot Fleet automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination can be used to avoid the most expensive Spot Instances.
What I couldn't determine from the documentation is what happens if there's no spot capacity for some instance types in the diversified
strategy. Do the spot requests just sit and wait to be fulfilled or do they give up and use what is actually available? It seems to insinuate that at least...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that lowestPrice
is for riding the variance in spot price for a single instance type across regions and AZs. Since we're only operating on a single AZ and our limiting factor is capacity, I think that diversified
is the correct strategy here.
infrastructure/instances.tf
Outdated
} | ||
|
||
## | ||
# c5d.18xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C5's are compute optimized, not RAM optimized. I would have thought R5's would be more appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've skimmed this and at surface level it seems reasonable. However I don't know anything about spot fleets yet and I don't currently have the mental clarity to do the research on all the various settings here and think
…into miserlou/spotfleet
I think we should also include |
…into miserlou/spotfleet
So since r5d.24xlarge has 768 GB and 20000/768 = 26.04 I think we should also set Also we had discussed a strategy for trying to rotate volumes semi-fairly. I forget what it was, but has that been included in this PR yet? |
It doesn't, I think that might be out of scope as we might want to try a few strategies. |
Is there an initial strategy we want to try first? Or is that just what it's already doing? |
This strategy is AWS-generated chaos. The question is do we need CCDL-generated chaos, or to try to statefully tame the chaos. |
Cool, sounds pretty good to me. Earlier you mentioned some dev stack testing. Have you done that or should this just be tested in staging? |
I've tested this once before myself and I'm doing another test now. Curious at how the allocation will come in with these adjustments. |
README.md
Outdated
terraform taint <your-entity-from-state-list> | ||
``` | ||
|
||
And then rerun `deploy.sh` with the same parameters you originally ran it with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge conflict ended up causing this section to be repeated. In my branch I moved it down below the Autoscaling and Setting Spot Prices
section so this one can just be removed.
infrastructure/instances.tf
Outdated
# autoscaling_group_name = "${aws_autoscaling_group.clients.name}" | ||
# depends_on = ["aws_instance.nomad_server_1"] | ||
# } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.
infrastructure/logging.tf
Outdated
# "${aws_autoscaling_policy.clients_scale_down.arn}" | ||
# ] | ||
|
||
# } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'd prefer to keep dead code out of the repo, but other than this looks good to me.
Good work!
[WIP] Use Spot Fleet Rather Than Auto Scaling Groups
Needs more testing, tuning, and cost prediction, but seems like it's working so far! Spot fleets are cool!