Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

Merged
merged 16 commits into from
Oct 30, 2018
Merged

Conversation

Miserlou
Copy link
Contributor

Needs more testing, tuning, and cost prediction, but seems like it's working so far! Spot fleets are cool!

@ghost ghost assigned Miserlou Oct 19, 2018
@kurtwheeler
Copy link
Contributor

Something I thought of last night is that we need to dynamically set MAX_DOWNLOADER_JOBS_PER_NODE based on the amount of RAM the host instance has since we'll have different amounts. I've started messing around with psutil package as a way to do this, the tricky thing will be making sure we're using the RAM of the host and not the amount of RAM allocated to the docker instance.

resource "aws_spot_fleet_request" "cheap_ram" {
iam_fleet_role = "${aws_iam_role.data_refinery_spot_fleet.arn}"
allocation_strategy = "diversified"
target_capacity = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, the units of this capacity are arbitrary based on what we decide right? Looking at the weights assigned to the various instance types, it seems like 10 capacity units is roughly equal to 1TB of RAM, making our target_capacity 10 TB of RAM.

If this is accurate I think it'd be good to record this in a comment so if we add additional instance types it's easy to remember what weight to assign them or if we want to increase our capacity by 5 TB we know we need to up this by 50 capacity units.

# Client Specific
instance_type = "x1.16xlarge"
weighted_capacity = 10 # via https://aws.amazon.com/ec2/instance-types/
spot_price = "${var.spot_price}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be using the same spot_price for every instance type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spot_price isn't the value that you actually pay, it's the value of the highest-that-you're-willing-to-go. For us, that's always our spot price for the biggest class - we're not really concerned about the small variances for smaller classes.

user_data = "${data.template_file.nomad_client_script_smusher.rendered}"
resource "aws_spot_fleet_request" "cheap_ram" {
iam_fleet_role = "${aws_iam_role.data_refinery_spot_fleet.arn}"
allocation_strategy = "diversified"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to use diversified instead of lowestPrice? It seems like lowestPrice would be overall cheaper, but could have more volatility if all our instances are of a single type and then that type's capacity gets snapped up.

However I think that what might be most appropriate for our use case is to follow the directions of the Configuring Spot Fleet for Cost Optimization and Diversification section:

To create a fleet of Spot Instances that is both cheap and diversified, use the lowestPrice allocation strategy in combination with InstancePoolsToUseCount. Spot Fleet automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination can be used to avoid the most expensive Spot Instances.

What I couldn't determine from the documentation is what happens if there's no spot capacity for some instance types in the diversified strategy. Do the spot requests just sit and wait to be fulfilled or do they give up and use what is actually available? It seems to insinuate that at least...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that lowestPrice is for riding the variance in spot price for a single instance type across regions and AZs. Since we're only operating on a single AZ and our limiting factor is capacity, I think that diversified is the correct strategy here.

}

##
# c5d.18xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C5's are compute optimized, not RAM optimized. I would have thought R5's would be more appropriate here.

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've skimmed this and at surface level it seems reasonable. However I don't know anything about spot fleets yet and I don't currently have the mental clarity to do the research on all the various settings here and think

@kurtwheeler
Copy link
Contributor

I think we should also include x1e.8xlarge as an option for the spot fleet because it has more RAM than the r5d.24xlarge.

@ghost ghost added the in progress label Oct 26, 2018
@ghost ghost added the in progress label Oct 29, 2018
@kurtwheeler
Copy link
Contributor

So since r5d.24xlarge has 768 GB and 20000/768 = 26.04 I think we should also set MAX_CLIENTS to 26 in infrastructure/environments/prod.tfvars.

Also we had discussed a strategy for trying to rotate volumes semi-fairly. I forget what it was, but has that been included in this PR yet?

@Miserlou
Copy link
Contributor Author

It doesn't, I think that might be out of scope as we might want to try a few strategies.

@kurtwheeler
Copy link
Contributor

Is there an initial strategy we want to try first? Or is that just what it's already doing?

@Miserlou
Copy link
Contributor Author

This strategy is AWS-generated chaos. The question is do we need CCDL-generated chaos, or to try to statefully tame the chaos.

@kurtwheeler
Copy link
Contributor

Cool, sounds pretty good to me. Earlier you mentioned some dev stack testing. Have you done that or should this just be tested in staging?

@Miserlou
Copy link
Contributor Author

I've tested this once before myself and I'm doing another test now. Curious at how the allocation will come in with these adjustments.

README.md Outdated
terraform taint <your-entity-from-state-list>
```

And then rerun `deploy.sh` with the same parameters you originally ran it with.
Copy link
Contributor

@kurtwheeler kurtwheeler Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge conflict ended up causing this section to be repeated. In my branch I moved it down below the Autoscaling and Setting Spot Prices section so this one can just be removed.

# autoscaling_group_name = "${aws_autoscaling_group.clients.name}"
# depends_on = ["aws_instance.nomad_server_1"]
# }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.

# "${aws_autoscaling_policy.clients_scale_down.arn}"
# ]

# }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.

Copy link
Contributor

@kurtwheeler kurtwheeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'd prefer to keep dead code out of the repo, but other than this looks good to me.

Good work!

@Miserlou Miserlou merged commit 5caa571 into dev Oct 30, 2018
kurtwheeler pushed a commit that referenced this pull request Jan 10, 2019
[WIP] Use Spot Fleet Rather Than Auto Scaling Groups
@wvauclain wvauclain deleted the miserlou/spotfleet branch July 11, 2019 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants