[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

Miserlou · 2018-10-19T20:24:57Z

Needs more testing, tuning, and cost prediction, but seems like it's working so far! Spot fleets are cool!

kurtwheeler · 2018-10-25T14:46:50Z

Something I thought of last night is that we need to dynamically set MAX_DOWNLOADER_JOBS_PER_NODE based on the amount of RAM the host instance has since we'll have different amounts. I've started messing around with psutil package as a way to do this, the tricky thing will be making sure we're using the RAM of the host and not the amount of RAM allocated to the docker instance.

kurtwheeler · 2018-10-25T15:24:04Z

infrastructure/instances.tf

+resource "aws_spot_fleet_request" "cheap_ram" {
+  iam_fleet_role      = "${aws_iam_role.data_refinery_spot_fleet.arn}"
+  allocation_strategy = "diversified"
+  target_capacity     = 100


If I understand this correctly, the units of this capacity are arbitrary based on what we decide right? Looking at the weights assigned to the various instance types, it seems like 10 capacity units is roughly equal to 1TB of RAM, making our target_capacity 10 TB of RAM.

If this is accurate I think it'd be good to record this in a comment so if we add additional instance types it's easy to remember what weight to assign them or if we want to increase our capacity by 5 TB we know we need to up this by 50 capacity units.

kurtwheeler · 2018-10-25T15:24:25Z

infrastructure/instances.tf

+    # Client Specific
+    instance_type             = "x1.16xlarge"
+    weighted_capacity         = 10 # via https://aws.amazon.com/ec2/instance-types/
+    spot_price                = "${var.spot_price}"


We seem to be using the same spot_price for every instance type.

spot_price isn't the value that you actually pay, it's the value of the highest-that-you're-willing-to-go. For us, that's always our spot price for the biggest class - we're not really concerned about the small variances for smaller classes.

kurtwheeler · 2018-10-25T15:31:55Z

infrastructure/instances.tf

-    user_data = "${data.template_file.nomad_client_script_smusher.rendered}"
+resource "aws_spot_fleet_request" "cheap_ram" {
+  iam_fleet_role      = "${aws_iam_role.data_refinery_spot_fleet.arn}"
+  allocation_strategy = "diversified"


Are you sure we want to use diversified instead of lowestPrice? It seems like lowestPrice would be overall cheaper, but could have more volatility if all our instances are of a single type and then that type's capacity gets snapped up.

However I think that what might be most appropriate for our use case is to follow the directions of the Configuring Spot Fleet for Cost Optimization and Diversification section:

To create a fleet of Spot Instances that is both cheap and diversified, use the lowestPrice allocation strategy in combination with InstancePoolsToUseCount. Spot Fleet automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination can be used to avoid the most expensive Spot Instances.

What I couldn't determine from the documentation is what happens if there's no spot capacity for some instance types in the diversified strategy. Do the spot requests just sit and wait to be fulfilled or do they give up and use what is actually available? It seems to insinuate that at least...

My understanding is that lowestPrice is for riding the variance in spot price for a single instance type across regions and AZs. Since we're only operating on a single AZ and our limiting factor is capacity, I think that diversified is the correct strategy here.

kurtwheeler · 2018-10-25T15:36:56Z

infrastructure/instances.tf

+  }
+
+  ##
+  # c5d.18xlarge


C5's are compute optimized, not RAM optimized. I would have thought R5's would be more appropriate here.

kurtwheeler

I've skimmed this and at surface level it seems reasonable. However I don't know anything about spot fleets yet and I don't currently have the mental clarity to do the research on all the various settings here and think

…into miserlou/spotfleet

kurtwheeler · 2018-10-26T15:22:52Z

I think we should also include x1e.8xlarge as an option for the spot fleet because it has more RAM than the r5d.24xlarge.

…into miserlou/spotfleet

kurtwheeler · 2018-10-29T19:18:37Z

So since r5d.24xlarge has 768 GB and 20000/768 = 26.04 I think we should also set MAX_CLIENTS to 26 in infrastructure/environments/prod.tfvars.

Also we had discussed a strategy for trying to rotate volumes semi-fairly. I forget what it was, but has that been included in this PR yet?

Miserlou · 2018-10-29T19:23:34Z

It doesn't, I think that might be out of scope as we might want to try a few strategies.

kurtwheeler · 2018-10-29T19:24:27Z

Is there an initial strategy we want to try first? Or is that just what it's already doing?

Miserlou · 2018-10-29T19:25:18Z

This strategy is AWS-generated chaos. The question is do we need CCDL-generated chaos, or to try to statefully tame the chaos.

kurtwheeler · 2018-10-29T19:28:22Z

Cool, sounds pretty good to me. Earlier you mentioned some dev stack testing. Have you done that or should this just be tested in staging?

Miserlou · 2018-10-29T19:29:34Z

I've tested this once before myself and I'm doing another test now. Curious at how the allocation will come in with these adjustments.

kurtwheeler · 2018-10-29T19:31:04Z

README.md

+terraform taint <your-entity-from-state-list>
+```
+
+And then rerun `deploy.sh` with the same parameters you originally ran it with.


The merge conflict ended up causing this section to be repeated. In my branch I moved it down below the Autoscaling and Setting Spot Prices section so this one can just be removed.

kurtwheeler · 2018-10-29T19:32:21Z

infrastructure/instances.tf

+#     autoscaling_group_name = "${aws_autoscaling_group.clients.name}"
+#     depends_on = ["aws_instance.nomad_server_1"]
+# }
+


Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.

kurtwheeler · 2018-10-29T19:32:26Z

infrastructure/logging.tf

+#         "${aws_autoscaling_policy.clients_scale_down.arn}"
+#     ]
+
+# }


Can we clean up this dead code? If we ever need to go back to it it'll be in the git history.

kurtwheeler

Ok, I'd prefer to keep dead code out of the repo, but other than this looks good to me.

Good work!

…arently

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups

Miserlou added 3 commits October 19, 2018 13:55

spot fleet

bd18c57

fleet working

976fd78

add more types to fleet

66b9a03

ghost assigned Miserlou Oct 19, 2018

ghost added in progress review labels Oct 19, 2018

kurtwheeler reviewed Oct 25, 2018

View reviewed changes

kurtwheeler suggested changes Oct 25, 2018

View reviewed changes

Miserlou added 2 commits October 25, 2018 13:33

move from c5d to r5d

aaa0ed4

Merge branch 'dev' of https://github.com/data-refinery/data-refinery …

a53096b

…into miserlou/spotfleet

kurtwheeler removed the in progress label Oct 26, 2018

add x1e.8xlarge

07ab3ad

ghost added the in progress label Oct 26, 2018

kurtwheeler removed the in progress label Oct 26, 2018

Miserlou added 3 commits October 29, 2018 10:54

Merge branch 'dev' of https://github.com/data-refinery/data-refinery …

e21b6c3

…into miserlou/spotfleet

max jobs calculator attempt 1

aa06301

missing +

d86a9c8

ghost added the in progress label Oct 29, 2018

Miserlou added 2 commits October 29, 2018 14:19

mfd

200c872

attempt two, with math

09749b1

bump max_clients

f7b0c2f

kurtwheeler reviewed Oct 29, 2018

View reviewed changes

rm dupe section

4c08632

kurtwheeler approved these changes Oct 29, 2018

View reviewed changes

Miserlou added 3 commits October 29, 2018 15:36

rm some dead code

62b8997

kurt with the nice string joining fix

2bf2a5c

this behavior is different across the osx and linux version of tf app…

acd2e54

…arently

Miserlou merged commit 5caa571 into dev Oct 30, 2018

ghost removed in progress review labels Oct 30, 2018

kurtwheeler mentioned this pull request Nov 6, 2018

Release latest changes to master #781

Merged

kurtwheeler pushed a commit that referenced this pull request Jan 10, 2019

Merge pull request #750 from AlexsLemonade/miserlou/spotfleet

0461160

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups

wvauclain deleted the miserlou/spotfleet branch July 11, 2019 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

Miserlou commented Oct 19, 2018

kurtwheeler commented Oct 25, 2018

kurtwheeler Oct 25, 2018

kurtwheeler Oct 25, 2018

Miserlou Oct 25, 2018

kurtwheeler Oct 25, 2018

Miserlou Oct 25, 2018

kurtwheeler Oct 25, 2018

kurtwheeler left a comment

kurtwheeler commented Oct 26, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler Oct 29, 2018 •

edited

Loading

kurtwheeler Oct 29, 2018

kurtwheeler Oct 29, 2018

kurtwheeler left a comment

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

[WIP] Use Spot Fleet Rather Than Auto Scaling Groups #750

Conversation

Miserlou commented Oct 19, 2018

kurtwheeler commented Oct 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kurtwheeler left a comment

Choose a reason for hiding this comment

kurtwheeler commented Oct 26, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler commented Oct 29, 2018

Miserlou commented Oct 29, 2018

kurtwheeler Oct 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kurtwheeler left a comment

Choose a reason for hiding this comment

kurtwheeler Oct 29, 2018 •

edited

Loading