Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should EC2 default to non-spot when spot is not available? #193

Open
moltar opened this issue Dec 15, 2022 · 7 comments · May be fixed by #518
Open

Should EC2 default to non-spot when spot is not available? #193

moltar opened this issue Dec 15, 2022 · 7 comments · May be fixed by #518

Comments

@moltar
Copy link

moltar commented Dec 15, 2022

The following error is produced in SFN, when the spot is not available:

{
  "resourceType": "aws-sdk:ec2",
  "resource": "runInstances.waitForTaskToken",
  "error": "Ec2.Ec2Exception",
  "cause": "There is no Spot capacity available that matches your request. (Service: Ec2, Status Code: 500, Request ID: 5b80a392-c6eb-42d1-99e9-d36db1909cf4)"
}

Catching this error, and then defaulting to non-spot would be optimal.

Otherwise, GitHub runners continue running, waiting to be completed and nothing happens.

@kichik
Copy link
Member

kichik commented Dec 15, 2022

Good idea. Maybe retry spot a bit and the fallback if the user enabled this option.

@omichowdhury
Copy link

It would also be cool to give a list of instance types to increase the pool of eligible servers.
To be honest AWS should build that in to runInstance

@kichik
Copy link
Member

kichik commented Dec 20, 2022

I believe they have that with their fleet option. I think I tried using that for EC2 but it didn't work as smoothly as RunInstances. Might be worth another look.

@omichowdhury
Copy link

Ah ok. Tbh we'll probably move to Fargate runners eventually, EC2 was just the easiest to get running. Would much rather use a Dockerfile than EC2 Image Builder.

@kichik
Copy link
Member

kichik commented Jan 15, 2023

I think this might be easy to implement once #216 is merged.

@pharindoko
Copy link
Contributor

I think it would make sense to implement both ideas mentioned above...

  1. Default to non-spot instance if no spot instance available
  2. Change instanceType for the runners to a list to add multiple instance-types
    @kichik
    What`s your opinion ?

@kichik
Copy link
Member

kichik commented Mar 20, 2023

Falling back to non-spot should be very possible although won't be perfect. I think at first just any Ec2Exception can fallback to non-spot (Step Functions don't give us the option of filtering on the error message which is not guaranteed to be stable anyway). Then maybe in the future if it has too many false positives, we can consider adding a Lambda in between to check the error message. Parsing the error message will also be a challenge because we have to try all subnets first. Might just have to replace the whole thing with one big Lambda if we really need to check the error message.

Multiple instance types will be harder as RunInstances doesn't support it and the other options didn't work too well when I tried them. If someone can figure out a good way to switch to fleet, I'm happy to hear.

My priority right now is to overhaul runner customization, so if anyone wants to get a PR going for this one, I'd be happy to accept it. At the very least we should be able to quickly get a fallback to non-spot after trying all the subnets with spot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants