-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to GitHub hosted runners #1323
Conversation
I know very little about the self-hosted setup you've been using, but have you considered disk i/o at the bottleneck on AWS? I find it is one of the most overlooked aspects for this type of thing and by default, AWS is set to burst mode, which means you get fast performance for a short amount of time and then things slow to a crawl. Provisioned i/o ops are faster and guaranteed. I don't know how they play with spot instances, which are themselves not ideal for a CI system in my opinion.
I would be surprised if GitHub is spinning up a new instance every time, they are almost certainly using containers of some time and the time it takes to spin up a new container is way less than spinning up a new instance (the same is true on AWS).
Both of these numbers seem crazy to me. Like wouldn't it be way cheaper to just use a self-hosted runner on a machine we physical own at this point? (biggest issue there is the maintenance cost, so probably not). |
Yes, and I agree that's almost certainly the problem. I wrote what I know about it, and requested help with it, back in August: We're using GP2 volumes, which I think avoids the burst behavior. But maybe it also limits the peak performance we get. In any case, I'm pretty sure the main problem isn't the performance of the EBS volume itself (our working area is on a local SSD anyway, not on an EBS volume), but rather the time to "warm up" the EBS volume by streaming the system disk from the snapshot on S3 for a brand new VM. "For volumes that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume." (from here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ebs-initialize.html)
Well, I don't know, but this page seems to say otherwise: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#using-a-github-hosted-runner "When the job begins, GitHub automatically provisions a new VM for that job. All steps in the job execute on the VM, allowing the steps in that job to share information using the runner's filesystem. You can run workflows directly on the VM or in a Docker container. When the job has finished, the VM is automatically decommissioned." They might have sufficient scale to have instances already spun up and ready to run before they're requested. We could do that too, but it would be costly. But again, 2 minutes versus 5 minutes startup time isn't really what I'm worried about here.
Yes, it's a lot of money, but it's not obvious what we can do about it. On-premise hardware is always cheaper (by a lot!) than the cloud if you don't count the cost of maintaining the on-premise hardware. Whether we would save money with that approach would mostly come down to how reliable the hardware is, probably. |
Sorry, I missed that you were using snapshots. Yes, that is crazy slow in my experience and I think you are correct that it is likely the primary issue.
This is a question I struggle with as well. Re: Azure. Putting the zip on Azure for this should be straightforward, we can do that early January if you want to try it out and see. |
I don't know of a way to avoid using snapshots, though. I'd love to hear about it if there is such a way.
I'm out the first week, back on the 8th. Would love to set that up then. |
As another experiment, I added this line to the start of the self-hosted build:
This is mentioned on this page as a way to make sure the EBS volume is warmed up and gets its full performance. It takes an hour to run that command, so this is totally unworkable, but I expected the build performance to be really good after it completed, at least. Strangely enough, though, the build performance is still much slower than the github-hosted runners. After the hour-long warmup, building cesium-native took 7m 46s. But it only took 4m 42s on the GitHub-hosted instance. Similarly, building the plugin itself took 41m 28s on self-hosted instead of 25m 28s on GitHub. So unless I've somehow completely failed at using
|
GitHub just announced that the default, free runner for public repos has been upgraded to 4 vCPUs, 16 GiB of RAM, and 150GiB of storage. So it's probably possible to do Unreal builds for free now, and hopefully even with decent performance. 🤩 |
@kring Great update. Would this require downloading and unpacking Unreal Engine during the build? I suppose we can store it as an image right? |
Yes, it would. We definitely need to get the Unreal images on Azure in order to avoid huge costs.
Not sure what you mean here? |
We can't create an AMI-type image with Unreal already unpacked so reduce the time for the download+unpack step in GitHub Action runners. |
There's no way to use a custom image on GitHub-hosted runners AFAIK. A (Windows) container could be a possibility, but that'll only be a win if the container image is smaller than the ZIP, cause we'd still have to download it (again, AFAIK). |
Based on just one sample, the upgraded small runners seem much slower than the large ones (which isn't too surprising, of course). Its's harder to compare to the self-hosted runners. When the self-hosted runners are at their best, they're significantly faster than this (under an hour rather than the 1 hour 24 minutes we see here). However, when they're slower (for unknown reasons!) they're significantly slower than this. Considering the upgraded small runners are free, less maintenance than the self-hosted, and likely to at least be pretty consistent in how long they take to do a build, that's probably a win overall.
|
To save memory.
Because there's not enough space on that volume.
The GH Actions runner doesn't have the version of Visual Studio that Unreal wants, so Unreal (for some reason) chooses an _older_ one, rather than a newer one. Which causes linker errors while packaging because an older Visual Studio can't link code compiled with a newer Visual Studio. So this commit installs the version of the compiler that Unreal wants and that was used to build the Cesium for Unreal plugin.
This is working well and ready for review. It's using all the standard runner types, which are free for open source, so this should save us a lot of money. @mramato if you can hook me up with some Azure account credentials I should be able to drive our CI cost to zero with minimal effort. Right now it might still be semi-high because of the massive amount of data we download from S3 on every build. The Windows and Linux builds are pretty performant. I had to do some kind of crazy things (uninstall stuff we don't need, mostly) to make room for Unreal Engine on the Linux VMs, though, because they're very low on disk space, but it's working well enough. The macOS builds are slow, especially because we can only run 5 at a time and a single Unreal commit needs 6. I tried at one point to use the new M1 runners. They had amazing performance for building cesium-native, but they're so stupidly constrained on memory (only 7 GB, versus 14 GB for the macOS Intel runners!) that our Unreal builds took approximately forever. A lot of the problem is that Clang (unlike Visual Studio) uses silly amounts of memory when compiling the templates in the metadata system. It'd be nice to do something about this, but it won't be quick or easy (I know because I've already spent a fair bit of time trying). |
The correct version is already installed now that we're using the macos-12 runner instead of macos-14.
I'm merging this because, as imperfect as it may be, it's better than what's in main. And lots of other branches are failing in dodgy ways that are very likely to be fixed by this one. |
This is just a test, don't merge it.It's ready now.We've been using self-hosted runners to do Unreal builds for awhile now. Mostly it works fine, but it can be a hassle to maintain the system that manages the infrastructure, and we sometimes see truly awful performance from builds (for unknown reasons).
So this PR switches to using GitHub-hosted large runners instead. Only the UE 5.1 Windows build is hooked up for the moment. Running on a generic build image rather than our custom ones requires some extra steps to happen during the build:
These add time to the build (about 15-20 minutes), and also (2) adds significant cost because UE is huge and downloading it from S3 on each build is expensive. Hosting it on Azure instead should fix that, though (as GitHub runners run on Azure and same-cloud downloads are usually free).
Overall this works really well. Time to build start is much shorter (~2 minutes instead of 5-6). The build itself is somewhere between a little and a lot faster, which is pretty mind boggling because the build in this PR does a lot more work, and our self-hosted instances are similar to or faster than the GitHub hosted instance. But, as mentioned above, we truly have no idea why the performance of our EC2 instances is so astoundingly slow in the self-hosted case. Could be we're doing something wrong. Or perhaps AWS performance in this sort of use-case (spin up a new instance, run a single build, shut it down) is just really terrible compared to what GitHub gets with Azure? More on the build slowness here: #1192
So this is looks pretty viable from a purely technical perspective. From a cost perspective, though, I'm worried.
Our self-hosted Windows instances cost 78 to 99.2 cents per hour on-demand (depending on exactly which instance we use). We use spot instances so the actual cost is lower (that 99.2 cents is currently 59.94 cents as a spot instance). There are all 8 core machines with 32GB+ of memory and a local SSD.
The 8-core, 32GB GitHub-hosted Windows instances instead cost 6.4 cents per minute, or 384 cents per hour. This is almost 4 times the on-demand price of our most expensive runner type, and 6.5 times its current spot price. Even if it cuts build times in half, we'll still be paying a lot more. And - again - it's mind-boggling that the GitHub-hosted instances are faster! The self-hosted runners are more powerful machines, doing less work! And yet, they are.
To ballpark it a bit, each Unreal commit kicks off 15 builds (5 platforms times 3 UE versions), plus a test and package for each version (6 more total). That could easily be as much as 21 hours of compute time for each commit, or $80.64 in total, per commit! Yikes. If we could get similar performance with the self-hosted setup, the cost would only be $12.59 per commit (based on spot instances).
CC @mramato