-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS-native k8s Deployment #116
Comments
I thought about how to deal with kubernetes and I think the best way to handle this is to make a new implementation of the scheduler and have the json config file pick up the implementation and use it. I don't remember all the details, but I do remember that a special scheduler (and possibly modified worker impl) is likely to give the best experience, since you'd be able to talk natively to the k8s, especially when it comes to auto-scaling. |
After looking into this a bit more I think the configuration part should be fairly straightforward. Actually good autoscaling is probably a bit harder, but seems doable as well. |
I'd be interested to see what you are thinking for implementing it? I think the best way to proceed here is Io write a proof-of-concept that is throw away code then throw it off to reviewers for an overview of how it will work. This kind of task is going to be tricky to do in stages because there is a high chance of late stage design issues that are impossible to see early on. So my recommendation is to make a really Jacky version that fully works, but not high quality code and does whatever it takes to make it work, then we can discuss how to best put it in production. Im not sure I have a full view in my mind on how this would work with k8s,. For example, should the scheduler talk directly to the cluster controller or have an external scaling agent that monitors the scheduler for scale ups/down? It gets even more complicated if lamda support is added. |
Fortunately @rsenghaas expressed interest to help me out and we'll go over the codebase over the weekend and brainstorm a bit @allada A few random thoughts we're considering:
(*) I'm not sure autoscaling is actually the best way to go. At least naive autoscaling probably won't do. We've had bad experiences in the past with that with remote execution. RBE providers we tested seemed to optimize for ease of use and ease of scale, but not for raw performance. Dynamically scaling worker nodes means that artifacts need to be allocated from cache nodes to those worker nodes. This can become a huge waste of power and resources if it's unclear how the underlying hardware actually looks like. E.g. splitting 2 threads from a 50 thread job out to a new machine could trigger gigantic data transfers. Anything that doesn't run a worker process on the same physical machine as the cache could cause significant latency increase. Sure, clouds have their own backbone networks and stuff, but that's still orders of magnitude slower than having the data on the same physical machine. However, scaling nodes instead of pods could be a different story. But in any case, I'm fairly certain that we won't be able to abstract away the hardware in any way and it could be quite tricky to get this right. I love it 😆 Additionally, somewhat unrelated to turbo-cache itself, but relevant for @rsenghaas (and @jaroeichler ?) for our tests - We should add an S3-compatible storage backend to the |
Where to start...
FYI, at one point I experimented with using an S3 filesystem, but realized to do this efficiently I'd need to write the fuse filesystem myself. The idea is that you could send |
Hahaha this conversation is escalating quickly 🤣
Agreed. A CLI is only useful for experimenting and config files should be the "default" usage pattern.
This is also what I was thinking. If I'm not mistaken it should be possible to more-or-less leave the
Actually I don't care about final images, just about a CI-friendly way to build a k8s-friendly image. We usually build containers with nix directly from upstream sources and use incremental builds for the image layers similarly to how it works in Bazel. E.g. our remote execution image is built like that. No need to distribute a prebuilt image if you can fully reproduce it with the exact same final hash 😄 This might be a bit too incompatible with "most" setups though as it requires tooling to handle such image builds. I do think k8s-specialized images in general seem like the lowest-hanging fruit at the moment though and I think it makes sense to focus on that first.
First, there is the convenience that wasm basically obsoletes the entire container build pipeline. You can build the wasm module with a regular Rust toolchain and you're done and can immediately deploy it. Second, since wasm modules start up orders of magnitude faster than a container:
Hmm that does sound reasonable. I don't have an issue with autoscaling per-se, I just feel like current implementations are too inefficient and don't take { hardware | latency | physical distance | usage patterns } into account well. This is a really hard thing to get right though. It might actually be an interesting research topic to use ML for something like that, and apparently a lot of research has already gone into optimizing ANN search like this. But this also seems like more of a "luxury" problem and first we need to get unoptimized distribution working at all 😅 Just in case you didn't stumble upon it already: https://github.com/erikbern/ann-benchmarks 😊
Fortunately we have a bunch of cloud credits that we need to burn within the next 9 months, so I think testing something like this should be fine for us 😇 This also seems like something really fun to implement, so I'll play around with this Our primary use case at the moment is building LLVM and I'd like to build it as frequently as possible. However, while LLVM gets a commit every ~10 min or so, and commits tend to cause mass-rebuilds, we'll just have very few build jobs for that running. I believe one full build is ~10k actions, not sure how many files, but it's probably a lot smaller than your use case. If I remember correctly, inefficient cache/remoteexec usage brought us to ~200GB network transfer across ~50 jobs for from-scratch builds, but that shouldve been at most ~20 gigs and it shouldve happened on a single machine without any network transfer. Unfortunately the cloud config for those builds was outside of our control. We've now pivoted to a different approach: The
Node affinity/label: Encourage deployments to go to the same node. For instance, enforce that a remote execution node always comes with a cache deployment that's potentially pre-populated before the remote execution service starts getting requests. Node taint: Prevent a deployment to a tainted node, for instance to prevent a remote execution job from being run on an edge node. Or to prevent in-memory cache to be deployed to a "slow" storage node. Or to prevent S3 storage to be deployed to an in-memory cache node. Endless possibilities 🤣 AFAIK affinities and labels can both be set and changed dynamically via e.g. Pulumi, Terraform or via operators. I'll have to look into whether this is actually useful though, and even if it is, it's probably a rather late-stage optimization.
Sounds very interesting! Also sounds very custom tailored to the use case of remote caching which I'm not sure whether it's a good or bad thing. This sounds like it could improve performance quite significantly, but I wonder whether that's really worth the effort and complexity as compared to just "brute-force storing in memory". Just regarding monitoring, could this be a good use-case for an eBPF/XDP module? I've always wanted to play with that but never had a good usecase for it (until now?) 😄 |
To further #5, you may find some ideas for the The other issue I've noticed is the process that compilers use to generate debug symbols. Since bazel usually requires the entire file to be re-compiled if a variable name is changed or comment changed, but the stripped binary is actually the same. A much higher cache hit rate could be achieved by generating the debug symbols in a separate task similar to |
Sorry for opening the discussion on this old(er) thread, but I think it would help a lot for adoption of TurboCache if a Helm chart would be provided. From the AWS example I can tell that the topology for a multi-node cluster is somewhat complex, so setting up the services, deployments, storage would be much simpler with a Helm chart. |
I agree that a Helm chart is essentially a hard requirement for k8s users and would help adoption in that space a lot. It's very much planned to implement this. We're working towards this already, but it'll likely take some time until it's ready. Things that I'd consider "blockers" until we can implement a Helm chart:
This is only for the cache portion though. Note that I didn't include any form of user accounts etc, since I think it's better to figure this out once we have actual deployments running. We also probably want to split the current all-in-one binary into smaller binaries to reduce image size, but I wouldn't consider this a "hard blocker" (#394 works towards making this easier). We're working on support for OpenTelemetry, but I wouldn't consider it a blocker for an inital helm release (#387 works towards that as it lets us export metrics via an otel subscriber). Release-wise, I think our informal "roadmap" until the release of a helm chart looks roughly like this:
@C0mbatwombat This is more of a sketch at the moment. If you see anything that strikes you as unreasonable or "could be improved" or "not as important", please let me know. Feedback from K8s users is very valuable to us ❤️ |
Makes sense, happy to see that you have a plan, thanks for all the hard work! |
@blakehatch is this ticket closed by: #116 I think we could open other ticket for tracking GKE and/or Azure's Kubernetes. But this will be first. |
@kubevalet (an alter GitHub ego) and @blakehatch are working on this one. Edited the name. |
I'd like to run turbo-cache in a k8s cluster deployed with Pulumi so that we can automatically set it up for users as part of rules_ll. Simple yaml-manifests would be usable for users of raw k8s, Terraform and Pulumi.
I'd be willing to work on this☺️
The text was updated successfully, but these errors were encountered: