"Stop burning GPU dollars. Start slicing."
In the era of ubiquitous AI, GPU scarcity is no longer the only bottleneck—GPU Waste is. Most development, CI/CD, and inference workloads request a full NVIDIA GPU but utilize less than 15% of its hardware capability.
- Cloud Bills: You pay for 100% of a GPU while your workloads use a fraction.
- Scheduling Bottlenecks: Pending Pods waiting for a "Full GPU" while existing GPUs sit idle.
- Developer Friction: Teams manually editing YAMLs to share resources.
CastSlice is a lightweight, non-invasive Kubernetes Mutating Webhook that automatically converts "Whole GPU" requests into "Fractional/Shared GPU" slices based on smart policy.
It sits in your K8s Control Plane, intercepts Pod creation, and performs on-the-fly resource transformation—without changing a single line of your application code.
| Feature | The "Old" Way | The CastSlice Way |
|---|---|---|
| Cost | Full GPU per Pod | Shared GPU across multiple Pods |
| Concurrency | 1 Pod per GPU | Multiple Pods per GPU |
| Developer UX | Manual YAML changes | Zero-touch. Just add an annotation. |
| Vendor Lock-in | Locked to specific CSP tools | Cloud Agnostic. Works on EKS, GKE, AKS, or On-prem. |
CastSlice transparently rewrites nvidia.com/gpu resource requests into nvidia.com/gpu-shared resource requests for Pods that opt in via an annotation.
Pod CREATE request
│
▼
Kubernetes API server
│ (forwards to webhook)
▼
CastSlice webhook
│
├── castops.io/optimize: "true" annotation present?
│ │ YES │ NO
│ ▼ ▼
│ resolve slice ratio allow unchanged
│ (slice-ratio > workload-type > default: 1)
│ │
│ remove nvidia.com/gpu
│ add nvidia.com/gpu-shared: <ratio>
│ │
▼ ▼
JSON Patch returned → Pod scheduled with shared GPU
Annotations:
| Annotation | Value | Effect |
|---|---|---|
castops.io/optimize |
"true" |
Enable GPU slice optimization (required) |
castops.io/workload-type |
training / inference / batch / dev |
Select preset slice ratio |
castops.io/slice-ratio |
"N" (positive integer) |
Override slice count directly |
Preset ratios by workload type:
| Workload Type | GPU Slices | Use Case |
|---|---|---|
training |
4 | Model training jobs — higher GPU share |
inference |
2 | Serving / Triton inference servers |
batch |
2 | Batch preprocessing and feature extraction |
dev |
1 | Development and debugging (default) |
- Go 1.24+
- A Kubernetes cluster (KinD / Minikube for local testing)
- cert-manager for TLS certificate injection
# Install cert-manager (if not already present)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
# Wait for cert-manager to be ready
kubectl rollout status deployment/cert-manager -n cert-manager
# Deploy CastSlice
kubectl apply -f https://github.com/castops/cast-slice/releases/latest/download/install.yaml
# Create the TLS certificate for the webhook (issued by cert-manager)
kubectl apply -f config/cert/certificate.yaml
# Wait for the webhook pod to be ready
kubectl rollout status deployment/cast-slice -n cast-sliceAdd the castops.io/optimize annotation and optionally specify the workload type:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-inference
spec:
template:
metadata:
annotations:
castops.io/optimize: "true"
castops.io/workload-type: "inference" # → gpu-shared: 2
spec:
containers:
- name: ollama
image: ollama/ollama
resources:
limits:
nvidia.com/gpu: 1 # CastSlice rewrites this based on workload typeFor fine-grained control, use an explicit ratio:
annotations:
castops.io/optimize: "true"
castops.io/slice-ratio: "8" # explicit override → gpu-shared: 8# Check the mutated pod
kubectl get pod -o yaml | grep gpu-shared
# training workload: nvidia.com/gpu-shared: "4"
# inference workload: nvidia.com/gpu-shared: "2"
# dev workload (default): nvidia.com/gpu-shared: "1"CastSlice exposes Prometheus metrics on :8080/metrics via the standard controller-runtime metrics server.
| Metric | Type | Description |
|---|---|---|
castslice_requests_total |
Counter | Total admission requests processed |
castslice_mutations_total |
Counter | Pods mutated with GPU slice rewrites |
castslice_noop_total |
Counter | Pods allowed without mutation |
castslice_errors_total |
Counter | Requests rejected with an error |
# Port-forward the metrics service
kubectl port-forward svc/cast-slice-metrics 8080:8080 -n cast-slice
# Scrape metrics
curl http://localhost:8080/metrics | grep castsliceExample output:
# HELP castslice_errors_total Total number of admission requests rejected with an error.
# TYPE castslice_errors_total counter
castslice_errors_total 0
# HELP castslice_mutations_total Total number of Pods mutated with GPU slice rewrites.
# TYPE castslice_mutations_total counter
castslice_mutations_total 42
# HELP castslice_noop_total Total number of Pods allowed without mutation (no annotation or no GPU limits).
# TYPE castslice_noop_total counter
castslice_noop_total 158
# HELP castslice_requests_total Total number of admission requests processed by the CastSlice webhook.
# TYPE castslice_requests_total counter
castslice_requests_total 200
A ready-to-use Grafana dashboard is included at config/monitoring/grafana-dashboard.yaml. It provides 5 panels:
- Webhook Request Rate — overall admission throughput
- GPU Slice Mutations Rate — how many Pods per second get GPU sharing enabled
- No-op Rate — Pods passing through unchanged
- Error Rate — invalid annotation rejections (alert if non-zero)
- Mutation Efficiency — fraction of requests resulting in a GPU slice (higher = more GPU sharing)
Import via kubectl (auto-loads if kube-prometheus-stack sidecar dashboards are enabled):
kubectl apply -f config/monitoring/grafana-dashboard.yamlManual import: Grafana → Dashboards → Import → paste the JSON from the ConfigMap's castslice-finops-dashboard.json key.
The cast-slice-metrics Service is deployed with standard Prometheus annotations (prometheus.io/scrape: "true") so node-based Prometheus auto-discovery picks it up automatically. No additional scrape config is required for most setups.
cast-slice/
├── main.go # Manager + Webhook registration
├── TODOS.md # Planned improvements and deferred work
├── internal/
│ └── webhook/
│ ├── pod_webhook.go # Mutating webhook handler
│ ├── metrics.go # Prometheus counter definitions
│ └── pod_webhook_test.go # Unit tests
├── config/
│ ├── deploy/deployment.yaml # Namespace, SA, Deployment, Services (webhook + metrics)
│ ├── webhook/mutating_webhook.yaml# MutatingWebhookConfiguration
│ └── monitoring/
│ └── grafana-dashboard.yaml # FinOps Grafana dashboard ConfigMap
└── docs/
├── local-testing.md # How to test without a real GPU
├── node-mock.yaml # Mock node labels
└── test-pod.yaml # Test Pod that triggers the webhook
go build -o cast-slice .go test ./...# Apply workload manifests
kubectl apply -f config/deploy/deployment.yaml
kubectl apply -f config/webhook/mutating_webhook.yamlSee docs/local-testing.md for a step-by-step guide on mocking GPU capacity and validating webhook behavior.
- v0.1.0: Basic Mutating Webhook (Static Slicing).
- v0.2.0: Smart Slicing (Dynamic ratios based on workload type).
- v0.3.0: FinOps Dashboard (Live GPU utilization metrics).
- v0.4.0: Policy Engine (Namespace-level and label-based rules).
- v0.5.0: Multi-GPU Support (Cross-node GPU sharing).
We're looking for FinOps-minded engineers to help optimize GPU infrastructure for the AI era.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Built by CastOps - Engineering the Future of AI Infrastructure.