Skip to content

castops/cast-slice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CastSlice

"Stop burning GPU dollars. Start slicing."

CI Latest Release Go Version Go Report Card Kubernetes License: MIT


🔴 The Problem in 2026

In the era of ubiquitous AI, GPU scarcity is no longer the only bottleneck—GPU Waste is. Most development, CI/CD, and inference workloads request a full NVIDIA GPU but utilize less than 15% of its hardware capability.

  • Cloud Bills: You pay for 100% of a GPU while your workloads use a fraction.
  • Scheduling Bottlenecks: Pending Pods waiting for a "Full GPU" while existing GPUs sit idle.
  • Developer Friction: Teams manually editing YAMLs to share resources.

🟢 The CastSlice Solution

CastSlice is a lightweight, non-invasive Kubernetes Mutating Webhook that automatically converts "Whole GPU" requests into "Fractional/Shared GPU" slices based on smart policy.

It sits in your K8s Control Plane, intercepts Pod creation, and performs on-the-fly resource transformation—without changing a single line of your application code.


✨ Key Features

Feature The "Old" Way The CastSlice Way
Cost Full GPU per Pod Shared GPU across multiple Pods
Concurrency 1 Pod per GPU Multiple Pods per GPU
Developer UX Manual YAML changes Zero-touch. Just add an annotation.
Vendor Lock-in Locked to specific CSP tools Cloud Agnostic. Works on EKS, GKE, AKS, or On-prem.

🛠 How It Works

CastSlice transparently rewrites nvidia.com/gpu resource requests into nvidia.com/gpu-shared resource requests for Pods that opt in via an annotation.

Pod CREATE request
       │
       ▼
 Kubernetes API server
       │ (forwards to webhook)
       ▼
 CastSlice webhook
       │
       ├── castops.io/optimize: "true" annotation present?
       │        │ YES                       │ NO
       │        ▼                           ▼
       │  resolve slice ratio           allow unchanged
       │  (slice-ratio > workload-type > default: 1)
       │        │
       │  remove nvidia.com/gpu
       │  add    nvidia.com/gpu-shared: <ratio>
       │        │
       ▼        ▼
 JSON Patch returned → Pod scheduled with shared GPU

Annotations:

Annotation Value Effect
castops.io/optimize "true" Enable GPU slice optimization (required)
castops.io/workload-type training / inference / batch / dev Select preset slice ratio
castops.io/slice-ratio "N" (positive integer) Override slice count directly

Preset ratios by workload type:

Workload Type GPU Slices Use Case
training 4 Model training jobs — higher GPU share
inference 2 Serving / Triton inference servers
batch 2 Batch preprocessing and feature extraction
dev 1 Development and debugging (default)

🚀 Quick Start

Prerequisites

  • Go 1.24+
  • A Kubernetes cluster (KinD / Minikube for local testing)
  • cert-manager for TLS certificate injection

1. Install CastSlice

# Install cert-manager (if not already present)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl rollout status deployment/cert-manager -n cert-manager

# Deploy CastSlice
kubectl apply -f https://github.com/castops/cast-slice/releases/latest/download/install.yaml

# Create the TLS certificate for the webhook (issued by cert-manager)
kubectl apply -f config/cert/certificate.yaml

# Wait for the webhook pod to be ready
kubectl rollout status deployment/cast-slice -n cast-slice

2. Deploy an Optimized Workload

Add the castops.io/optimize annotation and optionally specify the workload type:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-inference
spec:
  template:
    metadata:
      annotations:
        castops.io/optimize: "true"
        castops.io/workload-type: "inference"  # → gpu-shared: 2
    spec:
      containers:
      - name: ollama
        image: ollama/ollama
        resources:
          limits:
            nvidia.com/gpu: 1 # CastSlice rewrites this based on workload type

For fine-grained control, use an explicit ratio:

annotations:
  castops.io/optimize: "true"
  castops.io/slice-ratio: "8"   # explicit override → gpu-shared: 8

3. Verify It's Working

# Check the mutated pod
kubectl get pod -o yaml | grep gpu-shared
# training workload: nvidia.com/gpu-shared: "4"
# inference workload: nvidia.com/gpu-shared: "2"
# dev workload (default): nvidia.com/gpu-shared: "1"

📊 Metrics & Monitoring (v0.3.0)

CastSlice exposes Prometheus metrics on :8080/metrics via the standard controller-runtime metrics server.

Exposed Metrics

Metric Type Description
castslice_requests_total Counter Total admission requests processed
castslice_mutations_total Counter Pods mutated with GPU slice rewrites
castslice_noop_total Counter Pods allowed without mutation
castslice_errors_total Counter Requests rejected with an error

Access the Metrics Endpoint

# Port-forward the metrics service
kubectl port-forward svc/cast-slice-metrics 8080:8080 -n cast-slice

# Scrape metrics
curl http://localhost:8080/metrics | grep castslice

Example output:

# HELP castslice_errors_total Total number of admission requests rejected with an error.
# TYPE castslice_errors_total counter
castslice_errors_total 0
# HELP castslice_mutations_total Total number of Pods mutated with GPU slice rewrites.
# TYPE castslice_mutations_total counter
castslice_mutations_total 42
# HELP castslice_noop_total Total number of Pods allowed without mutation (no annotation or no GPU limits).
# TYPE castslice_noop_total counter
castslice_noop_total 158
# HELP castslice_requests_total Total number of admission requests processed by the CastSlice webhook.
# TYPE castslice_requests_total counter
castslice_requests_total 200

Grafana Dashboard

A ready-to-use Grafana dashboard is included at config/monitoring/grafana-dashboard.yaml. It provides 5 panels:

  • Webhook Request Rate — overall admission throughput
  • GPU Slice Mutations Rate — how many Pods per second get GPU sharing enabled
  • No-op Rate — Pods passing through unchanged
  • Error Rate — invalid annotation rejections (alert if non-zero)
  • Mutation Efficiency — fraction of requests resulting in a GPU slice (higher = more GPU sharing)

Import via kubectl (auto-loads if kube-prometheus-stack sidecar dashboards are enabled):

kubectl apply -f config/monitoring/grafana-dashboard.yaml

Manual import: Grafana → Dashboards → Import → paste the JSON from the ConfigMap's castslice-finops-dashboard.json key.

Prometheus Scrape Configuration

The cast-slice-metrics Service is deployed with standard Prometheus annotations (prometheus.io/scrape: "true") so node-based Prometheus auto-discovery picks it up automatically. No additional scrape config is required for most setups.


📁 Project Structure

cast-slice/
├── main.go                          # Manager + Webhook registration
├── TODOS.md                         # Planned improvements and deferred work
├── internal/
│   └── webhook/
│       ├── pod_webhook.go           # Mutating webhook handler
│       ├── metrics.go               # Prometheus counter definitions
│       └── pod_webhook_test.go      # Unit tests
├── config/
│   ├── deploy/deployment.yaml       # Namespace, SA, Deployment, Services (webhook + metrics)
│   ├── webhook/mutating_webhook.yaml# MutatingWebhookConfiguration
│   └── monitoring/
│       └── grafana-dashboard.yaml   # FinOps Grafana dashboard ConfigMap
└── docs/
    ├── local-testing.md             # How to test without a real GPU
    ├── node-mock.yaml               # Mock node labels
    └── test-pod.yaml                # Test Pod that triggers the webhook

🧪 Development

Build from Source

go build -o cast-slice .

Run Tests

go test ./...

Manual Deployment

# Apply workload manifests
kubectl apply -f config/deploy/deployment.yaml
kubectl apply -f config/webhook/mutating_webhook.yaml

Local Testing Without a GPU

See docs/local-testing.md for a step-by-step guide on mocking GPU capacity and validating webhook behavior.


🏗 Roadmap

  • v0.1.0: Basic Mutating Webhook (Static Slicing).
  • v0.2.0: Smart Slicing (Dynamic ratios based on workload type).
  • v0.3.0: FinOps Dashboard (Live GPU utilization metrics).
  • v0.4.0: Policy Engine (Namespace-level and label-based rules).
  • v0.5.0: Multi-GPU Support (Cross-node GPU sharing).

🤝 Contributing

We're looking for FinOps-minded engineers to help optimize GPU infrastructure for the AI era.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE for more information.


Built by CastOps - Engineering the Future of AI Infrastructure.

About

Stop burning GPU dollars. Start slicing.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages