llmaz (pronounced /lima:z/
), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with state-of-the-art inference backends like vLLM to bring the cutting-edge researches to cloud.
- User Friendly: People can quick deploy a LLM service with minimal configurations.
- High Performance: llmaz integrates with vLLM by default for high performance inference. Other backends support is on the way.
- Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
- Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- SOTA Inference (WIP): llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise to run on Kubernetes.
- Multi-host Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.
Read the Installation for guidance.
Once Model
s (e.g. facebook/opt-125m) are published, you can quick deploy a Playground
to serve the model.
apiVersion: llmaz.io/v1alpha1
kind: Model
metadata:
name: opt-125m
spec:
familyName: opt
dataSource:
modelID: facebook/opt-125m
inferenceFlavors:
- name: t4 # GPU type
requests:
nvidia.com/gpu: 1
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: opt-125m
spec:
replicas: 1
modelClaim:
modelName: opt-125m
kubectl port-forward pod/opt-125m-0 8080:8080
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 10,
"temperature": 0
}'
Refer to examples to learn more.
- Gateway support for traffic routing
- Serverless support for cloud-agnostic users
- CLI tool support
- Model training, fine tuning in the long-term
🚀 All kinds of contributions are welcomed ! Please follow Contributing.
🎉 Thanks to all these contributors.