llmaz

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with state-of-the-art inference backends like vLLM to bring the cutting-edge researches to cloud.

Concept

Feature Overview

User Friendly: People can quick deploy a LLM service with minimal configurations.
High Performance: llmaz integrates with vLLM by default for high performance inference. Other backends support is on the way.
Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
SOTA Inference (WIP): llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise to run on Kubernetes.
Multi-host Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Once Models (e.g. facebook/opt-125m) are published, you can quick deploy a Playground to serve the model.

Model

apiVersion: llmaz.io/v1alpha1
kind: Model
metadata:
  name: opt-125m
spec:
  familyName: opt
  dataSource:
    modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

Refer to examples to learn more.

Roadmap

Gateway support for traffic routing
Serverless support for cloud-agnostic users
CLI tool support
Model training, fine tuning in the long-term

Contributions

🚀 All kinds of contributions are welcomed ! Please follow Contributing.

Contributors

🎉 Thanks to all these contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
api		api
client-go		client-go
cmd		cmd
config		config
docs		docs
hack		hack
llmaz		llmaz
pkg		pkg
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Dockerfile.loader		Dockerfile.loader
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llmaz

Concept

Feature Overview

Quick Start

Installation

Deploy

Model

Inference Playground

Test

Expose the service

Get registered models

Request a query

Roadmap

Contributions

Contributors

About

Releases 1

Packages

Contributors 2

Languages

License

InftyAI/llmaz

Folders and files

Latest commit

History

Repository files navigation

llmaz

Concept

Feature Overview

Quick Start

Installation

Deploy

Model

Inference Playground

Test

Expose the service

Get registered models

Request a query

Roadmap

Contributions

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages