Practical AI infrastructure and orchestration lab built with Rust, Slurm, K3s, llama.cpp, RAG, and LoRA workflows on ARM64 hardware.
This repository combines:
- Rust-based AI orchestration
- RAG pipelines
- MCP and HTTP tools
- llama.cpp inference
- Slurm batch scheduling
- K3s orchestration
- NFS shared storage
- LoRA fine-tuning workflows
- GPU-backed inference on edge hardware
The goal is not to build a polished SaaS product.
The goal is to understand how modern AI infrastructure layers fit together in practice using real heterogeneous hardware.
This repository serves as a hands-on engineering lab for:
- AI infrastructure
- distributed systems
- GPU inference
- retrieval systems
- orchestration
- batch scheduling
- model specialization
- OpenAI-compatible serving
The platform is intentionally built incrementally and kept observable.
Main design goals:
- swappable model backends
- explicit infrastructure layers
- reproducible workflows
- minimal hidden magic
- practical deployment experience
| Device | Role |
|---|---|
| Laptop | development + builds + control |
| Raspberry Pi 4 | Slurm controller + K3s control-plane |
| Jetson Orin Nano | GPU worker + inference + NFS server |
Main external API:
http://100.109.72.92:30080
Internal inference services:
http://llama-cpp:8000
http://llama-cpp-lora:8003
http://llama-cpp-embed:8001
Docker registry:
192.168.178.103:5000
Shared NFS root:
/home/roman/nfs
Client
β
βΌ
ββββββββββββββββββββ
β ai-platform-host β
β Rust orchestratorβ
ββββββββββ¬ββββββββββ
β
βββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββββ
β RAG β β Tools β β llama.cpp β
β Retrievalβ β MCP/HTTP β β runtimes β
ββββββ¬ββββββ ββββββ¬ββββββ βββββββ¬βββββββ
β β β
βββββββββββββββΌββββββββββββββ
βΌ
ββββββββββββββββββ
β Shared NFS β
β /home/roman/nfsβ
ββββββββ¬ββββββββββ
βΌ
ββββββββββββββββββ
β Slurm + GPU β
β batch jobs β
ββββββββββββββββββ
| Service | Purpose |
|---|---|
| llama-cpp | base inference |
| llama-cpp-lora | LoRA behavior specialization |
| llama-cpp-embed | embeddings |
| ai-platform-host | orchestration layer |
Client
β
Rust host
β
retrieval / tools / routing
β
llama.cpp runtime
β
response
Responsibilities:
- HTTP API
- orchestration
- retrieval
- tool execution
- backend selection
- response routing
Dataset / documents
β
Slurm job
β
Apptainer container
β
artifacts / adapters
β
NFS shared storage
Responsibilities:
- LoRA training
- RAG indexing
- artifact generation
- future automation jobs
workspace/
βββ host/ # Rust orchestration layer
βββ tools_server/ # MCP / HTTP tools
βββ llm_client/ # LLM abstraction layer
βββ shared_types/ # shared models
βββ indexer/ # RAG indexing
βββ knowledge_base/ # markdown KB
βββ artifacts/ # generated RAG artifacts
βββ lora/ # LoRA datasets + training
βββ infra/ # Slurm + K3s + NFS
- Raspberry Pi β controller
- Jetson β worker
- CPU + GPU scheduling
- Apptainer integration
- Jetson β NFS server
- Raspberry + K3s β clients
Shared root:
/home/roman/nfs
Used for:
- models
- datasets
- LoRA adapters
- RAG artifacts
- logs
- Apptainer images
| Node | Role |
|---|---|
| Raspberry Pi | control-plane |
| Jetson | GPU worker |
Current workloads:
- ai-platform-host
- llama.cpp runtimes
- embedding services
- warmup jobs
/ home / roman / nfs/
βββ models/
β βββ gguf/
β βββ huggingface/
βββ rag/
β βββ knowledge_base/
β βββ artifacts/
β βββ images/
βββ lora/
β βββ datasets/
β βββ adapters/
β βββ images/
β βββ jobs/
βββ logs/
Current RAG implementation intentionally avoids external vector databases.
Artifacts are stored as JSON:
artifacts/rag/
βββ chunks.json
βββ embeddings.json
βββ manifest.json
Current retrieval flow:
User query
β
embeddings
β
similarity search
β
context injection
β
generation
Design goals:
- debuggable
- observable
- easy to rebuild
- minimal dependencies
Current LoRA pipeline supports:
- CPU training
- GPU training
- Slurm scheduling
- Apptainer execution
- GGUF conversion
- llama.cpp runtime loading
Flow:
Dataset
β
Trainer
β
LoRA adapter
β
GGUF conversion
β
llama.cpp runtime
Current interpretation of LoRA:
- style specialization
- formatting behavior
- future tool-call discipline
NOT:
- primary factual memory
- replacement for retrieval
RAG still provides factual grounding.
The Rust orchestration layer supports multiple runtime profiles.
Current routing:
| Profile | Runtime |
|---|---|
| default | llama-cpp |
| lora | llama-cpp-lora |
Current response modes:
| Mode | Behavior |
|---|---|
| agent | retrieval + tools |
| direct | direct generation |
Example:
{
"message": "What is Slurm?",
"llm_profile": "lora",
"response_mode": "direct"
}Implemented and working:
- GPU inference β
- GPU LoRA training β
- Slurm scheduling β
- Apptainer GPU execution β
- OpenAI-compatible APIs β
- Rust orchestration β
- K3s deployment β
- RAG retrieval β
- embedding generation β
- LoRA runtime routing β
- direct response mode β
- MCP tools β
Jetson Orin Nano has limited VRAM/RAM.
During GPU LoRA training:
kubectl scale deployment llama-cpp -n ai-platform --replicas=0
kubectl scale deployment llama-cpp-lora -n ai-platform --replicas=0
kubectl scale deployment llama-cpp-embed -n ai-platform --replicas=0After training:
kubectl scale deployment llama-cpp -n ai-platform --replicas=1
kubectl scale deployment llama-cpp-lora -n ai-platform --replicas=1
kubectl scale deployment llama-cpp-embed -n ai-platform --replicas=1- separate online/offline workloads
- keep systems observable
- prefer explicit infrastructure
- avoid unnecessary abstraction
- use standard APIs
- build incrementally
- optimize later
Possible future directions:
- automated Slurm workflows
- structured tool-call tuning
- monitoring/metrics
- vector database integration
- distributed training experiments
- lightweight UI
- multi-model routing
This repository now demonstrates a practical miniature AI platform with:
- distributed compute
- shared storage
- K3s orchestration
- Slurm scheduling
- GPU inference
- GPU fine-tuning
- RAG pipelines
- LoRA specialization
- Rust orchestration
- OpenAI-compatible serving
built incrementally on real ARM hardware infrastructure.