Skip to content

RomanShushakov/ai_platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

78 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 AI Platform Lab

Rust K3s Slurm llama.cpp RAG LoRA Jetson ARM64

Practical AI infrastructure and orchestration lab built with Rust, Slurm, K3s, llama.cpp, RAG, and LoRA workflows on ARM64 hardware.

This repository combines:

  • Rust-based AI orchestration
  • RAG pipelines
  • MCP and HTTP tools
  • llama.cpp inference
  • Slurm batch scheduling
  • K3s orchestration
  • NFS shared storage
  • LoRA fine-tuning workflows
  • GPU-backed inference on edge hardware

The goal is not to build a polished SaaS product.

The goal is to understand how modern AI infrastructure layers fit together in practice using real heterogeneous hardware.


🎯 Project Goals

This repository serves as a hands-on engineering lab for:

  • AI infrastructure
  • distributed systems
  • GPU inference
  • retrieval systems
  • orchestration
  • batch scheduling
  • model specialization
  • OpenAI-compatible serving

The platform is intentionally built incrementally and kept observable.

Main design goals:

  • swappable model backends
  • explicit infrastructure layers
  • reproducible workflows
  • minimal hidden magic
  • practical deployment experience

πŸ–₯ Hardware Layout

Device Role
Laptop development + builds + control
Raspberry Pi 4 Slurm controller + K3s control-plane
Jetson Orin Nano GPU worker + inference + NFS server

🌐 Runtime Endpoints

Main external API:

http://100.109.72.92:30080

Internal inference services:

http://llama-cpp:8000
http://llama-cpp-lora:8003
http://llama-cpp-embed:8001

Docker registry:

192.168.178.103:5000

Shared NFS root:

/home/roman/nfs

🧭 High-Level Architecture

                Client
                   β”‚
                   β–Ό

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ ai-platform-host β”‚
         β”‚ Rust orchestratorβ”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β–Ό           β–Ό            β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAG      β”‚ β”‚ Tools    β”‚ β”‚ llama.cpp β”‚
β”‚ Retrievalβ”‚ β”‚ MCP/HTTP β”‚ β”‚ runtimes  β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
     β”‚             β”‚             β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Shared NFS     β”‚
          β”‚ /home/roman/nfsβ”‚
          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β–Ό

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Slurm + GPU    β”‚
          β”‚ batch jobs     β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”€ Current Runtime Layout

Service Purpose
llama-cpp base inference
llama-cpp-lora LoRA behavior specialization
llama-cpp-embed embeddings
ai-platform-host orchestration layer

🧠 Online vs Offline Split

🟒 Online Serving Plane

Client
  ↓
Rust host
  ↓
retrieval / tools / routing
  ↓
llama.cpp runtime
  ↓
response

Responsibilities:

  • HTTP API
  • orchestration
  • retrieval
  • tool execution
  • backend selection
  • response routing

πŸ”΅ Offline Batch Plane

Dataset / documents
  ↓
Slurm job
  ↓
Apptainer container
  ↓
artifacts / adapters
  ↓
NFS shared storage

Responsibilities:

  • LoRA training
  • RAG indexing
  • artifact generation
  • future automation jobs

πŸ“¦ Workspace Structure

workspace/
β”œβ”€β”€ host/              # Rust orchestration layer
β”œβ”€β”€ tools_server/      # MCP / HTTP tools
β”œβ”€β”€ llm_client/        # LLM abstraction layer
β”œβ”€β”€ shared_types/      # shared models
β”œβ”€β”€ indexer/           # RAG indexing
β”œβ”€β”€ knowledge_base/    # markdown KB
β”œβ”€β”€ artifacts/         # generated RAG artifacts
β”œβ”€β”€ lora/              # LoRA datasets + training
└── infra/             # Slurm + K3s + NFS

🧩 Current Infrastructure State

Slurm

  • Raspberry Pi β†’ controller
  • Jetson β†’ worker
  • CPU + GPU scheduling
  • Apptainer integration

NFS

  • Jetson β†’ NFS server
  • Raspberry + K3s β†’ clients

Shared root:

/home/roman/nfs

Used for:

  • models
  • datasets
  • LoRA adapters
  • RAG artifacts
  • logs
  • Apptainer images

K3s

Node Role
Raspberry Pi control-plane
Jetson GPU worker

Current workloads:

  • ai-platform-host
  • llama.cpp runtimes
  • embedding services
  • warmup jobs

πŸ“¦ Storage Layout

/ home / roman / nfs/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ gguf/
β”‚   └── huggingface/
β”œβ”€β”€ rag/
β”‚   β”œβ”€β”€ knowledge_base/
β”‚   β”œβ”€β”€ artifacts/
β”‚   └── images/
β”œβ”€β”€ lora/
β”‚   β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ adapters/
β”‚   β”œβ”€β”€ images/
β”‚   └── jobs/
└── logs/

🧠 RAG Architecture

Current RAG implementation intentionally avoids external vector databases.

Artifacts are stored as JSON:

artifacts/rag/
β”œβ”€β”€ chunks.json
β”œβ”€β”€ embeddings.json
└── manifest.json

Current retrieval flow:

User query
  ↓
embeddings
  ↓
similarity search
  ↓
context injection
  ↓
generation

Design goals:

  • debuggable
  • observable
  • easy to rebuild
  • minimal dependencies

πŸ”₯ LoRA Workflow

Current LoRA pipeline supports:

  • CPU training
  • GPU training
  • Slurm scheduling
  • Apptainer execution
  • GGUF conversion
  • llama.cpp runtime loading

Flow:

Dataset
  ↓
Trainer
  ↓
LoRA adapter
  ↓
GGUF conversion
  ↓
llama.cpp runtime

Current interpretation of LoRA:

  • style specialization
  • formatting behavior
  • future tool-call discipline

NOT:

  • primary factual memory
  • replacement for retrieval

RAG still provides factual grounding.


πŸ”€ Runtime Routing

The Rust orchestration layer supports multiple runtime profiles.

Current routing:

Profile Runtime
default llama-cpp
lora llama-cpp-lora

Current response modes:

Mode Behavior
agent retrieval + tools
direct direct generation

Example:

{
  "message": "What is Slurm?",
  "llm_profile": "lora",
  "response_mode": "direct"
}

πŸ§ͺ Current Working Features

Implemented and working:

  • GPU inference βœ”
  • GPU LoRA training βœ”
  • Slurm scheduling βœ”
  • Apptainer GPU execution βœ”
  • OpenAI-compatible APIs βœ”
  • Rust orchestration βœ”
  • K3s deployment βœ”
  • RAG retrieval βœ”
  • embedding generation βœ”
  • LoRA runtime routing βœ”
  • direct response mode βœ”
  • MCP tools βœ”

⚠️ Operational Notes

Jetson Orin Nano has limited VRAM/RAM.

During GPU LoRA training:

kubectl scale deployment llama-cpp -n ai-platform --replicas=0
kubectl scale deployment llama-cpp-lora -n ai-platform --replicas=0
kubectl scale deployment llama-cpp-embed -n ai-platform --replicas=0

After training:

kubectl scale deployment llama-cpp -n ai-platform --replicas=1
kubectl scale deployment llama-cpp-lora -n ai-platform --replicas=1
kubectl scale deployment llama-cpp-embed -n ai-platform --replicas=1

🧭 Design Principles

  • separate online/offline workloads
  • keep systems observable
  • prefer explicit infrastructure
  • avoid unnecessary abstraction
  • use standard APIs
  • build incrementally
  • optimize later

πŸ—Ί Roadmap

Possible future directions:

  • automated Slurm workflows
  • structured tool-call tuning
  • monitoring/metrics
  • vector database integration
  • distributed training experiments
  • lightweight UI
  • multi-model routing

πŸ“Œ Summary

This repository now demonstrates a practical miniature AI platform with:

  • distributed compute
  • shared storage
  • K3s orchestration
  • Slurm scheduling
  • GPU inference
  • GPU fine-tuning
  • RAG pipelines
  • LoRA specialization
  • Rust orchestration
  • OpenAI-compatible serving

built incrementally on real ARM hardware infrastructure.

About

AI infrastructure lab using Rust, K3s, Slurm, llama.cpp, RAG, and LoRA on ARM64 hardware.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors