ArKan

ArKan — это высокопроизводительная реализация сетей Колмогорова-Арнольда (KAN) на Rust, оптимизированная для задач с критическими требованиями к задержкам (Low Latency Inference).

Библиотека создавалась специально для интеграции в игровые солверы (например, Poker AI / MCTS), где требуется выполнять тысячи одиночных инференсов в секунду без оверхеда, свойственного большим ML-фреймворкам.

Теория: Что такое KAN?

В отличие от классических многослойных перцептронов (MLP), где функции активации зафиксированы на узлах (нейронах), а обучаются линейные веса, в Kolmogorov-Arnold Networks (KAN) всё наоборот:

Узлы выполняют простое суммирование.
Ребра (связи) содержат обучаемые нелинейные функции активации.

Математическая модель

В основе лежит теорема представления Колмогорова-Арнольда. Для слоя с N_in входами и N_out выходами преобразование выглядит так:

x[l+1, j] = Σᵢ φ[l,j,i](x[l, i])      где i = 1..N_in

Где φ[l,j,i] — это обучаемая 1D-функция, которая связывает i-й нейрон входного слоя с j-м нейроном выходного.

Реализация в ArKan (B-Splines)

В данной библиотеке функции φ параметризуются с помощью B-сплайнов (Basis splines). Это позволяет менять форму функции активации локально, сохраняя гладкость.

Уравнение для конкретного веса в ArKan:

φ(x) = Σᵢ cᵢ · Bᵢ(x)      где i = 1..(G+p)

Bᵢ(x) — базисные функции сплайна.
cᵢ — обучаемые коэффициенты.
G — размер сетки (grid size).
p — порядок сплайна (spline order).

Ключевые возможности

Zero-Allocation Inference: Весь forward проход выполняется на предвыделенном буфере (Workspace). Никаких аллокаций в горячем пути (Hot Path).
Zero-Allocation Training: Полный training step (forward + backward + SGD/Adam) также работает без аллокаций при прогретом Workspace.
SIMD-Optimized B-Splines: Вычисление базисных функций B-сплайнов векторизовано (AVX2/AVX-512 через крейт wide).
Cache-Friendly Layout: Веса хранятся в формате [Output][Input][Basis] для последовательного доступа к памяти и минимизации промахов кэша.
Standalone: Минимальные зависимости (rayon, wide). Не тянет за собой torch или burn, идеально для встраивания.
Quantization Ready: Архитектура подготовлена для работы с квантованными весами (baked models) для дальнейшего ускорения.
GPU-ускорение (wgpu): Опциональный GPU бэкенд с WGSL compute шейдерами для параллельного forward/backward.

GPU Backend (Опционально)

ArKan включает опциональный GPU бэкенд на основе wgpu для WebGPU/Vulkan/Metal/DX12 ускорения.

Установка

[dependencies]
arkan = { version = "0.3.0", features = ["gpu"] }

Использование

use arkan::{KanConfig, KanNetwork};
use arkan::gpu::{WgpuBackend, WgpuOptions, GpuNetwork};
use arkan::optimizer::{Adam, AdamConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Инициализация GPU бэкенда
    let backend = WgpuBackend::init(WgpuOptions::default())?;
    println!("GPU: {}", backend.adapter_name());

    // Создание CPU сети
    let config = KanConfig::preset();
    let mut cpu_network = KanNetwork::new(config.clone());

    // Создание GPU сети из CPU сети
    let mut gpu_network = GpuNetwork::from_cpu(&backend, &cpu_network)?;
    let mut workspace = gpu_network.create_workspace(64)?;

    // Forward инференс
    let input = vec![0.5f32; config.input_dim];
    let output = gpu_network.forward_single(&input, &mut workspace)?;

    // Обучение с Adam оптимизатором
    let mut optimizer = Adam::new(&cpu_network, AdamConfig::with_lr(0.001));
    let target = vec![1.0f32; config.output_dim];

    let loss = gpu_network.train_step_mse(
        &input, &target, 1,
        &mut workspace, &mut optimizer, &mut cpu_network
    )?;

    println!("Loss: {}", loss);
    Ok(())
}

GPU возможности

Функция	Статус
Forward инференс	✅
Forward training (сохранение активаций)	✅
Backward pass	✅ (GPU шейдеры)
Adam/SGD оптимизатор	✅
Синхронизация весов CPU↔GPU	✅
Многослойные сети	✅
Batch обработка	✅
train_step_with_options	✅
Gradient clipping	✅
Weight decay	✅

Ограничения GPU (wgpu 0.23)

Нет пробрасывания DeviceLost: wgpu 0.23 не предоставляет ошибки DeviceLost. Падение GPU может выглядеть как зависание вместо корректной ошибки.
Лимит памяти: MAX_VRAM_ALLOC = 2GB на буфер. Превышение возвращает ошибку BatchTooLarge.
Vec4 выравнивание: Веса дополняются до границы vec4 (4 элемента) для эффективности шейдеров.
CPU fallback: Если GPU недоступен, инициализация бэкенда корректно завершается с ошибкой AdapterNotFound.

Запуск GPU тестов и бенчмарков

# GPU parity тесты
cargo test --features gpu --test gpu_parity -- --ignored

# GPU бенчмарки (Windows PowerShell)
$env:ARKAN_GPU_BENCH="1"; cargo bench --bench gpu_forward --features gpu

# GPU бенчмарки (Linux/macOS)
ARKAN_GPU_BENCH=1 cargo bench --bench gpu_forward --features gpu

GPU производительность vs PyTorch CUDA

Сравнение ArKan GPU (wgpu/Vulkan) с PyTorch KAN реализациями на CUDA:

Implementation	Forward (batch=64)	Train (Adam)	Notes
fast-kan (CUDA)	0.58 ms	1.78 ms	RBF аппроксимация (самый быстрый)
ArKan (wgpu)	1.18 ms	3.04 ms	WebGPU/Vulkan, native training
efficient-kan (CUDA)	1.62 ms	3.70 ms	Native B-spline
ArKan-style (CUDA)	3.63 ms	N/A	Reference implementation

Вывод: ArKan wgpu быстрее efficient-kan на 27% и конкурентен с оптимизированными CUDA реализациями.

Бенчмарки (CPU)

Сравнение ArKan (Rust) против оптимизированной векторизованной реализации на PyTorch (CPU).

Тестовый стенд:

Config: Input 21, Output 24, Hidden [64, 64], Grid 5, Spline Order 3.
ArKan: cargo bench --bench forward (AVX2/Rayon enabled).
PyTorch: Optimized vectorized implementation (без Python-циклов).

Batch Size	ArKan (Time)	ArKan (Throughput)	PyTorch (Time)	PyTorch (Throughput)	Вывод
1	26.7 µs	0.79 M elems/s	1.45 ms	0.01 M elems/s	Rust быстрее в 54x (Low Latency)
16	427 µs	0.79 M elems/s	2.58 ms	0.13 M elems/s	Rust быстрее в 6.0x
64	1.70 ms	0.79 M elems/s	4.30 ms	0.31 M elems/s	Rust быстрее в 2.5x
256	6.82 ms	0.79 M elems/s	11.7 ms	0.46 M elems/s	Rust быстрее в 1.7x

Анализ производительности

Small Batch Dominance: На единичных запросах (batch=1) ArKan опережает PyTorch за счет отсутствия оверхеда интерпретатора и абстракций. Это позволяет совершать ~37,000 инференсов в секунду против ~700 у PyTorch.
Mid-Batch Performance: На средних батчах (16-64) ArKan сохраняет преимущество в 2.5x-6.0x, демонстрируя хорошую масштабируемость.
Throughput Scaling: На больших батчах (256+) ArKan сохраняет преимущество 1.7x благодаря zero-allocation архитектуре и эффективному использованию кэша.
Zero-Allocation Training: Весь training loop (forward + backward + update) работает без аллокаций при прогретом Workspace.

Сравнение с аналогами (Prior Art)

ArKan занимает нишу специализированного высокопроизводительного инференса.

Крейт	Назначение	Отличие ArKan
`burn-efficient-kan`	Часть экосистемы Burn. Отлично подходит для обучения на GPU.	ArKan — легковесная библиотека с опциональным GPU через wgpu. Минимальные зависимости в базовой конфигурации.
`fekan`	Богатый функционал (CLI, dataset loaders). General-purpose библиотека.	ArKan изначально спроектирован под SIMD (AVX2), параллелизм и GPU-ускорение.
`rusty_kan`	Базовая реализация, образовательный проект.	ArKan фокусируется на production-ready оптимизациях: workspace, батчинг, GPU.

Быстрый старт

Установка из crates.io:

[dependencies]
arkan = "0.3.0"

Пример использования (смотрите также examples/basic.rs и examples/training.rs):

use arkan::{KanConfig, KanNetwork};

fn main() {
    // 1. Конфигурация (Poker Solver preset)
    let config = KanConfig::preset();

    // 2. Инициализация сети
    let network = KanNetwork::new(config.clone());

    // 3. Создание Workspace (аллокация памяти один раз)
    let mut workspace = network.create_workspace(64); // Max batch size = 64

    // 4. Данные
    let inputs = vec![0.0f32; 64 * config.input_dim];
    let mut outputs = vec![0.0f32; 64 * config.output_dim];

    // 5. Инференс (Zero allocations here!)
    network.forward_batch(&inputs, &mut outputs, &mut workspace);

    println!("Inference done. Output[0]: {}", outputs[0]);
}

Архитектура

KanLayer: Реализует слой KAN. Хранит сплайновые коэффициенты. Использует локальное окно order+1 для вычислений, что позволяет эффективно использовать кэш CPU.
Workspace: Ключевая структура для производительности. Содержит выровненные (AlignedBuffer) буферы для промежуточных вычислений. Переиспользуется между вызовами.
spline: Модуль с реализацией алгоритма Cox-de Boor. Содержит SIMD-интринсики.

Лицензия

Распространяется под двойной лицензией MIT и Apache-2.0.

ArKan (English Version)

ArKan is a high-performance implementation of Kolmogorov-Arnold Networks (KAN) in Rust, optimized for tasks with critical latency requirements (Low Latency Inference).

The library was created specifically for integration into game solvers (e.g., Poker AI / MCTS), where thousands of single inferences per second are required without the overhead typical of large ML frameworks.

Theory: What is KAN?

Unlike classical Multi-Layer Perceptrons (MLP), where activation functions are fixed on nodes (neurons) and linear weights are learned, in Kolmogorov-Arnold Networks (KAN), it's the opposite:

Nodes perform simple summation.
Edges contain learnable non-linear activation functions.

Mathematical Model

Based on the Kolmogorov-Arnold representation theorem. For a layer with N_in inputs and N_out outputs, the transformation looks like this:

x[l+1, j] = Σᵢ φ[l,j,i](x[l, i])      where i = 1..N_in

Where φ[l,j,i] is a learnable 1D function connecting the i-th input neuron to the j-th output neuron.

Implementation in ArKan (B-Splines)

In this library, φ functions are parameterized using B-Splines. This allows modifying the shape of the activation function locally while maintaining smoothness.

Equation for a specific weight in ArKan:

φ(x) = Σᵢ cᵢ · Bᵢ(x)      where i = 1..(G+p)

Bᵢ(x) — B-spline basis functions.
cᵢ — learnable coefficients.
G — grid size.
p — spline order.

Key Features

Zero-Allocation Inference: The entire forward pass runs on a pre-allocated buffer (Workspace). No allocations in the Hot Path.
Zero-Allocation Training: The full training step (forward + backward + SGD/Adam) also runs without allocations on a warmed-up Workspace.
SIMD-Optimized B-Splines: B-spline basis evaluation is vectorized (AVX2/AVX-512 via wide crate).
Cache-Friendly Layout: Weights are stored in [Output][Input][Basis] format for sequential memory access and minimal cache misses.
Standalone: Minimal dependencies (rayon, wide). No torch or burn bloat, ideal for embedding.
Quantization Ready: Architecture is ready for quantized weights (baked models) for further acceleration.
GPU Acceleration (wgpu): Optional GPU backend with WGSL compute shaders for parallel forward/backward passes.

GPU Backend (Optional)

ArKan includes an optional GPU backend using wgpu for WebGPU/Vulkan/Metal/DX12 acceleration.

Installation

[dependencies]
arkan = { version = "0.3.0", features = ["gpu"] }

Usage

use arkan::{KanConfig, KanNetwork};
use arkan::gpu::{WgpuBackend, WgpuOptions, GpuNetwork};
use arkan::optimizer::{Adam, AdamConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize GPU backend
    let backend = WgpuBackend::init(WgpuOptions::default())?;
    println!("GPU: {}", backend.adapter_name());

    // Create CPU network
    let config = KanConfig::preset();
    let mut cpu_network = KanNetwork::new(config.clone());

    // Create GPU network from CPU network
    let mut gpu_network = GpuNetwork::from_cpu(&backend, &cpu_network)?;
    let mut workspace = gpu_network.create_workspace(64)?;

    // Forward inference
    let input = vec![0.5f32; config.input_dim];
    let output = gpu_network.forward_single(&input, &mut workspace)?;

    // Training with Adam optimizer
    let mut optimizer = Adam::new(&cpu_network, AdamConfig::with_lr(0.001));
    let target = vec![1.0f32; config.output_dim];

    let loss = gpu_network.train_step_mse(
        &input, &target, 1, 
        &mut workspace, &mut optimizer, &mut cpu_network
    )?;

    println!("Loss: {}", loss);
    Ok(())
}

GPU Features

Feature	Status
Forward inference	✅
Forward training (saves activations)	✅
Backward pass	✅ (GPU shaders)
Adam/SGD optimizer	✅
Weight sync CPU↔GPU	✅
Multi-layer networks	✅
Batch processing	✅
train_step_with_options	✅
Gradient clipping	✅
Weight decay	✅

Weight Synchronization

// Sync weights from CPU to GPU (after loading a model)
gpu_network.sync_weights_cpu_to_gpu(&cpu_network)?;

// Sync weights from GPU to CPU (for saving/export)
gpu_network.sync_weights_gpu_to_cpu(&mut cpu_network)?;

Training with Options

use arkan::TrainOptions;

let opts = TrainOptions {
    max_grad_norm: Some(1.0),  // Gradient clipping
    weight_decay: 0.01,         // AdamW-style weight decay
};

let loss = gpu_network.train_step_with_options(
    &input, &target, None, batch_size,
    &mut workspace, &mut optimizer, &mut cpu_network,
    &opts
)?;

GPU Limitations (wgpu 0.23)

No DeviceLost propagation: wgpu 0.23 does not expose DeviceLost errors. GPU crashes may appear as hangs instead of proper errors.
Memory limits: MAX_VRAM_ALLOC = 2GB per buffer. Exceeding this returns BatchTooLarge error.
Vec4 alignment: Weights are padded to vec4 (4-element) boundaries for shader efficiency.
CPU fallback: If GPU is unavailable, the backend initialization fails gracefully with AdapterNotFound.

Choosing Backend

// High-performance GPU (default)
let backend = WgpuBackend::init(WgpuOptions::default())?;

// Compute-optimized (larger buffers)
let backend = WgpuBackend::init(WgpuOptions::compute())?;

// Low-memory/integrated GPU
let backend = WgpuBackend::init(WgpuOptions::low_memory())?;

// Force specific adapter
let opts = WgpuOptions {
    force_adapter_name: Some("NVIDIA".to_string()),
    ..Default::default()
};
let backend = WgpuBackend::init(opts)?;

Running GPU Tests and Benchmarks

# GPU parity tests
cargo test --features gpu --test gpu_parity -- --ignored

# GPU benchmarks (Windows PowerShell)
$env:ARKAN_GPU_BENCH="1"; cargo bench --bench gpu_forward --features gpu

# GPU benchmarks (Linux/macOS)
ARKAN_GPU_BENCH=1 cargo bench --bench gpu_forward --features gpu
ARKAN_GPU_BENCH=1 cargo bench --bench gpu_backward --features gpu

GPU Performance vs PyTorch CUDA

Comparison of ArKan GPU (wgpu/Vulkan) with PyTorch KAN implementations on CUDA:

Implementation	Forward (batch=64)	Train (Adam)	Notes
fast-kan (CUDA)	0.58 ms	1.78 ms	RBF approximation (fastest)
ArKan (wgpu)	1.18 ms	3.04 ms	WebGPU/Vulkan, native training
efficient-kan (CUDA)	1.62 ms	3.70 ms	Native B-spline
ArKan-style (CUDA)	3.63 ms	N/A	Reference implementation

Conclusion: ArKan wgpu is 27% faster than efficient-kan and competitive with optimized CUDA implementations.

Benchmarks (CPU)

Comparison of ArKan (Rust) vs. optimized vectorized PyTorch implementation (CPU).

Test Setup:

Config: Input 21, Output 24, Hidden [64, 64], Grid 5, Spline Order 3.
ArKan: cargo bench --bench forward (AVX2/Rayon enabled).
PyTorch: Optimized vectorized implementation (no Python loops).

Batch Size	ArKan (Time)	ArKan (Throughput)	PyTorch (Time)	PyTorch (Throughput)	Conclusion
1	26.7 µs	0.79 M elems/s	1.45 ms	0.01 M elems/s	Rust is 54x faster (Low Latency)
16	427 µs	0.79 M elems/s	2.58 ms	0.13 M elems/s	Rust is 6.0x faster
64	1.70 ms	0.79 M elems/s	4.30 ms	0.31 M elems/s	Rust is 2.5x faster
256	6.82 ms	0.79 M elems/s	11.7 ms	0.46 M elems/s	Rust is 1.7x faster

Performance Analysis

Small Batch Dominance: On single requests (batch=1), ArKan outperforms PyTorch due to the lack of interpreter overhead and abstractions. This allows for ~37,000 inferences per second vs ~700 for PyTorch.
Mid-Batch Performance: On medium batches (16-64), ArKan maintains a 2.5x-6.0x advantage, showing good scalability.
Throughput Scaling: On large batches (256+), ArKan maintains 1.7x advantage due to zero-allocation architecture and efficient cache utilization.
Zero-Allocation Training: The entire training loop (forward + backward + update) runs without allocations on a warmed-up Workspace.

Comparison with Analogues (Prior Art)

ArKan occupies the niche of specialized high-performance inference.

Crate	Purpose	Difference from ArKan
`burn-efficient-kan`	Part of the Burn ecosystem.	ArKan is lightweight with optional GPU via wgpu. Minimal dependencies in base config.
`fekan`	Rich functionality, general-purpose library.	ArKan is designed with SIMD, parallelism, and GPU acceleration from the start.
`rusty_kan`	Basic implementation, educational project.	ArKan focuses on production-ready optimizations: workspace, batching, GPU.

Quick Start

Install from crates.io:

[dependencies]
arkan = "0.3.0"

Usage Example (see also examples/basic.rs and examples/training.rs):

use arkan::{KanConfig, KanNetwork};

fn main() {
    // 1. Configuration (Poker Solver preset)
    let config = KanConfig::preset();

    // 2. Network initialization
    let network = KanNetwork::new(config.clone());

    // 3. Create Workspace (memory allocated once)
    let mut workspace = network.create_workspace(64); // Max batch size = 64

    // 4. Data preparation
    let inputs = vec![0.0f32; 64 * config.input_dim];
    let mut outputs = vec![0.0f32; 64 * config.output_dim];

    // 5. Inference (Zero allocations here!)
    network.forward_batch(&inputs, &mut outputs, &mut workspace);

    println!("Inference done. Output[0]: {}", outputs[0]);
}

Architecture

KanLayer: Implements the KAN layer. Stores spline coefficients. Uses a local window order+1 for calculations, allowing efficient CPU cache usage.
Workspace: Key structure for performance. Contains aligned (AlignedBuffer) buffers for intermediate calculations. Reused between calls.
spline: Module with the Cox-de Boor algorithm implementation. Contains SIMD intrinsics.

License

Distributed under a dual license MIT and Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github		.github
benches		benches
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
README.md		README.md

License

Licenses found

LutwigStack/ArKan

Folders and files

Latest commit

History

Repository files navigation

ArKan

Теория: Что такое KAN?

Математическая модель

Реализация в ArKan (B-Splines)

Ключевые возможности

GPU Backend (Опционально)

Установка

Использование

GPU возможности

Ограничения GPU (wgpu 0.23)

Запуск GPU тестов и бенчмарков

GPU производительность vs PyTorch CUDA

Бенчмарки (CPU)

Анализ производительности

Сравнение с аналогами (Prior Art)

Быстрый старт

Архитектура

Лицензия

ArKan (English Version)

Theory: What is KAN?

Mathematical Model

Implementation in ArKan (B-Splines)

Key Features

GPU Backend (Optional)

Installation

Usage

GPU Features

Weight Synchronization

Training with Options

GPU Limitations (wgpu 0.23)

Choosing Backend

Running GPU Tests and Benchmarks

GPU Performance vs PyTorch CUDA

Benchmarks (CPU)

Performance Analysis

Comparison with Analogues (Prior Art)

Quick Start

Architecture

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Uh oh!

Languages

Packages