KLLM (Key-Light Large Model) is a C++17 header-only runtime of high-performance transform and sketch primitives with fused microkernels, a tiny IR + planner, streaming pipeline, and quantization to int8/int4.
- Header-only, zero external deps by default; builds on Linux and Android/Termux
- AVX2/NEON-optimized FWHT, scalar fallback
- GPU support removed: streamlined CPU-only path for maximum portability
- Streaming pipeline v2.1: Transform → Sketch → Route → Quantize with slab buffering
- Parallel sketch, routing, and blockwise quantization
- Tiny IR + planner that fuses Transform+Relu
- int8/int4 blockwise quantization, dequant helpers
- Robust status handling, thread pool, affinity (Linux)
- Parallel SketchEngine and RoutingEngine (across buckets)
- Parallel blockwise quantization (int8/int4)
- Pipeline buffer reuse to reduce allocations
- GPU code paths removed; simpler build and predictable performance on CPU
- Autograd memory fix: safer graph ownership (parents now held as shared_ptr) to avoid leaks and UAF
- New runtime knobs:
enable_simd,enable_matmul_blocked,matmul_block_{m,n,k},release_tls_fused_buffers
Performance snapshot (this environment):
- FWHT 1M: ~2.64–5.22 ms (parallel vs single)
- Fused FWHT-scale-add 1M: ~6.5 ms
- Pipeline v2.1 int8 1M: ~24–30 ms
- Pipeline v2.1 int4 1M: ~23–28 ms
Your mileage varies by CPU.
kklm.h: single public headerexamples/main.cpp: usage demoexamples/miko.cpp: toy self-play chess learner with emoji board and save/loadbench/bench.cpp: micro-benchmarkstest.cpp: correctness tests
Dependencies: clang++ (or g++), Linux or Android/Termux.
# x86-64
clang++ -std=c++17 -O3 -march=native -mtune=native -fPIC -Wall -Wextra -Wpedantic -Werror \
-I. examples/main.cpp -o kllm_demo
clang++ -std=c++17 -O3 -march=native -mtune=native -fPIC -Wall -Wextra -Wpedantic -Werror \
-I. examples/miko.cpp -o miko
clang++ -std=c++17 -O3 -march=native -mtune=native -fPIC -Wall -Wextra -Wpedantic -Werror \
-I. bench/bench.cpp -o kllm_bench
clang++ -std=c++17 -O3 -march=native -mtune=native -fPIC -Wall -Wextra -Wpedantic -Werror \
-I. test.cpp -o kllm_test
# aarch64 (Termux)
clang++ -std=c++17 -O3 -march=armv8-a+simd -mtune=native -fPIC -Wall -Wextra -Wpedantic -Werror \
-I. examples/miko.cpp -o mikoRun:
./kllm_demo
./kllm_bench
./kllm_test
./mikoRemoved. CPU-only.
Include the header:
#include "kklm.h"Config/tuning (CPU):
kllm::set_num_threads(8); // threads (0 uses hardware_concurrency)
kllm::set_parallel_threshold(1<<14); // size threshold for parallel paths
kllm::set_large_slab_bytes(1024*1024); // pipeline slab size
// Direct config access (advanced):
kllm::global_config().enable_simd = true; // runtime SIMD on/off
kllm::global_config().enable_matmul_blocked = true; // blocked GEMM on/off
kllm::global_config().matmul_block_m = 64; // tile sizes
kllm::global_config().matmul_block_n = 128;
kllm::global_config().matmul_block_k = 128;
kllm::global_config().release_tls_fused_buffers = true; // free fused TLS buffers each callCore transforms:
// FWHT in-place; length must be power-of-two
kllm::fwht_inplace(ptr, n);
kllm::fwht_inplace_parallel(ptr, n, pool);
kllm::fwht_inplace_inverse(ptr, n);
// Fused microkernels
kllm::fused_fwht_scale_add(x, n, scale, y);
kllm::fused_fwht_bias_relu(x, bias, n, out);Pipelines and quant:
kllm::PipelineTelemetry T{};
std::vector<int8_t> q8; std::vector<float> scales;
// Transform -> Sketch -> Pointwise -> Route -> Quantize
kllm::run_pipeline_v21_to_int8(input, sketch_size, q8, scales, T, kllm::PointwiseOp::kRelu);Sketch and routing:
kllm::CountSketch cs(1<<12, 3);
cs.apply(x.data(), x.size(), y.data());Rewards/metrics:
float mse = kllm::reward_mse(pred, target);
float cos = kllm::reward_cosine_similarity(a, b);
float acc = kllm::reward_top1_accuracy(logits, labels, num_classes);
float f1 = kllm::reward_f1_binary(pred_labels, true_labels);
float bleu = kllm::reward_bleu_1_4(seq_pred, seq_ref);
float rouge = kllm::reward_rouge_l(seq_pred, seq_ref);Threading/config (CPU):
kllm::set_num_threads(8);
kllm::set_parallel_threshold(1<<14);
kllm::set_large_slab_bytes(1024*1024);Build a tiny MLP classifier using the new minimal nn API:
using namespace kllm::nn;
// No API changes required; internal graph ownership improved to prevent leaks.Available ops and layers:
- Tensors:
tensor(values, shape, requires_grad),zeros(shape),ones(shape),full(shape, value),randu(shape),randn(shape) - Ops:
add,mul,matmul,add_bias, activations:relu,leaky_relu,elu,selu,gelu, softmax:softmax_lastdim; losses:mse_loss,cross_entropy_logits - Modules:
Linear,Sequential,Dropout,LayerNorm,BatchNorm1d; utilities:collect_parameters,summary - Optimizers:
SGD(params, lr[, weight_decay]),Adam(params, lr),RMSprop,Adagrad,Adadelta - Training helpers: gradient clipping:
clip_grad_norm(params, max_norm); LR schedulers:StepLR,CosineAnnealingLR - Data:
TensorDataset::from,TensorDatasetReg::from(regression),DataLoader,DataLoaderReg - Trainer:
Trainer(cfg).fit(model, loader, optimizer),fit_regression(model, loaderReg, optimizer, epochs)
Notes:
- Each
nn::Valueexposes adata()method for raw access; internal storage is nowvalues. Module::parameters()returns a vector;collect_parameters(model)gathers recursively.
Custom module implementation (example):
struct MyReluModule : kllm::nn::Module {
kllm::nn::ValuePtr forward(const kllm::nn::ValuePtr &x) override { return kllm::nn::relu(x); }
std::vector<kllm::nn::ValuePtr> parameters() override { return {}; }
};Autograd notes:
- Backprop builds a DAG of
Valueparents. Callloss->backward()to populate.gradfor tensors withrequires_grad=true. - Gradients flow through ops via stored
backward_fnlambdas; usezero_grad()on parameters before each step. - For stability, you can use
clip_grad_norm(params, max_norm)before optimizerstep().
This API is intentionally compact for easy use on mobile/Termux while staying header-only.
FWHT 1M floats: ~2.64–5.22 ms
FWHT(par,4) 1M floats: ~2.64–3.66 ms
Fused FWHT-scale-add 1M: ~6.5 ms
NTT 262k uint32: ~4.5 ms
CountSketch 1M -> 262k: ~4.3 ms
BlockDiag float 1024x(16x16): ~0.049 ms
BlockDiag int8 1024x(16x16): ~0.046 ms
LowRank 4096x4096 (r=64): ~0.00003–0.00004 ms
Pipeline v2.1 int8 1M: ~24–30 ms (slabs=4)
Pipeline v2.1 int4 1M: ~23–28 ms (slabs=4)
Notes:
enable_simd=falseforces scalar paths for maximum portability/testing.- Blocked matmul uses runtime tiling; adjust
matmul_block_*to match cache.
Tips:
- Increase
set_large_slab_bytes(1<<20)and setset_num_threads(6–12) - Keep
routing_bucket_sizenear L2-sized tiles (default 256)
- Some Android ROMs restrict CPU affinity. Use
set_current_thread_affinity_status()to inspect errors. - OpenCL may not be available on many devices; CPU path is optimized for NEON.
You may use, copy, and modify this code for personal, educational, or research purposes only. You may NOT sell or distribute this software or derivatives for commercial purposes without explicit permission.