LEAP is a high-performance, distributed Large Language Model (LLM) inference engine written in C++20. It enables the execution of massive models (e.g., Llama 3 70B) on clusters of consumer-grade devices or heterogeneous servers by splitting the model layer-wise across a Ring Topology.
LEAP solves the "VRAM Wall" problem. Instead of requiring a single expensive GPU with massive memory (like an H100), LEAP pipelines the inference process across multiple smaller devices connected via standard Ethernet. It achieves near-native performance through aggressive optimizations:
- Zero-Copy Networking: A custom Linux Kernel Module (
leap_kmod) bypasses the OS network stack. - SIMD Acceleration: Hand-written AVX2 (x86) and NEON (ARM) kernels for maximum CPU throughput.
- Asymmetric Pipelining: Supports nodes with different compute capabilities working in unison.
- Home Lab Clusters: Chain together spare hardware (MacBooks, Gaming PCs, Mini PCs) to run state-of-the-art 70B+ models that no single device could fit.
- Edge AI: Deploy powerful intelligence on a stack of embedded devices (e.g., NVIDIA Jetson, Rockchip) in environments with limited power and connectivity.
- Privacy-Focused Local Inference: Run sensitive workloads entirely on-premise without relying on cloud APIs or expensive enterprise hardware.
- Cost-Effective Scaling: Utilize fragmented, heterogeneous resources in a data center without needing expensive interconnects like InfiniBand or NVLink.
LEAP prioritizes low latency and high throughput through aggressive optimizations, including SIMD (AVX2/NEON), custom quantization, and a specialized Linux Kernel Module for zero-copy networking.
LEAP consists of four main components:
- Exporter: Converts PyTorch/Safetensors models into LEAP's optimized binary format.
- Tokenizer: A standalone, pure C++ tokenizer (BPE) compatible with Llama 3 / Tiktoken.
- Inference Engine: The core runtime supporting Float32 and Int8 Quantized execution.
- Transport Layer: A modular networking interface supporting TCP, UDP, and Kernel-Bypass modes.
The exporter is responsible for preparing models for inference. It reads .safetensors checkpoints, transforms weights
into the correct memory layout, and serializes them.
- Source Format: Supports HuggingFace-style
.safetensors(sharded or single file). - Weight Transformation: Automatically permutes weights for Rotary Positional Embeddings (RoPE) to match the inference engine's expected interleaved format.
- Quantization: Supports post-training quantization to Int8 (Symmetric).
- Float32 (
float32_export):- Magic:
0x616B3432("ak42") - Version:
1 - Content: Full precision weights.
- Magic:
- Int8 (
int8_export):- Magic:
0x616B3432 - Version:
2 - Quantization Scheme: Symmetric Block-wise Quantization.
- Group Size: Configurable (default 64). If
dimis not divisible, it backs off to 32. - Storage: Weights stored as
int8+floatscale per block. - Norms: Layer norms and output norms are kept in FP32 for stability.
- Group Size: Configurable (default 64). If
- Pipeline: Uses a lookahead pipeline (Async Quantize + Sync Write) to maximize export speed.
- Magic:
LEAP uses a custom BPE tokenizer implementation designed for zero-dependency inference.
- Wraps the
tiktokenC++ library to load BPE models (like Llama 3'stokenizer.model). - Special Tokens: Explicitly handles Llama 3 special tokens (
<|begin_of_text|>,<|eot_id|>, etc.) and reserves slots for fine-tuning tokens. - Output Format: A simple binary format containing the vocabulary, scores, and token strings.
- Pure C++: No runtime dependency on Python or HuggingFace libraries.
- Algorithm: Implements BPE merge using a Priority Queue (O(N log N)) to greedily merge adjacent token pairs based on score/rank.
- Byte Fallback: Supports raw byte tokens (
<0xHH>) for robust handling of arbitrary binary data/utf-8 fragments.
The core engine implements a standard Llama-2/3 architecture: RMSNorm, SwiGLU FeedForward, and Rotary Positional Embeddings (RoPE). It is built for maximum efficiency on CPU, utilizing two distinct transformer implementations.
The FloatTransformer is the baseline implementation offering maximum precision. It is designed around zero-copy memory mapping and rigorous loop optimization.
-
Zero-Copy Loading: The model weights are accessed directly via
mmap. The file is mapped into the process's virtual address space, and pointers (FloatTransformerWeights) are set to specific offsets. This eliminates parsing overhead and allows the OS to manage page caching. -
State Management: To prevent dynamic memory allocation during the inference loop, a
FloatRunStatestructure pre-allocates all necessary buffers:-
x,xb: Activation buffers (sizedim). -
q,k,v: Query/Key/Value vectors. -
key_cache,value_cache: KV Cache for past tokens (sizen_layers * seq_len * kv_dim). -
att,logits: Attention scores and output probabilities.
-
For every token, the engine performs the following operations per layer:
-
RMSNorm: Normalizes the input vector
x.- Optimization: Uses SIMD (AVX2/NEON) to compute the sum of squares. A Newton-Raphson iteration is applied to refine the reciprocal square root estimate (
vrsqrteq_f32/_mm_rsqrt_ss) for high precision without the cost of a fullsqrtcall.
- Optimization: Uses SIMD (AVX2/NEON) to compute the sum of squares. A Newton-Raphson iteration is applied to refine the reciprocal square root estimate (
-
QKV Projection: FP32 Matrix Multiplication to generate Query, Key, and Value vectors.
-
RoPE (Rotary Positional Embeddings):
-
Rotates
QandKvectors to encode position information. -
Optimization: Pairs of elements
(x, y)are transformed into(x*cos - y*sin, x*sin + y*cos). This is fully vectorized using SIMD shuffles (vrev64q_f32on NEON,_mm256_permute_pson AVX2) to process multiple heads simultaneously.
-
-
Multi-Head Attention (MHA/GQA):
-
Score:
Attn = Q @ K.T(Dot product). -
Softmax: Exponentiates and normalizes scores.
-
Weighted Sum:
Output = Attn @ V.
-
-
Feed Forward (SwiGLU):
-
Computes
Gate = w1(x)andUp = w3(x). -
Applies SiLU activation:
val = val * (1 / (1 + exp(-val))). -
Computes
Down = w2(Gate * Up). -
Residual connection:
x += Down.
-
The QuantizedTransformer reduces memory usage by 4x and increases compute throughput by utilizing integer arithmetic. It implements a Weight-Int8, Activation-Int8 (W8A8) scheme with dynamic activation quantization.
-
Block-wise Quantization: To maintain accuracy, weights are not quantized globally but in small groups (default
group_size=64). -
QuantizedTensor Structure:
-
int8_t* q: Pointer to the raw compressed 8-bit integer weights. -
float* s: Pointer to the scaling factors (one float pergroup_sizeweights).
-
-
Dynamic Activation Quantization:
-
Before every matrix multiplication, the input activation vector
x(FP32) is dynamically quantized to Int8. -
The range of
xis measured, and a scale factor is computed to map floats to[-127, 127].
-
The critical performance driver is the integer dot product.
-
Operation: The dot product of a quantized weight vector
wand activationxis computed as:$ \text{Result} = \sum (w_i \cdot x_i) \cdot s_w \cdot s_x $
Where
$w_i, x_i$ areint8, and$s_w, s_x$ arefloatscales. -
SIMD Kernels:
-
ARM NEON: Uses the
vdotq_s32instruction (Dot Product), which performs 4-way int8 multiplication and accumulation into a 32-bit integer in a single cycle. -
AVX2: Uses
_mm256_madd_epi16(Multiply-Add). It expands int8s to int16s, multiplies them, and adds adjacent pairs, reducing register pressure.
-
-
Dequantization: The accumulated 32-bit integer result is converted back to FP32 and multiplied by the combined scale factor (
weight_scale * activation_scale) to recover the true value.
-
Memory Bandwidth: Fetching 1 byte per weight instead of 4 bytes relieves the bottleneck on memory-bound CPUs.
-
Cache Efficiency: 4x more parameters fit into L1/L2/L3 caches.
-
Compute: Integer MAC (Multiply-Accumulate) instructions often have higher throughput and lower latency than their FP32 counterparts.
LEAP supports distributed inference via a Ring Topology. The model is split layer-wise across multiple nodes.
- Flow:
Prev Node -> [Receive] -> Compute Layers -> [Send] -> Next Node. - Protocol: A lightweight binary protocol sending a
PacketHeaderfollowed by the raw activation tensor (embedding size * float32).
- Reliability: Guaranteed delivery via standard TCP/IP.
- Optimization:
TCP_NODELAY: Nagle's algorithm disabled for lowest latency.- Large Buffers: 8MB Socket buffers (
SO_RCVBUF/SO_SNDBUF). - Multipart Send: Uses
writevto send Header + Data in a single syscall to avoid copying.
- Type: Best-effort datagram transport.
- Chunking: Splits large tensors into
LEAP_CHUNK_SIZE(1400 bytes) packets to avoid IP fragmentation. - Reassembly: Receiver tracks
seq_idandchunk_idto reassemble frames. - Note: Currently lacks retransmission logic; assumes a high-quality LAN environment.
The LEAP Kernel Module (leap_kmod) is the high-performance heart of the distributed pipeline. It implements a Zero-Copy, Kernel-Bypass networking stack specifically tailored for tensor transmission. By intercepting packets before the standard Linux network stack processes them, it drastically reduces latency and CPU context switches.
The module exposes a character device (/dev/leap_tensor) that userspace maps into memory.
- Split Ring Buffer (16MB):
- Lower 8MB (RX Banks): Divided into 64 "Banks" (128KB each). This allows up to 64 concurrent packet streams (or out-of-order sequences) to be reassembled in parallel without locking the entire buffer.
- Upper 8MB (TX Buffer): A contiguous staging area for outgoing tensors.
- Synchronization:
- Bitmaps & Spinlocks: Each RX bank is protected by a fine-grained spinlock (
rx_locks) and uses a bitmap to track received chunks. This ensures lock contention is minimized even under high packet loads. - Wait Queues: Userspace threads sleep on a wait queue (
leap_wait_queue) and are woken up only when a full bank is complete, avoiding busy-waiting.
- Bitmaps & Spinlocks: Each RX bank is protected by a fine-grained spinlock (
- ** interception**: The module registers a Netfilter hook at
NF_INET_PRE_ROUTINGwith highest priority (NF_IP_PRI_FIRST). - Packet Inspection: Every incoming UDP packet is inspected. If it matches the configured
listening_portand contains theLEAP_MAGICheader:- The kernel extracts
seq_idandchunk_id. - It calculates the exact offset in the memory-mapped RX buffer:
Bank_Idx * Bank_Size + Chunk_Id * Chunk_Size.
- The kernel extracts
- Direct DMA Copy: The payload is copied directly from the socket buffer (
skb) to the userspace-mapped RAM. - Verdict: The hook returns
NF_STOLEN, telling the OS to stop processing the packet immediately. The standard network stack never sees it. - Completion Notification: Once all chunks for a bank are received, the module sets a bit in the
data_ready_mapand wakes up the userspace process.
- Preparation: Userspace writes the tensor data + header into the mapped TX Buffer (Upper 8MB).
- Trigger: Userspace calls
ioctl(LEAP_IOCTL_SEND). - Kernel Transmission:
- The kernel constructs UDP/IP headers pointing to the data already in RAM.
- It uses
kernel_sendmsgwith akvecpointing to the userspace buffer, allowing the NIC to DMA read directly from the application's memory view (Zero-Copy).
Control flow is managed via ioctl:
LEAP_IOCTL_WAIT_DATA: Sleep until a bank is ready. Returns thebank_idx.LEAP_IOCTL_SEND: Trigger transmission of data in the TX buffer.LEAP_IOCTL_SET_DEST: Configure the destination IP for the ring topology.LEAP_IOCTL_SET_PORT: Bind to a specific local UDP port.LEAP_IOCTL_GET_BANK_SRC: Retrieve the source IP/Port of the sender (for auto-discovery).
Usage: Requires insmod src/kernel/leap_transport.ko. This mode provides the lowest possible latency for Ethernet-based clusters.
- CMake 3.25+
- C++20 Compiler (GCC 10+, Clang 12+, MSVC 19.30+)
- OpenMP
- LibTorch (Required for the
exporttool)
Clone the repository recursively to fetch all dependencies (e.g., nlohmann_json, safetensors).
git clone --recursive https://github.com/Harikeshav-R/LEAP.git
cd LEAPThe Exporter tool relies on LibTorch to read tensors. Important: You only need the CPU version.
Download the cxx11 ABI CPU version from the PyTorch Website. Extract it to third-party/libtorch.
wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcpu.zip
unzip libtorch-*.zip -d third-party/Option B: Via Python (Recommended for systems that do not have a pre-built libtorch binary available)
If a pre-built LibTorch binary isn't available for your architecture (e.g., Raspberry Pi, Jetson Orin), install the CPU version of PyTorch via pip and point CMake to it.
# Create a virtual env (optional but recommended)
python3 -m venv venv
source venv/bin/activate
# Install CPU-only torch to save space and avoid CUDA dependency issues
pip install torch --index-url https://download.pytorch.org/whl/cpuIf you used Option A (Direct Download):
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=$(pwd)/third-party/libtorchIf you used Option B (Python):
# Get the torch path
TORCH_PATH=$(python3 -c 'import torch; print(torch.utils.cmake_prefix_path)')
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=$TORCH_PATHcmake --build build --target all --config Release -- -j$(nproc)To enable the zero-copy kernel transport:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_KERNEL_MODULE=ON [CMAKE_PREFIX_PATH args...]
cmake --build build --config Release -- -j$(nproc)Note: You will need Linux kernel headers installed (e.g., sudo apt install linux-headers-$(uname -r)).
Converts PyTorch/Safetensors checkpoints into the LEAP binary format.
Usage:
./export [options] <output_filepath>Positional Arguments:
output_filepath: The destination path for the.binmodel file.
Options:
--meta-llama <path>: (Required) Path to the directory containing the Llama model (params.jsonand.safetensorsfiles).--version <1|2>:1: Float32 export (Default).2: Int8 Quantized export.
Example:
# Export Llama-3-8B to Int8
./export llama3-8b-int8.bin --meta-llama /models/Meta-Llama-3-8B --version 2Converts a HuggingFace/Tiktoken tokenizer model (usually tokenizer.model) into LEAP's binary format.
Usage:
./tokenizer [options] <model_path>Positional Arguments:
model_path: Path to the tokenizer model file.
Options:
-o, --output <path>: Optional output file path for the binary export.
If output is not specified, it will be saved as tokenizer.bin in the same directory as the input.
The main runtime for generating text. Supports single-node and distributed ring inference.
Usage:
./inference [options] <model_path>Positional Arguments:
model_path: (Required) Path to the exported LEAP model (.bin).
-t, --tokenizer <path>: Path to the tokenizer file (default:tokenizer.bin).-p, --prompt <text>: Initial prompt to start generation.-c, --chat: Enable interactive chat mode (Llama 3 instruction template).-s, --system <text>: System prompt (only for chat mode).-n, --n-predict <int>: Maximum number of tokens to generate (default: 4096).
--temp <float>: Temperature for sampling (default: 1.0). Higher = more creative, Lower = more deterministic.--top-p <float>: Top-P (Nucleus) sampling threshold (default: 0.9).--seed <int>: Random seed for reproducibility (default: 0 = random).
--role <single|master|worker>: Node role (default:single).single: Runs the entire model locally.master: Runs the first part of the model and manages the prompt/generation loop.worker: Runs the subsequent parts of the model in a loop.
--split <int>:- Master: Layer index where the model is split (Master runs 0 to
split-1). - Worker: Layer index to start processing from.
- Master: Layer index where the model is split (Master runs 0 to
--end <int>: (Worker only) Layer index to stop processing at (exclusive). 0 = process until the last layer.--transport <tcp|udp|kernel>: Transport protocol (default:tcp).--host <ip>: Local interface to bind to (default:0.0.0.0).--port <int>: Local port to listen on (default:9999).--next-host <ip>: IP address of the next node in the ring (Required for distributed).--next-port <int>: Port of the next node in the ring (default:9999).
Master Node (192.168.1.1):
./inference --role master --split 16 --next-host 192.168.1.2 --prompt "Once upon a time" model.binWorker Node (192.168.1.2):
./inference --role worker --split 16 --next-host 192.168.1.1 model.binLEAP is designed for Llama Architecture models.
- Supported: Llama 2, Llama 3, Llama 3.1, CodeLlama, TinyLlama.
- Key Features: Grouped Query Attention (GQA), RoPE, RMSNorm, SwiGLU.
For maximum inference speed, ensure OpenMP threads are pinned to physical cores.
# Example for a 8-core CPU
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores
./inference ...Note: On hybrid CPUs (Intel 12th+ Gen), bind threads to P-cores for best results.
Error: Linker errors complaining about std::string symbols.
Fix: Ensure you downloaded the cxx11 ABI version of LibTorch. If you used pip install torch, this is usually handled automatically, but ensure your system GCC matches the one PyTorch was built with.
Error: insmod: ERROR: could not insert module ...: Operation not permitted
Fix: Secure Boot may block unsigned modules. You must either sign the module or disable Secure Boot.
Error: Exec format error
Fix: The module was compiled for a different kernel version. Re-run make inside src/kernel/ or rebuild via CMake.
Fix: Check your firewall (ufw status, iptables). LEAP uses UDP ports (default 9999). Ensure ICMP isn't blocking path MTU discovery if you aren't using the kernel module (which handles fragmentation manually via chunks).