Quadtrix.cpp: a local LLM stack you can actually read #67

Eamon2009 · 2026-06-01T18:45:03Z

Eamon2009
Jun 1, 2026
Maintainer

Quadtrix.cpp: A Local Language Model Implementation in C++ and CUDA

Quadtrix.cpp is a local language model system implemented in C++ and CUDA, with a PyTorch reference implementation included in the same repository. The repository contains two execution paths: a native C++ and CUDA training and inference engine in main.cu, and a PyTorch reference script in train_quadtrix.py. Both paths implement the same model architecture. A checkpoint trained in PyTorch can be exported and loaded into the native path for inference.

---

Model Architecture

The model is a decoder-only transformer with the following layer ordering:

tokens + positions
        ↓
     embeddings
        ↓
  N × transformer block
        ↓
    final LayerNorm
        ↓
      lm head
        ↓
      logits

Each transformer block uses pre-normalization ordering:

x = x + self_attention(layer_norm(x));
x = x + feed_forward(layer_norm(x));

In pre-normalization, the layer norm is applied before the sublayer rather than after. This preserves gradient flow through the residual stream and improves training stability at increased depth.

Training Runs

Run 1 — CUDA / bf16 · 10.84M Parameters

run_20260508_110726 · CUDA · bf16 · 10.84M params · 14.1M train tokens / 1.6M val · 8,000 steps · 82m 42s · best val 2.3918

Training loss decreases from approximately 11 at step 0 to a final value of 2.2825. The best recorded validation loss is 2.3918. Loss reduction is most rapid in the first 500 steps and continues at a reduced rate through step 8,000 with no recorded instability events.

Throughput after the initial warmup period reaches a peak of 19.6k tokens per second and remains at that level for the duration of the run.

The gradient norm reaches a maximum of 2.25 at step 1,200 and stabilizes in the range of 1.75–2.0 for the remainder of training. Validation loss does not increase at any recorded evaluation interval.


Params	10.84M
Device	CUDA / bf16
Train tokens	14.1M
Steps	8,000
Wall time	82m 42s
Peak throughput	19.6k tok/s
Best val loss	2.3918
Final train loss	2.2825

Run 2 — Native CPU · 6.68M Parameters

run_20260430_192930 · CPU · 6.68M params · 7.1M train tokens · 7,000 steps · 86m 26s · best val 2.9971

The best validation loss is 2.9971, recorded at step 6,800. Training and validation loss curves remain close throughout the run. The generalization gap, defined as validation loss minus training loss, oscillates near zero for most of training, with a maximum of +0.15 at step 3,900. The final generalization gap is +0.0567.


Params	6.68M
Device	CPU
Train tokens	7.1M
Steps	7,000
Wall time	86m 26s
Best val loss	2.9971
Final gen. gap	+0.0567

Run 3 — PyTorch CPU · 6.68M Parameters

run_20260530_165216 · PyTorch · CPU · 6.68M params · batch=16 · block=32 · lr=1e-3 · 6,000 steps · 77m 16s · best val 4.1319

This run uses the PyTorch reference path with the same parameter count as Run 2 and the same device class. The best validation loss is 4.1319. Training loss reaches approximately 3.0 at step 6,000. The generalization gap increases after an initial decrease, reaching a recorded peak of 8.965 before settling at 0.048 at step 5,500. The widening gap indicates the model converges on the training distribution faster than on the validation distribution.

The gradient norm peaks at 2.2433 at step 3,395 and stabilizes at a mean of 1.337. Mean throughput is 791 tokens per second. Mean step time is 656.7 ms, with 113 recorded spikes above 937 ms attributable to garbage collection or OS scheduling events.


Params	6.68M
Device	PyTorch CPU
Batch / block	16 / 32
LR	1e-3
Steps	6,000
Wall time	77m 16s
Mean throughput	791 tok/s
Mean step time	656.7 ms
Best val loss	4.1319
Peak grad norm	2.2433

Run Comparison

Run	Params	Device	Best val	Wall time	Tok/s
CUDA / bf16	10.84M	CUDA	2.3918	82m 42s	19,600
Native CPU	6.68M	CPU	2.9971	86m 26s	—
PyTorch CPU	6.68M	PyTorch CPU	4.1319	77m 16s	791

The CUDA run produces the lowest validation loss at the highest throughput. The native CPU run produces a lower validation loss than the PyTorch CPU run at the same model size and device class. The gradient norm across all three runs peaks in the range of 2.0–2.25 and stabilizes, which is consistent with equivalent underlying implementations.

Tensor Runtime

The implementation uses a custom tensor layer with no external library dependencies. The base structure is:

struct Tensor {
    std::vector<int> shape;
    std::vector<float> data;
};

Operations implemented on this structure include elementwise arithmetic, softmax, layer normalization, tiled matrix multiplication, batched matrix multiplication, transposition, and concatenation. CPU paths are parallelized with OpenMP and accelerated with AVX and SSE intrinsics.

Backpropagation

The C++ implementation includes a full analytical backward pass. Gradients are derived and implemented explicitly for linear layers, layer normalization, ReLU, dropout, softmax, batched matrix multiplication, attention, feed-forward blocks, embeddings, and cross-entropy loss.

The attention backward pass reconstructs the gradient chain from the output projection through softmax, through the causal mask, to the Q, K, and V projections. All activations required by the backward pass are saved during the forward pass. No activation is recomputed and no computation graph is constructed.

Optimizer

The optimizer is AdamW, implemented from scratch:

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad²
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
p = p - lr * m_hat / (sqrt(v_hat) + eps)
p = p * (1 - lr * weight_decay)

Weight decay is applied directly to the parameter values rather than through the gradient. This is the AdamW formulation and differs from L2 regularization, in which weight decay interacts with the adaptive gradient scaling.

Generation and Chat Interface

The following command-line options are available after training:

# generate tokens indefinitely (Ctrl-C to stop)
./quadtrix data/input.txt --generate
 
# interactive chat
./quadtrix data/input.txt --chat
 
# control response length
./quadtrix data/input.txt --chat --chat-tokens 300

Token generation is autoregressive. The context window is capped at block_size. Tokens older than the context window are discarded.

Planned Development

Flash attention — the attention computation is the primary bottleneck at longer sequence lengths.
KV cache — the full attention matrix is recomputed at each generation step in the current implementation.
Larger default configurations — a 124M parameter reference configuration is a stated target.
Multi-GPU support — ZeRO stage 1 sharding is partially implemented.
External evaluation benchmarks — cross-run comparison currently relies on validation loss alone.

codeaddict-119 · 2026-06-01T19:54:20Z

codeaddict-119
Jun 1, 2026
Collaborator

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadtrix.cpp: a local LLM stack you can actually read #67

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Quadtrix.cpp: a local LLM stack you can actually read #67

Uh oh!

Uh oh!

Eamon2009 Jun 1, 2026 Maintainer

Quadtrix.cpp: A Local Language Model Implementation in C++ and CUDA

Model Architecture

Training Runs

Run 1 — CUDA / bf16 · 10.84M Parameters

Run 2 — Native CPU · 6.68M Parameters

Run 3 — PyTorch CPU · 6.68M Parameters

Run Comparison

Tensor Runtime

Backpropagation

Optimizer

Generation and Chat Interface

Planned Development

Replies: 1 comment

Uh oh!

codeaddict-119 Jun 1, 2026 Collaborator

Eamon2009
Jun 1, 2026
Maintainer

codeaddict-119
Jun 1, 2026
Collaborator