Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Quadtrix.cpp: A Local Language Model Implementation in C++ and CUDA
Quadtrix.cpp is a local language model system implemented in C++ and CUDA, with a PyTorch reference implementation included in the same repository. The repository contains two execution paths: a native C++ and CUDA training and inference engine in
main.cu, and a PyTorch reference script intrain_quadtrix.py. Both paths implement the same model architecture. A checkpoint trained in PyTorch can be exported and loaded into the native path for inference.
---Model Architecture
The model is a decoder-only transformer with the following layer ordering:
Each transformer block uses pre-normalization ordering:
In pre-normalization, the layer norm is applied before the sublayer rather than after. This preserves gradient flow through the residual stream and improves training stability at increased depth.
Training Runs
Run 1 — CUDA / bf16 · 10.84M Parameters
Throughput after the initial warmup period reaches a peak of 19.6k tokens per second and remains at that level for the duration of the run.
The gradient norm reaches a maximum of 2.25 at step 1,200 and stabilizes in the range of 1.75–2.0 for the remainder of training. Validation loss does not increase at any recorded evaluation interval.
Run 2 — Native CPU · 6.68M Parameters
The best validation loss is 2.9971, recorded at step 6,800. Training and validation loss curves remain close throughout the run. The generalization gap, defined as validation loss minus training loss, oscillates near zero for most of training, with a maximum of +0.15 at step 3,900. The final generalization gap is +0.0567.
Run 3 — PyTorch CPU · 6.68M Parameters
The gradient norm peaks at 2.2433 at step 3,395 and stabilizes at a mean of 1.337. Mean throughput is 791 tokens per second. Mean step time is 656.7 ms, with 113 recorded spikes above 937 ms attributable to garbage collection or OS scheduling events.
Run Comparison
The CUDA run produces the lowest validation loss at the highest throughput. The native CPU run produces a lower validation loss than the PyTorch CPU run at the same model size and device class. The gradient norm across all three runs peaks in the range of 2.0–2.25 and stabilizes, which is consistent with equivalent underlying implementations.
Tensor Runtime
The implementation uses a custom tensor layer with no external library dependencies. The base structure is:
Operations implemented on this structure include elementwise arithmetic, softmax, layer normalization, tiled matrix multiplication, batched matrix multiplication, transposition, and concatenation. CPU paths are parallelized with OpenMP and accelerated with AVX and SSE intrinsics.
Backpropagation
The C++ implementation includes a full analytical backward pass. Gradients are derived and implemented explicitly for linear layers, layer normalization, ReLU, dropout, softmax, batched matrix multiplication, attention, feed-forward blocks, embeddings, and cross-entropy loss.
The attention backward pass reconstructs the gradient chain from the output projection through softmax, through the causal mask, to the Q, K, and V projections. All activations required by the backward pass are saved during the forward pass. No activation is recomputed and no computation graph is constructed.
Optimizer
The optimizer is AdamW, implemented from scratch:
Weight decay is applied directly to the parameter values rather than through the gradient. This is the AdamW formulation and differs from L2 regularization, in which weight decay interacts with the adaptive gradient scaling.
Generation and Chat Interface
The following command-line options are available after training:
Token generation is autoregressive. The context window is capped at
block_size. Tokens older than the context window are discarded.Planned Development
Beta Was this translation helpful? Give feedback.
All reactions