Quadtrix.cpp

Quadtrix.cpp a autoregressive language model in two variants:

CPU Implementation (C++17): Fully custom from-scratch implementation with zero external dependencies. Features a hand-rolled Tensor class with manual gradient tracking, explicit forward/backward passes through transformer blocks (multi-head self-attention, feedforward MLPs, layer normalization), AdamW optimizer with momentum and weight decay, and cross-entropy loss. All operations—matrix multiplications, softmax, GELU activations, attention scoring—are raw C++ arithmetic with no framework overhead.

GPU Implementation (PyTorch + CUDA): Architecturally identical transformer (same layer configs, attention heads, embedding dimensions) but delegates tensor operations to PyTorch's CUDA backend for GPU parallelization. The model structure remains unchanged; only the compute substrate shifts from CPU to GPU tensor cores.

Training Pipeline: Both versions follow standard autoregressive training: tokenize input text → forward pass through embedding + N transformer blocks → compute cross-entropy loss on next-token predictions → backpropagation to compute gradients → AdamW weight updates. Repeated over batches until convergence.

GPU Extension Limitation: A native CUDA implementation (custom kernels for matrix ops, attention, etc.) requires access to NVIDIA hardware for development and testing, currently unavailable. The PyTorch version serves as a GPU-accelerated alternative without requiring hand-written CUDA

Leaderboard

s.no	time	val_bpb	CORE	Description	Date	Contributors
0	39.4 min	1.3145	0.82M	Quadtrix CPU baseline, small data (200K chars), fragmented output	2026	@Eamon2009
1	61.3 min	0.7176	10.82M	Quadtrix Colab large-scale run, coherent paragraphs, strong convergence	2026	@Eamon2009
2	6.1 min	0.9250	1.99M	Quadtrix T4 optimized run, fast training, stable learning, basic coherence	2026	@Eamon2009
3	76.2 min	1.6371	~0.82M	Quadtrix.cpp Extended CPU training (3000 iters)	2026	@Eamon2009

Hardware Execution Backends

Device	Technical Execution Pathway
CPU	Utilizes vectorized instructions (AVX/SSE) and multi-threading for sequential or small-batch inference.
CUDA	Leverages NVIDIA’s parallel computing platform for high-throughput training and inference on discrete GPUs.
iGPU	Targets Integrated GPUs (e.g., Intel Iris, AMD Radeon, or Apple Silicon M-series) via backends like Metal (MPS), DirectML, or oneAPI/SYCL, optimizing for power-efficient local execution.

What is this?

Quadtrix.cpp is a transformer learning laboratory. Write your own backprop, debug attention matrices, export to bare-metal C. If you've read the Attention paper and want to implement it rather than just call model.fit(), this is for you.

Philosophy: Frameworks hide the fundamentals. This project reveals them. Every gradient, every checkpoint, every matrix multiply lives in code you can step through with a debugger.

Parallel tracks: Native C++ training path (educational), PyTorch path (faster iteration), DirectML path (Windows iGPU), pure C inference (deployment), web frontend (chat UI).

Quick Start

# Native C++ path - train from scratch
g++ -std=c++17 -O2 -I. -Iinclude -o quadtrix main.cpp
./quadtrix data/input.txt

# PyTorch path - faster experimentation
pip install torch tiktoken numpy
python engine/main.py

# Interactive chat
./quadtrix data/input.txt --chat
python engine/inference.py

Overview

Quadtrix is a self-contained C++17 implementation of a GPT-style transformer language model trained at the character level. It implements the full pipeline — tokenisation, forward pass, analytical backpropagation, and autoregressive generation — in a single dependency-free codebase.

The project is a faithful port of the Python/PyTorch training loop written in C++ with hand-derived gradients for every layer: linear projections, multi-head causal self-attention, layer normalisation, feed-forward blocks, softmax, ReLU, dropout, and cross-entropy loss.

Training run v1.0 used a 31M-character children's story corpus, reaching a validation loss of 1.6371 in 76 minutes on CPU.

Architecture

Quadtrix uses a decoder-only transformer (GPT-2 style) with pre-layer-normalisation residual blocks.

Input tokens  [B, T]
      │
      ▼
Token Embedding     [vocab_size × n_embd]
      +
Position Embedding  [block_size × n_embd]
      │
      ▼
┌─────────────────────────────────┐
│  Transformer Block  × n_layer   │
│                                 │
│  ┌──────────────────────────┐   │
│  │  LayerNorm (pre-LN)      │   │
│  └──────────┬───────────────┘   │
│             ▼                   │
│  ┌──────────────────────────┐   │
│  │  Multi-Head Causal Attn  │   │
│  │  n_head × head_size      │   │
│  │  + causal mask           │   │
│  │  + dropout               │   │
│  └──────────┬───────────────┘   │
│             ▼ (residual +)      │
│  ┌──────────────────────────┐   │
│  │  LayerNorm (pre-LN)      │   │
│  └──────────┬───────────────┘   │
│             ▼                   │
│  ┌──────────────────────────┐   │
│  │  Feed-Forward MLP        │   │
│  │  Linear → ReLU → Linear  │   │
│  │  (4× expansion) +dropout │   │
│  └──────────┬───────────────┘   │
│             ▼ (residual +)      │
└─────────────────────────────────┘
      │
      ▼
LayerNorm  →  Linear  →  Logits [B, T, vocab_size]
      │
      ▼
Cross-Entropy Loss  /  Softmax + Multinomial Sampling

Hyperparameters — v1.0

Parameter	Value	Notes
`batch_size`	4	Sequences per step
`block_size`	64	Context window (tokens)
`n_embd`	128	Embedding dimension
`n_head`	4	Attention heads
`n_layer`	4	Transformer blocks
`dropout`	0.2	Applied to attn weights + proj
`learning_rate`	3e-4	AdamW, β₁=0.9, β₂=0.999
`max_iters`	3000
`eval_interval`	200
Total params	0.83 M	(826,985)

Training

The model was trained on a 31.4M-character corpus of short children's stories (data/input.txt), split 90/10 into train and validation sets.

Set	Tokens
Train	28,311,139
Validation	3,145,683
Vocabulary	105 characters

Training used full analytical backpropagation — hand-derived gradients through every operator (cross-entropy → lm_head → layernorm → MHA → FFN → embeddings) without any automatic differentiation library.

The gradient computation follows this chain:

dLoss/dLogits  →  dW_lmhead  →  d(LayerNorm_f)
    → for each Block (reverse):
        d(FFN residual)  →  d(LN2)  →  d(fc2)  →  d(ReLU)  →  d(fc1)
        d(MHA residual)  →  d(LN1)  →  d(proj)
            → for each Head:
                d(wei@V)  →  d(softmax)  →  d(scale)  →  d(Q@Kᵀ)
                    →  d(Wk), d(Wq), d(Wv)
    → d(tok_emb), d(pos_emb)

Training Log

Training run: Quadtrix v1.0

------------------------------------------------------------
  Quadtrix v1.0                        2026 
  iter      train     val       elapsed   eta
  ──────────────────────────────────────────────────────
     0/3000  4.6523    4.6570    15s       —        [saved]
   200/3000  2.4876    2.4478    427s      5976s    [saved]
   400/3000  2.2965    2.3334    783s      5091s    [saved]
   600/3000  2.2971    2.2572    1105s     4418s    [saved]
   800/3000  2.2424    2.2018    1331s     3660s    [saved]
  1000/3000  2.1570    2.2009    1569s     3138s    [saved]
  1200/3000  2.0914    2.0577    1791s     2687s    [saved]
  1400/3000  1.9575    2.0151    2013s     2301s    [saved]
  1600/3000  1.9409    1.9532    2317s     2028s    [saved]
  1800/3000  1.8233    1.8250    2673s     1782s    [saved]
  2000/3000  1.7386    1.7724    2999s     1500s    [saved]
  2200/3000  1.6850    1.7256    3353s     1219s    [saved]
  2400/3000  1.7298    1.7403    3697s     924s
  2600/3000  1.7204    1.6680    4031s     620s     [saved]
  2800/3000  1.5717    1.6471    4347s     310s     [saved]
  3000/3000  1.7055    1.6371    4571s     0s       [saved]
  ──────────────────────────────────────────────────────

  done
  |  time      : 4571.1s  (76.2 min)
  |  best val  : 1.6371
  |  weights   : best_model.bin
------------------------------------------------------------

The model crosses 2.0 val loss around iteration 1200 and continues to improve steadily to 1.6371 — a drop of 3.02 nats from random initialisation.

Inference & Sample Outputs

Weights are loaded from best_model.bin and generation proceeds autoregressively: at each step the model consumes the last block_size tokens, computes logits over the vocabulary, applies softmax, and samples the next token from the resulting distribution.

Chat Transcript

The following samples were produced by the trained v1.0 checkpoint.
The model operates at the character level and was trained on children's stories — expect emergent narrative structure, invented words, and short-form prose.

You: hi how are you

Quadtrix:

said. You big his's reat cably hawere cast wen. Anted the on feer
the girl the heri beg that's appy and happecing apy his. Whe mommmy
had alowith of thad very the rand a coove the ses and the

You: what is life

Quadtrix:

st happe. It happ a liked back abp happy thing flongs way.
"Lily lood take maked a fiside apie?" Tom and abed Timm.
A na starhe of her that they was is they," that and Timmy
flauch seer aing time tri

You: Timmy is a

Quadtrix:

bog the scated justo prove the bret you. Timmy nevery some
the gecid. Her neplay to bet starked a way, that litked cliend.
They shat corsee get.
Ben it lows mured she happlay mom and had says."
Tom

You: Once upon a time

Quadtrix:

, and said askiced and so owas said sri
The his brickerys and stew hhat and saw and stark a din't
She stingry and asked day. Timmy watch and played to cones.
It eve smors and he skiseed auch pived el make

Extended Free Generation

usidpy the nis the girl Mian in the she foll pot stwed note ful to
snid sids funded araving many to them big strear birng and That the
momn,'s saw and icked aging for the cor dad and spzy ited thankeft
the beecia rear who rah gight sad, and and help ceated and blut the
waye! Shearted started, "Yes. I wout she feare scal the dingred asked
not dearve had s                    g. It was hat startedry like his
in the the was give grin Lily.

Tim ould and hoppen rand tce to the - faind her time. As and Ben't
the sise askep. It every and sticked Lia loshe wentimed toohld the
cookes and he gagayss in hen greveryby.

One day, here stomed trreave one up in Annamecy it noted Mise with
that make tret a like.

Tom."
T'vey, ""As smie, a, "I's wurre and not make day, but tway it?
Lily. The stach, says eveere they am and then a to happprosh apper,
and his plh? That you obo. The garded rike, nothis to fring they
is his ared to shing itsed and old neved the pretoy beard shappy
hingse they him at happy his stroughts have nex's by.

Observations

The model has learned basic English word-level shapes after 3000 iterations despite training at the character level.
Names (Timmy, Lily, Tom, Ben, Mia) appear consistently — likely the most frequent tokens in the story corpus.
Dialogue punctuation (", ', ,) is placed plausibly.
Sentence-level structure emerges: the model produces recognisable story beats ("One day,", "It was", "and said").
At 0.83M parameters and block_size=64, the context window is narrow; longer-range coherence will improve with larger block_size and more iterations.

Main Structure

Quadtrix.cpp/
├── config/
│   └── config.h          # All hyperparameters — edit here to retrain
├── include/
│   ├── tensor.h          # CPU float tensor: 2D/3D ops, matmul, softmax, etc.
│   ├── linear.h          # nn.Linear equivalent (weight + optional bias)
│   ├── embedding.h       # nn.Embedding (token + position lookup tables)
│   ├── layernorm.h       # nn.LayerNorm (γ/β, ε=1e-5)
│   ├── attention.h       # Head (causal mask, scaled dot-product) + MultiHeadAttention
│   ├── feedforward.h     # FeedForward: Linear → ReLU → Linear → Dropout
│   ├── block.h           # Transformer Block (pre-LN, dual residual)
│   ├── gpt.h             # GPTLanguageModel, cross_entropy, generate()
│   ├── dataloader.h      # Char tokeniser (stoi/itos), train/val split, get_batch()
│   └── backward.h        # Full analytical backprop + AdamW state
├── data/
│   └── input.txt         # Training corpus (plain text)
├── main.cpp              # Training pipeline
└── README.md

Building

Requirements: GCC ≥ 9 or Clang ≥ 10, C++17, no external dependencies.

# Compile
g++ -std=c++17 -O2 -I. -o quadtrix main.cpp

# Or use Make
make

# Train on your own text
./quadtrix data/input.txt

# Generate only (loads best_model.bin)
./quadtrix data/input.txt --generate

Configuration

All hyperparameters live in config/config.h. Rebuild after any change.

// config/config.h

static const int   BATCH_SIZE     = 4;       // sequences per gradient step
static const int   BLOCK_SIZE     = 64;      // context window length
static const int   MAX_ITERS      = 3000;    // total training iterations
static const int   EVAL_INTERVAL  = 200;     // evaluate every N steps
static const float LEARNING_RATE  = 3e-4f;   // AdamW learning rate
static const int   EVAL_ITERS     = 10;      // batches per eval estimate
static const int   N_EMBD         = 128;     // embedding dimension
static const int   N_HEAD         = 4;       // number of attention heads
static const int   N_LAYER        = 4;       // number of transformer blocks
static const float DROPOUT        = 0.2f;    // dropout probability

Scaling guide

Target	Suggestion
Better coherence	↑ `block_size` (256–512), ↑ `n_embd` (256+)
Faster training	↑ `batch_size`, compile with `-O3 -march=native`
Smaller model	↓ `n_layer` (2), ↓ `n_embd` (64)
More parameters	↑ `n_embd` (512), ↑ `n_layer` (6–8)

Design Notes

Why C++17, no dependencies?

The goal is full transparency. Every multiply-accumulate, every softmax row, every gradient derivation is readable in the source. There is no framework between the math and the metal.

Analytical backprop

Rather than automatic differentiation, Quadtrix implements explicit backward passes for each operator. The derivations follow the standard formulations:

Cross-entropy: d_logits = softmax(logits) − one_hot(target) scaled by 1/BT
Linear: dX = dOut @ Wᵀ, dW += Xᵀ @ dOut, db += Σ dOut
LayerNorm: Ba et al. (2016) three-term formula via saved μ, σ⁻¹, x̂
Softmax: d_pre[i] = s[i] * (d[i] − Σⱼ s[j] d[j])
ReLU: dX[i] = dOut[i] if pre_relu[i] > 0 else 0
Attention: product rule through Q @ Kᵀ, causal mask zeros upper-triangle grads
Embeddings: scatter-add for tokens, batch-sum for positions

Causal masking

The upper-triangular mask is applied before softmax by setting future positions to -1e30. During backprop these positions receive zero gradient (the -inf entries have zero softmax output, so s[i] * (...) = 0).

Dropout

Both the attention weight matrix and the projection output have independent dropout masks sampled during each forward pass. The same masks are stored and reused in the backward pass (d = d * mask / (1 - p)).

Training Metrics

The training report visualizes three critical dynamics:

Loss curves (left panel): Cross-entropy decreases from 4.5 to 1.6 over 3000 iterations. Training and validation losses track closely, indicating effective learning without severe overfitting.

Wall-clock efficiency (middle panel): Linear relationship between validation loss and elapsed time demonstrates consistent GPU utilization and efficient batching.

Generalization gap (right panel): Difference between validation and training loss oscillates around zero with peak divergence of 0.0754. This healthy pattern suggests the model learns general patterns rather than memorizing training data.

Final metrics:

Validation loss: 1.6371 (iteration 3000)
Training parameters: 0.83M params, 105 vocab tokens, 28.3M training / 3.1M validation tokens
Architecture: n_layer=4, n_embd=128

Training Comparison & Scaling Analysis

Python (Pytorch)

c++

The following table compares three distinct training runs across different architectures and datasets, demonstrating empirical scaling law behavior:

Metric	Run 1: Character-Level	Run 2: Small Scale	Run 3: Large Scale
Architecture
Parameters	0.83M	2.00M	19.17M
Layers	4	4	4
Embedding Dim	128	200	200
Attention Heads	4	4	4
Context Length	64	200	200
Dataset
Corpus	`tinystories`	`tinystories`	Children's Stories
Vocab Size	105 (char)	110 (char)	~50K (BPE)
Training Tokens	28.3M	79.6M	Unknown
Validation Tokens	3.1M	8.8M	Unknown
Data Volume	-	88.5 MB	-
Training Config
Total Iterations	3,000	5,000	5,000
Hardware	CPU/CUDA	CUDA (Tesla T4)	Unknown
Wall-Clock Time	~76 min	5.97 min	Unknown
Throughput	-	~838 iter/min	-
Final Performance
Train Loss	1.5632	0.9045	Unknown
Val Loss	1.6371	0.9301	Unknown
Generalization Gap	0.0739	0.0256	Unknown
Peak Gap	0.0754 @ iter 2800	Unknown	Unknown
Convergence
Initial Loss	4.5	4.6946	~5.0
Loss Reduction	65.7%	80.2%	~80%
Saved Checkpoints	Every 200 iters	Every 200 iters	Multiple
Best Iteration	3000	4999	Unknown

Scaling Law Observations

1. Parameter Count vs Performance

The relationship between model size and loss follows the expected power law:

L(N) ∝ N^(-α)

Where N is parameter count and α ≈ 0.076 based on our data:

0.83M params → Val Loss 1.6371
2.00M params → Val Loss 0.9301 (43.2% reduction for 2.4× params)
19.17M params → Expected Val Loss ~0.65-0.75 (extrapolated)

2. Data Efficiency

Token scaling shows diminishing returns:

Run 1: 28.3M tokens @ 1.6371 loss
Run 2: 79.6M tokens @ 0.9301 loss (2.8× data → 43% loss reduction)

This suggests we're in the data-limited regime where increasing model capacity yields better returns than increasing data alone.

3. Compute Efficiency

Run 2 achieved superior performance despite shorter wall-clock time (5.97 min vs 76 min), highlighting the importance of:

Hardware acceleration (Tesla T4 CUDA)
Larger batch processing
Optimized data pipeline

4. Generalization Dynamics

Both runs show healthy train/val convergence:

Run 1: Final gap of 0.0739 (4.5% relative)
Run 2: Final gap of 0.0256 (2.8% relative)

Smaller gap in Run 2 suggests better regularization or more diverse training data per parameter.

5. Neural Scaling Law Projection

Extrapolating from our empirical data:

Target Loss	Estimated Params	Estimated Tokens	Expected Compute
1.0	~1.5M	~50M	~2-3 min (T4)
0.8	~3-4M	~100M	~8-12 min (T4)
0.6	~15-20M	~300M	~40-60 min (T4)
0.5	~40-50M	~1B	~3-5 hours (T4)

Chinchilla-optimal ratio: For compute-efficient training at this scale, target N ≈ 20 × D (parameters ≈ 20 × training tokens in billions).

Scaling works: Doubling parameters reduces loss by ~30-40% consistently
Hardware matters: GPU acceleration provides 12× speedup with better loss
Small models saturate quickly: Beyond 5K iterations, gains diminish without more capacity
Character-level is competitive: At small scale, character models perform reasonably despite simpler tokenization
Generalization is healthy: Both runs avoid severe overfitting, suggesting good regularization defaults

Comparison to Other Projects

Project	Focus	Language	Autograd
nanoGPT	Minimal PyTorch GPT	Python	PyTorch
llama2.c	Inference only	C	None
minGPT	Educational PyTorch	Python	PyTorch
Quadtrix.cpp	Training + inference, multi-backend	C++/Python/C	Manual + PyTorch

Building From Source

Requirements:

C++17 compiler (GCC 7+, Clang 5+, MSVC 2017+)
Python 3.8+ (for PyTorch paths)
CMake 3.15+ (for LibTorch experiments)

Minimal build (C++ only):

g++ -std=c++17 -O2 -I. -Iinclude -o quadtrix main.cpp

With LibTorch:

# Download libtorch from pytorch.org
cmake -S . -B build -DCMAKE_PREFIX_PATH=/path/to/libtorch
cmake --build build --config Release

Python environment:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install torch tiktoken numpy

Reference

Architecture based on "Attention Is All You Need" (Vaswani et al., 2017)
GPT-2 (Radford et al., 2019).

License

MIT

Quadtrix.cpp · val loss 1.6371 · 0.83M params · 76 min on CPU

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.vscode		.vscode
config		config
engine		engine
frontend		frontend
gpu		gpu
iGPU		iGPU
include		include
model		model
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
quadtrix_training_plots.ipynb		quadtrix_training_plots.ipynb
quadtrix_training_report.png		quadtrix_training_report.png
run.md		run.md

Folders and files

Latest commit

History

Repository files navigation

Quadtrix.cpp

Quadtrix.cpp a autoregressive language model in two variants:

Leaderboard

Hardware Execution Backends

What is this?

Quick Start

Overview

Architecture

Hyperparameters — v1.0

Training

Training Log

Inference & Sample Outputs

Chat Transcript

Extended Free Generation

Observations

Main Structure

Building

Configuration

Scaling guide

Design Notes

Why C++17, no dependencies?

Analytical backprop

Causal masking

Dropout

Training Metrics

Training Comparison & Scaling Analysis

Python (Pytorch)

c++

Scaling Law Observations

Comparison to Other Projects

Building From Source

Reference

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages