A from-scratch GPT implementation in Ruby. The goal is to make every layer of a modern language model legible — not just usable, but understandable — by building it up from first principles, in historical order, without hiding the math behind framework magic.
This is not a production ML system. It is a pedagogical one. The code tells the story of how neural networks work.
The codebase is organized as a dependency ladder. Each gem is complete and useful on its own, and each one builds on the last. You can stop at any rung and have a working, comprehensible system.
rinzler-autograd ← scalar autograd (understand backprop on one number)
↓
rinzler-tensor ← matrix autograd (make it practical)
↓
rinzler-nn ← building blocks (Linear, LayerNorm, Embedding)
↓
rinzler-optim ← AdamW optimizer
↓
rinzler-tokenizer ← BPE tokenizer
↓
rinzler-gpt ← GPT model + training loop
↓
rinzler-vulkan ← GPU compute backend (optional)
Scalar reverse-mode automatic differentiation. A Value wraps a single float and records every operation performed on it. Calling .backward on the loss walks the computation graph in reverse and accumulates gradients at every node via the chain rule.
This is the Karpathy micrograd approach — implement it once at the scalar level and you understand backprop completely.
The same autograd idea applied to Numo::DFloat arrays. Instead of tracking individual numbers, every operation (add, mul, dot, bmm, softmax, log_softmax, layer_norm, etc.) records how to backpropagate through a whole matrix at once. This is what makes training practical.
Supports a selectable backend:
Rinzler::Tensor.backend = :vulkan # route dot/bmm through GPU
Rinzler::Tensor.backend = :cpu # defaultKey ops: dot, bmm, transpose_last2, softmax, log_softmax, sum, mean, reshape, slice_cols, concat_cols.
Neural network primitives built on top of rinzler-tensor:
Linear— learned weight matrix + optional bias; supports 2D and 3D inputs (flattens leading dims automatically)LayerNorm— Pre-LN normalization; numerically stable; 3D-awareEmbedding— learned lookup table; handles batched index input; gradient accumulates correctly for repeated indicesParameter— thin wrapper that marks a tensor as trainable
Optimizers and LR schedulers.
Optimizers: SGD, SGDMomentum, RMSprop, Adam, AdamW. All support clip_grad_norm!(max_norm) for gradient clipping. AdamW supports checkpoint save/load of moment state so training resumes exactly.
Schedulers wrap any optimizer and adjust lr each step: LinearWarmup (ramp from 0 → base over N steps), CosineWithWarmup (warmup then cosine decay). Both expose the same interface as an optimizer.
Byte-Pair Encoding tokenizer. Character-stream approach (no word splitting) — tokens naturally include spaces, which is correct for Ruby code and prose where whitespace is semantic. decode is just tokens.join.
Trained in-session from the corpus; the vocabulary is saved alongside model checkpoints.
GPT-2 style transformer:
- Pre-LayerNorm architecture
- Multi-head causal self-attention with learned position embeddings (causal mask precomputed once at construction)
- GELU activation in FFN layers (tanh approximation, as used in GPT-2)
- Configurable depth/width via
Config - Vectorized cross-entropy loss through
log_softmax(full gradient path preserved) - Binary checkpoint format: weights and optimizer moments as raw float bytes + JSON sidecar for metadata. Legacy JSON-only format still loads.
- Graceful shutdown:
SIGINT/SIGTERMsets a flag, training finishes the current step and saves a checkpoint before exiting.
All scripts run from the monorepo root so Bundler resolves inter-gem dependencies correctly:
# Find the fastest batch_size × OMP_NUM_THREADS combo before committing to a long run
bundle exec ruby rinzler-gpt/autotune.rb --vulkan
# Fresh run
bundle exec ruby rinzler-gpt/train.rb \
--corpus "rinzler-gpt/corpus/*.txt" \
--steps 50000 \
--vocab-size 1000 \
--warmup-steps 500 \
--cosine \
--div-crit 50 \
--gen-every 500 \
--vulkan
# Resume
bundle exec ruby rinzler-gpt/train.rb \
--corpus "rinzler-gpt/corpus/*.txt" \
--steps 100000 \
--vocab-size 1000 \
--warmup-steps 500 \
--cosine \
--div-crit 50 \
--resume rinzler-gpt/runs/4/checkpoint_step15000.json \
--vulkanKey flags:
| Flag | Default | Description |
|---|---|---|
--steps N |
1000 | Training steps |
--lr N |
3e-4 | Learning rate |
--batch-size N |
8 | Sequences per step |
--context N |
128 | Max sequence length |
--d-model N |
64 | Embedding dimension |
--layers N |
4 | Transformer blocks |
--vocab-size N |
500 | BPE merge count |
--warmup-steps N |
0 | LR warmup steps |
--cosine |
off | Cosine decay after warmup (requires --warmup-steps) |
--clip-grad N |
1.0 | Gradient clipping max norm |
--no-clip-grad |
— | Disable gradient clipping |
--gen-every N |
200 | Steps between text samples |
--save-every N |
500 | Steps between checkpoints |
--div-warn N |
20 | Warn when train/val gap exceeds N% |
--div-crit N |
— | Stop when train/val gap exceeds N% |
--vulkan |
off | Use GPU backend (see note below) |
--resume PATH |
— | Resume from checkpoint (.json or .bin) |
--corpus PATTERN |
corpus.txt | Repeatable glob |
Autotune (rinzler-gpt/autotune.rb):
Benchmarks batch_size × OMP_NUM_THREADS combinations over a short fixed run, measures tokens/sec, and emits the optimal train.rb invocation. Accepts --batch-sizes "4,8,16,32", --threads "1,2,4", --steps N, --vulkan.
Corpus (rinzler-gpt/corpus/):
corpus.txt— _why's Poignant Guide to Rubylearn-to-program.txt— Chris Pine's Learn to Programpickaxe6_clean.txt— Programming Ruby 4 (cleaned from PDF)
Optional GPU compute backend via Vulkan compute shaders. Implements tiled 16×16 GEMM on the GPU using a GLSL compute shader compiled to SPIR-V at gem build time.
When to use it: Vulkan only helps when matrix dimensions exceed the serialization overhead of the Ruby→GPU→Ruby transfer. Benchmarked crossover on integrated AMD graphics: ~n=512. For the default small model (d_model=64), Vulkan is 2.4× slower than CPU — every matmul pays the transfer cost for matrices that OpenBLAS handles faster natively. Only add --vulkan if you scale up to d_model ≥ 256 or larger batch sizes where the vocabulary projection ([B×T, d_model] × [d_model, vocab_size]) becomes the dominant cost. Run autotune.rb with and without --vulkan to confirm before committing to a long run.
Requirements: vulkan-headers, vulkan-icd-loader (or vulkan-radeon), shaderc (for glslc).
Build:
cd rinzler-vulkan
bundle install
bundle exec rake compile- Full training loop: forward pass → loss → backward pass → AdamW step
- Batched training (multiple sequences per step)
- BPE tokenization with 1000-merge vocabulary
- Binary checkpoint format (raw floats + JSON sidecar) with backwards-compatible JSON-only loader
- Checkpoint save/resume: model weights + AdamW moment state + tokenizer
- Multi-corpus support (glob patterns, merged at load time)
- GPU acceleration via Vulkan (dot + bmm backend)
- Text generation with temperature sampling
- Linear LR warmup (
LinearWarmup,CosineWithWarmupschedulers) - Gradient clipping (
clip_grad_norm!) - GELU activation (GPT-2 tanh approximation)
- Causal mask precomputed at model construction
- Graceful
SIGINT/SIGTERMshutdown with checkpoint save - Divergence monitor: configurable warn/stop thresholds on train/val gap
Run 8 is in progress: 1000-merge BPE vocabulary, full corpus (_why + Chris Pine + Pickaxe), 100k steps. At step 5k, generation shows emerging Ruby sentence structure and correct use of method/block/argument terminology. Running on CPU at ~0.51s/step (~14h total).
- No KV cache — generation is O(T²) per token
- Single-GPU only, no multi-device
- Vulkan path serializes data through Ruby arrays (no zero-copy)
- Model is small by current standards (64d); coherent generation but limited range
- KV cache for O(T) generation (currently O(T²) per token)
CosineWithWarmupresume support:total_stepsshould account forstart_stepon resume
- Ruby 4.0+
- numo-narray (local fork at
../numo-narray) - Vulkan SDK (optional, for
rinzler-vulkan)