A character-level decoder-only Transformer (GPT) implemented in BASH.
Yes, BASH.
The script trains on a small list of names and generates new ones. All model state, intermediate values, and gradients live inside the shell script. Floating-point arithmetic is handled through bc -l, which is exactly as sensible as it sounds.
This is either educational or a poor life choice, depending on your mood.
Important
Performance Notice: This original version (v1) is hand-written for maximum educational transparency but is extremely slow (~7 hours for a training run) due to BASH process forking overhead.
If you actually want to see the model train in a reasonable amount of time (~11 minutes), please see the v2 README for the optimised version using bc coprocesses and AI-assisted refactoring.
bashGPT is a minimal GPT-style language model written as a single BASH program.
It includes:
- character-level tokenisation
- token embeddings
- positional embeddings
- causal multi-head self-attention
- RMSNorm
- an MLP with ReLU
- softmax
- negative log-likelihood loss
- Adam optimiser
The current model is intentionally tiny.
- 1 layer
- 2 attention heads
- 4-dimensional embeddings
- context window of 8 tokens
That keeps the runtime barely civilised.
I wanted to understand the mechanism, not just the tooling. BASH gave me the lowest possible cognitive load, which meant I could focus on the raw math. It also introduced several new and unnecessary problems, which turned out to be part of the lesson.
Most Transformer code is hidden behind tensor libraries, kernels, and enough abstraction to make the mechanism harder to see. This script leaves the mechanism exposed.
Each value is tracked directly. Each gradient is stored explicitly. The forward pass is assembled step by step. The backward pass is walked by hand. If you want to inspect the moving parts of a GPT without leaving the terminal, this is one way to do it.
It is also a slightly unreasonable way to spend time, which is part of the appeal.
Each scalar in the computation graph gets an ID. Data, gradients, parent nodes, and local gradients are stored in BASH associative arrays.
There are no tensors here. Everything is scalar math.
The backward pass uses an iterative topological traversal. No recursive autograd. No hidden engine. Just explicit graph bookkeeping in shell.
Non-integer math is delegated to bc -l with fixed precision. This works well enough for a proof of concept and slowly enough to build character.
On first run, the script downloads the names dataset automatically. It tokenises the data at character level, trains for 100 steps, and then samples 20 outputs.
NOTE: Ideally, you want 1000+ steps to be any level of useful. 100 was chosen because, well... BASH. If you want a more performant version, try V2.
Expect this to be slow.
BASH is many things. A high-performance numerical runtime is not one of them. Training happens one scalar operation at a time, with a lot of process overhead and very little mercy shown to the CPU.
Expect the outputs to be limited too. The model is small, the training run is short, and the numerical behaviour is less stable than you would get from standard ML tooling.
Still, it trains. It samples. It behaves like a language model. In BASH.
That is the whole point.
This project has very clear constraints.
- no batching
- no tensors
- no useful scaling story
- no competition with standard frameworks on speed, stability, or ergonomics
Some implementation choices are simplified so the whole thing can remain understandable inside a shell script.
chmod +x bashGPT.bash
./bashGPT.bash- BASH 4+
bccurl
On the first run, the script downloads the training data automatically.
- This repository is for studying mechanism, not for training serious models.
- If you want performance, use literally almost anything other than BASH.
- If you want to inspect the gradient state directly, it is sitting in BASH associative arrays:
for id in $(printf '%s\n' "${!values_grad[@]}" | sort -n | head -20); do
printf '%s\t%s\n' "$id" "${values_grad[$id]}"
done- If you want to watch a tiny GPT crawl out of a shell script and somehow function, this could be useful.
Inspired by Andrej Karpathy's makemore and his lectures on neural networks. His teaching made this project possible. The conceptual debt to Karpathy's work is real, and gratefully acknowledged.
The included (input.txt) dataset, as an example, has the most common 32K names takes from ssa.gov for the year 2018, slightly neatened.
The implementation is an original rewrite in BASH. No Python code or GPU was harmed in the making of this script. The CPU is a different matter altogether.
The philosophical implications of this project are explored in A Philosophical Approach to IT Architecture.