A visual, interactive, stage-by-stage explainer of exactly what happens when you send a prompt to a large language model. Built for learners who want to see under the hood.
- 🚀 Real Ollama Integration — Connects directly to your local
ollama serveto visualize actual tokenization and generation. - 🔥 Native PyTorch Backend — An optional Python FastAPI backend that runs small HuggingFace models (like GPT-2 or SmolLM) to extract real mathematical data like attention weights and per-token embeddings.
- 🎨 Beautiful Dark Theme — A sleek, modern UI designed for clarity, built with CSS variables.
- 🔄 Smart Mock Mode — Automatically falls back to high-quality mock data if no backends are running, perfect for quick demos or learning on the go.
- 📱 Responsive & Accessible — Navigate through stages using your keyboard arrow keys.
- 📚 Built-in Learning — "Why does this matter?" explainers on every single stage.
- Raw Input — See how your text is broken down into UTF-8 bytes with hex codes.
- Tokenizer — Watch the text split into tokens with IDs, featuring color-coded BPE classifications.
- Embeddings — View high-dimensional vectors projected to 2D using PCA. Inspect per-token dimensions and cosine similarities!
- Self-Attention — The core of the Transformer. Visualize attention weights in a heatmap or arc view. Analyze head entropy and layer-by-layer attention stats.
- Feed-Forward Network — Watch token data flow through the MLP layers with animated node activations.
- Layer Stack — Trace a token's journey through all 32 layers of the deep neural network.
- Softmax & Temperature — Interactively tweak Temperature, Top-K, and Top-P to see how the probability distribution bends and warps in real-time.
- Autoregressive Generation — Watch the model generate text token-by-token with live streaming stats (tokens/sec, latency sparklines).
You can run this project in Standard Mode (Node.js + Ollama) or Advanced Mode (adding the Python PyTorch backend for deep internal data extraction).
- Node.js v18+ — Download
- Ollama — Download (Pull at least one model:
ollama pull llama3) - (Optional) Python 3.9+ — For the PyTorch Native backend.
# 1. Clone or extract the project
git clone https://github.com/yourusername/llm-internals-explorer.git
cd llm-internals-explorer
# 2. Install dependencies (installs root, server, and client packages)
npm install
npm --prefix client install
npm --prefix server install
# 3. Start Ollama in a separate terminal
ollama serve
# 4. Start the development server
npm run devThe app will be available at http://localhost:5173. The Node proxy server runs on http://localhost:3001.
To visualize real attention matrices and continuous vector embeddings, you can spin up the PyTorch server directly from the main app interface.
# 1. First time only: Install Python dependencies
cd pytorch_server
pip install -r requirements.txtOnce dependencies are installed, just run npm run dev and click the + Start PyTorch button in the app header to launch the companion backend automatically in a new window!
The project uses a dual-backend architecture to provide both conversational generation and deep internal tensor inspection.
llm-internals-explorer/
├── client/ # React + Vite frontend
│ ├── src/stages/ # The 8 interactive stage components
│ ├── src/components/ # Shared UI (Nav, Toggles, Tour)
│ ├── src/workers/ # Web Workers (e.g., PCA calculations)
│ └── src/context/ # Global AppContext state
├── server/ # Node.js / Express Proxy
│ ├── index.js # Routes to Ollama & Python backend
│ └── mock/ # Fallback data when offline
└── pytorch_server/ # Native PyTorch API
├── main.py # FastAPI server extracting Tensors
└── requirements.txt # Python dependencies
The Node server at localhost:3001 intelligently proxies your frontend requests:
/api/generate,/api/model-info,/api/tags➡️ Local Ollama Server/api/pytorch/attention,/api/pytorch/embeddings➡️ Local PyTorch Python Server
- Ollama's API Limits: Ollama does not expose internal layer attentions or per-token continuous embeddings natively. Our solution is the
pytorch_serverbackend. If you only use Ollama, Stages 3 and 4 will gracefully fall back to realistic mock data. - Logprobs Requirement: Visualizing alternate token paths in Stage 7 and 8 requires Ollama v0.1.33+.
- Heavy Tensors: The PyTorch backend extracts massive tensors (
output_attentions=True). On older machines, extracting data for long prompts might take a few seconds.
We welcome contributions! Feel free to improve visualizations, add new stages, or optimize performance. Some open ideas:
- Add KV-cache visualization stage.
- Show gradient flow during autoregressive generation.
- Support for multiple PyTorch models side-by-side.
- Export raw mathematically accurate attention matrices as CSV.
This project is licensed under the GNUv3 License.
Don't forget to take the guided tour in the app to get familiar with the interface!