A private, and entirely local Large Language Model (LLM) dashboard running directly in your browser using WebGPU and WebAssembly.
This project leverages the @mlc-ai/web-llm engine to execute high-performance models like Llama-3.2 and Phi-3.5 entirely on the client side. No data ever leaves your device, and no API keys are required.
Built with the help of Gemini CLI. This is something to play with if you are learning about LLM's and AI. You don't need a backend to run, its all in the browser to keep things simple.
- High-Density UI: A compact, glassmorphic React interface optimized for 1920x1080 screens, featuring a sidebar-driven chat history and professional dashboard aesthetics.
- Hardware Accelerated: Uses WebGPU for near-native inference speeds on supported hardware.
- Privacy First: 100% local execution; your conversations never touch a server.
- Persistent Memory: Uses IndexedDB (via Dexie.js) to automatically save and restore your chat sessions, model preferences, and history across browser restarts.
- Markdown & Code Support: Rich text rendering with syntax-highlighted code blocks and one-click "Copy to Clipboard" functionality.
- Optimized for Compatibility:
- Uses q4f32 (32-bit float) model variants to bypass
shader-f16requirements on Linux/Mesa drivers. - Implements conservative KV Cache management to support integrated GPUs with strict compute limits.
- Uses q4f32 (32-bit float) model variants to bypass
The application includes a pre-configured model registry, allowing you to scale performance based on your available VRAM:
| Model Name | VRAM Req. | Best For... |
|---|---|---|
| Llama 3.2 1B (Default) | ~800MB | High-quality instructions and balanced performance. |
| Llama 3.2 3B (Pro) | ~2.5GB | Complex reasoning and high-intelligence tasks. |
| Phi 3.5 Mini (1k Context) | ~3.2GB | Logic-heavy tasks; optimized with a compact context window. |
| Qwen 2.5 0.5B | ~350MB | Extremely fast responses for simple utility tasks. |
| SmolLM2 135M | ~100MB | Ultra-lightweight testing and minimal resource usage. |
The "Hardware Lab" is a real-time diagnostic suite providing deep visibility into the LLM's interaction with your GPU.
- Decode Speed: Real-time generation speed in Tokens per Second.
- Prefill Speed: How fast the model processes your prompt before generating.
- TTFT (Time To First Token): The "reaction time" of the model in milliseconds.
- TPOT (Time Per Output Token): The average latency between individual token generations.
- Context Saturation: Real-time percentage of the KV Cache used (alerts in Red if OOM is imminent).
- P:R Ratio (Prompt-to-Response): Analyzes the efficiency of your prompting relative to the generated length.
- Memory/Token: Precise measurement of VRAM consumed per generated token (typically ~0.20 KB/T).
- Local Savings: Estimated cost savings compared to using commercial GPT-4o APIs.
Fine-tune the engine's behavior in real-time without restarting the model:
- KV Cache Management: Slider to adjust memory pages (Lower = stability, Higher = long-term memory).
- Memory Depth (Context Slicing): Control exactly how many previous messages are sent to the AI to prevent crashes on low-VRAM devices.
- Creativity (Temperature): Direct control over the model's sampling randomness.
WebGPU on Linux requires a healthy Vulkan environment. Ensure your system meets these prerequisites:
- Drivers: Use the proprietary NVIDIA or AMD drivers (open-source Mesa drivers work but may require specific browser flags).
- Vulkan Loader: Ensure the Vulkan loader is installed:
sudo apt install libvulkan1 mesa-vulkan-drivers
- Verification: Confirm Vulkan is active by running
vulkaninfo --summary.
-
Clone the repository and navigate to the project folder:
npm install
-
Start the development server:
npm run dev
-
Run lint
npm run lint- Run tests
npm run testDue to strict shader validation in certain Linux/Mesa drivers, it is recommended to launch Chrome with these flags:
google-chrome --user-data-dir=/tmp/chrome-test \
--enable-dawn-features=allow_unsafe_apis \
--enable-webgpu-developer-features \
http://localhost:5173--user-data-dir: Uses a fresh profile to avoid corrupted GPU shader caches.--enable-dawn-features=allow_unsafe_apis: Bypasses strict WGSL shader validation.--enable-webgpu-developer-features: Required for development. See here.
- Framework: React 19 + TypeScript
- Database: Dexie.js (IndexedDB wrapper for persistent chat history)
- AI Engine: @mlc-ai/web-llm
- Rendering: React-Markdown + Syntax Highlighting
- Build Tool: Vite 7 (with
vite-plugin-wasmandvite-plugin-top-level-await)
- Initial Load: The first run will download model shards (~100MB to ~3GB depending on model) and cache them in your browser.
- Hardware Limits: This app is pre-configured for devices with a
maxComputeInvocationsPerWorkgrouplimit of 256. If your hardware is more powerful, you can increasemaxNumPagesin the Hardware Lab.
@misc{ruan2024webllmhighperformanceinbrowserllm,
title={WebLLM: A High-Performance In-Browser LLM Inference Engine},
author={Charlie F. Ruan and Yucheng Qin and Xun Zhou and Ruihang Lai and Hongyi Jin and Yixin Dong and Bohan Hou and Meng-Shiun Yu and Yiyan Zhai and Sudeep Agarwal and Hangrui Cao and Siyuan Feng and Tianqi Chen},
year={2024},
howpublished={\url{https://github.com/mlc-ai/web-llm}},
note={GitHub repository}
}
