🚀 Browser LLM: High-Performance Local Dashboard

A private, and entirely local Large Language Model (LLM) dashboard running directly in your browser using WebGPU and WebAssembly.

This project leverages the @mlc-ai/web-llm engine to execute high-performance models like Llama-3.2 and Phi-3.5 entirely on the client side. No data ever leaves your device, and no API keys are required.

Built with the help of Gemini CLI. This is something to play with if you are learning about LLM's and AI. You don't need a backend to run, its all in the browser to keep things simple.

🛠 Features

High-Density UI: A compact, glassmorphic React interface optimized for 1920x1080 screens, featuring a sidebar-driven chat history and professional dashboard aesthetics.
Hardware Accelerated: Uses WebGPU for near-native inference speeds on supported hardware.
Privacy First: 100% local execution; your conversations never touch a server.
Persistent Memory: Uses IndexedDB (via Dexie.js) to automatically save and restore your chat sessions, model preferences, and history across browser restarts.
Markdown & Code Support: Rich text rendering with syntax-highlighted code blocks and one-click "Copy to Clipboard" functionality.
Optimized for Compatibility:
- Uses q4f32 (32-bit float) model variants to bypass shader-f16 requirements on Linux/Mesa drivers.
- Implements conservative KV Cache management to support integrated GPUs with strict compute limits.

🧠 Available Models

The application includes a pre-configured model registry, allowing you to scale performance based on your available VRAM:

Model Name	VRAM Req.	Best For...
Llama 3.2 1B (Default)	~800MB	High-quality instructions and balanced performance.
Llama 3.2 3B (Pro)	~2.5GB	Complex reasoning and high-intelligence tasks.
Phi 3.5 Mini (1k Context)	~3.2GB	Logic-heavy tasks; optimized with a compact context window.
Qwen 2.5 0.5B	~350MB	Extremely fast responses for simple utility tasks.
SmolLM2 135M	~100MB	Ultra-lightweight testing and minimal resource usage.

📟 Hardware Laboratory & Real-time Telemetry

The "Hardware Lab" is a real-time diagnostic suite providing deep visibility into the LLM's interaction with your GPU.

Inference Performance (T/s)

Decode Speed: Real-time generation speed in Tokens per Second.
Prefill Speed: How fast the model processes your prompt before generating.
TTFT (Time To First Token): The "reaction time" of the model in milliseconds.
TPOT (Time Per Output Token): The average latency between individual token generations.

Efficiency & Economics

Context Saturation: Real-time percentage of the KV Cache used (alerts in Red if OOM is imminent).
P:R Ratio (Prompt-to-Response): Analyzes the efficiency of your prompting relative to the generated length.
Memory/Token: Precise measurement of VRAM consumed per generated token (typically ~0.20 KB/T).
Local Savings: Estimated cost savings compared to using commercial GPT-4o APIs.

Interactive "Lab Mode" Controls

Fine-tune the engine's behavior in real-time without restarting the model:

KV Cache Management: Slider to adjust memory pages (Lower = stability, Higher = long-term memory).
Memory Depth (Context Slicing): Control exactly how many previous messages are sent to the AI to prevent crashes on low-VRAM devices.
Creativity (Temperature): Direct control over the model's sampling randomness.

Screenshot

🐧 Linux GPU Requirements

WebGPU on Linux requires a healthy Vulkan environment. Ensure your system meets these prerequisites:

Drivers: Use the proprietary NVIDIA or AMD drivers (open-source Mesa drivers work but may require specific browser flags).
Vulkan Loader: Ensure the Vulkan loader is installed:
```
sudo apt install libvulkan1 mesa-vulkan-drivers
```
Verification: Confirm Vulkan is active by running vulkaninfo --summary.

🐧 Install and run

Clone the repository and navigate to the project folder:
```
npm install
```
Start the development server:
```
npm run dev
```
Run lint

   npm run lint

Run tests

   npm run test

🐧 Running Chrome

Due to strict shader validation in certain Linux/Mesa drivers, it is recommended to launch Chrome with these flags:

google-chrome --user-data-dir=/tmp/chrome-test \
              --enable-dawn-features=allow_unsafe_apis \
              --enable-webgpu-developer-features \
              http://localhost:5173

--user-data-dir: Uses a fresh profile to avoid corrupted GPU shader caches.
--enable-dawn-features=allow_unsafe_apis: Bypasses strict WGSL shader validation.
--enable-webgpu-developer-features: Required for development. See here.

🛠 Google Chrome Flags

🧠 Technical Stack

Framework: React 19 + TypeScript
Database: Dexie.js (IndexedDB wrapper for persistent chat history)
AI Engine: @mlc-ai/web-llm
Rendering: React-Markdown + Syntax Highlighting
Build Tool: Vite 7 (with vite-plugin-wasm and vite-plugin-top-level-await)

📊 Performance Notes

Initial Load: The first run will download model shards (~100MB to ~3GB depending on model) and cache them in your browser.
Hardware Limits: This app is pre-configured for devices with a maxComputeInvocationsPerWorkgroup limit of 256. If your hardware is more powerful, you can increase maxNumPages in the Hardware Lab.

Citation

@misc{ruan2024webllmhighperformanceinbrowserllm,
      title={WebLLM: A High-Performance In-Browser LLM Inference Engine}, 
      author={Charlie F. Ruan and Yucheng Qin and Xun Zhou and Ruihang Lai and Hongyi Jin and Yixin Dong and Bohan Hou and Meng-Shiun Yu and Yiyan Zhai and Sudeep Agarwal and Hangrui Cao and Siyuan Feng and Tianqi Chen},
      year={2024},
      howpublished={\url{https://github.com/mlc-ai/web-llm}},
      note={GitHub repository}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
docs		docs
public		public
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Browser LLM: High-Performance Local Dashboard

🛠 Features

🧠 Available Models

📟 Hardware Laboratory & Real-time Telemetry

Inference Performance (T/s)

Efficiency & Economics

Interactive "Lab Mode" Controls

Screenshot

🐧 Linux GPU Requirements

🐧 Install and run

🐧 Running Chrome

🛠 Google Chrome Flags

🧠 Technical Stack

📊 Performance Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Browser LLM: High-Performance Local Dashboard

🛠 Features

🧠 Available Models

📟 Hardware Laboratory & Real-time Telemetry

Inference Performance (T/s)

Efficiency & Economics

Interactive "Lab Mode" Controls

Screenshot

🐧 Linux GPU Requirements

🐧 Install and run

🐧 Running Chrome

🛠 Google Chrome Flags

🧠 Technical Stack

📊 Performance Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages