Skip to content

Random-Exception/BrowserLLM

Repository files navigation

🚀 Browser LLM: High-Performance Local Dashboard

A private, and entirely local Large Language Model (LLM) dashboard running directly in your browser using WebGPU and WebAssembly.

This project leverages the @mlc-ai/web-llm engine to execute high-performance models like Llama-3.2 and Phi-3.5 entirely on the client side. No data ever leaves your device, and no API keys are required.

Built with the help of Gemini CLI. This is something to play with if you are learning about LLM's and AI. You don't need a backend to run, its all in the browser to keep things simple.

🛠 Features

  • High-Density UI: A compact, glassmorphic React interface optimized for 1920x1080 screens, featuring a sidebar-driven chat history and professional dashboard aesthetics.
  • Hardware Accelerated: Uses WebGPU for near-native inference speeds on supported hardware.
  • Privacy First: 100% local execution; your conversations never touch a server.
  • Persistent Memory: Uses IndexedDB (via Dexie.js) to automatically save and restore your chat sessions, model preferences, and history across browser restarts.
  • Markdown & Code Support: Rich text rendering with syntax-highlighted code blocks and one-click "Copy to Clipboard" functionality.
  • Optimized for Compatibility:
    • Uses q4f32 (32-bit float) model variants to bypass shader-f16 requirements on Linux/Mesa drivers.
    • Implements conservative KV Cache management to support integrated GPUs with strict compute limits.

🧠 Available Models

The application includes a pre-configured model registry, allowing you to scale performance based on your available VRAM:

Model Name VRAM Req. Best For...
Llama 3.2 1B (Default) ~800MB High-quality instructions and balanced performance.
Llama 3.2 3B (Pro) ~2.5GB Complex reasoning and high-intelligence tasks.
Phi 3.5 Mini (1k Context) ~3.2GB Logic-heavy tasks; optimized with a compact context window.
Qwen 2.5 0.5B ~350MB Extremely fast responses for simple utility tasks.
SmolLM2 135M ~100MB Ultra-lightweight testing and minimal resource usage.

📟 Hardware Laboratory & Real-time Telemetry

The "Hardware Lab" is a real-time diagnostic suite providing deep visibility into the LLM's interaction with your GPU.

Inference Performance (T/s)

  • Decode Speed: Real-time generation speed in Tokens per Second.
  • Prefill Speed: How fast the model processes your prompt before generating.
  • TTFT (Time To First Token): The "reaction time" of the model in milliseconds.
  • TPOT (Time Per Output Token): The average latency between individual token generations.

Efficiency & Economics

  • Context Saturation: Real-time percentage of the KV Cache used (alerts in Red if OOM is imminent).
  • P:R Ratio (Prompt-to-Response): Analyzes the efficiency of your prompting relative to the generated length.
  • Memory/Token: Precise measurement of VRAM consumed per generated token (typically ~0.20 KB/T).
  • Local Savings: Estimated cost savings compared to using commercial GPT-4o APIs.

Interactive "Lab Mode" Controls

Fine-tune the engine's behavior in real-time without restarting the model:

  • KV Cache Management: Slider to adjust memory pages (Lower = stability, Higher = long-term memory).
  • Memory Depth (Context Slicing): Control exactly how many previous messages are sent to the AI to prevent crashes on low-VRAM devices.
  • Creativity (Temperature): Direct control over the model's sampling randomness.

Screenshot

Example - Name some fruit and their benefits

🐧 Linux GPU Requirements

WebGPU on Linux requires a healthy Vulkan environment. Ensure your system meets these prerequisites:

  1. Drivers: Use the proprietary NVIDIA or AMD drivers (open-source Mesa drivers work but may require specific browser flags).
  2. Vulkan Loader: Ensure the Vulkan loader is installed:
    sudo apt install libvulkan1 mesa-vulkan-drivers
  3. Verification: Confirm Vulkan is active by running vulkaninfo --summary.

🐧 Install and run

  1. Clone the repository and navigate to the project folder:

    npm install
  2. Start the development server:

    npm run dev
  3. Run lint

   npm run lint
  1. Run tests
   npm run test

🐧 Running Chrome

Due to strict shader validation in certain Linux/Mesa drivers, it is recommended to launch Chrome with these flags:

google-chrome --user-data-dir=/tmp/chrome-test \
              --enable-dawn-features=allow_unsafe_apis \
              --enable-webgpu-developer-features \
              http://localhost:5173
  • --user-data-dir: Uses a fresh profile to avoid corrupted GPU shader caches.
  • --enable-dawn-features=allow_unsafe_apis: Bypasses strict WGSL shader validation.
  • --enable-webgpu-developer-features: Required for development. See here.

🛠 Google Chrome Flags

Google Chrome GPU Flags

🧠 Technical Stack

  • Framework: React 19 + TypeScript
  • Database: Dexie.js (IndexedDB wrapper for persistent chat history)
  • AI Engine: @mlc-ai/web-llm
  • Rendering: React-Markdown + Syntax Highlighting
  • Build Tool: Vite 7 (with vite-plugin-wasm and vite-plugin-top-level-await)

📊 Performance Notes

  • Initial Load: The first run will download model shards (~100MB to ~3GB depending on model) and cache them in your browser.
  • Hardware Limits: This app is pre-configured for devices with a maxComputeInvocationsPerWorkgroup limit of 256. If your hardware is more powerful, you can increase maxNumPages in the Hardware Lab.

Citation

@misc{ruan2024webllmhighperformanceinbrowserllm,
      title={WebLLM: A High-Performance In-Browser LLM Inference Engine}, 
      author={Charlie F. Ruan and Yucheng Qin and Xun Zhou and Ruihang Lai and Hongyi Jin and Yixin Dong and Bohan Hou and Meng-Shiun Yu and Yiyan Zhai and Sudeep Agarwal and Hangrui Cao and Siyuan Feng and Tianqi Chen},
      year={2024},
      howpublished={\url{https://github.com/mlc-ai/web-llm}},
      note={GitHub repository}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages