Chrome ships a small local language model, Gemini Nano, built into the browser. It's exposed through a JavaScript API called the Prompt API (LanguageModel). No API key, no server, no external network calls once the model is downloaded. Everything runs on the user's machine.
This repo covers how to turn it on and what the API can do. It also includes scripts that automate setup and verification end to end.
git clone https://github.com/Ar9av/gemini-nano-chrome.git
cd gemini-nano-chrome
npm startOne command: launches Chrome with the right flags set (no manual chrome://flags clicking), starts the OpenAI-compatible API server, serves the chat UI, and opens it in a new tab. First run downloads the ~4 GB model, so give it a few minutes; npm start again afterward reuses the same Chrome instance and is instant.
Chat UI: http://localhost:8123
API server: http://localhost:8788/v1/chat/completions
Everything below explains what that command is doing and how to use each piece directly, in case you want more control than one script gives you.
| Chrome | 138+ (Dev or Canary channel recommended; this repo was verified on 149) |
| OS | Windows 10/11, macOS 13+, Linux, or ChromeOS on a Chromebook Plus |
| Storage | 22 GB free |
| GPU | 4 GB+ VRAM, or |
| CPU | 16 GB+ RAM and 4+ cores |
| Network | Unmetered connection for the one-time ~4 GB model download |
- Open
chrome://flags/#optimization-guide-on-device-modeland set it to Enabled BypassPerfRequirement - Open
chrome://flags/#prompt-api-for-gemini-nanoand set it to Enabled - Relaunch Chrome (the flags page has a button for this)
That's the whole setup. Everything else happens through JavaScript.
Open DevTools on any page and run:
const availability = await LanguageModel.availability();
console.log(availability); // "unavailable" | "downloadable" | "downloading" | "available"
const session = await LanguageModel.create();
const answer = await session.prompt("What are you, in one sentence?");
console.log(answer);
session.destroy();The first create() call starts the model download. availability() only reports state; it doesn't advance the download on its own. See examples/basic-prompt.js for a version that also reports download progress.
web/index.html is a self-contained chat interface, no build step, no server required for the model itself. It uses LanguageModel directly from the page, with one session kept alive across the whole conversation so follow-up questions have real context. npm start serves this automatically; to serve it on its own instead:
cd web && python3 -m http.server 8123Open http://localhost:8123 in Chrome (with the flags from above already enabled). It shows the model-download progress bar on first run, then a chat thread: type a message, get a streamed reply, "New chat" resets the session.
A plain file:// open works in some Chrome versions too, but the Prompt API expects a secure context, so serving it over http://localhost is the reliable option.
Want the same chat as an installable extension instead, no server, no npm start at all? See extension/.
Constrain the response to a JSON Schema with responseConstraint, and you get back parseable JSON instead of free-form prose:
const session = await LanguageModel.create();
const schema = {
type: "object",
properties: {
sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
confidence: { type: "number" },
},
required: ["sentiment", "confidence"],
};
const result = await session.prompt(
'Classify the sentiment of: "This new on-device AI is shockingly fast and works offline."',
{ responseConstraint: schema }
);
JSON.parse(result); // { sentiment: "positive", confidence: 0.95 }Full example: examples/structured-output.js.
const session = await LanguageModel.create();
const stream = session.promptStreaming("List 3 advantages of on-device AI.");
for await (const chunk of stream) console.log(chunk);Full example: examples/streaming.js.
Gemini Nano only runs inside a browser tab, so a Python script or an existing OpenAI client can't call it directly. server/index.js bridges that gap: a small HTTP server that speaks the OpenAI chat completions format on one side and drives a real Chrome tab over CDP on the other.
node server/index.js # listens on http://localhost:8788
PORT=8080 node server/index.jsIt launches Chrome itself on the first request if nothing is running yet, the same way tools/enable-flags.js does. npm start runs this alongside the chat UI in one process; use node server/index.js directly when you only need the API, not the browser-based chat.
| Endpoint | Behavior |
|---|---|
POST /v1/chat/completions |
Standard request/response shape, stream: true for SSE |
GET /v1/models |
Lists a single model, gemini-nano |
GET /health |
Reports LanguageModel.availability() |
curl http://localhost:8788/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What are you, in one sentence?"}]
}'That also means it works with the official OpenAI SDKs, by pointing them at the local server instead of api.openai.com:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8788/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemini-nano",
messages=[{"role": "user", "content": "What are you, in one sentence?"}],
)
print(response.choices[0].message.content)A plain-fetch version with streaming is in examples/openai-client.js.
response_format with a json_schema maps to responseConstraint, so the structured-output example above works the same way through the server:
curl http://localhost:8788/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Classify the sentiment of: \"This works great.\""}],
"response_format": {
"type": "json_schema",
"json_schema": { "name": "sentiment", "schema": {
"type": "object",
"properties": { "sentiment": {"type": "string"}, "confidence": {"type": "number"} },
"required": ["sentiment", "confidence"]
}}
}
}'The full message history (system/user/assistant) maps to a real Prompt API session via initialPrompts, so multi-turn conversations carry context correctly. Beyond that:
temperatureandtop_kare honored only when both are set together, and only take effect outside flags-only dev mode (Origin Trial or extension context). In plain flags mode the request still succeeds, with the default sampling instead.stopsequences are matched client-side against the streamed text. Once matched, the server stops forwarding further tokens to you, but Chrome keeps generating internally until it finishes.usagenumbers come fromsession.contextUsagebefore and after the call, which is an estimate, not an exact token count from Gemini Nano's own tokenizer.- No image input, no
max_tokens, non(multiple completions), no/v1/completionsor/v1/embeddings. Gemini Nano's Prompt API doesn't expose the hooks those need. - Each request creates a fresh session built from the messages you sent. The server itself holds no conversation state between requests.
If you already ran tools/enable-flags.js or tools/smoke-test.js with a custom PROFILE_DIR, set the same value when starting the server, or it launches a separate Chrome profile and downloads the model again:
PROFILE_DIR=/tmp/my-profile node server/index.jsClicking through chrome://flags by hand works fine once, but it gets old if you're testing repeatedly or want a reproducible setup. tools/ has two Node scripts that drive a real Chrome instance over the Chrome DevTools Protocol (CDP) instead. (npm start and server/index.js use the same underlying module, tools/chrome.js, to launch Chrome automatically.)
node tools/enable-flags.js # launches Chrome in a throwaway profile with both flags set
node tools/smoke-test.js # triggers the download, waits for it, and runs a real promptenable-flags.js launches Chrome with --user-data-dir pointed at a fresh profile, so it never touches your normal browser session or settings, plus --remote-debugging-port. It then sets the flags by clicking the same dropdowns you'd click by hand, driven through Runtime.evaluate. Chrome's internal pages like chrome://flags and chrome://on-device-internals are built from web components whose content lives in shadow DOM, so the script walks shadow roots recursively to find the controls.
smoke-test.js connects to that instance, calls LanguageModel.create(), and polls chrome://on-device-internals for the install progress while it waits. That confirms the ~4 GB download completes instead of watching a "downloading" status with no further detail.
Both scripts need Node 22+ (for the built-in fetch and WebSocket globals) and no dependencies.
# defaults to /Applications/Google Chrome.app on macOS, google-chrome on Linux
CHROME_BIN=/path/to/chrome PROFILE_DIR=/tmp/my-profile node tools/enable-flags.jsOnce installed, chrome://on-device-internals (enable internal debug pages first via chrome://chrome-urls) shows the model in use:
Model Name: v3Nano
Backend Type: GPU (highest quality)
Folder size: ~4,072 MiB
These numbers come from measuring this build directly. LanguageModel.params() isn't available here, so context window and quota behavior come from session.contextWindow / session.contextUsage and from hitting the limit and reading the error.
Context window: 9216 tokens, shared between input and output. Track usage with session.contextUsage / session.contextWindow. Going over it doesn't truncate silently:
await session.prompt(hugeText); // hugeText tokenizes to >9216
// throws QuotaExceededError: "The input is too large."
// err.requested has the token count that was attemptedGeneration speed: ~50 tokens/sec, measured with promptStreaming() and session.contextUsage deltas across three 300-word-essay runs (524, 525, 513 completion tokens; 47.3, 51.1, 51.5 tok/s). Short answers tend a bit higher, 56+ tok/s on a two-sentence explanation.
Time to first token: ~80-90ms once the model is warm. A prompt near the context window limit (~3000 input tokens) takes ~2.9s to first token. That's prefill time, roughly 1000 tokens/sec for processing input, much faster than the token-by-token decoding used for generating output.
Tested on:
| Machine | MacBook Pro, Apple M4 Pro |
| CPU | 14 cores (10 performance + 4 efficiency) |
| GPU | 20 cores |
| RAM | 24 GB unified memory |
| Backend | GPU, highest quality mode (chrome://on-device-internals reports this as the selected backend automatically) |
Generation speed scales with the GPU backend Chrome selects for your hardware. Lower-end devices, or devices that fall back to CPU inference, will be slower; there's no documented way to query expected throughput ahead of time short of running a prompt and timing it, the way the numbers above were produced.
The same on-device model backs several task-specific APIs, each scoped to a narrower job than the general-purpose Prompt API:
| API | Purpose |
|---|---|
| Summarizer | Condense text into headlines, summaries, or key points |
| Writer | Generate new text from a prompt |
| Rewriter | Revise existing text per instructions |
| Proofreader | Check spelling/grammar and suggest corrections |
| Translator | Translate between languages |
| Language Detector | Identify the language of a string |
Translator and Language Detector are desktop-only; the rest share the Prompt API's platform requirements above.
Flags only work for you, locally. For a production site, you need either:
- An Origin Trial token registered for your origin, or
- Distribution as a Chrome Extension, which gets stable access without a trial
Either way, treat the API as progressive enhancement: check availability() and provide a fallback for browsers or devices where it returns "unavailable".
LanguageModel is not defined: one of the two flags isn't set, or Chrome wasn't relaunched after setting them. Check chrome://version to confirm which flags are active.
availability() stays on "downloading" forever: expected if you never called create(). The model downloads only in response to a create() call from a JavaScript context.
LanguageModel.params is not a function: some Chrome builds don't expose this method yet, despite it appearing in the docs. Check Object.getOwnPropertyNames(LanguageModel) for what your version supports before relying on it.
chrome://on-device-internals says debugging pages are disabled: visit chrome://chrome-urls first and click "Enable internal debugging pages."
MIT, see LICENSE.