v0.2.0 Release Notes

The first follow-up to the initial release brings a new mid-tier model, opt-in step-by-step reasoning, a completely rebuilt download pipeline, a real settings panel for managing local model storage, and a layout that finally adapts to your window size.

New Models & Context Defaults

Gemma 4 12B added: A dense mid-tier variant (mlx-community/gemma-4-12B-it-4bit, ~11 GB on disk) now sits between E4B and the 26B MoE. Default 32K context, native max 256K. Sharper reasoning than E4B without the MoE's RAM cliff. Comfortable on a 24 GB Mac; tight at 16 GB.
Context window defaults rebalanced: E4B drops 32K → 16K to leave KV-cache headroom on 8 GB Macs. The 26B MoE drops 128K → 64K to keep its boot-time reservation manageable. 31B rises 32K → 64K to align with the other dense variants. All five models still expose their full native max (128K for E2B/E4B, 256K for 12B / 26B MoE / 31B) via the Context Window slider.
Disk sizes refined across the registry: E2B ~4 GB, E4B ~5.5 GB, 12B 11 GB, 26B MoE ~16 GB, 31B ~18 GB. The "27B MoE" label was also corrected to "26B MoE" to match Google's naming.

Thinking Mode (Opt-In)

Gemma 4's built-in step-by-step reasoning is now plumbed end-to-end:

Settings → Thinking → Enabled flips enable_thinking: true on every chat request. The model emits its reasoning trace before the final answer.
Collapsible Thinking block appears above the answer. The header cycles through "Thinking → Considering → Planning → Pondering → Reasoning → Sketching" while the trace streams in, then collapses to "Thought process" once the final answer begins. Click any time to re-expand.
Full markdown inside the block — syntax-highlighted code, KaTeX math, GFM tables, clickable links — same renderer as the main answer.
No context-window leak: Past thinking traces are stripped at the IPC boundary on every subsequent turn (defense in depth: Gemma 4's chat template also strips them server-side via its strip_thinking() macro). Reasoning lives in the UI only.

Rebuilt Download Pipeline

The first-launch download was replaced end-to-end after the deprecated huggingface-cli stopped working:

Manifest-driven byte progress: Total size is fetched from HfApi.model_info() before the first byte streams, so the progress bar is correct from frame zero — no more "stuck at 0%".
Resume detection: Partial blobs from interrupted previous attempts are detected on next launch; the UI swaps the verb from "Downloading" to "Resuming" and only fetches what's missing.
Cancel button: Interrupt a download mid-stream and drop back to the model picker — no orphaned Python children, no half-written blobs in an indeterminate state.
Authoritative completion marker: A .gemx-verified file is written only after snapshot_download(..., local_files_only=True) integrity-checks the cache. Subsequent launches trust this marker exactly — no more "model present but didn't load" edge cases.

Self-Healing MLX Runtime

Auto-upgrade on version mismatch: When the bundled mlx-vlm is too old to load a model (e.g., gemma4_unified requires ≥ 0.6.1), the app catches the typed error, runs pip install --upgrade, and retries boot once. No restart, no manual venv surgery.
Lifecycle hardening: A single-flight server wrapper handles dev hot-reload double-mounts cleanly. Orphaned port holders from previous force-quits are reclaimed via lsof before the new server binds. On app quit, the Python child gets a synchronous SIGKILL so nothing leaks across runs.

Settings Panel Expansion

Downloaded Models: Every cached model is now listed with size and active/default tags, and one-click delete for any that aren't currently loaded. Reclaim disk space without leaving the app.
Context Window: Power-of-2 slider from 4K up to the model's native max, with a Reset button to fall back to the default.
Tavily API Key: Paste your free Tavily key (1,000 searches/month, no card required) for faster, more reliable web research. DuckDuckGo remains the fallback when no key is set.
HuggingFace Token: Optional token field that speeds up gated-repo downloads.
Thinking: Toggle described above.

Responsive Layout

The whole app now adapts to its window size instead of holding a fixed-width column:

Minimum window 700×500 (was 820×560) — usable in split-screen and on smaller displays.
Sidebar tiers: w-60 on roomy windows, w-52 between 640–768px, auto-collapsed below 640px on first launch (⌘B and the chevron still toggle manually).
Settings goes horizontal at ≥ 1024px: The modal becomes a wide 2-column layout — HF Token, Web Search, Thinking, and Context Window on the left, Downloaded Models on the right — instead of a tall scrolling stack.
Setup screens go horizontal at ≥ 1024px: The welcome screen splits into intro / model picker; the in-progress screen splits into stages + error / download progress + cancel.
Chat area scales padding, the header model dropdown shrinks gracefully, and the composer wraps cleanly at very narrow widths.

UX Polish

Manual stop now shows a clear "Response stopped" indicator instead of leaving the bubble blank.
Generation errors show a clean "Something went wrong while generating the response." card instead of streaming raw stack traces into chat.
Welcome card padding fixed for short windows.
LM Studio comparison updated to reflect their new mlx-vlm support. GemX's remaining differentiators: on-device Whisper voice input, the multi-step web research workflow with inline [N] citations, and the MIT open-source license.

Please see the README in the main repository for complete installation instructions, system requirements, updated model memory specifications, and troubleshooting steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GemX-v0.2.0

Choose a tag to compare

Sorry, something went wrong.