GemX-v0.2.0
v0.2.0 Release Notes
The first follow-up to the initial release brings a new mid-tier model, opt-in step-by-step reasoning, a completely rebuilt download pipeline, a real settings panel for managing local model storage, and a layout that finally adapts to your window size.
New Models & Context Defaults
- Gemma 4 12B added: A dense mid-tier variant (
mlx-community/gemma-4-12B-it-4bit, ~11 GB on disk) now sits between E4B and the 26B MoE. Default 32K context, native max 256K. Sharper reasoning than E4B without the MoE's RAM cliff. Comfortable on a 24 GB Mac; tight at 16 GB. - Context window defaults rebalanced: E4B drops 32K → 16K to leave KV-cache headroom on 8 GB Macs. The 26B MoE drops 128K → 64K to keep its boot-time reservation manageable. 31B rises 32K → 64K to align with the other dense variants. All five models still expose their full native max (128K for E2B/E4B, 256K for 12B / 26B MoE / 31B) via the Context Window slider.
- Disk sizes refined across the registry: E2B ~4 GB, E4B ~5.5 GB, 12B 11 GB, 26B MoE ~16 GB, 31B ~18 GB. The "27B MoE" label was also corrected to "26B MoE" to match Google's naming.
Thinking Mode (Opt-In)
Gemma 4's built-in step-by-step reasoning is now plumbed end-to-end:
- Settings → Thinking → Enabled flips
enable_thinking: trueon every chat request. The model emits its reasoning trace before the final answer. - Collapsible Thinking block appears above the answer. The header cycles through "Thinking → Considering → Planning → Pondering → Reasoning → Sketching" while the trace streams in, then collapses to "Thought process" once the final answer begins. Click any time to re-expand.
- Full markdown inside the block — syntax-highlighted code, KaTeX math, GFM tables, clickable links — same renderer as the main answer.
- No context-window leak: Past thinking traces are stripped at the IPC boundary on every subsequent turn (defense in depth: Gemma 4's chat template also strips them server-side via its
strip_thinking()macro). Reasoning lives in the UI only.
Rebuilt Download Pipeline
The first-launch download was replaced end-to-end after the deprecated huggingface-cli stopped working:
- Manifest-driven byte progress: Total size is fetched from
HfApi.model_info()before the first byte streams, so the progress bar is correct from frame zero — no more "stuck at 0%". - Resume detection: Partial blobs from interrupted previous attempts are detected on next launch; the UI swaps the verb from "Downloading" to "Resuming" and only fetches what's missing.
- Cancel button: Interrupt a download mid-stream and drop back to the model picker — no orphaned Python children, no half-written blobs in an indeterminate state.
- Authoritative completion marker: A
.gemx-verifiedfile is written only aftersnapshot_download(..., local_files_only=True)integrity-checks the cache. Subsequent launches trust this marker exactly — no more "model present but didn't load" edge cases.
Self-Healing MLX Runtime
- Auto-upgrade on version mismatch: When the bundled
mlx-vlmis too old to load a model (e.g.,gemma4_unifiedrequires ≥ 0.6.1), the app catches the typed error, runspip install --upgrade, and retries boot once. No restart, no manual venv surgery. - Lifecycle hardening: A single-flight server wrapper handles dev hot-reload double-mounts cleanly. Orphaned port holders from previous force-quits are reclaimed via
lsofbefore the new server binds. On app quit, the Python child gets a synchronous SIGKILL so nothing leaks across runs.
Settings Panel Expansion
- Downloaded Models: Every cached model is now listed with size and active/default tags, and one-click delete for any that aren't currently loaded. Reclaim disk space without leaving the app.
- Context Window: Power-of-2 slider from 4K up to the model's native max, with a Reset button to fall back to the default.
- Tavily API Key: Paste your free Tavily key (1,000 searches/month, no card required) for faster, more reliable web research. DuckDuckGo remains the fallback when no key is set.
- HuggingFace Token: Optional token field that speeds up gated-repo downloads.
- Thinking: Toggle described above.
Responsive Layout
The whole app now adapts to its window size instead of holding a fixed-width column:
- Minimum window 700×500 (was 820×560) — usable in split-screen and on smaller displays.
- Sidebar tiers:
w-60on roomy windows,w-52between 640–768px, auto-collapsed below 640px on first launch (⌘B and the chevron still toggle manually). - Settings goes horizontal at ≥ 1024px: The modal becomes a wide 2-column layout — HF Token, Web Search, Thinking, and Context Window on the left, Downloaded Models on the right — instead of a tall scrolling stack.
- Setup screens go horizontal at ≥ 1024px: The welcome screen splits into intro / model picker; the in-progress screen splits into stages + error / download progress + cancel.
- Chat area scales padding, the header model dropdown shrinks gracefully, and the composer wraps cleanly at very narrow widths.
UX Polish
- Manual stop now shows a clear "Response stopped" indicator instead of leaving the bubble blank.
- Generation errors show a clean "Something went wrong while generating the response." card instead of streaming raw stack traces into chat.
- Welcome card padding fixed for short windows.
- LM Studio comparison updated to reflect their new
mlx-vlmsupport. GemX's remaining differentiators: on-device Whisper voice input, the multi-step web research workflow with inline[N]citations, and the MIT open-source license.
Please see the README in the main repository for complete installation instructions, system requirements, updated model memory specifications, and troubleshooting steps.