Phone-native AI Coding Harness
The agent harness runs on the phone. Models can be remote; the coding loop, files, previews, runtime routing, and shipping controls stay in MobileCode.
不是远程 IDE 的手机壳,而是真正把 agent loop、工具状态、文件、预览和发布控制面放到手机本机的 MobileCode。
Watch 15s Short · Watch 9:16 Promo Video · README Motion Cover · HTML Principle Video · Download v0.1.68 apps · GitHub Pages Demo
Demo Lab · 2048 Demo · GitHub Test · Dual app build · Download app builds · Release QA
15-second Remotion teaser with voiceover. Full explainer covers demand, pain, RuntimeProvider, and GitHub-first shipping.
MobileCode is packaged as a phone-native coding harness: a small mobile companion for writing, previewing, running agent tasks, and shipping artifacts from the same handheld workspace.
If your Markdown viewer does not embed video, open mobilecode-product-walkthrough.mp4.
| Runs on the phone | Remote by choice | GitHub-first shipping |
|---|---|---|
| Agent trace, tool selection, runtime routing, local files, WebView preview, result cards | Model provider, optional Cloud Runtime, external Termux/Helper backends | Repo discovery, Contents API commits, Pages publish, Actions builds, release artifacts |
MobileCode 的第一性原理很简单:手机端不适合塞一个完整桌面编译环境,但非常适合成为 AI coding 的本机 harness。
它不是 Codex Remote、Claude Remote 或云端 IDE 的移动端外壳。模型可以来自云端 provider,但对话、工具编排、运行时选择、文件落盘、WebView 预览、GitHub 发布和恢复提示都在手机 App 内闭环。
它把最重的部分交给外部平台,把最贴近用户的部分留在手机上:
| Layer | MobileCode does | External layer does |
|---|---|---|
| Phone-native harness | Chat, tool trace, role cards, file cards, preview, runtime diagnostics, settings | None |
| Local runtime | Helper / Termux / WebViewOnly through RuntimeProvider |
Shell, logs, small local tasks |
| GitHub-first workspace | Repo Hub, watchlist, remote-linked folders, Pages publish cards | Repos, Contents API commits, Actions builds, artifacts |
| Web artifacts | Generate HTML, run publish readiness checks, open browser/WebView | GitHub Pages hosting |
| Heavy builds | Show workflow status, jobs, artifacts | GitHub Actions APK/Web/release builds |
PhoneWorld 的最新研究把 phone-use agent 的瓶颈从“模型是否会点手机”推进到“谁能规模化提供可控环境、任务、验证器、轨迹和训练/评测 harness”。这不是对 MobileCode 的直接背书,但它清晰说明了一个方向:手机 Agent 的下一阶段核心资产是可执行、可复现、可验证的 harness。
MobileCode 选择从 AI coding 切入同一条趋势:模型可以远程,重构建可以交给 GitHub Actions,但会话、工具轨迹、文件、HTML/Markdown 预览、运行时路由、GitHub 发布、构建 artifact 和结果证据需要在手机端形成闭环。
- Paper: PhoneWorld: Scaling Phone-Use Agent Environments
- Local PDF: docs/research/phoneworld-scaling-phone-use-agent-environments-2605.29486.pdf
- MobileCode analysis: PhoneWorld 与 Mobile Harness 时代
- Long-term roadmp: Mobile Harness 长期路线图
- ICLR draft: PDF · TeX
- Anonymous supplement boundary: include/exclude and redaction gate
- Current anonymous supplement:
paper/iclr-mobile-harness/build/mobile-harness-anonymous-supplement.zip(staged file count and byte size are emitted by the supplement script) - Benchmark seed: MobileHarnessBench
- v1 task bank: 200 MobileHarnessBench candidate tasks
- v2 task bank: 1000 MobileHarnessBench candidate tasks
- v2 quality audit: machine audit report
- Verifier contracts: machine-readable catalog · coverage readiness
- Baseline protocol: comparison readiness
- Baseline run contract: result schema readiness
- Baseline scaffold: not-run scaffold manifest
- Baseline T0 dry run: not-counted dry-run manifest
- Baseline pilot pack: prompt and evidence templates
- Baseline pilot readiness: non-counted readiness gate
- Core claim readiness: positioning claim boundary
- Evidence maturity: claim maturity matrix
- Evaluation protocol readiness: E1-E5 machine-checkable protocol
- Method presentation readiness: visuals, algorithms, modules and formulas gate
- Bibliography readiness: verified related-work metadata
- Threats to validity: review risk matrix
- Page-limit readiness: compiled PDF page boundary
- Reproducibility checklist: command-to-artifact matrix
- Submission readiness: draft upload gate
- Paper claim ledger: claim-to-evidence map
- Mobile-tier readiness: Android/iOS readiness probe
- Mobile evidence pack: T2/T3 capture templates · execution playbook
- Draft frozen subset: planning manifest · readiness report
- Mobile test strategy: Android/iOS benchmark tiers
- Simulator launcher reference: simutil
- MobileCode Skill Spec: SKILL.md + WebView script + permission + verifier contract
- Harness Task Registry: task metadata for Tools, sheets, routes, skills and benchmark evidence
- v0 dry run evidence: 2026-06-06 representative run
- smoke-v2 T0 evidence: 2026-06-06 60-task smoke run
Recent on-device AI applications are moving from plain chat demos toward task galleries, skill packages, tool bridges, model/runtime management and benchmark views. Google AI Edge Gallery is a useful public example of this product shape: it organizes on-device models around tasks, custom tasks, skills, MCP tooling and benchmark surfaces. MobileCode adopts the pattern but changes the object of evaluation. Instead of becoming a general model gallery, MobileCode turns phone-native AI coding into a harness: incoming files, artifact editing, HTML/Markdown preview, GitHub delivery, runtime routing, verifier contracts and evidence reports.
The practical design consequence is now explicit in the repo:
- Skills use
SKILL.md,scripts/index.html, permission tokens and verifier contracts. - Tools, sheets and pages are promoted into a Harness Task Registry instead of remaining one-off buttons.
- Benchmark Lab is becoming an in-app surface for MobileHarnessBench status, task tiers and evidence boundaries.
- Claims remain evidence-bound: T0 fixture runs, mobile tiers, GitHub sandbox delivery and baseline comparison are reported separately.
These thumbnails are generated from the live GitHub Pages demos with just-thumbnail, so the README shows rendered pages rather than mock claims.
| Scene | What to try | Link |
|---|---|---|
| Demo Lab | A static landing page for published mobile demos | Open demo lab |
| 2048 Web | Touch-first generated HTML game, useful for WebView and mobile layout checks | Play 2048 |
| GitHub Test | Verify token identity, repo access, and Pages readiness from a browser | Open GitHub test |
| Repo Hub | Watch repos, map them to mobilecode_projects/github/<owner>/<repo>/, inspect Actions, edit files through GitHub API |
mobile_agent/lib/screens/github_repo_hub_screen.dart |
| Published Work Card | After Pages publish, show Pages URL, repo URL, local file path, browser open, copy/share, and redeploy actions | mobile_agent/lib/screens/home_screen.dart |
flowchart LR
A["User prompt on phone"] --> B["AI generates HTML / code artifact"]
B --> C["Local WebView preview"]
C --> D["HTML publish readiness check"]
D --> E["GitHub Pages publish"]
E --> F["Shareable work card"]
B --> G["GitHub Repo Hub"]
G --> H["Contents API edit + commit"]
G --> I["GitHub Actions workflow_dispatch"]
I --> J["Jobs, logs, artifacts"]
- Runtime abstraction:
RuntimeProvider,RuntimeManager, Helper, External Termux, planned Embedded Lite, Cloud, and WebViewOnly fallback. - MobileCode Helper prototype: health, execute, streaming logs, task stop, task state, preflight checks.
- Chat and agent process UI: model call progress, stop control, trace cards, generated artifact cards.
- HTML-first generation: built-in HTML/UI skill context, publish readiness checks, WebView preview, browser open, GitHub Pages publish.
- GitHub-first workspace: repo list, watchlist, language/Pages/local filters, local existence status, Remote-linked folder marker.
- GitHub Actions surface: workflows, latest run status, jobs/steps, workflow dispatch, artifact zip download record.
- API-backed file flow: browse remote tree, read text files, edit, commit via GitHub Contents API, reload on SHA conflict.
- Extension management: Roles, Skill, MCP, Memory, Agent, Hook Registry surfaces for role-based workflows.
- Observability: RR AgentView, pending role approvals, Token Usage/cache-hit statistics, searchable/sortable LiteLLM-style pricing with manual snapshot checks, and Device Telemetry htop-style phone health.
- Lark CLI connector: opt-in diagnostics and structured dry-run action model.
flowchart TB
UI["Flutter App\nChat · Files · Preview · Settings"] --> RM["RuntimeManager"]
RM --> H["MobileCode Helper\nAndroid foreground service / daemon"]
RM --> T["External Termux\nfallback shell"]
RM --> W["WebViewOnly\npreview-only fallback"]
RM --> C["Cloud Runtime\nheavy tasks later"]
UI --> GH["GitHub Deep Service"]
GH --> Repo["Repos / Contents API"]
GH --> Pages["GitHub Pages"]
GH --> Actions["GitHub Actions"]
Actions --> Artifacts["APK / Web / release artifacts"]
Open:
cd app
npm install
npm run buildLocal Flutter SDK is required:
cd mobile_agent
flutter pub get
flutter create --platforms=android,ios .
flutter build apk --releaseFor release QA, prefer GitHub Actions so the build is reproducible:
python scripts/generate_mobile_harness_task_bank.py
python scripts/run_mobile_harness_bench.py --task-set representative-v0 --run-id 2026-06-06-v0-dry-run
python scripts/run_mobile_harness_bench.py --task-set smoke-v2 --run-id 2026-06-06-smoke-v2-t0
python scripts/audit_mobile_harness_task_bank.py
python scripts/collect_mobile_harness_mobile_tier_evidence.py
python scripts/generate_mobile_harness_mobile_evidence_pack.py
python scripts/generate_mobile_harness_frozen_subset.py
python scripts/generate_mobile_harness_verifier_contract_readiness.py
python scripts/generate_mobile_harness_baseline_protocol.py
python scripts/generate_mobile_harness_baseline_run_contract.py
python scripts/generate_mobile_harness_baseline_scaffold.py
python scripts/generate_mobile_harness_baseline_dry_run.py
python scripts/generate_mobile_harness_baseline_pilot_pack.py
python scripts/generate_mobile_harness_baseline_pilot_readiness.py
python scripts/generate_mobile_harness_claim_ledger.py
python scripts/generate_mobile_harness_core_claim_readiness.py
python scripts/generate_mobile_harness_evidence_maturity_matrix.py
python scripts/generate_mobile_harness_evaluation_protocol_readiness.py
python scripts/generate_mobile_harness_method_presentation_readiness.py
python scripts/generate_mobile_harness_bibliography_readiness.py
python scripts/generate_mobile_harness_threats_to_validity.py
python scripts/generate_mobile_harness_page_limit_readiness.py
python scripts/generate_mobile_harness_reproducibility_checklist.py
python scripts/generate_mobile_harness_submission_readiness.py
python scripts/validate_mobile_harness_bench.py
python scripts/prepare_mobile_harness_supplement.pyThe current benchmark data contains 25 v0 seed tasks, a 200-task v1 candidate bank and a 1000-task v2 candidate bank. v2 raises the taxonomy from five categories to six by adding runtime orchestration, plus mobile profiles, test oracles and Android/iOS test tiers. The representative run covers five tasks across file intake, code edit, preview verification, GitHub delivery and harness evidence. The smoke-v2 T0 run covers 60 tasks, with 50 fixture-level passes and 10 typed GitHub-delivery blocks. These T0 runs do not replace Android/iOS device evidence. The mobile-tier readiness probe records whether the local machine can collect Android/iOS evidence; the current probe is blocked because this environment lacks adb and Xcode tools. The mobile evidence pack prepares 48 Android T2 / iOS T3 task templates, device metadata templates, run manifest templates and an execution playbook, but keeps counts_as_mobile_experiment=false. The draft frozen subset fixes the planned 60-task paper subset but explicitly sets counts_as_final_paper_subset=false until mobile/GitHub sandbox evidence is attached. The machine-readable verifier catalog defines 12 verifier contracts and the readiness report checks all 1225 current task definitions across v0/v1/v2, but it does not claim full implementation or mobile-device verifier coverage. The baseline protocol defines three comparison flows and seven metrics, the baseline run contract defines future baseline-run.json evidence shape, the scaffold emits three scaffold_not_run baseline runs with 60 not_run entries each, the T0 baseline dry-run emits one dry_run_not_counted blocked task per baseline, the pilot pack locks prompts plus model/intervention/evidence templates for the first real pilot, and the pilot readiness report says the package is ready for non-counted execution but not ready for counted baseline results; none of these count as baseline results. The paper claim ledger maps draft paper claims to concrete artifacts, the core claim readiness report checks the control-plane positioning without counting it as an experiment, the evidence maturity matrix marks current counted paper evidence at T0 only while keeping mobile and baseline results open, the evaluation protocol readiness report binds E1-E5 to concrete task sets, evidence tiers and a 7-metric formula contract, the method presentation readiness report checks that the draft contains reviewable visuals, algorithms, module interfaces, formulas and evidence-boundary language, the bibliography readiness report verifies current related-work metadata, the threats-to-validity matrix tracks six review risks, the page-limit readiness report records the compiled PDF page boundary, the reproducibility checklist maps 16 draft commands to expected artifacts while keeping full empirical reproduction false, and the submission readiness gate keeps the draft explicitly not upload-ready until venue metadata, real mobile evidence, counted baselines and the final supplement are complete. The v2 quality audit checks machine-readable coverage and uniqueness but does not replace human review or real mobile runs. The supplement script stages a local anonymized reviewer package under paper/iclr-mobile-harness/build/ and keeps generated files out of git.
MobileCode does not try to become a full Termux clone. The long-term model is:
Flutter App
-> RuntimeProvider abstraction
-> MobileCode Helper
-> External Termux fallback
-> Embedded Lite runtime later
-> Cloud runtime for heavy builds
-> GitHub Pages + GitHub Actions for shipping
That keeps the phone lightweight while still letting users produce shareable web pages, inspect repos, commit small changes, and build APKs through GitHub Actions.
.
├─ app/ React/Vite product site
├─ docs/ GitHub Pages demos, QA docs, runtime docs
├─ mobile_agent/ Flutter app source
│ ├─ lib/screens/ Home, GitHub Repo Hub, Skill/MCP/Agent/Memory UI
│ ├─ lib/services/ Runtime, GitHub, Pages, Helper, skill services
│ └─ assets/ Role avatars and icons
├─ mobile-coding-*.md Product and architecture analysis
└─ README.md Project homepage
Current candidate: v0.1.68-mobile-harness-d2dd9a7.
See:
- Latest dual app build - Android APK, iOS simulator app, and iOS unsigned archive all completed successfully.
- Release assets - Android APK plus iOS simulator/archive artifacts.
- Version Policy
- Release QA Checklist
- Helper Runtime Protocol
- Production Hardening Notes
| Priority | Next focus | Stop condition |
|---|---|---|
| P0 | Pass Mobile Runtime CI, Android APK build, Android smoke test for the pushed commit | APK artifact is downloadable and app launches |
| P1 | Smooth Repo Hub file edit conflict handling and artifact download UX | User can recover from SHA conflicts and find downloaded artifacts |
| P2 | Expand API-backed workspace into selected repo file import/export | Phone can edit selected repo files without true clone |
| Later | Helper APK maturity, queue recovery, PTY, cloud heavy builds | Runtime remains replaceable behind RuntimeProvider |
This repository is actively moving toward a deployable mobile coding workspace. The Android build path is GitHub Actions-first; local machines without Flutter/Android SDK should use CI artifacts instead of local builds.
No license file is included yet. Add a LICENSE before treating this as a reusable open-source distribution.
