v0.2.0 β Foundation Hardening (5 axes engineering-complete)
v0.2.0 β Foundation Hardening
After 44 merged PRs since v0.1.4 (44 days, 127 unit tests, 0 open issues at release time), JAMES exits the v0.2 cycle with five of six axes engineering-complete. Axis 6's remaining gate (a second user running the bench end-to-end on their own corpus) is now in self-feedback + recruitment phase β not code work.
This is the "trustworthy enough to recommend to one other person" milestone described in ROADMAP.md v0.2.0.
νκ΅μ΄ μμ½
μλ©μ€κ° v0.2 Foundation Hardening 5κ° μΆμ μμ§λμ΄λ§ μΈ‘λ©΄μμ μλ£νμ΅λλ€. 6λ²μ§Έ μΆ(μ€μ λ°μ΄ν° κ²μ¦)μ μ½λ μμ μ λλΆλΆ μλ£λμκ³ , λ¨μ κ²μ΄νΈλ "λ λ²μ§Έ μ¬μ©μκ° μκΈ° corpusλ‘ bench μ€ν"μΈλ° μ΄κ±΄ μ½λκ° μλλΌ μ¬μ©μ λͺ¨μ§ + μ체 νΌλλ°± λ¨κ³μ λλ€.
μ΄λ² 릴리μ¦λ v0.1.4 λλΉ 44κ° PR λ¨Έμ§, 127κ° λ¨μ ν μ€νΈ ν΅κ³Ό, open issues 0건 μνμμ λ°νλ©λλ€.
What's new (axis-by-axis)
Axis 1 β Architecture Separation β
core/reasoning_engine.py (51 KB monolith) split into:
core/reasoning/engine.pyβ orchestration only (16 KB)core/reasoning/pipeline.pyβ RAG retrieval pipelinecore/reasoning/modes.pyβ 4 mode handlers (chat / wiki_edit / self_evolve / coding) + new meta mode
core/memory_*.py consolidated into core/memory/ package with documented public API.
Axis 2 β Evaluation Harness β
- STEP 7 13-query regression suite locked at
eval/regression/step7_*.jsonwith byte-identical security-block invariants and graph-paths bands. Runner:python scripts/bench.py --suite=step7 --check. - RAGAS integrated as the third-party harness β context_precision / context_recall / faithfulness / answer_relevancy. Live
/query/driver. Baseline + drift check ateval/ragas/baseline.json. - PR-contract: every change to
core/{retrieval,graph,reasoning}must paste bench numbers (CLAUDE.mdrule 2 +CONTRIBUTING.md).
Axis 3 β Observability / Tracing β
trace_idContextVar at the API edge, propagated end-to-end throughcore/observability::log_stage.- Per-trace JSONL files at
reports/trace/<YYYY-MM-DD>/<trace_id>.jsonlcoveringauth β retrieve β graph β tool β answer β completestages. JAMES_TRACE_STDOUTconsole mirror β default ON for the single-user operator workflow (set=0to silence).GET /admin/trace/{trace_id}full pipeline replay endpoint.GET /admin/metrics?window_hours=24per-stage p50/p90/p99/max latency histograms.- 7-day auto-prune via
JAMES_TRACE_RETENTION_DAYSenv (default 7, clamped to [1, 365]).
PRs: #67, #71, #75, #82, #83, #84.
Axis 4 β Security Boundary β
core/policy_engine.pyis now the single source of role / sensitivity / capability decisions. Removing it would break 6+ production modules.- Capability tokens at every tool call site β no direct fs path strings.
- Multimodal trust quarantine: image / video / audio / web inputs flagged and sanitized at a single ingestion chokepoint before joining the LLM context.
- Risky-coding hard-refuse policy at
pre_check. Queries that ask the model to produce destructive shell / SQL / git commands receive the same byte-identical 26-char block as prompt-injection attempts (q11 / q12 invariants in STEP 7).
PRs: #50, #53, #54, #56, #57, #58, #59, #60, #61, #63, #70.
Axis 5 β Controlled Self-Evolution β
- Opt-in env flag
JAMES_ENABLE_EVOLUTION=0(default off).JAMES_AUTO_APPROVErequiresJAMES_DEV_MODE=1or the server refuses to start. - Every approved patch is recorded with
approver_username/approver_role/approved_at/approval_methodin the lifecycle JSONL. - Bench eval gate at
/admin/patch/approve: afterpatch_apply()succeeds,scripts/bench.py --checkruns in a subprocess (asyncio.to_thread). Regression triggers auto-rollback viarestore_latest()and aROLLED_BACKlifecycle entry. - Mid-deploy crash recovery tested for byte-identical restore.
GET /admin/patch/audit?since=&approver=&outcome=&limit=operator-facing query endpoint overjames_patch_log.jsonl.
Axis 6 β Real-Data Validation π‘
- Wiki corpus at 161 entities (concept 62 / org 57 / person 11 / document 31), hard-deduped.
- 13-query STEP 7 suite spans retrieve / relation / multi-hop / compare / dedup / lang-mix / negative / security / meta categories.
- Edge cases discovered + closed via real-data feedback: #5, #6, #7, #8, #11, #14, #20.
- Remaining: a second user running the bench end-to-end on their own corpus. This is the v0.2 β v0.3 gate and is now in recruitment phase.
User-feedback fixes (UX)
This cycle also folded in three direct user-feedback items:
- Answer flow β replaced the rigid
π μλ£ κΈ°λ° / π‘ μΆλ‘two-section template with a Claude-style natural prose flow (ν΅μ¬ λ΅ β κ·Όκ±° β μΆκ° μκ°). PR #74. - Debug visibility β
JAMES_TRACE_STDOUTdefaults ON so operators see per-stage JSONL lines without env-var setup. PR #75. - Meta-mode routing β chat-page inventory queries (
"μ΄λ€ μλ£ μμ΄?"/"λ°μ΄ν° λ μλμ§ λ³΄μ¬μ€") now route to a dedicatedhandle_metainstead of hallucinating via retrieval. PR #76.
Production bug fixes
- Windows path-check in patch_applier.py β
str(Path(target)).startswith(".")on Windows normalized away the leading./, silently rejecting every legitimate self-evolution sandbox patch. Single-line fix. PR #78. - Korean encoding in 4 sites β three self-test files wrote Korean comments via
open(..., "w")withoutencoding="utf-8", landing as cp949 on Windows;bench_gate.subprocess.runalso decoded captured output via locale (cp949) instead of utf-8. PR #80. - cp949 console crash β
ensure_utf8_console()wired into the server entry, admin scripts, and tests so emoji-bearing print statements don't crash on default Windows consoles. PR #36 + ongoing test additions.
Breaking / behavior changes
- Default
JAMES_TRACE_STDOUT=1β every server startup now prints per-stage JSONL lines to stdout. SetJAMES_TRACE_STDOUT=0to silence. (PR #75) - Default
JAMES_TRACE_RETENTION_DAYS=7βreports/trace/directories older than 7 days are removed on server startup. Set higher if you need longer audit retention. (PR #84) - Answer style β
response_styleAPI param +JAMES_RESPONSE_STYLEenv are now no-ops (kept for back-compat); all answers use the natural-flow prompt. (PR #74) - STEP 7 baseline is
step7-v3(was v1 in v0.1.4) β q12 promoted from flaky to byte-identical block, q13 added for meta-mode. Old--checkruns againststep7-v1will fail; rerun against the new baseline.
How to upgrade
git pull origin main
git checkout v0.2.0
pip install -r requirements.txt # cryptography 47.0.0, pynvml 12.x
# new env knobs (optional):
export JAMES_TRACE_STDOUT=0 # silence per-stage console mirror
export JAMES_TRACE_RETENTION_DAYS=14 # keep 2 weeks of traces
export JAMES_EVOLUTION_GATE=0 # disable bench gate during patch deploy (debug only)
# verify:
python -m unittest discover -s tests # 127 tests, ~6s
python scripts/bench.py --suite=step7 # live STEP 7 against your wikiWhat's next
v0.2.1 cycle β self-feedback + second-user recruitment for Axis 6. No more code-only PRs unless a real-use bug surfaces.
v0.3.0 β platform skeleton: core/plugins/base.py (4 plugin types), JAMES_PLUGINS loader, packs/general/ dogfood, JAMES_WORKSPACE for multi-instance hosting, docs/VERSIONING.md with 12-month deprecation policy. Required before any domain pack work.
See ROADMAP.md for the full v0.3 β v0.4 β v1.0 gate definitions.
π€ Generated with Claude Code