Skip to content

v0.2.0 β€” Foundation Hardening (5 axes engineering-complete)

Choose a tag to compare

@Hashevolution Hashevolution released this 08 May 00:56
· 687 commits to main since this release
4e0c48f

v0.2.0 β€” Foundation Hardening

After 44 merged PRs since v0.1.4 (44 days, 127 unit tests, 0 open issues at release time), JAMES exits the v0.2 cycle with five of six axes engineering-complete. Axis 6's remaining gate (a second user running the bench end-to-end on their own corpus) is now in self-feedback + recruitment phase β€” not code work.

This is the "trustworthy enough to recommend to one other person" milestone described in ROADMAP.md v0.2.0.


ν•œκ΅­μ–΄ μš”μ•½

μžλ©”μŠ€κ°€ v0.2 Foundation Hardening 5개 좕을 μ—”μ§€λ‹ˆμ–΄λ§ μΈ‘λ©΄μ—μ„œ μ™„λ£Œν–ˆμŠ΅λ‹ˆλ‹€. 6번째 μΆ•(μ‹€μ œ 데이터 검증)의 μ½”λ“œ μž‘μ—…μ€ λŒ€λΆ€λΆ„ μ™„λ£Œλ˜μ—ˆκ³ , 남은 κ²Œμ΄νŠΈλŠ” "두 번째 μ‚¬μš©μžκ°€ 자기 corpus둜 bench μ‹€ν–‰"인데 이건 μ½”λ“œκ°€ μ•„λ‹ˆλΌ μ‚¬μš©μž λͺ¨μ§‘ + 자체 ν”Όλ“œλ°± λ‹¨κ³„μž…λ‹ˆλ‹€.

이번 λ¦΄λ¦¬μ¦ˆλŠ” v0.1.4 λŒ€λΉ„ 44개 PR λ¨Έμ§€, 127개 λ‹¨μœ„ ν…ŒμŠ€νŠΈ 톡과, open issues 0건 μƒνƒœμ—μ„œ λ°œν–‰λ©λ‹ˆλ‹€.


What's new (axis-by-axis)

Axis 1 β€” Architecture Separation βœ…

core/reasoning_engine.py (51 KB monolith) split into:

  • core/reasoning/engine.py β€” orchestration only (16 KB)
  • core/reasoning/pipeline.py β€” RAG retrieval pipeline
  • core/reasoning/modes.py β€” 4 mode handlers (chat / wiki_edit / self_evolve / coding) + new meta mode

core/memory_*.py consolidated into core/memory/ package with documented public API.

PRs: #35, #37, #38, #39, #50.

Axis 2 β€” Evaluation Harness βœ…

  • STEP 7 13-query regression suite locked at eval/regression/step7_*.json with byte-identical security-block invariants and graph-paths bands. Runner: python scripts/bench.py --suite=step7 --check.
  • RAGAS integrated as the third-party harness β€” context_precision / context_recall / faithfulness / answer_relevancy. Live /query/ driver. Baseline + drift check at eval/ragas/baseline.json.
  • PR-contract: every change to core/{retrieval,graph,reasoning} must paste bench numbers (CLAUDE.md rule 2 + CONTRIBUTING.md).

PRs: #43, #51, #52, #64, #66.

Axis 3 β€” Observability / Tracing βœ…

  • trace_id ContextVar at the API edge, propagated end-to-end through core/observability::log_stage.
  • Per-trace JSONL files at reports/trace/<YYYY-MM-DD>/<trace_id>.jsonl covering auth β†’ retrieve β†’ graph β†’ tool β†’ answer β†’ complete stages.
  • JAMES_TRACE_STDOUT console mirror β€” default ON for the single-user operator workflow (set =0 to silence).
  • GET /admin/trace/{trace_id} full pipeline replay endpoint.
  • GET /admin/metrics?window_hours=24 per-stage p50/p90/p99/max latency histograms.
  • 7-day auto-prune via JAMES_TRACE_RETENTION_DAYS env (default 7, clamped to [1, 365]).

PRs: #67, #71, #75, #82, #83, #84.

Axis 4 β€” Security Boundary βœ…

  • core/policy_engine.py is now the single source of role / sensitivity / capability decisions. Removing it would break 6+ production modules.
  • Capability tokens at every tool call site β€” no direct fs path strings.
  • Multimodal trust quarantine: image / video / audio / web inputs flagged and sanitized at a single ingestion chokepoint before joining the LLM context.
  • Risky-coding hard-refuse policy at pre_check. Queries that ask the model to produce destructive shell / SQL / git commands receive the same byte-identical 26-char block as prompt-injection attempts (q11 / q12 invariants in STEP 7).

PRs: #50, #53, #54, #56, #57, #58, #59, #60, #61, #63, #70.

Axis 5 β€” Controlled Self-Evolution βœ…

  • Opt-in env flag JAMES_ENABLE_EVOLUTION=0 (default off). JAMES_AUTO_APPROVE requires JAMES_DEV_MODE=1 or the server refuses to start.
  • Every approved patch is recorded with approver_username / approver_role / approved_at / approval_method in the lifecycle JSONL.
  • Bench eval gate at /admin/patch/approve: after patch_apply() succeeds, scripts/bench.py --check runs in a subprocess (asyncio.to_thread). Regression triggers auto-rollback via restore_latest() and a ROLLED_BACK lifecycle entry.
  • Mid-deploy crash recovery tested for byte-identical restore.
  • GET /admin/patch/audit?since=&approver=&outcome=&limit= operator-facing query endpoint over james_patch_log.jsonl.

PRs: #69, #77, #78, #79.

Axis 6 β€” Real-Data Validation 🟑

  • Wiki corpus at 161 entities (concept 62 / org 57 / person 11 / document 31), hard-deduped.
  • 13-query STEP 7 suite spans retrieve / relation / multi-hop / compare / dedup / lang-mix / negative / security / meta categories.
  • Edge cases discovered + closed via real-data feedback: #5, #6, #7, #8, #11, #14, #20.
  • Remaining: a second user running the bench end-to-end on their own corpus. This is the v0.2 β†’ v0.3 gate and is now in recruitment phase.

User-feedback fixes (UX)

This cycle also folded in three direct user-feedback items:

  1. Answer flow β€” replaced the rigid πŸ“š 자료 기반 / πŸ’‘ μΆ”λ‘  two-section template with a Claude-style natural prose flow (핡심 λ‹΅ β†’ κ·Όκ±° β†’ μΆ”κ°€ μ‹œκ°). PR #74.
  2. Debug visibility β€” JAMES_TRACE_STDOUT defaults ON so operators see per-stage JSONL lines without env-var setup. PR #75.
  3. Meta-mode routing β€” chat-page inventory queries ("μ–΄λ–€ 자료 μžˆμ–΄?" / "데이터 뭐 μžˆλŠ”μ§€ λ³΄μ—¬μ€˜") now route to a dedicated handle_meta instead of hallucinating via retrieval. PR #76.

Production bug fixes

  • Windows path-check in patch_applier.py β€” str(Path(target)).startswith(".") on Windows normalized away the leading ./, silently rejecting every legitimate self-evolution sandbox patch. Single-line fix. PR #78.
  • Korean encoding in 4 sites β€” three self-test files wrote Korean comments via open(..., "w") without encoding="utf-8", landing as cp949 on Windows; bench_gate.subprocess.run also decoded captured output via locale (cp949) instead of utf-8. PR #80.
  • cp949 console crash β€” ensure_utf8_console() wired into the server entry, admin scripts, and tests so emoji-bearing print statements don't crash on default Windows consoles. PR #36 + ongoing test additions.

Breaking / behavior changes

  • Default JAMES_TRACE_STDOUT=1 β€” every server startup now prints per-stage JSONL lines to stdout. Set JAMES_TRACE_STDOUT=0 to silence. (PR #75)
  • Default JAMES_TRACE_RETENTION_DAYS=7 β€” reports/trace/ directories older than 7 days are removed on server startup. Set higher if you need longer audit retention. (PR #84)
  • Answer style β€” response_style API param + JAMES_RESPONSE_STYLE env are now no-ops (kept for back-compat); all answers use the natural-flow prompt. (PR #74)
  • STEP 7 baseline is step7-v3 (was v1 in v0.1.4) β€” q12 promoted from flaky to byte-identical block, q13 added for meta-mode. Old --check runs against step7-v1 will fail; rerun against the new baseline.

How to upgrade

git pull origin main
git checkout v0.2.0
pip install -r requirements.txt   # cryptography 47.0.0, pynvml 12.x

# new env knobs (optional):
export JAMES_TRACE_STDOUT=0          # silence per-stage console mirror
export JAMES_TRACE_RETENTION_DAYS=14 # keep 2 weeks of traces
export JAMES_EVOLUTION_GATE=0        # disable bench gate during patch deploy (debug only)

# verify:
python -m unittest discover -s tests   # 127 tests, ~6s
python scripts/bench.py --suite=step7  # live STEP 7 against your wiki

What's next

v0.2.1 cycle β€” self-feedback + second-user recruitment for Axis 6. No more code-only PRs unless a real-use bug surfaces.

v0.3.0 β€” platform skeleton: core/plugins/base.py (4 plugin types), JAMES_PLUGINS loader, packs/general/ dogfood, JAMES_WORKSPACE for multi-instance hosting, docs/VERSIONING.md with 12-month deprecation policy. Required before any domain pack work.

See ROADMAP.md for the full v0.3 β†’ v0.4 β†’ v1.0 gate definitions.


πŸ€– Generated with Claude Code