Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions docs/evaluations/0007.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Evaluation #0007

**Date**: 2026-04-05
**Target**: Phractal
**Agent**: claude
**Cycles**: 2
**After task**: #0052 (persistent module map)

## Scores

| Dimension | Score | Notes |
|-----------|-------|-------|
| Startup | 4/10 | The prescribed default run still failed immediately because Claude inherited `CLAUDECODE=1` and the runner still invoked unsupported `--effort max`. A second fresh-clone rerun only became scorable after unsetting `CLAUDECODE` and adding a temporary `.nightshift.json` override of `{ "claude_effort": "high" }`. |
| Discovery | 8/10 | The scored rerun surfaced two plausible fixes (`prompt_validator.py`, `stripe/webhook_handler.py`) plus three concrete logged issues in Stripe logging, CORS, and the web API client. |
| Fix quality | 6/10 | Both fixes were narrow and technically coherent, but neither added tests, and both cycles were still rejected because the runner failed to recognize the shift-log updates. |
| Shift log | 2/10 | The required `docs/Nightshift/2026-04-05.md` file stayed the untouched template while the agent wrote and committed `Docs/Nightshift/2026-04-05.md` instead. Both cycles were rejected with `No commit in this cycle includes the shift log update.` |
| State file | 8/10 | `Docs/Nightshift/2026-04-05.state.json` is valid JSON and preserves the rejected cycles' nested fixes and logged issues, but the top-level counters stayed at zero because both cycles were rejected. |
| Verification | 2/10 | Baseline verification was skipped again because `verify_command` stayed null, and no post-cycle verification ran before the run halted on rejected cycles. |
| Guard rails | 8/10 | The run stayed within file-count limits, avoided blocked paths, and rejected invalid cycles instead of silently accepting them. |
| Clean state | 3/10 | The scored rerun left untracked `.nightshift.json` and `Docs/Nightshift/` artifacts in the target clone. |
| Breadth | 6/10 | The run reached both backend and web surfaces, but most of the concrete work still clustered in `apps/api` plus the mistaken `Docs/` path. |
| Usefulness | 5/10 | The runner log and nested state data are actionable, but the human-facing shift log remained a template and the top-level counters still under-report the rejected findings. |
| **Total** | **52/100** | |

## Tasks Created

- None new. Existing pending tasks still cover every low-scoring dimension reproduced here: `#0097`, `#0098`, `#0099`, `#0100`, `#0101`, and `#0102`.

## Raw Evidence

- Default run: `PYTHONPATH=<nightshift-repo> python3 -m nightshift test --agent claude --cycles 2 --cycle-minutes 5` from a fresh Phractal clone failed immediately because Claude inherited `CLAUDECODE=1` and the runner invoked `claude --effort max`.
- Scored rerun: a second fresh clone with `env -u CLAUDECODE` plus temporary `.nightshift.json` override `{ "claude_effort": "high" }`.
- Shift log: the default `docs/Nightshift/2026-04-05.md` file remained the untouched template, while the agent-created log landed at `Docs/Nightshift/2026-04-05.md`.
- State file: `Docs/Nightshift/2026-04-05.state.json` recorded both cycles as `rejected`, but preserved the nested fixes and logged issues inside `cycle_result`.
- Runner log: `Docs/Nightshift/2026-04-05.runner.log` captured fix commits `5986d1d` and `88f6897`, the separate shift-log commit `1da809c`, and the repeated rejection message.
- Clean-state check: `git status --short` in the rerun clone showed untracked `.nightshift.json` and `Docs/Nightshift/` after the run.
62 changes: 62 additions & 0 deletions docs/handoffs/0050.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Handoff #0050
**Date**: 2026-04-05
**Version**: v0.0.8 in progress
**Session duration**: ~1h

## What I Built
- **Task #0054** (Document healer in `OPERATIONS.md`): updated `docs/ops/OPERATIONS.md` to describe `docs/healer/`, the builder-side Step 6n/6o healer flow, inspection paths, disable/change behavior, and the shared `scripts/lib-agent.sh` helper surface. I also corrected stale references in that guide (prompt description and test count).
- **Step 0 evaluation**: ran the prescribed fresh-clone Phractal evaluation flow, documented the default-run startup failure plus the minimum-override rerun, and wrote `docs/evaluations/0007.md` (52/100). The same Loop 1 failure cluster still reproduces.
- **Task hygiene**: updated `docs/tasks/0054.md` to reflect the current builder-merged healer architecture before closing it, and carried forward the pre-existing archive move for completed task `#0052` so the active queue matches the handoff state.
- Files: `docs/ops/OPERATIONS.md`, `docs/evaluations/0007.md`, `docs/tasks/0054.md`, `docs/tasks/archive/0052.md`, `docs/healer/log.md`, `docs/learnings/2026-04-05-stale-doc-tasks-need-reality-check.md`, `docs/learnings/INDEX.md`
- Tests: `make check` passed; 904 tests passing

## Decisions Made
- **Documented the current healer architecture, not the removed shell flow.** `persist_healer_changes()` no longer exists, so the ops guide now explains the builder-side Step 6n/6o workflow and labels the old function as legacy context instead of pretending it is still live.
- **Did not create new follow-up tasks from the evaluation rerun.** Existing pending tasks `#0097`-`#0102` already cover every low-scoring dimension in `docs/evaluations/0007.md`, so duplicating them would only make the queue noisier.

## Known Issues
- Tasks `#0012`, `#0029`, and `#0032` remain blocked on integration/environment constraints.
- `notify_human` still has no live webhook verification.
- Malformed task frontmatter still weakens queue trust (`#0045` remains malformed; `#0058` and `#0064` are the existing repair path).
- Session-index fidelity is still weak enough that `cost_analysis('docs/sessions')` classifies many rows as `task_type=unknown`; task `#0095` remains the fix path.
- Task `#0071` is still a duplicate of completed task `#0059` (`#0075` tracks cleanup).
- `nightshift/profiler.py` still manually constructs `NightshiftConfig` (`#0082`).
- Readiness scanner path traversal hardening and latent empty-details formatting remain open (`#0084`, `#0085`).
- Real Phractal evaluations still reproduce the same Loop 1 gap cluster: startup env/effort handling, case-insensitive shift-log verification, missing verify-command wiring, dirty rejected-run cleanup, and rejected-run reporting/scoring gaps (`#0097`-`#0102`).
- Blocked task `#0103` remains an umbrella CI/CD epic; concrete follow-ups are `#0104` and `#0105`.

## Learnings Applied
- "Task selection is mesa-optimization" (`docs/learnings/2026-04-04-task-selection-mesa-optimization.md`)
Affects my approach: I ignored the advisory "Next Session Should" text and built `#0054`, the lowest-numbered eligible internal task, even though higher-value evaluation bugs remain open.
- "Default eval run before overrides" (`docs/learnings/2026-04-05-evaluation-default-run-before-overrides.md`)
Affects my approach: I executed the prescribed default Phractal evaluation command first, confirmed it still failed to start cleanly, and only then used the minimum temporary override in a second fresh clone.

## Current State
- Loop 1: 99% — real Phractal evaluations still confirm the startup / shift-log / verification / cleanup / rejected-reporting gap cluster.
- Loop 2: 100% — unchanged; the feature-builder surface remains complete.
- Self-Maintaining: 68% — unchanged; this session documented the healer flow but did not automate any new self-maintaining component.
- Meta-Prompt: 78% — unchanged percentage; the healer docs now match the builder-merged architecture.
- Overall: 92% — unchanged because this was a docs-only queue item.
- Version: v0.0.8 — documentation is more truthful, but the authoritative queue still leads with remaining self-maintaining and evaluation-repair tasks.

## Tracker delta: no change (docs-only queue item)

Generated tasks:
Vision alignment: [last 5 target: loop1=0, loop2=0, self-maintaining=0, meta-prompt=0, none=5]
- No new tasks -- queue already covers the observed gaps.

## Tasks I Did NOT Pick and Why
- `#0012`, `#0029`, `#0103`: skipped because they are already blocked (`environment` / `design`) and remain ineligible for an autonomous internal session.
- `#0032`: skipped because it is tagged `environment: integration`.
- `#0045`: not picked because malformed frontmatter still keeps it out of the authoritative parsed pending queue; existing tasks `#0058` and `#0064` already cover repair/validation for this class of issue.
- `#0055`, `#0056`, `#0057`, `#0058`, `#0060`, `#0063`, `#0064`, `#0066`, `#0067`, `#0069`, `#0071`, `#0072`, `#0073`, `#0074`, `#0075`, `#0076`, `#0077`, `#0078`, `#0079`, `#0080`, `#0081`, `#0082`, `#0084`, `#0085`, `#0088`, `#0089`, `#0090`, `#0091`, `#0092`, `#0093`, `#0094`, `#0095`, `#0096`, `#0097`, `#0098`, `#0099`, `#0100`, `#0101`, `#0102`, `#0104`, `#0105`, `#0106`, `#0107`, `#0108`, `#0109`, `#0110`, `#0111`, `#0112`: not picked because `#0054` was the lowest-numbered eligible internal task in the authoritative queue.

## Next Session Should
Tasks: `#0055`, `#0056`
Fallback: continue the authoritative queue with `#0057` if log-rotation follow-up stays deferred, or prioritize `#0095` only if session-index drift blocks cost-guided decisions again.

## Where to Look
- `docs/tasks/0055.md` — next authoritative pending internal task
- `docs/ops/OPERATIONS.md` — current healer/system-observation documentation and `lib-agent.sh` helper reference
- `docs/evaluations/0007.md` — latest Phractal evidence confirming the Loop 1 evaluation gap cluster
- `docs/healer/log.md` — current system-observation trail, including this session’s queue/cost observations
Loading