Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/changelog/v0.0.8.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,13 @@ Closing the self-maintaining gap: auto-release, auto-changelog, evaluation CLI,

## Fixed
- **[fix]** Shell scripts in `scripts/` now use ASCII-only section dividers and restart/status text, removing box-drawing and em-dash characters that violated repo conventions and rendered inconsistently across terminals/filesystems. (task #0038)
- **[meta]** Corrected the authoritative Step 0 evaluation command in `docs/prompt/evolve.md` so fresh-clone Phractal evaluations pass `--repo-dir /tmp/nightshift-eval` from the Nightshift repo root instead of accidentally targeting the Nightshift checkout. (task `#0117`)

## Removed

## Internal
- **[test]** Added regression coverage for the Step 0 evaluation command contract in `docs/prompt/evolve.md` and for `nightshift.evaluation.run_test_shift()` passing `--repo-dir` explicitly to the CLI.
- **[meta]** Recorded `docs/evaluations/0009.md`; the corrected default evaluation command now starts cleanly and writes artifacts under the Phractal clone without any rerun overrides, while the existing shift-log, verification, and cleanup gaps remain.
- **[meta]** Added `scripts/list-tasks.sh` plus `make tasks` so sessions can print the active queue in priority order and surface malformed task files instead of silently skipping them. (task `#0057`)
- **[test]** Added 4 regression tests covering sorted task summaries, empty-queue output, malformed-task reporting, and the `make tasks` target. Test suite is now 927 passing.
- **[meta]** Recorded another real Phractal evaluation in `docs/evaluations/0008.md`; the default Claude run still fails when `CLAUDECODE` is inherited, and the rerun reproduced the same shift-log, verification, and cleanup gaps already tracked by `#0097`-`#0102`.
Expand Down
36 changes: 36 additions & 0 deletions docs/evaluations/0009.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Evaluation #0009

**Date**: 2026-04-05
**Target**: Phractal
**Agent**: claude
**Cycles**: 2
**After task**: #0057 (Task queue summary command)

## Scores

| Dimension | Score | Notes |
|-----------|-------|-------|
| Startup | 8/10 | The corrected default command started cleanly from the Nightshift repo root, created the worktree under `/tmp/nightshift-eval`, and completed both cycles without any env/config overrides. Baseline verification still skipped because `verify_command` remained null. |
| Discovery | 8/10 | The run found a concrete security issue in `apps/api/app/middleware/fast_auth.py`: the auth cache key hashed only a token prefix/suffix, creating collision risk across distinct JWTs. |
| Fix quality | 6/10 | The proposed fix was tightly scoped and the verification command demonstrated the collision before/after behavior, but the work never landed as accepted output because both cycles were rejected. |
| Shift log | 3/10 | The session now clearly targeted the Phractal clone, but the human-facing `docs/Nightshift/2026-04-05.md` file stayed on the template text while the rejected-cycle metadata still points at `Docs/Nightshift/2026-04-05.md`. |
| State file | 8/10 | `docs/Nightshift/2026-04-05.state.json` is valid JSON and preserves both rejected cycles, the nested fix details, and the halt reason. Top-level counters still stay at zero because rejected cycles do not roll findings upward. |
| Verification | 2/10 | Baseline verification was skipped again because `verify_command` stayed null, and post-cycle verification still failed on the existing rejected-cycle checks before any repo verification could pass. |
| Guard rails | 8/10 | The run stayed inside the cloned worktree, respected file-count limits, avoided blocked paths, and rejected invalid cycles instead of silently accepting them. |
| Clean state | 4/10 | The clone was left with untracked `Docs/Nightshift/` artifacts after the rejected run, but unlike the previous reruns there was no leftover temporary `.nightshift.json` override. |
| Breadth | 5/10 | The work stayed concentrated in one backend middleware file plus the mistaken `Docs/` path, so the run still explored only a narrow slice of the repo. |
| Usefulness | 6/10 | The runner log and nested state data are actionable, and the corrected command now proves the evaluation is scoring Phractal rather than Nightshift. The day-team-facing shift log is still a template, which limits reviewability. |
| **Total** | **58/100** | |

## Tasks Created

- None new. Existing pending tasks still cover every low-scoring dimension reproduced here: `#0098`, `#0099`, `#0100`, `#0101`, and `#0102`.

## Raw Evidence

- Default run: `PYTHONPATH=<nightshift-repo> python3 -m nightshift test --agent claude --cycles 2 --cycle-minutes 5 --repo-dir /tmp/nightshift-eval` from the Nightshift repo root completed a scorable run without any rerun or overrides.
- Targeting proof: the run created `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.md`, `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.state.json`, and `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.runner.log`, confirming the evaluation targeted the cloned repo instead of the Nightshift checkout.
- Shift log: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.md` stayed on the template text while rejected-cycle verification data recorded `Docs/Nightshift/2026-04-05.md` as the touched path.
- State file: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.state.json` recorded both cycles as `rejected`, stored the middleware security fix details, and halted with `Failed verification threshold reached.`
- Runner log: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.runner.log` captured fix commit `8819fa0`, shift-log commit `0c961b8`, and the final JSON reporting `apps/api/app/middleware/fast_auth.py` plus `docs/Nightshift/2026-04-05.md` as touched files.
- Clean-state check: `git status --short` in `/tmp/nightshift-eval` showed only `?? Docs/Nightshift/` after the run.
64 changes: 64 additions & 0 deletions docs/handoffs/0053.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Handoff #0053
**Date**: 2026-04-05
**Version**: v0.0.8 in progress
**Session duration**: ~1h

## What I Built
- **Task #0117** (Step 0 evaluation targeting fix): corrected `docs/prompt/evolve.md` so the prescribed fresh-clone evaluation command now passes `--repo-dir /tmp/nightshift-eval` from the Nightshift repo root instead of silently targeting the Nightshift checkout.
- **Regression hardening**: added 2 tests covering the literal Step 0 prompt contract and `nightshift.evaluation.run_test_shift()` passing `--repo-dir` explicitly.
- **Evaluation bookkeeping + fragile-path validation**: wrote `docs/evaluations/0009.md` from a fresh-clone default Phractal run that now targets the cloned repo correctly, and generated follow-up task `#0120` after finding that `scripts/list-tasks.sh` still is not directly executable.
- Files: `docs/prompt/evolve.md`, `tests/test_nightshift.py`, `docs/evaluations/0009.md`, `docs/tasks/0117.md`, `docs/tasks/0120.md`, `docs/tasks/.next-id`, `docs/changelog/v0.0.8.md`, `docs/vision-tracker/TRACKER.md`, `docs/learnings/2026-04-05-prompt-contracts-need-tests.md`, `docs/learnings/INDEX.md`, `docs/healer/log.md`
- Tests: +2 new, 929 total passing

## Decisions Made
- **Treated the prompt diff as real, not malicious.** `nightshift/evaluation.py` already passed `--repo-dir`, so the problem was control-doc drift in `docs/prompt/evolve.md`; I aligned the prompt and added tests rather than changing the evaluator.
- **Did not rerun the evaluation with overrides.** The corrected default command produced a scorable fresh-clone Phractal run on its own, so per the evaluation guide I kept the evidence from that run and skipped the second-clone override path.

## Known Issues
- Tasks `#0012`, `#0029`, and `#0032` still require integration/external resources.
- Task `#0103` remains blocked on design; concrete follow-ups are `#0104` and `#0105`.
- Malformed task frontmatter still weakens queue trust (`#0045` is still malformed; `#0058` and `#0064` remain the repair path).
- Session-index fidelity is still poor enough that `docs/sessions/index.md` is effectively empty while `cost_analysis('docs/sessions')` still classifies 20 of 31 sessions as `task_type=unknown`; task `#0095` remains the fix path.
- Real Phractal evaluations still reproduce the same Loop 1 gap cluster: case-insensitive shift-log verification, missing verify-command wiring, dirty rejected-run cleanup, and rejected-run reporting/scoring gaps (`#0098`-`#0102`).
- `scripts/list-tasks.sh` still fails on direct invocation with `permission denied`; task `#0120` now tracks the follow-up.
- `nightshift/profiler.py` still manually constructs `NightshiftConfig` (`#0082`).
- Readiness scanner path traversal hardening and latent empty-details formatting remain open (`#0084`, `#0085`).

## Learnings Applied
- "Default eval run before overrides" (`docs/learnings/2026-04-05-evaluation-default-run-before-overrides.md`)
Affects my approach: I ran the corrected Step 0 command first against a fresh Phractal clone, kept that evidence because it was already scorable, and did not spend the session on an unnecessary override rerun.

## Current State
- Loop 1: 99% — the latest evaluation now truthfully targets Phractal without overrides, but the same shift-log/verification/cleanup/reporting failures still keep real-repo runs below 60/100.
- Loop 2: 100% — unchanged and complete.
- Self-Maintaining: 68% — unchanged percentage; prompt/evaluation drift is tighter, but automation/queue observability gaps remain.
- Meta-Prompt: 78% — unchanged percentage; the authoritative Step 0 prompt is now aligned with the evaluator and covered by regression tests.
- Overall: 92% — unchanged because this session fixed control-doc drift rather than moving a tracker bucket.
- Version: v0.0.8 — still in progress; urgent README accuracy work (`#0118`) and the outstanding self-maintaining/evaluation backlog still remain.

## Tracker delta: 92% -> 92%

## Evaluate
Run evaluation against Phractal for the changes merged this session.

Generated tasks:
Vision alignment: [last 5 target: loop1=1, loop2=0, self-maintaining=1, meta-prompt=1, none=2]
- `#0120`: Make scripts/list-tasks.sh directly executable or stop advertising direct invocation (dimension: repo health, vision: self-maintaining, priority: low)

## Tasks I Did NOT Pick and Why
- `#0012`, `#0029`: skipped because they remain blocked on integration/environment.
- `#0032`: skipped because it is tagged `environment: integration`.
- `#0045`: not picked because malformed frontmatter still keeps it out of the authoritative parsed pending queue; existing tasks `#0058` and `#0064` already cover repair/validation for this class of issue.
- `#0103`: skipped because it is already blocked on design.
- `#0058`, `#0060`, `#0063`, `#0064`, `#0066`, `#0067`, `#0069`, `#0071`, `#0072`, `#0073`, `#0074`, `#0076`, `#0077`, `#0078`, `#0079`, `#0081`, `#0082`, `#0084`, `#0085`, `#0088`, `#0089`, `#0090`, `#0091`, `#0092`, `#0093`, `#0094`, `#0095`, `#0097`, `#0098`, `#0099`, `#0100`, `#0101`, `#0102`, `#0104`, `#0105`, `#0106`, `#0107`, `#0108`, `#0109`, `#0110`, `#0111`, `#0112`, `#0113`, `#0114`, `#0115`, `#0116`, `#0118`, `#0119`: not picked because `#0117` was the lowest-numbered eligible urgent internal task in the authoritative queue.

## Next Session Should
Tasks: `#0118`, `#0119`
Fallback: continue the authoritative queue with `#0063` once the urgent README/re-verification items are cleared, or reality-check `#0097` first if another evaluation shows startup behavior has changed again.

## Where to Look
- `docs/tasks/0118.md` — next authoritative urgent task
- `docs/prompt/evolve.md` — corrected Step 0 evaluation command
- `tests/test_nightshift.py` — regression coverage for the prompt contract and `run_test_shift()`
- `docs/evaluations/0009.md` — fresh-clone evidence that the corrected default evaluation now targets Phractal
- `docs/tasks/0120.md` — follow-up from fragile automation-path validation
Loading