feat: add final judge and RewriteWorkflow orchestrator#64
Conversation
Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation.
…ine-eval-repair' into lipikaramaswamy/feat/rewrite-engine-final-judge
…ine-eval-repair' into lipikaramaswamy/feat/rewrite-engine-final-judge
…ipikaramaswamy/feat/rewrite-engine-final-judge
…, use a new rewritten text colname in repair workflow
…ipikaramaswamy/feat/rewrite-engine-final-judge
…air __next rename
andreatgretel
left a comment
There was a problem hiding this comment.
with #68 handling the public wiring, the main thing I am still wondering about in this PR is the short-output case in RewriteWorkflow. right now we can warn on a row-count mismatch and keep going, but the loop still seems to assume the joined columns are present afterwards. maybe this wants either stricter failure semantics here, or a more explicit way of carrying partial-row failures through the loop.
…h exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully
…rewrite-engine-final-judge
Addressed in the latest commit. |
…latten tests - _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows instead of crashing or skipping the join - RECORD_ID_COLUMN included in all seed lists for stable ID across calls - _join_judge_columns preserves all rows on partial judge failure, defaulting missing rows to needs_human_review=True - Initial evaluate runs before repair loop (max_repair_iterations=0 fix) - Flatten test_rewrite_workflow.py from classes to module-level functions - Add tests: judge partial row loss, evaluate row drop degradation
|
Thanks all for your review, this was a hairy one. Merging now :) |
Summary
FinalJudgeWorkflow(engine/rewrite/final_judge.py) -- holistic LLM evaluation usingLLMJudgeColumnConfigwith three rubrics (privacy, quality, naturalness) on a 1-10 scale, ported verbatim from the research repo.needs_human_reviewflagging based on objective metrics only (failed rewrite, utility below threshold, leakage above threshold, any HIGH-sensitivity leak); judge scores are informational -- not used for automated decisions (deferred to feat(rewrite): judge rubric refinement #37).RewriteWorkflow(engine/rewrite/rewrite_workflow.py) -- top-level orchestrator chaining all 6 sub-workflows: domain classification, sensitivity disposition, QA generation, rewrite generation, evaluate-repair loop, and final judge. Evaluate-repair loop runs up tomax_repair_iterations(fromEvaluationCriteria), exits early when all rows pass, and only sends failing rows to repair. Final judge is non-critical (failure logged, defaults applied). Fast path skips all LLM calls when no entities detected.Schema cleanup -- removed
JudgeScoreSchema/JudgeEvaluationSchema(redundant withLLMJudgeColumnConfig+Scorerubrics).Row split/merge helpers extracted as free functions (
_split_by_entities,_merge_and_reorder,_apply_passthrough_defaults) to prepare for refactor: extract shared entity-row split/reorder/recombine helper #60.Config gaps closed
evaluation.max_repair_iterations-- consumed by evaluate-repair loopevaluation.flag_utility_below-- consumed byneeds_human_reviewevaluation.flag_leakage_mass_above-- consumed byneeds_human_reviewType of Change
Testing
Test plan
FinalJudgeWorkflow(prompt, rubrics, human review flagging)RewriteWorkflow(fast path, call order, failed records, attrs, judge failure tolerance, repair loop)Related Issues
Closes #35