feat: add evaluate and repair workflows#61
Conversation
c6a904f to
784b653
Compare
asteier2026
left a comment
There was a problem hiding this comment.
Only one small comment in evaluate, in the re-answer prompt line 84
andreatgretel
left a comment
There was a problem hiding this comment.
The workflow structure is solid and the metric helpers are clean and well-tested. Two concerns on the evaluation logic:
-
compute_leakage_massandcompute_any_high_leakeddon't validate that the returned answer IDs match the QA IDs. If the evaluator drops a HIGH-sensitivity answer, the leak isn't counted and the row can skip repair. For a privacy gate, this should fail closed. -
compute_utility_scoreaverages whatever compare scores it gets without checking coverage. A partial compare response inflates the score.
Also flagged: _COL_NEEDS_REPAIR should probably move to constants.py before the orchestrator PR, and the repair module could use more test coverage.
|
Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation.
Extracted to |
|
nice progress on this PR. one thing i’d consider as follow-up, borrowing a bit from DD’s patterns, is tightening the workflow contract around the custom columns:
none of that needs to block this PR beyond the concrete bugs already called out, but it feels like the next step that would make these workflows more DD-native and harder to misuse later. |
First two done: |
andreatgretel
left a comment
There was a problem hiding this comment.
Approving. The blocking issues I called out are addressed on the current head.
One follow-up to keep in mind: the placeholder-collision fix looks complete in the evaluate prompts, but _render_repair_prompt() still appears to have the same sequential .replace() risk if user text contains one of the later placeholder tokens. I do not think that needs to hold this PR, but it would be good to clean up next.
…, use a new rewritten text colname in repair workflow
Decided to fix here. Added Other changes I've made since your review:
|
* feat: introduce evaluate and repair workflows * update vars after merging main, add warning * fix: remove unused schema: * feat(rewrite): add final judge workflow and rewrite orchestrator * fix: enforce per-row answer coverage with context-validated parsers Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation. * fix: remove unused logging import * fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order * use re.sub for replacmeents in prompts * refactor: use render_template instead of str replace * fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow * add tests for parsers * feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename * fix: update pre gen seed col' * rewrite generation gets a stripped seed dataset * avoid failed row crash, log warning * fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully * lint-fix * fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests - _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows instead of crashing or skipping the join - RECORD_ID_COLUMN included in all seed lists for stable ID across calls - _join_judge_columns preserves all rows on partial judge failure, defaulting missing rows to needs_human_review=True - Initial evaluate runs before repair loop (max_repair_iterations=0 fix) - Flatten test_rewrite_workflow.py from classes to module-level functions - Add tests: judge partial row loss, evaluate row drop degradation * fix: harden repair and judge row-loss handling * lintfix
* feat: introduce evaluate and repair workflows * update vars after merging main, add warning * fix: remove unused schema: * feat(rewrite): add final judge workflow and rewrite orchestrator * fix: enforce per-row answer coverage with context-validated parsers Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation. * fix: remove unused logging import * fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order * use re.sub for replacmeents in prompts * refactor: use render_template instead of str replace * fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow * add tests for parsers * feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename * fix: update pre gen seed col' * rewrite generation gets a stripped seed dataset * avoid failed row crash, log warning * feat: wire rewrite mode into Anonymizer interface and display * fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully * lint-fix * fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests - _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows instead of crashing or skipping the join - RECORD_ID_COLUMN included in all seed lists for stable ID across calls - _join_judge_columns preserves all rows on partial judge failure, defaulting missing rows to needs_human_review=True - Initial evaluate runs before repair loop (max_repair_iterations=0 fix) - Flatten test_rewrite_workflow.py from classes to module-level functions - Add tests: judge partial row loss, evaluate row drop degradation * fix(rewrite): validate replace aliases in rewrite mode, fix judge score parsing * fix: harden repair and judge row-loss handling * lintfix * fix: address review bugs and harden public column contract - compute_grouped_entities: replace misleading True-if-else-None with explicit boolean - failed records: log count at WARNING unconditionally; detail stays DEBUG - _join_judge_columns: pre-compute known_ids set outside loop to avoid O(n^2) Index reconstruction - test_rewrite_workflow: use COL_REWRITTEN_TEXT_NEXT constant - _build_user_dataframe: replace 'all non-underscore' filter with explicit per-mode whitelists; rewrite mode no longer leaks final_entities and {text_col}_with_spans into result.dataframe - harden original_text_column attr access from .get() to [] in three places * fix(rewrite): harden pipeline against parquet round-trips, auto-derive seeds, merge workflow calls Root cause fix: _parse_raw_wrapper in shared.py silently returned empty schemas for JSON-stringified inputs, causing generate_map_only to see zero entities and skip replacement-map generation entirely. All rows got empty COL_REPLACEMENT_MAP_FOR_PROMPT through the whole pipeline. Parquet round-trip hardening: - shared.py:30-34 -- json.loads handling in _parse_raw_wrapper - parsers.py:50-75 -- normalize_payload() (renamed from _to_dict): recursive normalizer for BaseModel, numpy arrays/scalars, JSON strings - parsers.py:79-131 -- all parse_* functions call normalize_payload() at entry - rewrite_generation.py:170 -- normalize_payload() on replacement map before validation - rewrite_generation.py:217 -- normalize_payload() on entities_by_value before structure check - qa_generation.py:55,289 -- parse_sensitivity_disposition() replaces raw model_validate() - repair.py:58-66 -- _replacement_map_is_empty() normalizes before checking - repair.py:175 -- explicit bool() cast for numpy booleans on COL_ANY_HIGH_LEAKED - qa_generation.py:290 -- .model_dump() output for cross-column payload normalization Auto-derive seed columns (eliminates manual _SEED_* lists): - workflow_utils.py:14-37 -- derive_seed_columns() introspects .required_columns and .side_effect_columns from column configs to compute the minimal external dependency set; RECORD_ID_COLUMN always included - workflow_utils.py:40-54 -- select_seed_cols() JSON-serializes non-string object columns for parquet survival - rewrite_workflow.py -- all 5 _SEED_* constants deleted; every adapter call site uses derive_seed_columns(columns, df) + select_seed_cols() Merge DD workflow calls (6 -> 4 adapter calls): - rewrite_workflow.py:233-240 -- replace-map-generation hoisted to orchestrator (step 1) - rewrite_workflow.py:243-269 -- domain + disposition + QA + rewrite merged into single "rewrite-pipeline" adapter call (step 2, 12 columns) - rewrite_workflow.py:384-394 -- re-evaluate only repaired rows (not all rows) - rewrite_generation.py -- simplified to pure column factory; run() deleted, RewriteGenerationResult removed, _has_entities moved to rewrite_workflow.py Evaluate answer normalization: - evaluate.py:260-303 -- _normalize_answer_items() fills missing IDs with conservative defaults (privacy: "yes"/leaked, quality: "unknown", compare: 0.0), drops duplicates and extras, preserves expected ordering - evaluate.py:199-209 -- quality answer parser uses normalizer before validation - evaluate.py:221-231 -- privacy answer parser uses normalizer before validation - evaluate.py:242-257 -- QA compare parser uses normalizer before validation Disposition ID tolerance: - schemas/rewrite.py:173-178 -- _validate_ids_sequential replaced with _normalize_ids that silently renumbers (LLM duplicate/gap IDs no longer crash) - sensitivity_disposition.py:149 -- prompt relaxed from "must be sequential" to "cosmetic" Replace workflow observability: - llm_replace_workflow.py:20-21 -- logger + _RAW_REPLACEMENT_MAP_COL for debugging - llm_replace_workflow.py:97 -- stash raw replacement map before filtering - llm_replace_workflow.py:150-154 -- warn on unexpected replacement map type - llm_replace_workflow.py:174-188 -- debug/warn logging for filter results per record Adapter edge case: - adapter.py:85 -- min(preview_num_records, len(input)) prevents DD seed wrap-around Tests: 333 pass, 0 lint errors, e2e verified (10/10 records, 0 failures) - test_schemas.py -- 6 new: from_raw JSON strings, disposition ID renumbering - test_workflow_utils.py -- 8 new: derive_seed_columns + select_seed_cols - test_evaluate.py -- 2 new: _normalize_answer_items coverage - test_parsers.py -- 2 new: numpy array payloads - test_repair.py -- 1 new: numpy array replacement map - test_rewrite_workflow.py -- 3 new: _has_entities, updated mocks for merged workflow calls - test_rewrite_generation.py -- simplified to column-factory tests only (run() tests removed) Made-with: Cursor
Summary
Implements Step 5 of the rewrite pipeline: evaluate rewritten text for quality and privacy, then repair if privacy leaks are detected.
EvaluateWorkflow(engine/rewrite/evaluate.py) — 6-column NDD pipeline: LLM quality re-answer, LLM privacy re-answer, QA pairing, LLM quality comparison, metric computation (utility_score, leakage_mass, any_high_leaked), and repair-needs determination.RepairWorkflow(engine/rewrite/repair.py) — 2-column NDD pipeline: leaked-items formatting and LLM repair viaPydanticResponseRecipeinside a custom column generator with dynamically-resolved model alias.Schema changes:
PrivacyAnswerenum enforces yes/no only — privacy questions are binary ("can entity X be deduced?"), so "unknown" is a hedge that doesn't serve the pipeline;format_for_rewrite_context()now usesprotected_entitiesinstead ofmedium_and_high_sensitivity_entities— the rewriter and evaluator must operate on the same entity set, otherwise LOW-sensitivity entities flagged for protection (due to combinatorial re-identification risk) would be evaluated but never shown to the rewriter.Design decision - separate workflows instead of one combined pipeline, so the orchestrator can run evaluate on all rows, filter to failing rows, run repair on just those, and re-evaluate only repaired rows (no wasted LLM calls on passing rows)
Prompts ported from the research repo with XML section tags,
<<PLACEHOLDER>>+.replace()convention, and schema field names validated at import time via_field().Type of Change
Testing
Related Issues
Closes #34