Skip to content

feat: add evaluate and repair workflows#61

Merged
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-eval-repair
Mar 25, 2026
Merged

feat: add evaluate and repair workflows#61
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-eval-repair

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

Summary

Implements Step 5 of the rewrite pipeline: evaluate rewritten text for quality and privacy, then repair if privacy leaks are detected.

EvaluateWorkflow (engine/rewrite/evaluate.py) — 6-column NDD pipeline: LLM quality re-answer, LLM privacy re-answer, QA pairing, LLM quality comparison, metric computation (utility_score, leakage_mass, any_high_leaked), and repair-needs determination.

RepairWorkflow (engine/rewrite/repair.py) — 2-column NDD pipeline: leaked-items formatting and LLM repair via PydanticResponseRecipe inside a custom column generator with dynamically-resolved model alias.

Schema changes: PrivacyAnswer enum enforces yes/no only — privacy questions are binary ("can entity X be deduced?"), so "unknown" is a hedge that doesn't serve the pipeline; format_for_rewrite_context() now uses protected_entities instead of medium_and_high_sensitivity_entities — the rewriter and evaluator must operate on the same entity set, otherwise LOW-sensitivity entities flagged for protection (due to combinatorial re-identification risk) would be evaluated but never shown to the rewriter.

Design decision - separate workflows instead of one combined pipeline, so the orchestrator can run evaluate on all rows, filter to failing rows, run repair on just those, and re-evaluate only repaired rows (no wasted LLM calls on passing rows)

Prompts ported from the research repo with XML section tags, <<PLACEHOLDER>> + .replace() convention, and schema field names validated at import time via _field().

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • Tests pass locally
  • Added/updated tests for changes

Related Issues

Closes #34

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner March 20, 2026 06:52
@lipikaramaswamy lipikaramaswamy changed the title feat: feat: add evaluate and repair workflows Mar 20, 2026
@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/feat/rewrite-engine-eval-repair branch from c6a904f to 784b653 Compare March 20, 2026 07:34
Comment thread src/anonymizer/engine/rewrite/evaluate.py
Copy link
Copy Markdown
Contributor

@asteier2026 asteier2026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one small comment in evaluate, in the re-answer prompt line 84

Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow structure is solid and the metric helpers are clean and well-tested. Two concerns on the evaluation logic:

  1. compute_leakage_mass and compute_any_high_leaked don't validate that the returned answer IDs match the QA IDs. If the evaluator drops a HIGH-sensitivity answer, the leak isn't counted and the row can skip repair. For a privacy gate, this should fail closed.

  2. compute_utility_score averages whatever compare scores it gets without checking coverage. A partial compare response inflates the score.

Also flagged: _COL_NEEDS_REPAIR should probably move to constants.py before the orchestrator PR, and the repair module could use more test coverage.

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated
Comment thread src/anonymizer/engine/rewrite/evaluate.py
Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated
Comment thread tests/engine/test_repair.py
Comment thread src/anonymizer/engine/rewrite/repair.py Outdated
@andreatgretel
Copy link
Copy Markdown
Collaborator

repair.py:69-74 and evaluate.py:163-168

def _parse_privacy_answers(raw: Any) -> list[PrivacyAnswerItemSchema]:
    if isinstance(raw, PrivacyAnswersSchema):
        return raw.answers
    if isinstance(raw, dict):
        return PrivacyAnswersSchema.model_validate(raw).answers
    raise TypeError(...)

_parse_privacy_answers and _parse_privacy_qa are identical between evaluate and repair - maybe extract to a shared module or add classmethods on the schemas?

Move all evaluation LLM calls from LLMStructuredColumnConfig to custom
columns so we can pass Pydantic validation context with expected IDs
per row. DD's correction loop retries when the LLM skips answers.
Also addresses PR #61 review: shared parsers module, repair unit tests,
COL_NEEDS_REPAIR to constants, consistent field() validation.
@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

repair.py:69-74 and evaluate.py:163-168

def _parse_privacy_answers(raw: Any) -> list[PrivacyAnswerItemSchema]:
    if isinstance(raw, PrivacyAnswersSchema):
        return raw.answers
    if isinstance(raw, dict):
        return PrivacyAnswersSchema.model_validate(raw).answers
    raise TypeError(...)

_parse_privacy_answers and _parse_privacy_qa are identical between evaluate and repair - maybe extract to a shared module or add classmethods on the schemas?

Extracted to rewrite/parsers.py. The shared parse helpers (parse_privacy_answers, parse_privacy_qa, parse_quality_qa, parse_quality_answers, parse_quality_compare, parse_sensitivity_disposition) and the field() validator live there. Both evaluate.py and repair.py import from it.

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated
Comment thread src/anonymizer/engine/schemas/rewrite.py
Comment thread src/anonymizer/engine/rewrite/repair.py Outdated
Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated
@andreatgretel
Copy link
Copy Markdown
Collaborator

nice progress on this PR. one thing i’d consider as follow-up, borrowing a bit from DD’s patterns, is tightening the workflow contract around the custom columns:

  • make required_columns exhaustive for anything the generator actually reads, since DD uses that metadata for DAG ordering and missing-input checks
  • keep pushing validation to the parser/schema boundary, not the metric helpers. missing IDs are fixed now, but duplicate / extra IDs would be good to reject there too
  • add one tiny workflow-level preview/smoke test that runs the evaluate/repair columns together on a row, so metadata issues show up before orchestrator wiring

none of that needs to block this PR beyond the concrete bugs already called out, but it feels like the next step that would make these workflows more DD-native and harder to misuse later.

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

nice progress on this PR. one thing i’d consider as follow-up, borrowing a bit from DD’s patterns, is tightening the workflow contract around the custom columns:

  • make required_columns exhaustive for anything the generator actually reads, since DD uses that metadata for DAG ordering and missing-input checks
  • keep pushing validation to the parser/schema boundary, not the metric helpers. missing IDs are fixed now, but duplicate / extra IDs would be good to reject there too
  • add one tiny workflow-level preview/smoke test that runs the evaluate/repair columns together on a row, so metadata issues show up before orchestrator wiring

none of that needs to block this PR beyond the concrete bugs already called out, but it feels like the next step that would make these workflows more DD-native and harder to misuse later.

First two done: required_columns is now exhaustive across all 7 generators (audited), and duplicate/extra IDs are rejected at the schema boundary via _validate_id_coverage(). The workflow-level smoke test we'll pick up in the orchestrator PR.

Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The blocking issues I called out are addressed on the current head.

One follow-up to keep in mind: the placeholder-collision fix looks complete in the evaluate prompts, but _render_repair_prompt() still appears to have the same sequential .replace() risk if user text contains one of the later placeholder tokens. I do not think that needs to hold this PR, but it would be good to clean up next.

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

lipikaramaswamy commented Mar 25, 2026

Approving. The blocking issues I called out are addressed on the current head.

One follow-up to keep in mind: the placeholder-collision fix looks complete in the evaluate prompts, but _render_repair_prompt() still appears to have the same sequential .replace() risk if user text contains one of the later placeholder tokens. I do not think that needs to hold this PR, but it would be good to clean up next.

Decided to fix here. Added render_template() to parsers.py — single-pass regex substitution that replaces all placeholders simultaneously. Both evaluate and repair prompts now use it, so no .replace() chain ordering risk anywhere. Also added test_parsers.py with a dedicated cross-contamination test that verifies <<SCORE>> inside user text doesn't get substituted. (and tracking #66)

Other changes I've made since your review:

  • .model_dump() on all three evaluate custom column outputs (DD's sanitize_record() needs JSON-serializable values)
  • Repair column writes to COL_REWRITTEN_TEXT_NEXT instead of overwriting the seed column (DD rejects generated columns with the same name as seed columns)

@lipikaramaswamy lipikaramaswamy merged commit 4e519f9 into main Mar 25, 2026
5 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-eval-repair branch March 25, 2026 16:14
lipikaramaswamy added a commit that referenced this pull request Mar 26, 2026
* feat: introduce evaluate and repair workflows

* update vars after merging main, add warning

* fix: remove unused schema:

* feat(rewrite): add final judge workflow and rewrite orchestrator

* fix: enforce per-row answer coverage with context-validated parsers
Move all evaluation LLM calls from LLMStructuredColumnConfig to custom
columns so we can pass Pydantic validation context with expected IDs
per row. DD's correction loop retries when the LLM skips answers.
Also addresses PR #61 review: shared parsers module, repair unit tests,
COL_NEEDS_REPAIR to constants, consistent field() validation.

* fix: remove unused logging import

* fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order

* use re.sub for replacmeents in prompts

* refactor: use render_template instead of str replace

* fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow

* add tests for parsers

* feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename

* fix: update pre gen seed col'

* rewrite generation gets a stripped seed dataset

* avoid failed row crash, log warning

* fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully

* lint-fix

* fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests
- _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows
  instead of crashing or skipping the join
- RECORD_ID_COLUMN included in all seed lists for stable ID across calls
- _join_judge_columns preserves all rows on partial judge failure,
  defaulting missing rows to needs_human_review=True
- Initial evaluate runs before repair loop (max_repair_iterations=0 fix)
- Flatten test_rewrite_workflow.py from classes to module-level functions
- Add tests: judge partial row loss, evaluate row drop degradation

* fix: harden repair and judge row-loss handling

* lintfix
lipikaramaswamy added a commit that referenced this pull request Mar 30, 2026
* feat: introduce evaluate and repair workflows

* update vars after merging main, add warning

* fix: remove unused schema:

* feat(rewrite): add final judge workflow and rewrite orchestrator

* fix: enforce per-row answer coverage with context-validated parsers
Move all evaluation LLM calls from LLMStructuredColumnConfig to custom
columns so we can pass Pydantic validation context with expected IDs
per row. DD's correction loop retries when the LLM skips answers.
Also addresses PR #61 review: shared parsers module, repair unit tests,
COL_NEEDS_REPAIR to constants, consistent field() validation.

* fix: remove unused logging import

* fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order

* use re.sub for replacmeents in prompts

* refactor: use render_template instead of str replace

* fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow

* add tests for parsers

* feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename

* fix: update pre gen seed col'

* rewrite generation gets a stripped seed dataset

* avoid failed row crash, log warning

* feat: wire rewrite mode into Anonymizer interface and display

* fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully

* lint-fix

* fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests
- _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows
  instead of crashing or skipping the join
- RECORD_ID_COLUMN included in all seed lists for stable ID across calls
- _join_judge_columns preserves all rows on partial judge failure,
  defaulting missing rows to needs_human_review=True
- Initial evaluate runs before repair loop (max_repair_iterations=0 fix)
- Flatten test_rewrite_workflow.py from classes to module-level functions
- Add tests: judge partial row loss, evaluate row drop degradation

* fix(rewrite): validate replace aliases in rewrite mode, fix judge score parsing

* fix: harden repair and judge row-loss handling

* lintfix

* fix: address review bugs and harden public column contract

  - compute_grouped_entities: replace misleading True-if-else-None with
    explicit boolean
  - failed records: log count at WARNING unconditionally; detail stays DEBUG
  - _join_judge_columns: pre-compute known_ids set outside loop to avoid
    O(n^2) Index reconstruction
  - test_rewrite_workflow: use COL_REWRITTEN_TEXT_NEXT constant
  - _build_user_dataframe: replace 'all non-underscore' filter with explicit
    per-mode whitelists; rewrite mode no longer leaks final_entities and
    {text_col}_with_spans into result.dataframe
  - harden original_text_column attr access from .get() to [] in three places

* fix(rewrite): harden pipeline against parquet round-trips, auto-derive seeds, merge workflow calls

Root cause fix: _parse_raw_wrapper in shared.py silently returned empty
schemas for JSON-stringified inputs, causing generate_map_only to see
zero entities and skip replacement-map generation entirely. All rows
got empty COL_REPLACEMENT_MAP_FOR_PROMPT through the whole pipeline.

Parquet round-trip hardening:
- shared.py:30-34          -- json.loads handling in _parse_raw_wrapper
- parsers.py:50-75         -- normalize_payload() (renamed from _to_dict): recursive
                              normalizer for BaseModel, numpy arrays/scalars, JSON strings
- parsers.py:79-131        -- all parse_* functions call normalize_payload() at entry
- rewrite_generation.py:170 -- normalize_payload() on replacement map before validation
- rewrite_generation.py:217 -- normalize_payload() on entities_by_value before structure check
- qa_generation.py:55,289  -- parse_sensitivity_disposition() replaces raw model_validate()
- repair.py:58-66          -- _replacement_map_is_empty() normalizes before checking
- repair.py:175            -- explicit bool() cast for numpy booleans on COL_ANY_HIGH_LEAKED
- qa_generation.py:290     -- .model_dump() output for cross-column payload normalization

Auto-derive seed columns (eliminates manual _SEED_* lists):
- workflow_utils.py:14-37  -- derive_seed_columns() introspects .required_columns and
                              .side_effect_columns from column configs to compute the minimal
                              external dependency set; RECORD_ID_COLUMN always included
- workflow_utils.py:40-54  -- select_seed_cols() JSON-serializes non-string object columns
                              for parquet survival
- rewrite_workflow.py       -- all 5 _SEED_* constants deleted; every adapter call site
                              uses derive_seed_columns(columns, df) + select_seed_cols()

Merge DD workflow calls (6 -> 4 adapter calls):
- rewrite_workflow.py:233-240  -- replace-map-generation hoisted to orchestrator (step 1)
- rewrite_workflow.py:243-269  -- domain + disposition + QA + rewrite merged into single
                                  "rewrite-pipeline" adapter call (step 2, 12 columns)
- rewrite_workflow.py:384-394  -- re-evaluate only repaired rows (not all rows)
- rewrite_generation.py        -- simplified to pure column factory; run() deleted,
                                  RewriteGenerationResult removed, _has_entities moved
                                  to rewrite_workflow.py

Evaluate answer normalization:
- evaluate.py:260-303      -- _normalize_answer_items() fills missing IDs with conservative
                              defaults (privacy: "yes"/leaked, quality: "unknown", compare: 0.0),
                              drops duplicates and extras, preserves expected ordering
- evaluate.py:199-209      -- quality answer parser uses normalizer before validation
- evaluate.py:221-231      -- privacy answer parser uses normalizer before validation
- evaluate.py:242-257      -- QA compare parser uses normalizer before validation

Disposition ID tolerance:
- schemas/rewrite.py:173-178 -- _validate_ids_sequential replaced with _normalize_ids
                                 that silently renumbers (LLM duplicate/gap IDs no longer crash)
- sensitivity_disposition.py:149 -- prompt relaxed from "must be sequential" to "cosmetic"

Replace workflow observability:
- llm_replace_workflow.py:20-21  -- logger + _RAW_REPLACEMENT_MAP_COL for debugging
- llm_replace_workflow.py:97     -- stash raw replacement map before filtering
- llm_replace_workflow.py:150-154 -- warn on unexpected replacement map type
- llm_replace_workflow.py:174-188 -- debug/warn logging for filter results per record

Adapter edge case:
- adapter.py:85            -- min(preview_num_records, len(input)) prevents DD seed wrap-around

Tests: 333 pass, 0 lint errors, e2e verified (10/10 records, 0 failures)
- test_schemas.py          -- 6 new: from_raw JSON strings, disposition ID renumbering
- test_workflow_utils.py   -- 8 new: derive_seed_columns + select_seed_cols
- test_evaluate.py         -- 2 new: _normalize_answer_items coverage
- test_parsers.py          -- 2 new: numpy array payloads
- test_repair.py           -- 1 new: numpy array replacement map
- test_rewrite_workflow.py -- 3 new: _has_entities, updated mocks for merged workflow calls
- test_rewrite_generation.py -- simplified to column-factory tests only (run() tests removed)

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rewrite): engine — evaluate & repair

3 participants