feat: add evaluate and repair workflows by lipikaramaswamy · Pull Request #61 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-03-20T06:52:58Z

Summary

Implements Step 5 of the rewrite pipeline: evaluate rewritten text for quality and privacy, then repair if privacy leaks are detected.

EvaluateWorkflow (engine/rewrite/evaluate.py) — 6-column NDD pipeline: LLM quality re-answer, LLM privacy re-answer, QA pairing, LLM quality comparison, metric computation (utility_score, leakage_mass, any_high_leaked), and repair-needs determination.

RepairWorkflow (engine/rewrite/repair.py) — 2-column NDD pipeline: leaked-items formatting and LLM repair via PydanticResponseRecipe inside a custom column generator with dynamically-resolved model alias.

Schema changes: PrivacyAnswer enum enforces yes/no only — privacy questions are binary ("can entity X be deduced?"), so "unknown" is a hedge that doesn't serve the pipeline; format_for_rewrite_context() now uses protected_entities instead of medium_and_high_sensitivity_entities — the rewriter and evaluator must operate on the same entity set, otherwise LOW-sensitivity entities flagged for protection (due to combinatorial re-identification risk) would be evaluated but never shown to the rewriter.

Design decision - separate workflows instead of one combined pipeline, so the orchestrator can run evaluate on all rows, filter to failing rows, run repair on just those, and re-evaluate only repaired rows (no wasted LLM calls on passing rows)

Prompts ported from the research repo with XML section tags, <<PLACEHOLDER>> + .replace() convention, and schema field names validated at import time via _field().

Type of Change

Testing

Tests pass locally
Added/updated tests for changes

Related Issues

Closes #34

asteier2026

Only one small comment in evaluate, in the re-answer prompt line 84

andreatgretel

The workflow structure is solid and the metric helpers are clean and well-tested. Two concerns on the evaluation logic:

compute_leakage_mass and compute_any_high_leaked don't validate that the returned answer IDs match the QA IDs. If the evaluator drops a HIGH-sensitivity answer, the leak isn't counted and the row can skip repair. For a privacy gate, this should fail closed.
compute_utility_score averages whatever compare scores it gets without checking coverage. A partial compare response inflates the score.

Also flagged: _COL_NEEDS_REPAIR should probably move to constants.py before the orchestrator PR, and the repair module could use more test coverage.

andreatgretel · 2026-03-23T21:31:56Z

repair.py:69-74 and evaluate.py:163-168

def _parse_privacy_answers(raw: Any) -> list[PrivacyAnswerItemSchema]:
    if isinstance(raw, PrivacyAnswersSchema):
        return raw.answers
    if isinstance(raw, dict):
        return PrivacyAnswersSchema.model_validate(raw).answers
    raise TypeError(...)

_parse_privacy_answers and _parse_privacy_qa are identical between evaluate and repair - maybe extract to a shared module or add classmethods on the schemas?

Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation.

lipikaramaswamy · 2026-03-24T02:29:46Z

repair.py:69-74 and evaluate.py:163-168
def _parse_privacy_answers(raw: Any) -> list[PrivacyAnswerItemSchema]:
    if isinstance(raw, PrivacyAnswersSchema):
        return raw.answers
    if isinstance(raw, dict):
        return PrivacyAnswersSchema.model_validate(raw).answers
    raise TypeError(...)
_parse_privacy_answers and _parse_privacy_qa are identical between evaluate and repair - maybe extract to a shared module or add classmethods on the schemas?

Extracted to rewrite/parsers.py. The shared parse helpers (parse_privacy_answers, parse_privacy_qa, parse_quality_qa, parse_quality_answers, parse_quality_compare, parse_sensitivity_disposition) and the field() validator live there. Both evaluate.py and repair.py import from it.

andreatgretel · 2026-03-24T19:07:33Z

nice progress on this PR. one thing i’d consider as follow-up, borrowing a bit from DD’s patterns, is tightening the workflow contract around the custom columns:

make required_columns exhaustive for anything the generator actually reads, since DD uses that metadata for DAG ordering and missing-input checks
keep pushing validation to the parser/schema boundary, not the metric helpers. missing IDs are fixed now, but duplicate / extra IDs would be good to reject there too
add one tiny workflow-level preview/smoke test that runs the evaluate/repair columns together on a row, so metadata issues show up before orchestrator wiring

none of that needs to block this PR beyond the concrete bugs already called out, but it feels like the next step that would make these workflows more DD-native and harder to misuse later.

…fe replace order

lipikaramaswamy · 2026-03-24T20:13:05Z

nice progress on this PR. one thing i’d consider as follow-up, borrowing a bit from DD’s patterns, is tightening the workflow contract around the custom columns:

make required_columns exhaustive for anything the generator actually reads, since DD uses that metadata for DAG ordering and missing-input checks

keep pushing validation to the parser/schema boundary, not the metric helpers. missing IDs are fixed now, but duplicate / extra IDs would be good to reject there too

add one tiny workflow-level preview/smoke test that runs the evaluate/repair columns together on a row, so metadata issues show up before orchestrator wiring

none of that needs to block this PR beyond the concrete bugs already called out, but it feels like the next step that would make these workflows more DD-native and harder to misuse later.

First two done: required_columns is now exhaustive across all 7 generators (audited), and duplicate/extra IDs are rejected at the schema boundary via _validate_id_coverage(). The workflow-level smoke test we'll pick up in the orchestrator PR.

andreatgretel

Approving. The blocking issues I called out are addressed on the current head.

One follow-up to keep in mind: the placeholder-collision fix looks complete in the evaluate prompts, but _render_repair_prompt() still appears to have the same sequential .replace() risk if user text contains one of the later placeholder tokens. I do not think that needs to hold this PR, but it would be good to clean up next.

…, use a new rewritten text colname in repair workflow

lipikaramaswamy · 2026-03-25T01:15:50Z

Approving. The blocking issues I called out are addressed on the current head.

One follow-up to keep in mind: the placeholder-collision fix looks complete in the evaluate prompts, but _render_repair_prompt() still appears to have the same sequential .replace() risk if user text contains one of the later placeholder tokens. I do not think that needs to hold this PR, but it would be good to clean up next.

Decided to fix here. Added render_template() to parsers.py — single-pass regex substitution that replaces all placeholders simultaneously. Both evaluate and repair prompts now use it, so no .replace() chain ordering risk anywhere. Also added test_parsers.py with a dedicated cross-contamination test that verifies <<SCORE>> inside user text doesn't get substituted. (and tracking #66)

Other changes I've made since your review:

.model_dump() on all three evaluate custom column outputs (DD's sanitize_record() needs JSON-serializable values)
Repair column writes to COL_REWRITTEN_TEXT_NEXT instead of overwriting the seed column (DD rejects generated columns with the same name as seed columns)

* feat: introduce evaluate and repair workflows * update vars after merging main, add warning * fix: remove unused schema: * feat(rewrite): add final judge workflow and rewrite orchestrator * fix: enforce per-row answer coverage with context-validated parsers Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation. * fix: remove unused logging import * fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order * use re.sub for replacmeents in prompts * refactor: use render_template instead of str replace * fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow * add tests for parsers * feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename * fix: update pre gen seed col' * rewrite generation gets a stripped seed dataset * avoid failed row crash, log warning * fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully * lint-fix * fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests - _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows instead of crashing or skipping the join - RECORD_ID_COLUMN included in all seed lists for stable ID across calls - _join_judge_columns preserves all rows on partial judge failure, defaulting missing rows to needs_human_review=True - Initial evaluate runs before repair loop (max_repair_iterations=0 fix) - Flatten test_rewrite_workflow.py from classes to module-level functions - Add tests: judge partial row loss, evaluate row drop degradation * fix: harden repair and judge row-loss handling * lintfix

* feat: introduce evaluate and repair workflows * update vars after merging main, add warning * fix: remove unused schema: * feat(rewrite): add final judge workflow and rewrite orchestrator * fix: enforce per-row answer coverage with context-validated parsers Move all evaluation LLM calls from LLMStructuredColumnConfig to custom columns so we can pass Pydantic validation context with expected IDs per row. DD's correction loop retries when the LLM skips answers. Also addresses PR #61 review: shared parsers module, repair unit tests, COL_NEEDS_REPAIR to constants, consistent field() validation. * fix: remove unused logging import * fix: reject duplicate/extra answer IDs, complete required_columns, safe replace order * use re.sub for replacmeents in prompts * refactor: use render_template instead of str replace * fix: (looking forward to wiring) make custom cols output model_dump(), use a new rewritten text colname in repair workflow * add tests for parsers * feat(rewrite): strip seed dataframes before adapter calls, handle repair __next rename * fix: update pre gen seed col' * rewrite generation gets a stripped seed dataset * avoid failed row crash, log warning * feat: wire rewrite mode into Anonymizer interface and display * fix: update _join_new_columns to align on RECORD_ID_COLUMN, which much exist in our workflows, when lengths differ; added test_evaluate_dropping_rows_degrades_gracefully * lint-fix * fix(rewrite): graceful partial-row failure, judge row preservation, flatten tests - _join_new_columns aligns on RECORD_ID_COLUMN when adapter drops rows instead of crashing or skipping the join - RECORD_ID_COLUMN included in all seed lists for stable ID across calls - _join_judge_columns preserves all rows on partial judge failure, defaulting missing rows to needs_human_review=True - Initial evaluate runs before repair loop (max_repair_iterations=0 fix) - Flatten test_rewrite_workflow.py from classes to module-level functions - Add tests: judge partial row loss, evaluate row drop degradation * fix(rewrite): validate replace aliases in rewrite mode, fix judge score parsing * fix: harden repair and judge row-loss handling * lintfix * fix: address review bugs and harden public column contract - compute_grouped_entities: replace misleading True-if-else-None with explicit boolean - failed records: log count at WARNING unconditionally; detail stays DEBUG - _join_judge_columns: pre-compute known_ids set outside loop to avoid O(n^2) Index reconstruction - test_rewrite_workflow: use COL_REWRITTEN_TEXT_NEXT constant - _build_user_dataframe: replace 'all non-underscore' filter with explicit per-mode whitelists; rewrite mode no longer leaks final_entities and {text_col}_with_spans into result.dataframe - harden original_text_column attr access from .get() to [] in three places * fix(rewrite): harden pipeline against parquet round-trips, auto-derive seeds, merge workflow calls Root cause fix: _parse_raw_wrapper in shared.py silently returned empty schemas for JSON-stringified inputs, causing generate_map_only to see zero entities and skip replacement-map generation entirely. All rows got empty COL_REPLACEMENT_MAP_FOR_PROMPT through the whole pipeline. Parquet round-trip hardening: - shared.py:30-34 -- json.loads handling in _parse_raw_wrapper - parsers.py:50-75 -- normalize_payload() (renamed from _to_dict): recursive normalizer for BaseModel, numpy arrays/scalars, JSON strings - parsers.py:79-131 -- all parse_* functions call normalize_payload() at entry - rewrite_generation.py:170 -- normalize_payload() on replacement map before validation - rewrite_generation.py:217 -- normalize_payload() on entities_by_value before structure check - qa_generation.py:55,289 -- parse_sensitivity_disposition() replaces raw model_validate() - repair.py:58-66 -- _replacement_map_is_empty() normalizes before checking - repair.py:175 -- explicit bool() cast for numpy booleans on COL_ANY_HIGH_LEAKED - qa_generation.py:290 -- .model_dump() output for cross-column payload normalization Auto-derive seed columns (eliminates manual _SEED_* lists): - workflow_utils.py:14-37 -- derive_seed_columns() introspects .required_columns and .side_effect_columns from column configs to compute the minimal external dependency set; RECORD_ID_COLUMN always included - workflow_utils.py:40-54 -- select_seed_cols() JSON-serializes non-string object columns for parquet survival - rewrite_workflow.py -- all 5 _SEED_* constants deleted; every adapter call site uses derive_seed_columns(columns, df) + select_seed_cols() Merge DD workflow calls (6 -> 4 adapter calls): - rewrite_workflow.py:233-240 -- replace-map-generation hoisted to orchestrator (step 1) - rewrite_workflow.py:243-269 -- domain + disposition + QA + rewrite merged into single "rewrite-pipeline" adapter call (step 2, 12 columns) - rewrite_workflow.py:384-394 -- re-evaluate only repaired rows (not all rows) - rewrite_generation.py -- simplified to pure column factory; run() deleted, RewriteGenerationResult removed, _has_entities moved to rewrite_workflow.py Evaluate answer normalization: - evaluate.py:260-303 -- _normalize_answer_items() fills missing IDs with conservative defaults (privacy: "yes"/leaked, quality: "unknown", compare: 0.0), drops duplicates and extras, preserves expected ordering - evaluate.py:199-209 -- quality answer parser uses normalizer before validation - evaluate.py:221-231 -- privacy answer parser uses normalizer before validation - evaluate.py:242-257 -- QA compare parser uses normalizer before validation Disposition ID tolerance: - schemas/rewrite.py:173-178 -- _validate_ids_sequential replaced with _normalize_ids that silently renumbers (LLM duplicate/gap IDs no longer crash) - sensitivity_disposition.py:149 -- prompt relaxed from "must be sequential" to "cosmetic" Replace workflow observability: - llm_replace_workflow.py:20-21 -- logger + _RAW_REPLACEMENT_MAP_COL for debugging - llm_replace_workflow.py:97 -- stash raw replacement map before filtering - llm_replace_workflow.py:150-154 -- warn on unexpected replacement map type - llm_replace_workflow.py:174-188 -- debug/warn logging for filter results per record Adapter edge case: - adapter.py:85 -- min(preview_num_records, len(input)) prevents DD seed wrap-around Tests: 333 pass, 0 lint errors, e2e verified (10/10 records, 0 failures) - test_schemas.py -- 6 new: from_raw JSON strings, disposition ID renumbering - test_workflow_utils.py -- 8 new: derive_seed_columns + select_seed_cols - test_evaluate.py -- 2 new: _normalize_answer_items coverage - test_parsers.py -- 2 new: numpy array payloads - test_repair.py -- 1 new: numpy array replacement map - test_rewrite_workflow.py -- 3 new: _has_entities, updated mocks for merged workflow calls - test_rewrite_generation.py -- simplified to column-factory tests only (run() tests removed) Made-with: Cursor

lipikaramaswamy requested a review from a team as a code owner March 20, 2026 06:52

lipikaramaswamy changed the title ~~feat:~~ feat: add evaluate and repair workflows Mar 20, 2026

lipikaramaswamy added 2 commits March 20, 2026 00:25

feat: introduce evaluate and repair workflows

b99b5bd

update vars after merging main, add warning

784b653

lipikaramaswamy force-pushed the lipikaramaswamy/feat/rewrite-engine-eval-repair branch from c6a904f to 784b653 Compare March 20, 2026 07:34

asteier2026 reviewed Mar 20, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/evaluate.py

asteier2026 reviewed Mar 20, 2026

View reviewed changes

asteier2026 approved these changes Mar 20, 2026

View reviewed changes

lipikaramaswamy mentioned this pull request Mar 23, 2026

feat: add final judge and RewriteWorkflow orchestrator #64

Merged

10 tasks

andreatgretel requested changes Mar 23, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated

Comment thread src/anonymizer/engine/rewrite/evaluate.py

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated

Comment thread tests/engine/test_repair.py

Comment thread src/anonymizer/engine/rewrite/repair.py Outdated

lipikaramaswamy requested a review from andreatgretel March 24, 2026 16:37

mvansegbroeck requested a review from johnnygreco March 24, 2026 17:20

andreatgretel reviewed Mar 24, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated

andreatgretel reviewed Mar 24, 2026

View reviewed changes

Comment thread src/anonymizer/engine/schemas/rewrite.py

andreatgretel reviewed Mar 24, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/repair.py Outdated

andreatgretel reviewed Mar 24, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/evaluate.py Outdated

fix: reject duplicate/extra answer IDs, complete required_columns, sa…

22ad899

…fe replace order

lipikaramaswamy requested a review from andreatgretel March 24, 2026 20:15

andreatgretel approved these changes Mar 24, 2026

View reviewed changes

lipikaramaswamy added 3 commits March 24, 2026 17:44

use re.sub for replacmeents in prompts

b84d652

fix: (looking forward to wiring) make custom cols output model_dump()…

8467bac

…, use a new rewritten text colname in repair workflow

add tests for parsers

551f3b7

lipikaramaswamy mentioned this pull request Mar 25, 2026

refactor: migrate prompt .replace() chains to render_template() #66

Closed

move constants smh:

8547985

lipikaramaswamy merged commit 4e519f9 into main Mar 25, 2026
5 checks passed

lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-eval-repair branch March 25, 2026 16:14

Conversation

lipikaramaswamy commented Mar 20, 2026

Summary

Type of Change

Testing

Related Issues

Uh oh!

Uh oh!

asteier2026 left a comment

Choose a reason for hiding this comment

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 23, 2026

Uh oh!

lipikaramaswamy commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 24, 2026

Uh oh!

lipikaramaswamy commented Mar 24, 2026

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

lipikaramaswamy commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lipikaramaswamy commented Mar 25, 2026 •

edited

Loading