Skip to content

feat: add rewrite generation workflow#49

Merged
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-rewrite-generation
Mar 19, 2026
Merged

feat: add rewrite generation workflow#49
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-rewrite-generation

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

@lipikaramaswamy lipikaramaswamy commented Mar 17, 2026

Summary

Implements RewriteGenerationWorkflow for the rewrite pipeline. This stage runs after sensitivity disposition and handles records with detected entities by generating replacement maps, preparing rewrite prompt inputs, invoking the rewriter model, and extracting rewritten text for downstream evaluation.

  • Reuses LlmReplaceWorkflow.generate_map_only() to produce COL_REPLACEMENT_MAP
  • Builds COL_REWRITE_DISPOSITION_BLOCK from COL_SENSITIVITY_DISPOSITION
  • Filters COL_REPLACEMENT_MAP_FOR_PROMPT to replace-method entities only
  • Produces COL_FULL_REWRITE through LLMStructuredColumnConfig and extracts COL_REWRITTEN_TEXT
  • Preserves passthrough behavior, df.attrs, failed-record accumulation, and original row order

Design Decisions

Structured cross-column payloads are normalized at the custom-column boundary rather than assuming live schema instances. The rewrite prompt uses the disposition block directly instead of a generic entity-context layer.

Testing

Tests pass locally and cover passthrough behavior, replacement-map filtering, prompt construction, schema/dict output extraction, failed-record propagation, attribute propagation, and mixed-row order preservation.

Related Issues

Closes #33

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner March 17, 2026 02:45
@lipikaramaswamy lipikaramaswamy changed the title fix: preserve row order and harden rewrite generation parsing feat: add rewrite generation workflow Mar 17, 2026
@lipikaramaswamy lipikaramaswamy changed the title feat: add rewrite generation workflow feat: add rewrite generation workflow Mar 17, 2026
Comment thread src/anonymizer/engine/constants.py Outdated
Comment thread src/anonymizer/engine/rewrite/rewrite_generation.py Outdated
Comment thread tests/engine/test_rewrite_generation.py
Comment thread src/anonymizer/engine/rewrite/rewrite_generation.py
Comment thread src/anonymizer/engine/rewrite/rewrite_generation.py Outdated
Copy link
Copy Markdown
Contributor

@asteier2026 asteier2026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording is not all that different from the GItLab prompt, but recall in GitLab even though we said only use the Replacement Map for entities tagged with Replace, and for tags that specify generalize, then generalize, nonetheless, often times the Replacement map was still used for entities tagged with generalize. Remember the plan to only feed into the rewrite prompt the portion of the Replacement map that contained direct identifiers.

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

The wording is not all that different from the GItLab prompt, but recall in GitLab even though we said only use the Replacement Map for entities tagged with Replace, and for tags that specify generalize, then generalize, nonetheless, often times the Replacement map was still used for entities tagged with generalize. Remember the plan to only feed into the rewrite prompt the portion of the Replacement map that contained direct identifiers.

@asteier2026 We already handle it in this PR: _filter_replacement_map_for_prompt filters COL_REPLACEMENT_MAP down to only entities where the disposition says protection_method_suggestion="replace", and the prompt references this filtered version (COL_REPLACEMENT_MAP_FOR_PROMPT) rather than the raw map. The <replacement_map> section is also wrapped in a Jinja conditional so it's omitted entirely when there are no "replace"-method entities — the LLM never sees replacement values for generalize/remove/paraphrase entities.

Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving - non-blocking stuff to track:

  • The orchestrator sets compute_grouped_entities=True only when config.replace is not None. This workflow needs COL_ENTITIES_BY_VALUE, so when rewrite gets wired in that condition needs to cover config.rewrite too or you'll get a KeyError. Worth a TODO somewhere.
  • This is the third workflow doing split-on-has-entities / reorder / recombine, and the row-order column name already diverged (_anonymizer_row_order vs _row_order). Might be time to pull that into a shared helper before the next one copies it again.
  • _has_entities swallows all parse errors silently (except Exception: return False) - a logger.debug would help when rows unexpectedly skip rewriting. Also does full schema validation just to check emptiness when a dict .get() would do.

@lipikaramaswamy lipikaramaswamy merged commit 2b6c451 into main Mar 19, 2026
5 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-rewrite-generation branch March 19, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rewrite): engine — rewrite generation

3 participants