feat: add rewrite generation workflow#49
Conversation
asteier2026
left a comment
There was a problem hiding this comment.
The wording is not all that different from the GItLab prompt, but recall in GitLab even though we said only use the Replacement Map for entities tagged with Replace, and for tags that specify generalize, then generalize, nonetheless, often times the Replacement map was still used for entities tagged with generalize. Remember the plan to only feed into the rewrite prompt the portion of the Replacement map that contained direct identifiers.
@asteier2026 We already handle it in this PR: |
andreatgretel
left a comment
There was a problem hiding this comment.
Approving - non-blocking stuff to track:
- The orchestrator sets
compute_grouped_entities=Trueonly whenconfig.replace is not None. This workflow needsCOL_ENTITIES_BY_VALUE, so when rewrite gets wired in that condition needs to coverconfig.rewritetoo or you'll get aKeyError. Worth a TODO somewhere. - This is the third workflow doing split-on-has-entities / reorder / recombine, and the row-order column name already diverged (
_anonymizer_row_ordervs_row_order). Might be time to pull that into a shared helper before the next one copies it again. _has_entitiesswallows all parse errors silently (except Exception: return False) - alogger.debugwould help when rows unexpectedly skip rewriting. Also does full schema validation just to check emptiness when a dict.get()would do.
Summary
Implements
RewriteGenerationWorkflowfor the rewrite pipeline. This stage runs after sensitivity disposition and handles records with detected entities by generating replacement maps, preparing rewrite prompt inputs, invoking therewritermodel, and extracting rewritten text for downstream evaluation.LlmReplaceWorkflow.generate_map_only()to produceCOL_REPLACEMENT_MAPCOL_REWRITE_DISPOSITION_BLOCKfromCOL_SENSITIVITY_DISPOSITIONCOL_REPLACEMENT_MAP_FOR_PROMPTtoreplace-method entities onlyCOL_FULL_REWRITEthroughLLMStructuredColumnConfigand extractsCOL_REWRITTEN_TEXTdf.attrs, failed-record accumulation, and original row orderDesign Decisions
Structured cross-column payloads are normalized at the custom-column boundary rather than assuming live schema instances. The rewrite prompt uses the disposition block directly instead of a generic entity-context layer.
Testing
Tests pass locally and cover passthrough behavior, replacement-map filtering, prompt construction, schema/dict output extraction, failed-record propagation, attribute propagation, and mixed-row order preservation.
Related Issues
Closes #33