feat: add QA generation workflow for rewrite pipeline#48
Conversation
Implements QaGenerationWorkflow with 5-column pipeline: 1. Format sensitivity disposition → JSON entity protection block 2. LLM meaning unit extraction (PII-safe semantic units) 3. Serialize meaning units → JSON string 4. LLM quality QA generation from meaning units 5. Template-based privacy QA generation (no LLM) Moves generate_privacy_qa_from_disposition from schemas/rewrite.py into qa_generation.py where it belongs as business logic. Prompt uses category-based PII rules (closest to research) with XML section headers and original text as input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
worth a docstring note that this field enforces |
|
one thing to keep in mind when you get to the artifact storage layer: |
asteier2026
left a comment
There was a problem hiding this comment.
The meaning unit prompt was changed too dramatically from what it was is GitLab. It was a lot of work to tweak it to do the right thing. If we want to do something different I think it should be post P0 as it will require exhaustive testing.
The trace dataframe already contains the raw text, tagged text, entity values, and sensitivity disposition — all of which carry the same PII. The privacy QA question strings don't introduce new exposure. When we build an artifact storage layer, we'll probably need to handle the entire trace dataframe as sensitive, not just this column. But that's probably a service consideration, Anonymizer is just a library for now... |
andreatgretel
left a comment
There was a problem hiding this comment.
re-reviewed the latest diff and the earlier comments look addressed. focused checks passed too. approving.
Summary
Implements
QAGenerationWorkflowfor the rewrite pipeline. This stage runs after sensitivity disposition and produces the QA artifacts used downstream to evaluate rewrite quality and privacy leakage.src/anonymizer/engine/rewrite/qa_generation.pyQAGenerationWorkflow.columns()_format_disposition_block()for prompt-safe disposition serialization_serialize_meaning_units()for quality-QA prompt injectiongenerate_privacy_qa_from_disposition()and_generate_privacy_qa_column()for template-based privacy QALLMStructuredColumnConfigusing themeaning_extractoraliasLLMStructuredColumnConfigusing theqa_generatoraliasSensitivityDispositionSchemaDesign Decisions
Structured cross-column payloads are normalized at the custom-column boundary with
model_validate(...)rather than assuming live schema instances. Prompt sections use the same XML-style layout introduced in #45Dependencies
Depends on #45 for domain classification and sensitivity disposition.
Testing
Tests pass locally and cover column ordering, model alias wiring, disposition/meaning-unit serialization, privacy-QA generation, and prompt column references.
Related Issues
Closes #32