Skip to content

feat: add QA generation workflow for rewrite pipeline#48

Merged
lipikaramaswamy merged 7 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-qa-generation
Mar 18, 2026
Merged

feat: add QA generation workflow for rewrite pipeline#48
lipikaramaswamy merged 7 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-qa-generation

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

@lipikaramaswamy lipikaramaswamy commented Mar 16, 2026

Summary

Implements QAGenerationWorkflow for the rewrite pipeline. This stage runs after sensitivity disposition and produces the QA artifacts used downstream to evaluate rewrite quality and privacy leakage.

  • Adds src/anonymizer/engine/rewrite/qa_generation.py
  • QAGenerationWorkflow.columns()
  • _format_disposition_block() for prompt-safe disposition serialization
  • _serialize_meaning_units() for quality-QA prompt injection
  • generate_privacy_qa_from_disposition() and _generate_privacy_qa_column() for template-based privacy QA
  • Meaning unit extraction runs through LLMStructuredColumnConfig using the meaning_extractor alias
  • Quality QA generation runs through LLMStructuredColumnConfig using the qa_generator alias
  • Privacy QA is generated without an LLM call from protected entities in SensitivityDispositionSchema

Design Decisions

Structured cross-column payloads are normalized at the custom-column boundary with model_validate(...) rather than assuming live schema instances. Prompt sections use the same XML-style layout introduced in #45

Dependencies

Depends on #45 for domain classification and sensitivity disposition.

Testing

Tests pass locally and cover column ordering, model alias wiring, disposition/meaning-unit serialization, privacy-QA generation, and prompt column references.

Related Issues

Closes #32

Implements QaGenerationWorkflow with 5-column pipeline:
1. Format sensitivity disposition → JSON entity protection block
2. LLM meaning unit extraction (PII-safe semantic units)
3. Serialize meaning units → JSON string
4. LLM quality QA generation from meaning units
5. Template-based privacy QA generation (no LLM)

Moves generate_privacy_qa_from_disposition from schemas/rewrite.py
into qa_generation.py where it belongs as business logic.

Prompt uses category-based PII rules (closest to research) with
XML section headers and original text as input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner March 16, 2026 23:05
Comment thread src/anonymizer/engine/rewrite/qa_generation.py Outdated
Comment thread src/anonymizer/engine/constants.py
@andreatgretel
Copy link
Copy Markdown
Collaborator

src/anonymizer/engine/schemas/rewrite.py:167

sensitivity_disposition: list[EntityDispositionSchema] = Field(min_length=1)

worth a docstring note that this field enforces min_length=1 by design - the expectation is that the orchestrator short-circuits before this step if no entities were detected. it's a non-obvious contract; a future caller constructing this schema directly (e.g. in tests) will get a confusing ValidationError on an empty list with no hint as to why

@andreatgretel
Copy link
Copy Markdown
Collaborator

one thing to keep in mind when you get to the artifact storage layer: trace_dataframe contains _privacy_qa, and each question in there embeds the raw entity value verbatim (e.g. "Can the first_name 'Alice' be deduced from the rewritten text?"). so if trace_dataframe gets written to artifact_path as parquet/csv, those files would contain PII in the question strings. probably worth a scrub step or a separate privacy-safe artifact before persisting.

Comment thread src/anonymizer/engine/rewrite/qa_generation.py Outdated
Comment thread src/anonymizer/engine/rewrite/qa_generation.py
Copy link
Copy Markdown
Contributor

@asteier2026 asteier2026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning unit prompt was changed too dramatically from what it was is GitLab. It was a lot of work to tweak it to do the right thing. If we want to do something different I think it should be post P0 as it will require exhaustive testing.

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

one thing to keep in mind when you get to the artifact storage layer: trace_dataframe contains _privacy_qa, and each question in there embeds the raw entity value verbatim (e.g. "Can the first_name 'Alice' be deduced from the rewritten text?"). so if trace_dataframe gets written to artifact_path as parquet/csv, those files would contain PII in the question strings. probably worth a scrub step or a separate privacy-safe artifact before persisting.

The trace dataframe already contains the raw text, tagged text, entity values, and sensitivity disposition — all of which carry the same PII. The privacy QA question strings don't introduce new exposure. When we build an artifact storage layer, we'll probably need to handle the entire trace dataframe as sensitive, not just this column. But that's probably a service consideration, Anonymizer is just a library for now...

Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-reviewed the latest diff and the earlier comments look addressed. focused checks passed too. approving.

@lipikaramaswamy lipikaramaswamy merged commit ad7dfa7 into main Mar 18, 2026
5 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-qa-generation branch March 18, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rewrite): engine — QA generation

3 participants