feat: add QA generation workflow for rewrite pipeline by lipikaramaswamy · Pull Request #48 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-03-16T23:05:41Z

Summary

Implements QAGenerationWorkflow for the rewrite pipeline. This stage runs after sensitivity disposition and produces the QA artifacts used downstream to evaluate rewrite quality and privacy leakage.

Adds src/anonymizer/engine/rewrite/qa_generation.py
QAGenerationWorkflow.columns()
_format_disposition_block() for prompt-safe disposition serialization
_serialize_meaning_units() for quality-QA prompt injection
generate_privacy_qa_from_disposition() and _generate_privacy_qa_column() for template-based privacy QA
Meaning unit extraction runs through LLMStructuredColumnConfig using the meaning_extractor alias
Quality QA generation runs through LLMStructuredColumnConfig using the qa_generator alias
Privacy QA is generated without an LLM call from protected entities in SensitivityDispositionSchema

Design Decisions

Structured cross-column payloads are normalized at the custom-column boundary with model_validate(...) rather than assuming live schema instances. Prompt sections use the same XML-style layout introduced in #45

Dependencies

Depends on #45 for domain classification and sensitivity disposition.

Testing

Tests pass locally and cover column ordering, model alias wiring, disposition/meaning-unit serialization, privacy-QA generation, and prompt column references.

Related Issues

Closes #32

Implements QaGenerationWorkflow with 5-column pipeline: 1. Format sensitivity disposition → JSON entity protection block 2. LLM meaning unit extraction (PII-safe semantic units) 3. Serialize meaning units → JSON string 4. LLM quality QA generation from meaning units 5. Template-based privacy QA generation (no LLM) Moves generate_privacy_qa_from_disposition from schemas/rewrite.py into qa_generation.py where it belongs as business logic. Prompt uses category-based PII rules (closest to research) with XML section headers and original text as input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andreatgretel · 2026-03-17T14:19:59Z

src/anonymizer/engine/schemas/rewrite.py:167
sensitivity_disposition: list[EntityDispositionSchema] = Field(min_length=1)

worth a docstring note that this field enforces min_length=1 by design - the expectation is that the orchestrator short-circuits before this step if no entities were detected. it's a non-obvious contract; a future caller constructing this schema directly (e.g. in tests) will get a confusing ValidationError on an empty list with no hint as to why

andreatgretel · 2026-03-17T14:20:09Z

one thing to keep in mind when you get to the artifact storage layer: trace_dataframe contains _privacy_qa, and each question in there embeds the raw entity value verbatim (e.g. "Can the first_name 'Alice' be deduced from the rewritten text?"). so if trace_dataframe gets written to artifact_path as parquet/csv, those files would contain PII in the question strings. probably worth a scrub step or a separate privacy-safe artifact before persisting.

asteier2026

The meaning unit prompt was changed too dramatically from what it was is GitLab. It was a lot of work to tweak it to do the right thing. If we want to do something different I think it should be post P0 as it will require exhaustive testing.

…ation

lipikaramaswamy · 2026-03-18T07:33:49Z

one thing to keep in mind when you get to the artifact storage layer: trace_dataframe contains _privacy_qa, and each question in there embeds the raw entity value verbatim (e.g. "Can the first_name 'Alice' be deduced from the rewritten text?"). so if trace_dataframe gets written to artifact_path as parquet/csv, those files would contain PII in the question strings. probably worth a scrub step or a separate privacy-safe artifact before persisting.

The trace dataframe already contains the raw text, tagged text, entity values, and sensitivity disposition — all of which carry the same PII. The privacy QA question strings don't introduce new exposure. When we build an artifact storage layer, we'll probably need to handle the entire trace dataframe as sensitive, not just this column. But that's probably a service consideration, Anonymizer is just a library for now...

andreatgretel

re-reviewed the latest diff and the earlier comments look addressed. focused checks passed too. approving.

lipikaramaswamy requested a review from a team as a code owner March 16, 2026 23:05

lipikaramaswamy added 3 commits March 16, 2026 16:12

fix: lint fix

b1da1b1

refactor: derive domain key from schema and add _jinja key access

15575e9

fix: normalize QA generation inputs and support keyed jinja access

35c0a63

andreatgretel reviewed Mar 17, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/qa_generation.py Outdated

andreatgretel reviewed Mar 17, 2026

View reviewed changes

Comment thread src/anonymizer/engine/constants.py

asteier2026 reviewed Mar 17, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/qa_generation.py Outdated

asteier2026 reviewed Mar 17, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/qa_generation.py

asteier2026 reviewed Mar 17, 2026

View reviewed changes

lipikaramaswamy mentioned this pull request Mar 17, 2026

feat: domain classification and sensitivity disposition workflows #45

Merged

1 task

lipikaramaswamy added 2 commits March 18, 2026 00:19

PR fb

a38be55

Merge branch 'main' into lipikaramaswamy/feat/rewrite-engine-qa-gener…

363d4d9

…ation

update SensitivityDispositionSchema docstring

c53377a

lipikaramaswamy requested review from andreatgretel and asteier2026 March 18, 2026 07:41

andreatgretel approved these changes Mar 18, 2026

View reviewed changes

asteier2026 approved these changes Mar 18, 2026

View reviewed changes

lipikaramaswamy merged commit ad7dfa7 into main Mar 18, 2026
5 checks passed

lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-qa-generation branch March 18, 2026 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add QA generation workflow for rewrite pipeline#48

feat: add QA generation workflow for rewrite pipeline#48
lipikaramaswamy merged 7 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-qa-generation

lipikaramaswamy commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 17, 2026

Uh oh!

andreatgretel commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

asteier2026 left a comment

Uh oh!

lipikaramaswamy commented Mar 18, 2026

Uh oh!

andreatgretel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lipikaramaswamy commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Decisions

Dependencies

Testing

Related Issues

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 17, 2026

Uh oh!

andreatgretel commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

asteier2026 left a comment

Choose a reason for hiding this comment

Uh oh!

lipikaramaswamy commented Mar 18, 2026

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lipikaramaswamy commented Mar 16, 2026 •

edited

Loading