feat: domain classification and sensitivity disposition workflows by lipikaramaswamy · Pull Request #45 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-03-15T19:56:51Z

Summary

Implements the first two rewrite pipeline steps as column factories, part of the
broader single-workflow rewrite architecture tracked in #30.

engine/rewrite/domain_classification.py — DomainClassificationWorkflow
- Classifies input text into one of 24 domains via LLMStructuredColumnConfig
- _enrich_domain custom column looks up per-domain guidance from DOMAIN_SUPPLEMENT_MAP
- _DOMAIN_LIST (for the prompt) and DOMAIN_SUPPLEMENT_MAP (for enrichment) are intentionally separate
engine/rewrite/sensitivity_disposition.py — SensitivityDispositionWorkflow
- Produces a structured per-entity protection plan against SensitivityDispositionSchema
engine/constants.py — added COL_DOMAIN_SUPPLEMENT

Design Decisions

Column factory pattern: workflows expose columns() -> list[ColumnConfigT] only.
No run() method — all steps will be collected and passed to a single NddAdapter.run_workflow() call in the top-level RewriteWorkflow (tracked in #30).

Prompt section headers: standardized to XML tags (<privacy_goal>, <input_tagged_text>, etc.)
as XML provides the clearest semantic structure across several model families (gpt-oss, claude, nemotron)

Data summary label: standardized to Dataset description: in prompts. Python param
stays data_summary to match AnonymizerConfig.

Trust DataDesigner output types: _enrich_domain accesses .domain directly on the
DomainClassificationSchema Pydantic object — no defensive dict/fallback handling since
LLMStructuredColumnConfig guarantees a valid schema instance.

Follow-ups

Align entity detection prompts (_get_validation_prompt, _get_augment_prompt, _get_latent_prompt) to XML section headers and Dataset description: label (TODO in
sensitivity_disposition.py)

Related Issues

Closes #31

…rewrite-engine-domain-disposition

andreatgretel · 2026-03-17T13:27:47Z

src/anonymizer/engine/schemas/__init__.py
__all__ = [
    ...
    "SensitivityLevel",
]

generate_privacy_qa_from_disposition is defined in schemas/rewrite.py and already tested, but it's missing from __all__ here - every other rewrite type is exported so this looks like an oversight. downstream orchestration code will need this function and shouldn't have to go through the submodule directly

andreatgretel

clean PR overall - the column-factory pattern is well-suited for the single-NddAdapter.run_workflow() call, and the Pydantic validation on LLM outputs is thorough. left a few inline nits: one missing export in schemas/__all__, an unused constant, a minor defensive-handling inconsistency in _enrich_domain, and a suggestion around exhaustiveness coverage for _DOMAIN_LIST. nothing blocking - approve.

lipikaramaswamy · 2026-03-17T18:30:04Z

src/anonymizer/engine/schemas/__init__.py
__all__ = [
    ...
    "SensitivityLevel",
]
generate_privacy_qa_from_disposition is defined in schemas/rewrite.py and already tested, but it's missing from __all__ here - every other rewrite type is exported so this looks like an oversight. downstream orchestration code will need this function and shouldn't have to go through the submodule directly

PR #48 moves this function to its final resting place in the pipeline, so we handle exports there rather than adding another interim export from schemas.__init__ on this branch.

…back,

andreatgretel

looks good, ship it!

lipikaramaswamy added 4 commits March 10, 2026 15:15

feat: add rewrite foundation schemas and model roles

7d032c1

feat: add methods on sensitivity disposition class required downstream

3196c53

Merge remote-tracking branch 'origin/main' into lipikaramaswamy/feat/…

946309d

…rewrite-engine-domain-disposition

domain classification and sensitivity disposition workflows

530a04b

lipikaramaswamy requested a review from a team as a code owner March 15, 2026 19:56

lipikaramaswamy changed the title ~~feat: domain classification and sensitivity disposition workflows~~ feat: domain classification and sensitivity disposition workflows Mar 15, 2026

fix: remove comments on model names, they are already clear

b96db08

lipikaramaswamy mentioned this pull request Mar 16, 2026

feat: add QA generation workflow for rewrite pipeline #48

Merged