Skip to content

feat: domain classification and sensitivity disposition workflows#45

Merged
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-domain-disposition
Mar 17, 2026
Merged

feat: domain classification and sensitivity disposition workflows#45
lipikaramaswamy merged 8 commits into
mainfrom
lipikaramaswamy/feat/rewrite-engine-domain-disposition

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

@lipikaramaswamy lipikaramaswamy commented Mar 15, 2026

Summary

Implements the first two rewrite pipeline steps as column factories, part of the
broader single-workflow rewrite architecture tracked in #30.

  • engine/rewrite/domain_classification.pyDomainClassificationWorkflow
    • Classifies input text into one of 24 domains via LLMStructuredColumnConfig
    • _enrich_domain custom column looks up per-domain guidance from DOMAIN_SUPPLEMENT_MAP
    • _DOMAIN_LIST (for the prompt) and DOMAIN_SUPPLEMENT_MAP (for enrichment) are intentionally separate
  • engine/rewrite/sensitivity_disposition.pySensitivityDispositionWorkflow
    • Produces a structured per-entity protection plan against SensitivityDispositionSchema
  • engine/constants.py — added COL_DOMAIN_SUPPLEMENT

Design Decisions

Column factory pattern: workflows expose columns() -> list[ColumnConfigT] only.
No run() method — all steps will be collected and passed to a single NddAdapter.run_workflow() call in the top-level RewriteWorkflow (tracked in #30).

Prompt section headers: standardized to XML tags (<privacy_goal>, <input_tagged_text>, etc.)
as XML provides the clearest semantic structure across several model families (gpt-oss, claude, nemotron)

Data summary label: standardized to Dataset description: in prompts. Python param
stays data_summary to match AnonymizerConfig.

Trust DataDesigner output types: _enrich_domain accesses .domain directly on the
DomainClassificationSchema Pydantic object — no defensive dict/fallback handling since
LLMStructuredColumnConfig guarantees a valid schema instance.

Follow-ups

  • Align entity detection prompts (_get_validation_prompt, _get_augment_prompt, _get_latent_prompt) to XML section headers and Dataset description: label (TODO in
    sensitivity_disposition.py)

Related Issues

Closes #31

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner March 15, 2026 19:56
@lipikaramaswamy lipikaramaswamy changed the title feat: domain classification and sensitivity disposition workflows feat: domain classification and sensitivity disposition workflows Mar 15, 2026
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Comment thread tests/engine/test_domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
Comment thread src/anonymizer/engine/rewrite/sensitivity_disposition.py Outdated
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
Comment thread src/anonymizer/engine/constants.py Outdated
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
@andreatgretel
Copy link
Copy Markdown
Collaborator

src/anonymizer/engine/schemas/__init__.py

__all__ = [
    ...
    "SensitivityLevel",
]

generate_privacy_qa_from_disposition is defined in schemas/rewrite.py and already tested, but it's missing from __all__ here - every other rewrite type is exported so this looks like an oversight. downstream orchestration code will need this function and shouldn't have to go through the submodule directly

Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean PR overall - the column-factory pattern is well-suited for the single-NddAdapter.run_workflow() call, and the Pydantic validation on LLM outputs is thorough. left a few inline nits: one missing export in schemas/__all__, an unused constant, a minor defensive-handling inconsistency in _enrich_domain, and a suggestion around exhaustiveness coverage for _DOMAIN_LIST. nothing blocking - approve.

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

lipikaramaswamy commented Mar 17, 2026

src/anonymizer/engine/schemas/__init__.py

__all__ = [
    ...
    "SensitivityLevel",
]

generate_privacy_qa_from_disposition is defined in schemas/rewrite.py and already tested, but it's missing from __all__ here - every other rewrite type is exported so this looks like an oversight. downstream orchestration code will need this function and shouldn't have to go through the submodule directly

PR #48 moves this function to its final resting place in the pipeline, so we handle exports there rather than adding another interim export from schemas.__init__ on this branch.

@lipikaramaswamy lipikaramaswamy merged commit a9a7eae into main Mar 17, 2026
5 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/feat/rewrite-engine-domain-disposition branch March 17, 2026 18:50
Copy link
Copy Markdown
Collaborator

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, ship it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rewrite): engine — domain classification + sensitivity disposition

3 participants