Skip to content

feat: unify rewrite domain metadata into a single source#143

Merged
memadi-nv merged 6 commits into
mainfrom
memadi/feature/unify-rewrite-domain-metadata
May 14, 2026
Merged

feat: unify rewrite domain metadata into a single source#143
memadi-nv merged 6 commits into
mainfrom
memadi/feature/unify-rewrite-domain-metadata

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv commented Apr 29, 2026

Summary

Refactors rewrite-pipeline domain metadata into a single source of truth and updates the domain taxonomy.

Refactor

  • Collapse three parallel structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into one DomainMetadata dataclass + one DOMAIN_METADATA tuple keyed by Domain.
  • New _build_domain_index raises RuntimeError at import time on duplicate or missing entries — drift fails fast instead of surfacing later as a mid-pipeline KeyError.
  • Rename rewrite_supplementquality_supplement (the field has always been quality guidance, never rewrite-specific).
  • privacy_supplement is str | None and defaults to quality_supplement via __post_init__, so domains without dedicated privacy guidance can omit it. Today only LEGAL has a distinct privacy supplement.

Taxonomy

24 prior Domain values → 21 new ones:

The Domain enum's serialized values have changed. Pipelines or downstream
consumers holding rows with prior values will fail validation. Migration:

Old value New value
SECURITY_INFOSEC SECURITY_INFOSEC
FINANCIAL FINANCIAL
LEGAL LEGAL
META_TEXT META_TEXT
ENTERTAINMENT_MEDIA ENTERTAINMENT_MEDIA
OTHER OTHER
BIOGRAPHY BIOGRAPHY_PROFILE
CLINICAL_EHR_MEDICAL MEDICAL_CLINICAL
HR_PEOPLE_OPS HR_EMPLOYMENT
MANAGEMENT_OPERATIONS BUSINESS_OPERATIONS
NEWS_JOURNALISM NEWS_PUBLIC_AFFAIRS
SCIENTIFIC_ACADEMIC RESEARCH_SCIENTIFIC
TECHNICAL_ENGINEERING_SOFTWARE TECHNICAL_SOFTWARE_ENGINEERING
EDUCATIONAL_PEDAGOGICAL EDUCATION
FICTION_CREATIVE CREATIVE_FICTION
ECONOMIC ECONOMIC_ANALYSIS
POLICY_REGULATORY_COMPLIANCE POLICY_REGULATORY
MARKETING_ADVERTISING, PRODUCT_REVIEW MARKETING_COMMERCIAL
SOCIAL_CULTURAL_OPED, SOCIAL_MEDIA SOCIAL_COMMENTARY
CHAT_EMAIL_CSAT dropped (re-classify)
PROCEDURAL_INSTRUCTIONAL dropped (re-classify)
TRANSCRIPTS_INTERVIEWS dropped (re-classify)
None INSURANCE
None GOVERNMENT_PUBLIC_RECORDS
CHAT_EMAIL_CSAT dropped
PROCEDURAL_INSTRUCTIONAL dropped
TRANSCRIPTS_INTERVIEWS dropped

Tests

  • Refreshed existing tests for the new field name and renamed identifiers.
  • Added two error-path tests for _build_domain_index covering duplicate entries and missing enum coverage.
  • make test — 647/647 pass; no new typecheck diagnostics in edited files.

Related Issues

Closes #55

Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner April 29, 2026 23:19
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from lipikaramaswamy May 7, 2026 00:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR consolidates three parallel domain-metadata structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into a single DomainMetadata dataclass and DOMAIN_METADATA tuple, with import-time validation via _build_domain_index that catches drift between the enum and the metadata table immediately. The Domain enum is simultaneously retaxonomized from 24 to 21 members (renames, merges, two additions, three removals), making this a breaking change for any persisted domain-column values.

  • Structural consolidation: DomainMetadata(domain, classification_description, quality_supplement, privacy_supplement) replaces three separate dicts; _build_domain_index raises RuntimeError at import time on duplicates or missing coverage, turning a formerly silent runtime KeyError into a loud startup failure.
  • Enum rename/merge/drop: 14 members renamed, 2 merged into existing values, 3 removed, 2 added (INSURANCE, GOVERNMENT_PUBLIC_RECORDS); any pipeline row carrying an old string value (e.g. "BIOGRAPHY", "CHAT_EMAIL_CSAT") will fail Pydantic validation after deployment.
  • Test coverage: Error-path tests for _build_domain_index are present but use a manual try/except/else pattern rather than pytest.raises.

Confidence Score: 4/5

Safe to merge for greenfield data; any existing pipeline data carrying old domain strings will break at validation time and needs a migration or compatibility layer before deploying.

The structural refactor and import-time validation are sound. The Domain enum rename is intentional and well-documented, but it is a hard break for persisted rows — there is no alias or migration layer, so deploying against existing data will produce Pydantic ValidationError failures in _enrich_domain and _enrich_domain_privacy for every row carrying an old string value.

src/anonymizer/engine/schemas/rewrite.py — verify that all upstream datasets have been migrated or that this is intentionally a clean-slate deployment before merging.

Important Files Changed

Filename Overview
src/anonymizer/engine/schemas/rewrite.py Domain enum renaming: 24 → 21 members, removing 3 (CHAT_EMAIL_CSAT, PROCEDURAL_INSTRUCTIONAL, TRANSCRIPTS_INTERVIEWS), adding INSURANCE and GOVERNMENT_PUBLIC_RECORDS. Breaking serialization change for any persisted rows (previously flagged in thread).
src/anonymizer/engine/rewrite/domain_classification.py Three parallel data structures collapsed into DomainMetadata dataclass + DOMAIN_METADATA tuple, with import-time validation via _build_domain_index. Minor ambiguity in FINANCIAL classification description overlapping with ECONOMIC_ANALYSIS.
tests/engine/test_domain_classification.py Tests updated to new field names; error-path tests for _build_domain_index added using try/except/else rather than the pytest.raises idiom.
tests/engine/test_rewrite_workflow.py Mechanical rename of "BIOGRAPHY" → "BIOGRAPHY_PROFILE" throughout test fixtures; no logic changes.

Reviews (3): Last reviewed commit: "address feedback" | Re-trigger Greptile

Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Comment thread tests/engine/test_domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
memadi-nv added 2 commits May 7, 2026 12:55
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread src/anonymizer/engine/schemas/rewrite.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
@memadi-nv memadi-nv requested a review from asteier2026 May 7, 2026 22:37
Copy link
Copy Markdown
Contributor

@asteier2026 asteier2026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Copy Markdown
Collaborator

@lipikaramaswamy lipikaramaswamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from an engineering perspective. I reviewed the refactor/taxonomy changes and ran the local test suite.

I didn't however run a live model-backed end-to-end rewrite pipeline locally, so my approval assumes you validated a full pipeline run with the new taxonomy.

Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv
Copy link
Copy Markdown
Contributor Author

Looks good from an engineering perspective. I reviewed the refactor/taxonomy changes and ran the local test suite.

I didn't however run a live model-backed end-to-end rewrite pipeline locally, so my approval assumes you validated a full pipeline run with the new taxonomy.

Tested with a single record of the following datasets => returned DOMAIN in rewrite:

  • NVIDIA_synthetic_biographies => BIOGRAPHY_PROFILE (Renamed domain from BIOGRAPHY to BIOGRAPHY_PROFILE in this branch)
  • TAB_legal_sample25 => LEGAL (only domain were quality_supplement and privacy_supplement are different)
  • Created a dataset with insurance context => INSURANCE (New DOMAIN)

All returned domains are as expected.

@memadi-nv memadi-nv merged commit e98989c into main May 14, 2026
11 checks passed
@memadi-nv memadi-nv deleted the memadi/feature/unify-rewrite-domain-metadata branch May 14, 2026 03:54
asteier2026 pushed a commit that referenced this pull request May 15, 2026
* unify domain metadata

Signed-off-by: memadi <memadi@nvidia.com>

* more clear docstring

Signed-off-by: memadi <memadi@nvidia.com>

* address feedback

Signed-off-by: memadi <memadi@nvidia.com>

* address feedback-update parameter name

Signed-off-by: memadi <memadi@nvidia.com>

* address feedback

Signed-off-by: memadi <memadi@nvidia.com>

* address feedback

Signed-off-by: memadi <memadi@nvidia.com>

---------

Signed-off-by: memadi <memadi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: unify rewrite domain metadata into a single source of truth

3 participants