feat: unify rewrite domain metadata into a single source#143
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR consolidates three parallel domain-metadata structures (
Confidence Score: 4/5Safe to merge for greenfield data; any existing pipeline data carrying old domain strings will break at validation time and needs a migration or compatibility layer before deploying. The structural refactor and import-time validation are sound. The Domain enum rename is intentional and well-documented, but it is a hard break for persisted rows — there is no alias or migration layer, so deploying against existing data will produce Pydantic ValidationError failures in _enrich_domain and _enrich_domain_privacy for every row carrying an old string value. src/anonymizer/engine/schemas/rewrite.py — verify that all upstream datasets have been migrated or that this is intentionally a clean-slate deployment before merging. Important Files Changed
Reviews (3): Last reviewed commit: "address feedback" | Re-trigger Greptile |
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
lipikaramaswamy
left a comment
There was a problem hiding this comment.
Looks good from an engineering perspective. I reviewed the refactor/taxonomy changes and ran the local test suite.
I didn't however run a live model-backed end-to-end rewrite pipeline locally, so my approval assumes you validated a full pipeline run with the new taxonomy.
Signed-off-by: memadi <memadi@nvidia.com>
Tested with a single record of the following datasets => returned DOMAIN in rewrite:
All returned domains are as expected. |
* unify domain metadata Signed-off-by: memadi <memadi@nvidia.com> * more clear docstring Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> * address feedback-update parameter name Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> --------- Signed-off-by: memadi <memadi@nvidia.com>
Summary
Refactors rewrite-pipeline domain metadata into a single source of truth and updates the domain taxonomy.
Refactor
_DOMAIN_LIST,DOMAIN_SUPPLEMENT_MAP,DOMAIN_SUPPLEMENT_PRIVACY_MAP) into oneDomainMetadatadataclass + oneDOMAIN_METADATAtuple keyed byDomain._build_domain_indexraisesRuntimeErrorat import time on duplicate or missing entries — drift fails fast instead of surfacing later as a mid-pipelineKeyError.rewrite_supplement→quality_supplement(the field has always been quality guidance, never rewrite-specific).privacy_supplementisstr | Noneand defaults toquality_supplementvia__post_init__, so domains without dedicated privacy guidance can omit it. Today onlyLEGALhas a distinct privacy supplement.Taxonomy
24 prior
Domainvalues → 21 new ones:The
Domainenum's serialized values have changed. Pipelines or downstreamconsumers holding rows with prior values will fail validation. Migration:
SECURITY_INFOSECSECURITY_INFOSECFINANCIALFINANCIALLEGALLEGALMETA_TEXTMETA_TEXTENTERTAINMENT_MEDIAENTERTAINMENT_MEDIAOTHEROTHERBIOGRAPHYBIOGRAPHY_PROFILECLINICAL_EHR_MEDICALMEDICAL_CLINICALHR_PEOPLE_OPSHR_EMPLOYMENTMANAGEMENT_OPERATIONSBUSINESS_OPERATIONSNEWS_JOURNALISMNEWS_PUBLIC_AFFAIRSSCIENTIFIC_ACADEMICRESEARCH_SCIENTIFICTECHNICAL_ENGINEERING_SOFTWARETECHNICAL_SOFTWARE_ENGINEERINGEDUCATIONAL_PEDAGOGICALEDUCATIONFICTION_CREATIVECREATIVE_FICTIONECONOMICECONOMIC_ANALYSISPOLICY_REGULATORY_COMPLIANCEPOLICY_REGULATORYMARKETING_ADVERTISING,PRODUCT_REVIEWMARKETING_COMMERCIALSOCIAL_CULTURAL_OPED,SOCIAL_MEDIASOCIAL_COMMENTARYCHAT_EMAIL_CSATPROCEDURAL_INSTRUCTIONALTRANSCRIPTS_INTERVIEWSNoneINSURANCENoneGOVERNMENT_PUBLIC_RECORDSCHAT_EMAIL_CSATPROCEDURAL_INSTRUCTIONALTRANSCRIPTS_INTERVIEWSTests
_build_domain_indexcovering duplicate entries and missing enum coverage.make test— 647/647 pass; no new typecheck diagnostics in edited files.Related Issues
Closes #55