chore: add ko_KR locale to nemotron personas datasets#572
Conversation
Register Korean (ko_KR, 2.66 GB) as an available managed persona dataset locale, update related CLI/repository tests, and document the new locale and its NGC download command.
Remove stale per-locale fields that no longer exist in any managed parquet (commune, departement, prefecture), drop district from the India-specific section since it's already listed in Core Fields, rename digital_skills → digital_skill to match the actual ja_JP column, and add sections for ko_KR, en_SG, and the en_US/en_SG shared ethnic_background. Corrects the religion-family membership to include en_SG.
The test asserts all 9 locales were downloaded but only enumerates 8 in its per-locale checks — fr_FR has been missing since before the ko_KR addition. Align the enumeration with the count.
|
Docs preview: https://08d71afb.dd-docs-preview.pages.dev
|
PR #572 Review —
|
Greptile SummaryThis PR adds
|
| Filename | Overview |
|---|---|
| docs/concepts/person_sampling.md | Adds ko_KR locale, Singapore-specific, and English-shared field sections; regroups religion/India fields; drops removed fields (commune, departement, prefecture, digital_skills). Minor gap: district removed from India section but not added to universal table. |
| packages/data-designer-config/src/data_designer/config/utils/constants.py | Adds ko_KR entry (2.66 GB) to NEMOTRON_PERSONAS_DATASET_SIZES and bumps fr_FR from 2.71 GB to 3.87 GB; straightforward registry update. |
| packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py | Reorganises PII_FIELDS and PERSONA_FIELDS: adds ko_KR health/household fields and family_persona, adds en_SG-specific fields and ethnic_background, drops commune/departement/prefecture/digital_skills, renames digital_skills → digital_skill, promotes district to universal block. |
| packages/data-designer/tests/cli/repositories/test_persona_repository.py | Bumps locale counts to 9, expands locale set assertion to include ko_KR, and adds ko_KR size/dataset-name test case. |
| packages/data-designer/tests/cli/controllers/test_download_controller.py | Bumps expected locale count from 8 to 9, adds ko_KR and fr_FR to downloaded-locales assertions. |
| packages/data-designer/tests/cli/services/test_download_service.py | Bumps expected locale count from 8 to 9 and adds ko_KR assertion; clean mechanical update. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[NEMOTRON_PERSONAS_DATASET_SIZES\nconstants.py] -->|9 locales incl. ko_KR| B[PersonaRepository\n_registry]
B --> C[DownloadService\nget_available_locales]
B --> D[DownloadController\n_determine_locales]
C --> E[CLI download command]
D --> E
E -->|all_locales=True| F[NGC download\nfor each locale]
G[PII_FIELDS / PERSONA_FIELDS\ndataset_based_person_fields.py] -->|field allow-list| H[Person sampler\nfilter/select columns]
H --> I[Generated dataset\nwith locale-specific fields]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/person_sampling.md
Line: 238-244
Comment:
**`district` removed from docs but promoted to universal in code**
In `dataset_based_person_fields.py`, `district` was moved from the India-specific block into the "Universal demographic fields (present in every managed locale)" comment section. However, the doc update removes `district` from the **India Locales Fields** section without adding it to the universal fields table (lines ~181–192). After this PR `district` is not documented under any locale section.
If `district` is now truly universal, it should appear in the universal PII fields table. If it isn't universal, the code comment classification needs to be corrected.
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "docs: add ko_KR to locale parameter list" | Re-trigger Greptile
Nemotron personas schema auditI created a temporary audit workspace at Result: PASS
Runtime-generated fields were excluded from raw parquet exactness because they are added by DataDesigner at generation time: One caveat: |
|
|
Thanks @andreatgretel, good catch. I added |
andreatgretel
left a comment
There was a problem hiding this comment.
Looks good - clean locale addition with thorough schema reconciliation. Johnny's parquet audit confirms all 9 locales match the updated field lists. Ship it.
📋 Summary
ko_KR(South Korea) to the Nemotron-Personas locale registry alongside the existing 8 locales, registering its dataset size (2.66 GB) and NGC resource name so users can download it via the CLIPII_FIELDS/PERSONA_FIELDSindataset_based_person_fields.pyto reflect the fields actually present in the installed parquet schemas — addsko_KR-specific fields (health indicators, household/economic status,family_persona),en_SG-specific fields (industry,preferred_english_name), sharedethnic_background(en_US/en_SG), and drops fields no longer produced (commune,departement,prefecture,digital_skills);digital_skillreplacesdigital_skillsfr_FRdataset size to3.87 GBto match the current NGC artifactdocs/concepts/person_sampling.mdpersona and PII field tables with the new field layout (adds Korea-Specific and Singapore-Specific sections, regroups religion/India fields)ko_KRandfr_FR