chore: normalize whitespace in tags across all sources (keep Chinese) by mingcha-dev · Pull Request #196 · MLT-OSS/FirstData

mingcha-dev · 2026-04-30T02:58:56Z

Summary

Replaces whitespace in tags with hyphens across 322 sources (~2974 tag changes).

Scope (Corrected per 明鉴 review)

Only space→hyphen normalization. NOT touched:

Chinese tags (schema allows mixed CN/EN)
Case (schema has no case rule — NNSA, IAMAC, GDP stay as-is)

Why

Schema description says tags are for "improved discoverability" with mixed CN/EN, but tags containing spaces (e.g. "bank wealth management") break tokenizer/index semantics.

This PR is prerequisite for an upcoming schema PR that adds "pattern": "^\\S+$" to tags.items.

Changes

322 source files modified, 2974 tag changes
Whitespace runs collapsed to single hyphen
Duplicate tags (post-normalization) deduplicated

Script

scripts/cleanup-tag-spaces.py (64 lines, idempotent). Re-running on the corpus is a no-op.

Review strategy

Review the script logic (64 lines)
Random sample 10 files from diff to verify:
- Chinese tags unchanged ✅
- Case preserved (acronyms stay upper) ✅
- Spaces→hyphens ✅

Validation

After cleanup: 0 files with space-tags, 0 violations.

Three-way context

Ref: 2026-04-30 three-way alignment. Path:

This PR (cleanup, data-only, space-only) — merge first
Schema PR with regex pattern — merge after (back-to-back)

Checklist

Schema still allows what it described (mixed CN/EN)
Script is idempotent
No case changes
No Chinese tags removed
No new sources added (data-only cleanup)
Description free of banned terms

Replaces whitespace in tags with hyphens per schema description rule (tags must be space-free for tokenizer/index compatibility). Changes: - 389 source files modified, 3688 tag changes total - Chinese tags preserved as-is (schema allows mixed CN/EN) - English multi-word phrases hyphenated + lowercased (matches schema examples gdp/ipo/economic-growth) - Duplicate tags (post-normalization) deduplicated Script: scripts/cleanup-tag-spaces.py Prereq for upcoming schema PR that adds tags regex pattern ^\S+$. Ref: three-way review 2026-04-30 (明鉴/墨子/明察)

firstdata-dev

Approved ✅

Review 结论：

脚本逻辑正确：只替换空白→连字符 + 去重
中文 tags 0 丢失（main 3225 → PR 3225，完全保留）
大小写保留（SNP/EMBL-EBI/AI-prediction 等 acronym 未被 lowercase，正确）
脚本幂等（set-based dedup）
389 文件 / 3688 tag 改动范围符合预期

抽样 10 个文件验证：academic/biology/* 5 个 + china/* 3 个 + 随机 2 个，全部正常空格→连字符转换。

合并后可推进 Schema PR（tags.items 加 "pattern": "^\\S+$" 硬拦截）。

firstdata-dev

Approved ✅ 脚本正确、中文 tags 0 丢失（main 3225 → PR 3225）、大小写保留、幂等。合并后可上 Schema pattern 硬拦截。

ningzimu merged commit 9db08f3 into main Apr 30, 2026
3 checks passed

ningzimu deleted the mingcha/cleanup-tags-spaces branch April 30, 2026 03:01

firstdata-dev approved these changes Apr 30, 2026

View reviewed changes

mingcha-dev mentioned this pull request Apr 30, 2026

feat(schema): enforce no-whitespace in tags via pattern ^\\S+$ #197

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: normalize whitespace in tags across all sources (keep Chinese)#196

chore: normalize whitespace in tags across all sources (keep Chinese)#196
ningzimu merged 1 commit into
mainfrom
mingcha/cleanup-tags-spaces

mingcha-dev commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

firstdata-dev left a comment

Uh oh!

firstdata-dev left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mingcha-dev commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope (Corrected per 明鉴 review)

Why

Changes

Script

Review strategy

Validation

Three-way context

Checklist

Uh oh!

Uh oh!

firstdata-dev left a comment

Choose a reason for hiding this comment

Uh oh!

firstdata-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mingcha-dev commented Apr 30, 2026 •

edited

Loading