chore: normalize whitespace in tags across all sources (keep Chinese)#196
Merged
Conversation
Replaces whitespace in tags with hyphens per schema description rule (tags must be space-free for tokenizer/index compatibility). Changes: - 389 source files modified, 3688 tag changes total - Chinese tags preserved as-is (schema allows mixed CN/EN) - English multi-word phrases hyphenated + lowercased (matches schema examples gdp/ipo/economic-growth) - Duplicate tags (post-normalization) deduplicated Script: scripts/cleanup-tag-spaces.py Prereq for upcoming schema PR that adds tags regex pattern ^\S+$. Ref: three-way review 2026-04-30 (明鉴/墨子/明察)
firstdata-dev
approved these changes
Apr 30, 2026
Collaborator
firstdata-dev
left a comment
There was a problem hiding this comment.
Approved ✅
Review 结论:
- 脚本逻辑正确:只替换空白→连字符 + 去重
- 中文 tags 0 丢失(main 3225 → PR 3225,完全保留)
- 大小写保留(SNP/EMBL-EBI/AI-prediction 等 acronym 未被 lowercase,正确)
- 脚本幂等(set-based dedup)
- 389 文件 / 3688 tag 改动范围符合预期
抽样 10 个文件验证:academic/biology/* 5 个 + china/* 3 个 + 随机 2 个,全部正常空格→连字符转换。
合并后可推进 Schema PR(tags.items 加 "pattern": "^\\S+$" 硬拦截)。
firstdata-dev
approved these changes
Apr 30, 2026
Collaborator
firstdata-dev
left a comment
There was a problem hiding this comment.
Approved ✅ 脚本正确、中文 tags 0 丢失(main 3225 → PR 3225)、大小写保留、幂等。合并后可上 Schema pattern 硬拦截。
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces whitespace in
tagswith hyphens across 322 sources (~2974 tag changes).Scope (Corrected per 明鉴 review)
Only space→hyphen normalization. NOT touched:
NNSA,IAMAC,GDPstay as-is)Why
Schema description says tags are for "improved discoverability" with mixed CN/EN, but tags containing spaces (e.g.
"bank wealth management") break tokenizer/index semantics.This PR is prerequisite for an upcoming schema PR that adds
"pattern": "^\\S+$"totags.items.Changes
Script
scripts/cleanup-tag-spaces.py(64 lines, idempotent). Re-running on the corpus is a no-op.Review strategy
Validation
After cleanup: 0 files with space-tags, 0 violations.
Three-way context
Ref: 2026-04-30 three-way alignment. Path:
Checklist