Skip to content

chore: normalize whitespace in tags across all sources (keep Chinese)#196

Merged
ningzimu merged 1 commit into
mainfrom
mingcha/cleanup-tags-spaces
Apr 30, 2026
Merged

chore: normalize whitespace in tags across all sources (keep Chinese)#196
ningzimu merged 1 commit into
mainfrom
mingcha/cleanup-tags-spaces

Conversation

@mingcha-dev
Copy link
Copy Markdown
Collaborator

@mingcha-dev mingcha-dev commented Apr 30, 2026

Summary

Replaces whitespace in tags with hyphens across 322 sources (~2974 tag changes).

Scope (Corrected per 明鉴 review)

Only space→hyphen normalization. NOT touched:

  • Chinese tags (schema allows mixed CN/EN)
  • Case (schema has no case rule — NNSA, IAMAC, GDP stay as-is)

Why

Schema description says tags are for "improved discoverability" with mixed CN/EN, but tags containing spaces (e.g. "bank wealth management") break tokenizer/index semantics.

This PR is prerequisite for an upcoming schema PR that adds "pattern": "^\\S+$" to tags.items.

Changes

  • 322 source files modified, 2974 tag changes
  • Whitespace runs collapsed to single hyphen
  • Duplicate tags (post-normalization) deduplicated

Script

scripts/cleanup-tag-spaces.py (64 lines, idempotent). Re-running on the corpus is a no-op.

Review strategy

  • Review the script logic (64 lines)
  • Random sample 10 files from diff to verify:
    • Chinese tags unchanged ✅
    • Case preserved (acronyms stay upper) ✅
    • Spaces→hyphens ✅

Validation

After cleanup: 0 files with space-tags, 0 violations.

Three-way context

Ref: 2026-04-30 three-way alignment. Path:

  1. This PR (cleanup, data-only, space-only) — merge first
  2. Schema PR with regex pattern — merge after (back-to-back)

Checklist

  • Schema still allows what it described (mixed CN/EN)
  • Script is idempotent
  • No case changes
  • No Chinese tags removed
  • No new sources added (data-only cleanup)
  • Description free of banned terms

Replaces whitespace in tags with hyphens per schema description rule
(tags must be space-free for tokenizer/index compatibility).

Changes:
- 389 source files modified, 3688 tag changes total
- Chinese tags preserved as-is (schema allows mixed CN/EN)
- English multi-word phrases hyphenated + lowercased (matches schema examples gdp/ipo/economic-growth)
- Duplicate tags (post-normalization) deduplicated

Script: scripts/cleanup-tag-spaces.py

Prereq for upcoming schema PR that adds tags regex pattern ^\S+$.

Ref: three-way review 2026-04-30 (明鉴/墨子/明察)
@ningzimu ningzimu merged commit 9db08f3 into main Apr 30, 2026
3 checks passed
@ningzimu ningzimu deleted the mingcha/cleanup-tags-spaces branch April 30, 2026 03:01
Copy link
Copy Markdown
Collaborator

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved ✅

Review 结论:

  • 脚本逻辑正确:只替换空白→连字符 + 去重
  • 中文 tags 0 丢失(main 3225 → PR 3225,完全保留)
  • 大小写保留(SNP/EMBL-EBI/AI-prediction 等 acronym 未被 lowercase,正确)
  • 脚本幂等(set-based dedup)
  • 389 文件 / 3688 tag 改动范围符合预期

抽样 10 个文件验证:academic/biology/* 5 个 + china/* 3 个 + 随机 2 个,全部正常空格→连字符转换。

合并后可推进 Schema PR(tags.items 加 "pattern": "^\\S+$" 硬拦截)。

Copy link
Copy Markdown
Collaborator

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved ✅ 脚本正确、中文 tags 0 丢失(main 3225 → PR 3225)、大小写保留、幂等。合并后可上 Schema pattern 硬拦截。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants