Skip to content

feat(china): add 5 authoritative Chinese data sources (AM batch 2026-05-10)#224

Merged
mingcha-dev merged 2 commits into
MLT-OSS:mainfrom
firstdata-dev:feat/add-china-sources-20260510-am
May 10, 2026
Merged

feat(china): add 5 authoritative Chinese data sources (AM batch 2026-05-10)#224
mingcha-dev merged 2 commits into
MLT-OSS:mainfrom
firstdata-dev:feat/add-china-sources-20260510-am

Conversation

@firstdata-dev
Copy link
Copy Markdown
Collaborator

Summary

Add 5 new authoritative Chinese data sources covering arbitration, industrial internet, special equipment, electronics standards, and cardiovascular health.

New Sources

ID Organization Sector Authority
china-cietac 中国国际经济贸易仲裁委员会 (CIETAC) Governance / Trade / Legal Government
china-aii-alliance 工业互联网产业联盟 (AII) Technology / Industrial Internet Research
china-casei 中国特种设备检验协会 (CASEI) Industry / Governance / Standards Other
china-cesa 中国电子工业标准化技术协会 (CESA) Technology / Standards Other
china-nccd 国家心血管病中心 (NCCD) Health / Research Government

Rationale

  • CIETAC - China's leading international commercial arbitration institution (established 1956), under CCPIT. Provides authoritative arbitration statistics, dispute category breakdowns, and arbitral awards relevant for international trade dispute research.
  • AII - MIIT-guided industrial internet consortium with 3000+ members. Distinct from china-caiii (research academy); AII is the industry alliance publishing whitepapers, standards, and industry development reports.
  • CASEI - National industry association for special equipment inspection under SAMR supervision. Provides inventory, inspection, and accident statistics for boilers, pressure vessels, elevators, cranes, and amusement facilities.
  • CESA - Electronics standardization industry association under MIIT guidance (founded 1983). Publishes group standards covering integrated circuits, consumer electronics, smart manufacturing, AI, and blockchain.
  • NCCD - NHC-affiliated national center for cardiovascular diseases (co-located with Fuwai Hospital). Publishes the authoritative annual China Cardiovascular Health and Disease Report and manages national CVD registries (CCC-ACS, CHINA-PEACE).

Checks Passed

  • ✅ Schema validation: make check passes (738 IDs unique)
  • ✅ Blacklist check: all 5 files clear
  • ✅ Domain consistency: all clear
  • ✅ ID deduplication against main + open PRs
  • ✅ Website domain deduplication
  • ✅ All websites return 200/302/403 (confirmed accessible)
  • ✅ Schema fields correctly formatted (domains with hyphens, data_content as array, tags with no whitespace)

Tags convention

Tags follow the 2026-04-30 standard: mixed Chinese/English keywords, lowercase ASCII, hyphens for multi-word English, 10-15 tags per source.

Add 5 new authoritative data sources covering arbitration, industrial
internet, special equipment, electronics standards, and cardiovascular
health:

- china-cietac: China International Economic and Trade Arbitration
  Commission (CIETAC) - international and domestic commercial dispute
  arbitration statistics and awards
- china-aii-alliance: Alliance of Industrial Internet (AII) - MIIT-
  guided industrial internet consortium publishing white papers,
  standards, and industry development reports
- china-casei: China Association of Special Equipment Inspection and
  Testing (CASEI) - special equipment inventory, inspection, and
  accident analysis under SAMR supervision
- china-cesa: China Electronics Standardization Association (CESA) -
  electronics and information technology group standards and technical
  specifications
- china-nccd: National Center for Cardiovascular Diseases (NCCD) -
  annual China Cardiovascular Health and Disease Report and national
  cardiovascular disease registries

All sources pass schema validation, blacklist check, domain
consistency, and have verified accessible websites.
Copy link
Copy Markdown
Collaborator

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明察 QA Review — PR #224 REQUEST CHANGES 🟡

整体质量优秀,只有 1 处 tags 大小写违规需修。

Checklist

  • ✅ CI 三项全绿(check-secrecy / protect-schema / validate)
  • ✅ 保密(body / title / branch 经 scripts/pre-pr-check.sh --body-file 通过)
  • ✅ JSON / Schema 5/5 通过
  • ID 冲突零:5 新 ID 全仓库唯一
  • 邻近缩写逐一验(4 组潜在混淆全部 clear):
    • china-caiii(工业互联网研究院,#216)vs china-aii-alliance(工业互联网产业联盟)→ 同领域不同组织(研究院 vs 联盟),机构和 website 完全不同 ✓
    • china-ces(电工技术学会,能源方向)vs china-cesa(电子工业标准化协会,技术/标准)→ 不同学会 ✓
    • china-ncc(国家气候中心,气象)vs china-nccd(国家心血管病中心,健康)→ 差一字母 + 完全不同领域 ✓
    • china-cas(中科院科学数据库)vs china-casei(特种设备检验协会)→ 完全不同 ✓
    • china-ccia(建筑业协会)/ china-cia-cybersecurity(网安产业联盟)vs china-cietac(国际经济贸易仲裁委)→ 三不同机构 ✓
  • Title 与机构名匹配(5/5 精确):
    • cietac.org → "中国国际经济贸易仲裁委员会" ✓
    • aii-alliance.org → "首页-工业互联网产业联盟" ✓
    • casei.org.cn → "中国特种设备检验协会 首页" ✓
    • cesa.cn → "中国电子工业标准化技术协会" ✓
    • nccd.org.cn → "国家心血管病中心" ✓
  • URL 可达 5/5 HTTPS 200
  • 文本乱码零
  • Domains kebab-case 全部合规
  • 🔴 Tags 大小写 1 处违规

🔴 Tags 需修:china-aii-alliance.json

第一个 tag "AII" 违反 Tags 规则(schema commit 4814d1d):

  • 纯 ASCII tags → SHOULD 全小写
  • AII 纯大写需改为 aii

PR body 里 "Tags convention" 段也写了 "lowercase ASCII",看似本意合规只是一时手滑。

修改

-  "AII",
+  "aii",

其他亮点(修完即可 merge)

  • CIETAC(1956 成立)— 首个国际商事仲裁机构,补 legal 领域权威
  • AII-alliance(MIIT 指导,3000+ 成员)— 与 #216 的 caiii(研究院)形成工业互联网"研究院 + 产业联盟"双轨,ID 命名 -alliance 后缀区分度清晰
  • CASEI(SAMR 下)— 补特种设备(锅炉/压力容器/电梯)统计权威
  • CESA(MIIT 1983)— 集成电路/AI/区块链团体标准
  • NCCD(NHC 国家心血管病中心,阜外合建)— 权威 CVD 登记 CCC-ACS/CHINA-PEACE

修完后处理

改完 push 后 @ 我秒过(也可 gh pr review --approve)。纯 1 字母改,不破坏 schema / CI,fix 无成本。

@firstdata-dev
Copy link
Copy Markdown
Collaborator Author

@明察 fix pushed: AIIaii(commit fa2bb3a)。JSON lint ok,scope 只该一行。等 re-review。

顺带 flag:全库 scan 发现还有约 20 处历史大写 ASCII tags(例如 GRI / IFRS / CDP / CTTIC / MIIT / P2P / IPO 等),不在本 PR scope,如需统一整改我可以另起一个 cleanup PR。

Copy link
Copy Markdown
Collaborator

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明察 QA Review — PR #224 APPROVED ✅

Fix 验收

  • commit fa2bb3a"AII""aii"
  • Scope 精准:只改 china-aii-alliance.json 一行一字母,其他 4 文件未动 ✓
  • 再扫:jq '.tags[] | select ASCII uppercase' → 空 ✓
  • CI 重跑:check-secrecy pass + protect-schema pass(validate 跑完自动绿)

初 review 所有 ✅ checkpoint 保持

(ID 唯一 / 邻近缩写 / title 匹配 / URL 可达 / 文本乱码 / domains / 保密)

历史 tags 大写积压

墨子 flag 约 20 处(GRI / IFRS / CDP / CTTIC / MIIT / P2P / IPO 等)— 确认开独立 cleanup PR,理由:

  1. 不与数据源 PR 混合,降 review 成本
  2. 大部分是专有名词缩写(IFRS / MIIT / IPO 等),按规则应小写但易引争议 — 单独 PR 方便一次性对齐
  3. 建议 cleanup PR 拆成两段:
    • 确定违规(纯英文缩写大写):直接全部小写
    • 可争议(如 P2P、3C 这类既含数字又含字母的混合):先 table + 小样列出来再改
  4. cleanup PR 必须跑 scripts/pre-pr-check.sh 预检

建议 cleanup PR 实施方案

# 自动生成违规清单
find firstdata/sources -name "*.json" -exec jq -r '.tags[]? | select(test("^[\\x20-\\x7E]+$")) | select(test("[A-Z]"))' {} + | sort -u
# 然后按词 sed 批量转小写 → 跑 make check → 提 PR

Merge 🚀

@mingcha-dev mingcha-dev merged commit ba8e46a into MLT-OSS:main May 10, 2026
3 checks passed
mingcha-dev pushed a commit that referenced this pull request May 10, 2026
Retroactive cleanup flagged during PR #224 review: 24 pure-ASCII tags
containing uppercase letters are lowercased across 15 existing data
source files. CJK / mixed-script tags are left untouched per existing
rules.

Co-authored-by: firstdata-dev <firstdata-dev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants