Skip to content

Transfusion/cjkvi-ids-unicode

Repository files navigation

Unicode-only IDS data

IDS data in the format of CHISE which avoids, to the maximum extent possible, the use of any entity references.

This dataset is intended for the Radically project, a component-based CJK character search engine which currently only handles UCS/UTF-8 characters; being able to look up characters with the semantic equivalents of their subcomponents, even if not structurally exact, is extremely useful.

Examples of IDSes containing entity references which have exact semanto-structural Unicode counterparts:

U+877E 蝾 ⿱虫&AJ1-17775; (荣)
U+8C50 豐 ⿱&CDP-8D51;豆 (𠁳)
U+4049 䁉 ⿱&hanaJU+2BF09;目 (𫼉)
U+FA1F 﨟 ⿱艹&M-29726; (𦝲)
U-00027F3C 𧼼 ⿺走&GT-40124; (若)

Substitution heuristics

A list of entity reference formats may be found in these papers: Multiple-policy Character Annotation based on CHISE, pg. 16 and Possibilities of integration between a glyph-image database for Kanji characters and a character ontology, pg. 5.

The current entity prefixes being handled are (with examples):

  • &A-IWDSU+777F; (A-, alias prefix, strip and attempt to resolve the remainder of the string)
  • &R-HD-JA-376E; (R-, ??, strip and attempt to resolve the remainder of the string)
  • &A-compU+5DDB; (compU+, ?? strip and convert to UCS/UTF-8 character)
  • &CDP-8C66; (CDP-, Chinese Document Processing lab's internal code for various components in 「漢字構形資料庫」. Resolve to the nearest 関連字 using GlyphWiki's dump.)
  • &U-i003+51AC; (U-, variants of Unicode characters; strip all the variant information and convert to UCS/UTF-8 character)
  • &C4-212F; (C[1-7]-, CNS11643 charset characters, resolve with the tables found in https://gitlab.chise.org/CHISE/xemacs-chise
  • &GT-K01770; (GT- or GT-K, a font with a large collection of CJK chars and their variants. Resolve to the nearest 関連字 using GlyphWiki's dump.) More information:
  • &R-HD-JA-376E; (HD-, organized by the 汎用電子情報交換環境整備プログラム . Sub-prefixes usually correspond to well-known charsets; HD-JA corresponds to JIS X0208, HD-JD JIS X0213 plane 1, HD-KS 戸籍統一文字, etc. Resolve using lookup tables.)
  • &JX2-793E; (JX[1-3]-, corresponds to various versions of the JIS charset. Refer to the papers above. Resolve using lookup tables.)
  • &MJ000778; (MJ, 「文字情報基盤」codepoints. Lookup tables are available on the moji.or.jp website.)
  • &G0-4056; (G0-, ??? Resolve to the nearest 関連字 using GlyphWiki's dump.)

cjkvi_ids_unicode.data_access.EntityResolver is subclassed for each of these sources, and fed into cjkvi_ids_unicode.unified_resolver.resolve_entity_references.

Notes on GlyphWiki resolution

An algorithm determining the nearest 関連字 using GlyphWiki's dump_newest_only.txt may be found in cjkvi_ids_unicode.data_access.GlyphWiki. It is invented my me and is known to be incorrect in certain cases.

Essentially, it recursively searches the related column until u3013 is reached (u3013 being the placeholder character), and then checks if it the glyph begins with the signature KAGE engine bounding box, 99:0:0:0:0:200:200, which indicates that it is fully defined in terms of another glyph.

The logic is that non-suffixed UCS characters, e.g. u2667e, should be aliased to hyphenated characters such as u2667e-j; Simplified Chinese characters are thus aliased to -g. This falls apart due to data inconsistency and not all 表外 characters being related to a UCS character.

Licenses

SPDX-License-Identifier: MIT OR GPL-2.0-or-later

  • The products of this repository (*IDS*.txt and *IDS*.json) are licensed under the GNU General Public License v2.0 or later.
  • The code written by me is dual-licensed under the GNU General Public License v2.0 or later, and the MIT license.
    • If you adapt and use it for its evident intended purpose (fetching and processing IDSs from CHISE and the Kanji Database Project), you must release your changes under the GPLv2+.
    • I do not see how it is very useful for anything other than its evident intended purpose, but if, say, you take the lookup table logic and the HTML generation logic on their own stackoverflow-style with no relation to the aforementioned sources, you may use such snippets under the MIT license. I'd still like to know how you find it useful, in such cases.

About

Unicode-only CJKV IDS data

Resources

License

GPL-2.0, MIT licenses found

Licenses found

GPL-2.0
LICENSE.GPLV2
MIT
LICENSE.MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published