# ZhengMa Character Conversion: Background

## 1 Initial Notes & Resources

### 1.1 Zheng Ma Tutorial

We should begin by making clear our object of study.  Properly [written](https://chinese.yabla.com/chinese-english-pinyin-dictionary.php?define=zhengma), we're discussing the following encoding system.

> - Traditional: 鄭碼
> - Simplified: 郑码
> - Pinyin: **Zhèng mǎ**
> - Zheng coding
>     - original Chinese character coding based on component shapes, created by Zheng Yili 鄭易里|郑易里[Zheng4 Yi4 li3], underlying most stroke-based Chinese input methods
>     - also called common coding 字根通用碼|字根通用码[zi4 gen1 tong1 yong4 ma3]

Note that the **[Arch Chinese Dictionary](https://www.archchinese.com/chinese_english_dictionary.html)** seems to give Zheng Ma codes for individual characters, providing quick and dirty access to Zheng Ma codes.  For those wishing to understand how the encoding works, a useful quick introduction to the mechanics of the ZhengMa input method can be found in [this Wikibooks resource](https://en.wikibooks.org/wiki/Zhengma_Input).

### 1.2 Windows Data Resources


[This StackExchange thread](https://chinese.stackexchange.com/questions/83/learning-resources-for-zhengma-input-method) has a nice discussion of resources for learning about the ZhengMa input method and how to use it.  Most importantly, it mentions what specific file in the Microsoft Windows OS contains the encoding information: 

> On my computer it is found at `C:\Program Files(x86)\Windows NT\TableTextService`; it is called `TableTextServiceSimplifiedZhengMa.txt`

And [here](https://github.com/Furzoom/wubi/blob/master/TableTextServiceSimplifiedZhengMa.txt) I've managed to find a copy of that encoding file.  That's helpful!

I just noted, however, some discrepancies between the ZhengMa input method description mentioned [above](https://en.wikibooks.org/wiki/Zhengma_Input) and the [Windows file](https://github.com/Furzoom/wubi/blob/master/TableTextServiceSimplifiedZhengMa.txt) I downloaded.  In particular, the description mentions how to arrive at ZM codes for various strings of several characters, e.g. for 4 characters:

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 生态系统 | mgmz | mc+gdsw+mzvv+zszr | mc+gsw+mzv+zs |
| 高等教育 | smbs | sjld+mbds+bmym+szq | sjl+ms+bmm+szq |

and for more than 4 characters:

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 新石器时代 | sgjk | sufp+ga+jjjj+kds+nhs | sf+ga+jjg+kd+nh |
| 合成洗涤剂 | ohvv | odaj+hmy+vmrd+vrf+sonk | oaj+h+vmr+vrf+snk |
| 中华人民共和国 | jnoy | jivv+nred+od+yybh+eao+mfj+jdcs | |
| 全国工商业联合会 | ojbs | odc+jdcs+bi+suld+ku+ceug+odaj+odbz | |
| 中国有色金属工业总公司 | jjgr | jivv+jdcs+gdq+ryia+pa+xmil+bi+ku+udjw+ozs+yaj | |

But when I search the Windows file, I find

> "sgjk"="相互影响"

which, in addition to having different characters than those in the table, is a 4- rather than a 5-character string!  And I don't find `mgmz` at all!  So that makes me wonder

* How complete is the Windows database?
* How universal are the codes for multi-character phrases?

### 1.3 `fcitx` Zheng Ma Resources

For comparison, I also found [this file](https://github.com/fcitx/fcitx-table-extra/blob/master/tables/zhengma-large.txt), called `zhengma-large.txt`, that's part of the Ubuntu package [`fcitx-table-extra`](https://github.com/fcitx/fcitx-table-extra), corresponding to [`fcitx`](https://github.com/fcitx).

But there, for example, I only find `mgmz` as part of the following entry:

> mgmzs 生态系统

And for `sgjk`, I find

> sgjkn 新石器时代\
> sgjk 𠝒\
> sgjk 𠝒

So I don't really know what's going on there.  This time the strings look right, but the codes have an extra letter... making them **5-characters long!**  I thought the ZhengMa encoding tried to keep everything to 4 characters...

Moreover, if we look at `av` in this file, we find

> ^av 一

... but in the previous file, we find

> "av"="切"

So it seems that these don't agree, even on simple glyphs.

### 1.4 IBus Zheng Ma Data

[This StackExchange thread](https://chinese.stackexchange.com/questions/43465/incomplete-list-of-free-chinese-input-methods-in-current-use) serves as a useful resource.  It lists a number of Chinese input methods (including both 4-corner and Zheng Ma), and it points to websites that have more information.

In particular, for the Zheng Ma encoding, it points to [this website](www.zmfans.cn/bbs) and [this GitHub repo](https://github.com/acevery/ibus-table-zhengma) related to the [IBus input method](https://code.google.com/archive/p/ibus/) project.  The latter contains [this file](https://github.com/acevery/ibus-table-zhengma/blob/master/tables/zhengma.txt) called `zhengma.txt` which has another data store of the Zheng Ma codes and their corresponding characters.

### 1.5 RIME Zheng Ma Data

The [RIME input system](https://rime.im/) for writing Chinese characters includes the file `zhengma.dict.yaml`, located [here](https://github.com/Openvingen/rime-zhengma/blob/master/zhengma.dict.yaml), as part of the [Zheng Ma extension](https://github.com/Openvingen/rime-zhengma).

This file seems to share some of the same codes as `zhengma.txt` and `zhengma-large.txt` above, looking at a few simple codes, like `a`, `aa`, etc.  But we find some disagreement with the Windows file `TableTextServiceSimplifiedZhengMa.txt`, even just looking at the character represented by the code `a`.

Moreover, the first handful of lines shows a number of instances where the same code corresponds to different character strings, undercutting the idea that Zheng Ma codes are (nearly) unique:

```yaml
a	一
a	下
a	平
aa	一下
aa	一天
aaac	一无可取
aaag	无可无不可
aaal	百无一用
aaam	万无一失
aaam	天下无敌
aaar	可丁可卯
aaav	可歌可泣
aaaw	天下一家
aaax	天下无双
aaax	天下无难事
aabk	天无二日
```

Of course, the Zheng Ma encoding isn't *strictly* unique.  This really amounts to a question of how frequent such instances are in the rest of the file.  In addition, it's a question of whether this occurs only with strings of multiple Chinese characters, or with individual characters as well.

### 1.6 IBM Data?

Does IBM have a separate source file for this?  That's what [this page](https://www.ibm.com/docs/en/aix/7.2?topic=methods-simplified-chinese-input-method-zim-ucs) seems to suggest, which seems to refer to AIX 7.2 (whatever that is).  They say the following:

> ZIM-UCS features the following characteristics:
> 
> - The following commonly used input methods exist:
>     - **Intelligent ABC**
>         - An input method based on the phonetic representation of Chinese characters.
>     - **Pin Yin Input Method**
>         - An input method based on the phonetic representation of Chinese characters. A Chinese character is divided into one or several phonemes according to its pronunciation. 
>     - **Wu Bi (Five Strike) Input Method**
>         - An input method based on the grapheme representation of Chinese characters. According to the WuBi grapheme input method, Chinese characters are classified into three levels: stroke, radical and single-character.
>     - **Zheng Ma**
>         - An input method based on the grapheme representation of Chinese word. 
>     - **Biao Xing Ma Input Method**
>         - An input method in which a Chinese character is divided into several components,or radicals. When coding a character, these radicals are presented with the corresponding English letters.
>     - **Internal Code Input Method**
>         - An input method in accordance with the code table defined in GB18030 (Chinese Internal Code Specification) and UCS-2 (Unicode System Version 2).
> 
> - Half-width and full-width character input. Supports ASCII characters in both single-byte and multibyte modes.
> - Auxiliary window to support all the candidate lists. For example, Intelligent ABC generate a list of possible characters that contain the same sound symbols (*radicals*). Users select the desired characters by pressing the conversion key.
> - Over-the-spot pre-editing drawing area. Allows entry of radicals in reverse video area that temporarily covers the text line. The complete character is sent to the editor by pressing the conversion key.
> 
> The UCS-ZIM files are in the **/usr/lib/nls/loc** directory.
> 
> The UCS-ZIM keymap is in the **/usr/lib/nls/loc/ZH_CN.UTF-8.imkeymap** directory.

Now I guess I have to decipher that...  The home page seems to be [here](https://www.ibm.com/docs/en/aix/7.2), for the documentation at least.  Evidently AIX is a proprietary brand of UNIX developed by IBM, according to [this Wikipedia article](https://en.wikipedia.org/wiki/IBM_AIX).  Interestingly, it seems that AIX appeared in some form in 1990, while Linux only appeared in 1999.  (Is this right?)