# ZhengMa Character Conversion: Working Draft

## 1 Initial Notes & Resources

### 1.1 Zheng Ma Tutorial

We should begin by making clear our object of study.  Properly [written](https://chinese.yabla.com/chinese-english-pinyin-dictionary.php?define=zhengma), we're discussing the following encoding system.

> - Traditional: 鄭碼
> - Simplified: 郑码
> - Pinyin: **Zhèng mǎ**
> - Zheng coding
>     - original Chinese character coding based on component shapes, created by Zheng Yili 鄭易里|郑易里[Zheng4 Yi4 li3], underlying most stroke-based Chinese input methods
>     - also called common coding 字根通用碼|字根通用码[zi4 gen1 tong1 yong4 ma3]

Note that the **[Arch Chinese Dictionary](https://www.archchinese.com/chinese_english_dictionary.html)** seems to give Zheng Ma codes for individual characters, providing quick and dirty access to Zheng Ma codes.  For those wishing to understand how the encoding works, a useful quick introduction to the mechanics of the ZhengMa input method can be found in [this Wikibooks resource](https://en.wikibooks.org/wiki/Zhengma_Input).

### 1.2 Windows Data Resources


[This StackExchange thread](https://chinese.stackexchange.com/questions/83/learning-resources-for-zhengma-input-method) has a nice discussion of resources for learning about the ZhengMa input method and how to use it.  Most importantly, it mentions what specific file in the Microsoft Windows OS contains the encoding information: 

> On my computer it is found at `C:\Program Files(x86)\Windows NT\TableTextService`; it is called `TableTextServiceSimplifiedZhengMa.txt`

And [here](https://github.com/Furzoom/wubi/blob/master/TableTextServiceSimplifiedZhengMa.txt) I've managed to find a copy of that encoding file.  That's helpful!

I just noted, however, some discrepancies between the ZhengMa input method description mentioned [above](https://en.wikibooks.org/wiki/Zhengma_Input) and the [Windows file](https://github.com/Furzoom/wubi/blob/master/TableTextServiceSimplifiedZhengMa.txt) I downloaded.  In particular, the description mentions how to arrive at ZM codes for various strings of several characters, e.g. for 4 characters:

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 生态系统 | mgmz | mc+gdsw+mzvv+zszr | mc+gsw+mzv+zs |
| 高等教育 | smbs | sjld+mbds+bmym+szq | sjl+ms+bmm+szq |

and for more than 4 characters:

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 新石器时代 | sgjk | sufp+ga+jjjj+kds+nhs | sf+ga+jjg+kd+nh |
| 合成洗涤剂 | ohvv | odaj+hmy+vmrd+vrf+sonk | oaj+h+vmr+vrf+snk |
| 中华人民共和国 | jnoy | jivv+nred+od+yybh+eao+mfj+jdcs | |
| 全国工商业联合会 | ojbs | odc+jdcs+bi+suld+ku+ceug+odaj+odbz | |
| 中国有色金属工业总公司 | jjgr | jivv+jdcs+gdq+ryia+pa+xmil+bi+ku+udjw+ozs+yaj | |

But when I search the Windows file, I find

> "sgjk"="相互影响"

which, in addition to having different characters than those in the table, is a 4- rather than a 5-character string!  And I don't find `mgmz` at all!  So that makes me wonder

* How complete is the Windows database?
* How universal are the codes for multi-character phrases?

### 1.3 `fcitx` Zheng Ma Resources

For comparison, I also found [this file](https://github.com/fcitx/fcitx-table-extra/blob/master/tables/zhengma-large.txt), called `zhengma-large.txt`, that's part of the Ubuntu package [`fcitx-table-extra`](https://github.com/fcitx/fcitx-table-extra), corresponding to [`fcitx`](https://github.com/fcitx).

But there, for example, I only find `mgmz` as part of the following entry:

> mgmzs 生态系统

And for `sgjk`, I find

> sgjkn 新石器时代\
> sgjk 𠝒\
> sgjk 𠝒

So I don't really know what's going on there.  This time the strings look right, but the codes have an extra letter... making them **5-characters long!**  I thought the ZhengMa encoding tried to keep everything to 4 characters...

Moreover, if we look at `av` in this file, we find

> ^av 一

... but in the previous file, we find

> "av"="切"

So it seems that these don't agree, even on simple glyphs.

### 1.4 IBus Zheng Ma Data

[This StackExchange thread](https://chinese.stackexchange.com/questions/43465/incomplete-list-of-free-chinese-input-methods-in-current-use) serves as a useful resource.  It lists a number of Chinese input methods (including both 4-corner and Zheng Ma), and it points to websites that have more information.

In particular, for the Zheng Ma encoding, it points to [this website](www.zmfans.cn/bbs) and [this GitHub repo](https://github.com/acevery/ibus-table-zhengma) related to the [IBus input method](https://code.google.com/archive/p/ibus/) project.  The latter contains [this file](https://github.com/acevery/ibus-table-zhengma/blob/master/tables/zhengma.txt) called `zhengma.txt` which has another data store of the Zheng Ma codes and their corresponding characters.

### 1.5 RIME Zheng Ma Data

The [RIME input system](https://rime.im/) for writing Chinese characters includes the file `zhengma.dict.yaml`, located [here](https://github.com/Openvingen/rime-zhengma/blob/master/zhengma.dict.yaml), as part of the [Zheng Ma extension](https://github.com/Openvingen/rime-zhengma).

This file seems to share some of the same codes as `zhengma.txt` and `zhengma-large.txt` above, looking at a few simple codes, like `a`, `aa`, etc.  But we find some disagreement with the Windows file `TableTextServiceSimplifiedZhengMa.txt`, even just looking at the character represented by the code `a`.

Moreover, the first handful of lines shows a number of instances where the same code corresponds to different character strings, undercutting the idea that Zheng Ma codes are (nearly) unique:

```yaml
a	一
a	下
a	平
aa	一下
aa	一天
aaac	一无可取
aaag	无可无不可
aaal	百无一用
aaam	万无一失
aaam	天下无敌
aaar	可丁可卯
aaav	可歌可泣
aaaw	天下一家
aaax	天下无双
aaax	天下无难事
aabk	天无二日
```

Of course, the Zheng Ma encoding isn't *strictly* unique.  This really amounts to a question of how frequent such instances are in the rest of the file.  In addition, it's a question of whether this occurs only with strings of multiple Chinese characters, or with individual characters as well.

### 1.6 IBM Data?

Does IBM have a separate source file for this?  That's what [this page](https://www.ibm.com/docs/en/aix/7.2?topic=methods-simplified-chinese-input-method-zim-ucs) seems to suggest, which seems to refer to AIX 7.2 (whatever that is).  They say the following:

> ZIM-UCS features the following characteristics:
> 
> - The following commonly used input methods exist:
>     - **Intelligent ABC**
>         - An input method based on the phonetic representation of Chinese characters.
>     - **Pin Yin Input Method**
>         - An input method based on the phonetic representation of Chinese characters. A Chinese character is divided into one or several phonemes according to its pronunciation. 
>     - **Wu Bi (Five Strike) Input Method**
>         - An input method based on the grapheme representation of Chinese characters. According to the WuBi grapheme input method, Chinese characters are classified into three levels: stroke, radical and single-character.
>     - **Zheng Ma**
>         - An input method based on the grapheme representation of Chinese word. 
>     - **Biao Xing Ma Input Method**
>         - An input method in which a Chinese character is divided into several components,or radicals. When coding a character, these radicals are presented with the corresponding English letters.
>     - **Internal Code Input Method**
>         - An input method in accordance with the code table defined in GB18030 (Chinese Internal Code Specification) and UCS-2 (Unicode System Version 2).
> 
> - Half-width and full-width character input. Supports ASCII characters in both single-byte and multibyte modes.
> - Auxiliary window to support all the candidate lists. For example, Intelligent ABC generate a list of possible characters that contain the same sound symbols (*radicals*). Users select the desired characters by pressing the conversion key.
> - Over-the-spot pre-editing drawing area. Allows entry of radicals in reverse video area that temporarily covers the text line. The complete character is sent to the editor by pressing the conversion key.
> 
> The UCS-ZIM files are in the **/usr/lib/nls/loc** directory.
> 
> The UCS-ZIM keymap is in the **/usr/lib/nls/loc/ZH_CN.UTF-8.imkeymap** directory.

Now I guess I have to decipher that...  The home page seems to be [here](https://www.ibm.com/docs/en/aix/7.2), for the documentation at least.  Evidently AIX is a proprietary brand of UNIX developed by IBM, according to [this Wikipedia article](https://en.wikipedia.org/wiki/IBM_AIX).  Interestingly, it seems that AIX appeared in some form in 1990, while Linux only appeared in 1999.  (Is this right?)

## 2 Gathering Data

Let's see if we can try to get some of this data into memory.

In [1]:
# If running in Google Colab
#from google.colab import drive
#drive.mount('/content/gdrive')
#
#path_prefix = "/content/gdrive/My Drive/Colab Notebooks/zhengma/raw/"

In [2]:
# If running on local system
path_prefix = "../raw/"

In [3]:
import pandas as pd
import re

In [4]:
def read_zm(file, file_encoding='utf-16', head=5, re_pattern=r"^\"(\w+)\"=\"(\w+)\"", verbose=False, find_character=None):
    # Read the Zheng Ma data files
    # Create a dictionary database, where 
    #   keys are the ZM codes, and
    #   values are the corresponding CJK characters
    # Input:
    #   file: filename
    #   file_encoding: what flavor of UTF encoding, or other?
    #   head: show how many (key, value) pairs?
    #   re_pattern: give a raw-string with the regex for reading codes & characters from the file
    # Output:
    #   zm_codes: dictionary of (code, character) pairs
    #   print:
    #     how many lines read
    #     how many lines with characters
    #     head-number of (key, value) pairs in zm_codes
    
    data_pattern = re.compile(re_pattern)
    
    zm_codes = {}
    line_count = 0
    cjk_count = 0
    head_count = 0
    
    with open(file, encoding=file_encoding) as fi:
        for line in fi:
            row = line.strip()

            if not row:
                # Optionally let me know if we've skipped a line
                if verbose:
                    print('Skipping line: {}'.format(row))
                continue
            
            line_count += 1
            
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                # So append '-' and then add a number suffix
                # ... but make sure that new code isn't already there...
                while zm_code in zm_codes.keys():
                    if '-' not in zm_code:
                        zm_code += '-'
                    
                    base_code, n_suffix = zm_code.split('-')
                    
                    # Take the numerical suffix and add 1
                    # But if n_suffix is None, int(n_suffix) is undefined
                    zm_code = base_code + '-' + str(int(0 if n_suffix in (None, '') else n_suffix) + 1)

                    # Next loop... see if this incremented code is itself already in the keys
                    # If not, done.  If it is, increment again.
                
                # We should now have a zm_code not in the keys
                zm_codes[zm_code] = cjk_char

                cjk_count += 1

                # In case we want to make sure that we read in
                # a certain CJK character from the database
                if find_character:
                    if str(find_character) in cjk_char:
                        print('Found one instance of {}: \nLine: {:>}\t{:>}\t{:>}'.format(str(find_character), cjk_count, zm_code, cjk_char))
            else:
                # Optionally let me know if there was no regex match
                if verbose:
                    print('No pattern match in line {:>}: {}'.format(line_count, row))
                continue

   
    print('\nTotal lines read:  {:>10}'.format(line_count))
    print('Total codes found: {:>10}\n'.format(cjk_count))

    if head > 0:
        # If you want to see some of the ZM codes and CJK characters read in
        print('Some of the initial codes:\n')
        for code_idx, cjk_string in zm_codes.items():
            if head_count < head:
                print('{}:\t{}'.format(code_idx, cjk_string))
                head_count += 1
            else:
                break

    return zm_codes

### 2.1 Windows Data

In [5]:
windows_filename = "TableTextServiceSimplifiedZhengMa.txt"
zm_data_file_windows = path_prefix + windows_filename

In [6]:
zm_codes_windows = read_zm(zm_data_file_windows)


Total lines read:       59587
Total codes found:      59506

Some of the initial codes:

a:	工
aa:	式
aaa:	工
aaaa:	工
aaaa-1:	恭恭敬敬


For the explanation of using the `list` of dictionary `items()` to get the dictionary keys as row elements, see [this reference](https://www.stackvidhya.com/convert-dictionary-to-pandas-dataframe-python/).

In [7]:
df_zm_windows = pd.DataFrame(list(zm_codes_windows.items()), columns=['ZM Codes', 'MS Characters'])

In [8]:
df_zm_windows.head()

Unnamed: 0,ZM Codes,MS Characters
0,a,工
1,aa,式
2,aaa,工
3,aaaa,工
4,aaaa-1,恭恭敬敬


### 2.2 `fcitx` Data

In [9]:
fcitx_filename = "zhengma-large.txt"
zm_data_file_fcitx = path_prefix + fcitx_filename

In [10]:
zm_codes_fcitx = read_zm(zm_data_file_fcitx, re_pattern=r"^(\^\w+|\w+)\s(\S+)", file_encoding='utf-8', find_character='黿')
#zm_codes_fcitx = read_zm(zm_data_file_fcitx, re_pattern=r"^(\^\w+|\w+)\s(\S+)", file_encoding='utf-8')

Found one instance of 黿: 
Line: 20733	^br-47	黿
Found one instance of 黿: 
Line: 43433	bdrw-1	黿

Total lines read:      151070
Total codes found:     151058

Some of the initial codes:

^av:	一
^ai:	丁
^az:	丂
^hd:	七
^ia:	丄


If we look closer at the data, there's a bunch of codes preceded by a `'^'`.  This group appears to be a little strange.  For example, there seem to be upwards of 73 appearances of the code '^br', one of them corresponding to the character '黿'.


But after this group, the rest of the file seems to contain "normal" ZM codes without the preceding `'^'`.  Those start with `a` corresponding to `一`, which seems to be in accord with the other databases.

But there are several codes which seem only to correspond to a missing character:

```
bii 址
bii 𧈫
bii 𧈬
bij 坫
bij 𢀛
bik 墟
bil 盐
bio 𪤳
biq 𪣯
bix 五
```

It's not clear if those are supposed to be blanks or not.  The file's encoding is UTF-8, so I'm not sure if that's an error brought about by "clipping" the underlying bytes of the characters.

In [11]:
df_zm_fcitx = pd.DataFrame(list(zm_codes_fcitx.items()), columns=['ZM Codes', 'fcitx Characters'])

In [12]:
df_zm_fcitx.head()

Unnamed: 0,ZM Codes,fcitx Characters
0,^av,一
1,^ai,丁
2,^az,丂
3,^hd,七
4,^ia,丄


In [13]:
df_zm_fcitx[df_zm_fcitx['ZM Codes'].isin(['bdrw', 'bdrw-', 'bdrw--'])]

Unnamed: 0,ZM Codes,fcitx Characters
43431,bdrw,远


### 2.3 IBus Data

In [14]:
ibus_filename = "zhengma.txt"
zm_data_file_ibus = path_prefix + ibus_filename

In [15]:
zm_codes_ibus = read_zm(zm_data_file_ibus, re_pattern=r"^(\w+)\s(\w+)\s\d+", file_encoding='utf-8')


Total lines read:      151125
Total codes found:     123383

Some of the initial codes:

a:	一
b:	地
c:	现
c-1:	現
d:	的


In [16]:
df_zm_ibus = pd.DataFrame(list(zm_codes_ibus.items()), columns=['ZM Codes', 'IBus Characters'])

In [17]:
df_zm_ibus.head()

Unnamed: 0,ZM Codes,IBus Characters
0,a,一
1,b,地
2,c,现
3,c-1,現
4,d,的


### 2.4 RIME Data

In [18]:
rime_filename = "zhengma.dict.yaml"
zm_data_file_rime = path_prefix + rime_filename

In [19]:
# If you need to hunt down a specific character as it's being read in,
# use the following line:
# zm_codes_rime = read_zm(zm_data_file_rime, re_pattern=r"^([a-z]+)\s+(\w+)", file_encoding='utf-8', find_character='也')
#
# Otherwise, if you want a less verbose read-in, use this line:
zm_codes_rime = read_zm(zm_data_file_rime, re_pattern=r"^([a-z]+)\s+(\w+)", file_encoding='utf-8')


Total lines read:       81485
Total codes found:      81463

Some of the initial codes:

a:	一
a-1:	下
a-2:	平
aa:	一下
aa-1:	一天


In [20]:
df_zm_rime = pd.DataFrame(list(zm_codes_rime.items()), columns=['ZM Codes', 'RIME Characters'])

In [21]:
df_zm_rime.head()

Unnamed: 0,ZM Codes,RIME Characters
0,a,一
1,a-1,下
2,a-2,平
3,aa,一下
4,aa-1,一天


### 2.5 Merging Data

Now let's try to line up all this data.  For each Zheng Ma code, let's see what each database has.  [Here](https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes)'s a handy reference for merging several `pandas` DataFrames at once.

In [22]:
import functools as ft

#### 2.5.1 Getting the Data into One Place

Let's first try to get all the different databases into a single DataFrame.

In [23]:
dfs = [df_zm_windows, df_zm_fcitx, df_zm_ibus, df_zm_rime]

In [24]:
df_zm_merged = ft.reduce(lambda  left,right: pd.merge(left,right,on=['ZM Codes'], how='outer'), dfs).fillna('')

In [25]:
df_zm_merged.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
2,aaa,工,,,
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
5,aaab,工作,,,
6,aaad,工期,,,
7,aaae,黄花菜,,,
8,aaah,葡萄牙,,,
9,aaal,花花世界,,,百无一用


In [26]:
df_zm_merged.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238449,zzwv,,,,出逃
238450,zzww-1,,,,缝缝补补
238451,zzww-2,,,,结结实实
238452,zzxe-1,,,,出展
238453,zzxh,,,,出戏
238454,zzxs,,,,出尽
238455,zzxu-1,,,,红绿灯
238456,zzyf,,,,缝纫机
238457,zzyh-1,,,,乡民
238458,zzyu,,,,纨绔子弟


#### 2.5.2 Looking at Details of the Data

Let's take a momento to see what the data looks like in detail, finding where things are similar and where they diverge.  Let's have a look at codes with `'-'` in them, i.e. codes that map to more than one specific character.  We can collect those codes along with the other instances of the ones they're duplicating.

For using `apply()` to gain access to properties of the `values` of the specified DataFrame column, see [this StackOverflow thread](https://stackoverflow.com/questions/19937362/filter-string-data-based-on-its-string-length).

In [27]:
df_zm_short_codes = df_zm_merged[df_zm_merged['ZM Codes'].apply(lambda x: len(str(x)) == 2)]

In [28]:
df_zm_short_codes.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
1,aa,式,一下,一下,一下
142,ab,节,一起,一起,一起
199,ac,芭,平静,平静,平静
229,ad,基,于,于,于
457,ae,菜,开,开,开
534,af,革,末,末,末
733,ag,七,无,无,无
842,ah,牙,形成,形成,形成
916,ai,东,丁,丁,丁
1063,aj,划,可,可,可


In [29]:
df_zm_short_codes.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
87158,zg,,出面,出面,出面
87159,zh,,纯,纯,纯
87160,zi,,如此,如此,如此
87161,zj,,如,如,如
87162,zk,,细,细,细
87163,zl,,组,组,组
87164,zm,,女,女,女
87165,zn,,以便,以便,她们
87166,zo,,以,以,以
87167,zp,,以后,以后,以后


Let's have a look at how many codes there are with `'-'` in them, i.e. how many codes map to more than one specific character.

In [30]:
df_zm_hyphenated = df_zm_merged[df_zm_merged['ZM Codes'].str.contains('-')]

In [31]:
df_zm_hyphenated.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
4,aaaa-1,恭恭敬敬,,,
28,aadn-1,慝,,,
31,aady-1,工矿,,,
38,aaff-1,蓬蓬勃勃,,,
45,aagk-1,工事,,,
55,aahw-1,工龄,,,
59,aaig-1,工艺水平,,,
107,aatk-1,工种,,,
131,aaww-1,世世代代,,,开开心心
132,aaww-2,草菅人命,,,严严实实


In [32]:
df_zm_hyphenated.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238427,zzuu-1,,,,熊熊燃烧
238435,zzvs-1,,,,始发站
238436,zzvw-2,,,,出演
238438,zzwa-1,,,,女娲补天
238439,zzwb-1,,,,娓娓道来
238442,zzwe-1,,,,出塞
238445,zzwj-1,,,,丝绸之路
238446,zzwk-1,,,,幽冥
238447,zzwr-1,,,,纠察
238450,zzww-1,,,,缝缝补补


In [33]:
df_zm_hyphenated.shape

(72600, 5)

Let's also try to collect those codes along with the other instances of the ones they're duplicating.

In [34]:
hyphenated_codes = df_zm_merged[df_zm_merged['ZM Codes'].str.contains('-')]['ZM Codes'].tolist()
dehyphenated_codes = [x.split('-')[0] for x in hyphenated_codes]
hyphen_adjacent_codes = hyphenated_codes + dehyphenated_codes

df_zm_hyphen_adjacent = df_zm_merged[df_zm_merged['ZM Codes'].isin(hyphen_adjacent_codes)]

In [35]:
df_zm_hyphen_adjacent.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
27,aadn,葚,,,
28,aadn-1,慝,,,
30,aady,落落大方,,,
31,aady-1,工矿,,,
37,aaff,苷,,,
38,aaff-1,蓬蓬勃勃,,,


As we can see with `aadn` and `aadn-1`, representing 葚 and 慝 respectively, it is **not the case that repeated codes always correspond to multiple-character strings**.

In [36]:
df_zm_hyphen_adjacent.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238438,zzwa-1,,,,女娲补天
238439,zzwb-1,,,,娓娓道来
238441,zzwe,,,,出赛
238442,zzwe-1,,,,出塞
238444,zzwj,,,,出宫
238445,zzwj-1,,,,丝绸之路
238446,zzwk-1,,,,幽冥
238447,zzwr-1,,,,纠察
238450,zzww-1,,,,缝缝补补
238451,zzww-2,,,,结结实实


In [37]:
df_zm_hyphen_adjacent.shape

(103260, 5)

Let's see specifically where the columns differ among themselves.  Using the method of [this StackOverflow thread](https://stackoverflow.com/questions/22701799/pandas-dataframe-find-rows-where-all-columns-equal), we can try checking where each column is individually equal (or not) to the first column.

In [38]:
df_zm_different = df_zm_merged[~df_zm_merged.eq(df_zm_merged.iloc[:, 0], axis=0).all(1)]

In [39]:
df_zm_different.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
2,aaa,工,,,
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
5,aaab,工作,,,
6,aaad,工期,,,
7,aaae,黄花菜,,,
8,aaah,葡萄牙,,,
9,aaal,花花世界,,,百无一用


In [40]:
df_zm_different.shape

(238469, 5)

In [41]:
df_zm_same = df_zm_merged[df_zm_merged.eq(df_zm_merged.iloc[:, 0], axis=0).all(1)]

In [42]:
df_zm_same.shape

(0, 5)

Hmmm... that's a bummer.  So there are *no* codes for which *all* the databases have the same CJK characters.  What about the specific code examples we had in the lessons mentioned above?

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 生态系统 | mgmz | mc+gdsw+mzvv+zszr | mc+gsw+mzv+zs |
| 高等教育 | smbs | sjld+mbds+bmym+szq | sjl+ms+bmm+szq |
| 新石器时代 | sgjk | sufp+ga+jjjj+kds+nhs | sf+ga+jjg+kd+nh |
| 合成洗涤剂 | ohvv | odaj+hmy+vmrd+vrf+sonk | oaj+h+vmr+vrf+snk |
| 中华人民共和国 | jnoy | jivv+nred+od+yybh+eao+mfj+jdcs | |
| 全国工商业联合会 | ojbs | odc+jdcs+bi+suld+ku+ceug+odaj+odbz | |
| 中国有色金属工业总公司 | jjgr | jivv+jdcs+gdq+ryia+pa+xmil+bi+ku+udjw+ozs+yaj | |


In [43]:
tutorial_examples = [\
    ['mgmz', '生态系统'], \
    ['smbs', '高等教育'], \
    ['sgjk', '新石器时代'], \
    ['ohvv', '合成洗涤剂'], \
    ['jnoy', '中华人民共和国'], \
    ['ojbs', '全国工商业联合会'], \
    ['jjgr', '中国有色金属工业总公司'], \
]
tutorial_codes = [x[0] for x in tutorial_examples]
df_zm_examples = df_zm_merged[df_zm_merged['ZM Codes'].isin(tutorial_codes)]
df_zm_examples.shape

(4, 5)

In [44]:
df_zm_examples.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
21937,jnoy,电炉,𡄆,𡄆,中华人民共和国
40372,sgjk,相互影响,𠝒,𠝒,新石器时代
217649,mgmz,,,,生态系统
226396,smbs,,,,高等教育


### 2.6 Important Conclusions

So these databases, generally speaking, **do not recapitulate the multiple-character examples from the tutorial.**

Moreover, as we saw above, `aadn` and `aadn-1` represent 葚 and 慝, respectively, so that **repeated codes do _not_ always correspond to multiple-character strings**.

## 3 Test Implementation: Character by Character

So let's just get something that works, in the sense that it takes us from characters to codes and back again.  For pretty-printing dictionaries, cf. [this post](https://datagy.io/python-pretty-print-dictionary/).

In [45]:
import pprint

In [46]:
test_string1 = '三人行必有我師'
test_string2 = '性相近也习相远也'
test_columns = ['MS Characters', 'fcitx Characters', 'IBus Characters', 'RIME Characters']

In [47]:
def characters_to_codes(cjk_string, zm_dataframe, cjk_columns=['MS Characters'], zm_column='ZM Codes'):
    # Input: 
    #   string of CJK characters
    #   database of Zheng Ma codes as a pandas DataFrame
    #   list of columns to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   list (dictionary?) of Zheng Ma codes

    characters = cjk_string.strip().replace(' ', '')

    codes = {}

    for character in characters:
        codes[character] = {}

        for column in cjk_columns:
            # This part won't work based on the test above
            # For the 'MS Characters' column, for example, CJK character '三' returns **3 rows**: dg, dgg, dggg
            codes[character][column] = zm_dataframe[zm_dataframe[column] == character][zm_column]
    
    return codes

### 3.1 Finding Multiple Codes for Multiple Characters

In [48]:
df_zm_quicklook = df_zm_merged[df_zm_merged[test_columns[0]] == test_string1[0]]
df_zm_quicklook.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
5919,dg,三,扩大,扩大,扩大
5974,dgg,三,,,
5980,dggg,三,㩡,㩡,


This is a little surprising.  Within the Microsoft file for the Zheng Ma encoding, the character `'三'` has three different poassible ZM codes: `'dg'`, `'dgg'`, or `'dggg'`.  I don't quite get why that's the case.

If you run the same little test on the columns containing the `fcitx`, IBus, or RIME versions, you don't get the multiple codes... *for this character*.

But for the character `'師'`, you **don't get any match at all... _in any database!_**  Note that the [entry in the Arch Chinese Dictionary](https://www.archchinese.com/chinese_english_dictionary.html?find=%E5%B8%AB) also lacks a ZM code for this character.  Wait... that's not so: it seems to correspond to the code `'myal'` in the `fcitx` and IBus databases, but it's not included in RIME.

In [49]:
test_codes_output1 = characters_to_codes(test_string1, df_zm_merged, cjk_columns=test_columns)

# Pretty-Print the resulting dictionary
pprint.pprint(test_codes_output1)

{'三': {'IBus Characters': 4290    cd
Name: ZM Codes, dtype: object,
       'MS Characters': 5919      dg
5974     dgg
5980    dggg
Name: ZM Codes, dtype: object,
       'RIME Characters': 4290    cd
Name: ZM Codes, dtype: object,
       'fcitx Characters': 4290      cd
59515    ^cd
Name: ZM Codes, dtype: object},
 '人': {'IBus Characters': 30262    od
Name: ZM Codes, dtype: object,
       'MS Characters': 50562         w
53811    wwww-3
Name: ZM Codes, dtype: object,
       'RIME Characters': 30262    od
Name: ZM Codes, dtype: object,
       'fcitx Characters': 30262     od
59692    ^od
Name: ZM Codes, dtype: object},
 '師': {'IBus Characters': 142532    myal
Name: ZM Codes, dtype: object,
       'MS Characters': Series([], Name: ZM Codes, dtype: object),
       'RIME Characters': Series([], Name: ZM Codes, dtype: object),
       'fcitx Characters': 63645     ^my-6
142532     myal
Name: ZM Codes, dtype: object},
 '必': {'IBus Characters': 96348    wzm
Name: ZM Codes, dtype: object,
      

Actually, let's see if we can find where all those characters are contained across any of the columns.  For some of the techniques used, see [this post](https://kanoki.org/2022/02/04/pandas-search-a-string-in-dataframe-across-all-columns/) and [this StackOverflow thread](https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter).

In [50]:
# This is a *very* time-intensive search, as
# it returns any row that merely *contains* the string,
# not only exact matches.
# So only uncomment it if you really need it.
#
# search_string1 = '|'.join([letter for letter in test_string1])
# search_string2 = test_string1[0]
# df_zm_scavenger = df_zm_merged[df_zm_merged.apply(lambda row: row.astype(str).str.contains(search_string2, case=False).any(), axis=1)]
# df_zm_scavenger.head(20)

In [51]:
# df_zm_scavenger.shape

I think [this post on `pandas` DataFrames and masking](https://www.shecancode.io/blog/filter-a-pandas-dataframe-by-a-partial-string-or-pattern-in-8-ways) is particularly helpful.

In [52]:
individual_characters = [cjk for cjk in test_string2]

mask_ms = df_zm_merged[test_columns[0]].isin(individual_characters)
mask_fcitx = df_zm_merged[test_columns[1]].isin(individual_characters)
mask_ibus = df_zm_merged[test_columns[2]].isin(individual_characters)
mask_rime = df_zm_merged[test_columns[3]].isin(individual_characters)

df_zm_find_string = df_zm_merged[mask_ms | mask_fcitx | mask_ibus | mask_rime]

df_zm_find_string.head(100)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
3426,bn,也,增值,增值,增值
3445,bnhn,也,,,
10975,fl,协,相,相,相
11642,fqp,远,,,
11647,fqpv,远,,,
29447,ntg,性,儣,儣,
29457,ntgg,性,,,
29643,nu,习,伪,伪,伪
29659,nud-1,习,,,
37973,rp,近,然后,然后,然后


We want to capture, in addition, any codes that

- match the codes in `df_zm_find_string`, but also
- possibly contain `'-'` followed by some number at the end of the letter string.

That will help us understand if ZM assigns those characters *unique* codes or not.

But for simplicity, we'll omit codes with a leading `'^'`, since that seems to be a project-specific database modification.

In [53]:
simple_codes = df_zm_find_string['ZM Codes'].tolist()

# Look for codes with the same base, but a hyphen and numerical suffix
# ... and for codes with the same base, but no hyphen (if the original has a hyphen)
augmented_codes = []

for x in simple_codes:
    if '^' in x:
        continue
    elif '-' in x:
        base_code, n_suffix = x.split('-')
    else:
        base_code = x

    # Make sure the new list has the basic ZM code
    if base_code not in augmented_codes:
        augmented_codes.append(base_code)

    # then add codes with that base and possible numerical suffixes
    for n in range(100):
        possible_code = base_code + '-' + str(n)

        if possible_code not in augmented_codes:
            augmented_codes.append(possible_code)

#print(augmented_codes)

In principle, we should check for strings with more than 100 possible numbers after the '-', since (in the case of '^br') we can easily find 70 or more characters assigned to a given a code.  But these "large" variants seem confined to codes preceded by '^', and the collection of '^' codes seems already to duplicate items also assigned to other "regular" (not '-'-initial) codes.  So for practical purposes, we'll consider only the "augmented codes" above.

In [54]:
mask_alt_codes = df_zm_merged['ZM Codes'].isin(augmented_codes)
df_zm_find_string_alternates = df_zm_merged[mask_alt_codes | mask_ms | mask_fcitx | mask_ibus | mask_rime]

df_zm_find_string_alternates.head(100)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
3426,bn,也,增值,增值,增值
3445,bnhn,也,,,
10975,fl,协,相,相,相
11642,fqp,远,,,
11647,fqpv,远,,,
29447,ntg,性,儣,儣,
29457,ntgg,性,,,
29643,nu,习,伪,伪,伪
29658,nud,买,,,
29659,nud-1,习,,,


### 3.2 Comments on the Process thus Far

I've started taking a closer look at the Microsoft, `fcitx`, IBus, and RIME databases for the characters in our proposed quote:

> 性相近也习相远也

First, for reference, here's a table of the codes obtained through the [stl56 website](http://www.stl56.com/zhengma/).

| Code | Character | Number |
| :-- | --: | --: |
| umc | 性 | 0 |
| flvv | 相 | 1 |
| pdw | 近 | 2 |
| yi | 也 | 3 |
| yt | 习 | 4 |
| flvv | 相 | 5 |
| bdrw | 远 | 6 |
| yi | 也 | 7 |


#### 3.2.1 Some Initial Issues


The table below, from our databases. collects all rows where *any* column contains a single character from the target string above.  It also includes any rows for *other* characters that have the *same* ZM code as one of the characters in our quote.  That is, it's looking not just for what are the codes for the characters want, but it looks at whether those same codes give us unwanted characters too.

The table below comes from the implementation *before* I gave each new instance of a duplicated code a numerical suffix.  I'm just keeping this around for historical reasons.  At that point the routine to create new, suffixed codes looked like this:

```python
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                if zm_code in zm_codes.keys():
                    zm_code += '-'
                zm_codes[zm_code] = cjk_char

                cjk_count += 1
```

With that code, if, say, `'yi'` already existed, it would create a new key `'yi-'` and assign a character to that.  But on the *next* pass, it would be looking for `'yi'` again, find it, add `'-'` to it, and then potentially overwrite the `'yi-'` that it had just created.  So we'd correctly recognize when one code was used for more than one character (we'd have `'yi'` and `'yi-'` in the keys), but we would not necessarily count correctly *how many* characters corresponded to that code.

Since then, I've updated the routine to contain the following code:

```python
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                # So append '-' and then add a number suffix
                # ... but make sure that new code isn't already there...
                while zm_code in zm_codes.keys():
                    if '-' not in zm_code:
                        zm_code += '-'
                    
                    base_code, n_suffix = zm_code.split('-')
                    
                    # Take the numerical suffix and add 1
                    # But if n_suffix is None, int(n_suffix) is undefined
                    zm_code = base_code + '-' + str(int(0 if n_suffix in (None, '') else n_suffix) + 1)

                    # Next loop... see if this incremented code is itself already in the keys
                    # If not, done.  If it is, increment again.
                
                # We should now have a zm_code not in the keys
                zm_codes[zm_code] = cjk_char

                cjk_count += 1
```

The idea here is to give a unique numerical suffix to each code, and check if the code-with-number combination is already in the keys.  Keep going until you get a number that isn't there, then assign the character to that unrepresented key.


|   | ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	| 
| --------:	| -------------	| ----------------	| ---------------	| ---------------	| -   |
| 3392	| bn	| 也 (3, 7)	| 增值	| 增值	| 增值	|  
| 3409	| bnhn	| 也 (3, 7)	|	|	|	|  
| 10830	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 11489	| fqp	| 远 (6)	|	|	|	|  
| 11494	| fqpv	| 远 (6)	|	|	|	|  
| 28938	| ntg	| 性 (0)	| 儣	| 儣	|	|  
| 28946	| ntgg	| 性 (0)	|	|	|	|  
| 29122	| nu	| 习 (4)	| 伪	| 伪	| 伪	|  
| 29137	| nud	| 买	|	|	|	|  
| 29138	| nud-	| 习 (4)	|	|	|	|  
| 37285	| rp	| 近 (2)	| 然后	| 然后	| 然后	|  
| 37341	| rpk	| 近 (2)	| 鱕	| 鱕	|	|  
| 39786	| sh	| 相 (1, 5)	| 亡	| 亡	| 亡	|  
| 39805	| shg	| 相 (1, 5)	|	|	|	|  
| 46676	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 46683	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 55940	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 57279	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 57280	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 58297	| \^yi	|	| 也 (3, 7)	|	|	|  
| 58298	| \^yt	|	| 习 (4)	|	|	|  
| 58761	| \^yi-	|	| 㢭	|	|	|  
| 59135	| \^yt-	|	| 䧪	|	|	|  
| 59600	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 59986	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 61544	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 64402	| ntg-	|	| 𠆲	| 𠆲	|	|  
| 64795	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 67433	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 67434	| wbr-	|	| 远 (6)	| 远 (6)	|	|  
| 67640	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 71191	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 71192	| **bdrw-**	|	| 𪓣	| 𪓣	|	|  
| 86721	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 134229	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 193680	| **yi-**	|	|	|	| 那些   |

So we can **diregard the table above** for the purposes of calculation.  It's just there as a double-check on the argument going forward.



Below is the **updated table**, after accounting for overwritten codes with '-'.

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 3426	| bn	| 也 (3, 7)	| 增值	| 增值	| 增值	|  
| 3445	| bnhn	| 也 (3, 7)	|	|	|	|  
| 10975	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 11642	| fqp	| 远 (6)	|	|	|	|  
| 11647	| fqpv	| 远 (6)	|	|	|	|  
| 29447	| ntg	| 性 (0)	| 儣	| 儣	|	|  
| 29457	| ntgg	| 性 (0)	|	|	|	|  
| 29643	| nu	| 习 (4)	| 伪	| 伪	| 伪	|  
| 29658	| nud	| 买	|	|	|	|  
| 29659	| nud-1	| 习 (4)	|	|	|	|  
| 37973	| rp	| 近 (2)	| 然后	| 然后	| 然后	|  
| 38030	| rpk	| 近 (2)	| 鱕	| 鱕	|	|  
| 40567	| sh	| 相 (1, 5)	| 亡	| 亡	| 亡	|  
| 40586	| shg	| 相 (1, 5)	|	|	|	|  
| 47652	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 59601	| ^yi	|	| 也 (3, 7)	|	|	|  
| 59602	| ^yt	|	| 习 (4)	|	|	|  
| 64153	| ^um-6	|	| 性 (0)	|	|	|  
| 69993	| ^fl-48	|	| 相 (1, 5)	|	|	|  
| 76352	| ^wp-19	|	| 近 (2)	|	|	|  
| 76363	| ^wb-49	|	| 远 (6)	|	|	|  
| 87147	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 87572	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 89351	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 92545	| ntg-1	|	| 𠆲	| 𠆲	|	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 95992	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 95993	| wbr-1	|	| 远 (6)	| 远 (6)	|	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


The table can be a little confusing to sift through, and I'm still trying to make sure I see all the details, but for now here are some points to note.

* Where you see a `'-'` followed by a number in a ZM code, that's something I had to insert.  There are codes where a single ZM code represents two or more different Chinese characters.  To make sure I didn't overwrite a previous correspondence while gathering the data, I added `'-'` and a number to any ZM code that was already in the database and had a Chinese character assigned.
	* In short, treat ZM codes with `'-'` followed by a number as if they didn't have them: e.g. `'yi-1'` and `'yi'` are the *same code*.
* I've **boldfaced** the specific ZM codes that Le and Qifan have already used in the quote.
* I've numbered the characters in our quote from 0 to 7, so that we could keep track of them in the data table.
	* If you focus on a specific column, you can look for 0, 1, 2, ..., 7 in order and verify that each database represented here does in fact contain all the characters.
* You can also see that **no database (column) _uniquely_ assigns one code to one character**.
	* In each database, each character appears at least twice (... except for 也 (3, 7) and 习 (4) in the RIME database).  That means at least two codes can represent the same character: e.g. `'pdw'` and `'wpd'` both represent 近 (2).
	* And you see many `'-'`s, which means that frequently the same code can represent *two* characters (or character strings): e.g. 
		* `'wbr'` can represent 远 (6) or 冠 in the `fcitx` and IBus databases;
		* `'yi'` can represent 也 (3, 7) or 那些 in the RIME database;
		* `'nud'` can represent 习 (4) or 买 in the Microsoft database.
* **Practical Upshot: we need a _heuristic_ to resolve the ambiguities**, regardless of the database we choose.
	* This gets us back to the situation with the 4-corner codes: there we resolved the ambiguities through considering frequency.  Here we might try something else.
	* We could **try using the longest code available for each character**, assuming that shorter codes are "shortcuts".
		* We will still run into trouble with the code `'bdrw`', representing 远 (6), but also '黿' and yet another character which, for some reason I haven't understood yet, doesn't render in the `fcitx` and IBus databases.
			* But we can see on the [stl56 website](http://www.stl56.com/zhengma/) that even there `'bdrw'` corresponds to two characters, if I'm understanding the output properly: 远 (6) and 黿.
			* We could try using the code `'wbrd'` (I'm not sure what our heuristic would be for choosing that code over the other, since they're the same length).  But the website doesn't render any characters for that code, if I'm understanding correctly.
	* We could **try using the shortest code available for each character**.
		* This seems to work for the Microsoft database, though it doesn't give the codes y'all got from the website.
		* This doesn't seem to work for the character 远 (6) in the `fcitx`, IBus, and RIME databases, since that could have code `'brw'` or `'wbr'`.




#### 3.2.2 An Initial Heuristic

If we assume we're using the RIME database, then we could perhaps use the heuristic below:

* For *en*coding, choose the **longest available code** for a given character.
* For *de*coding, choose only **individual characters** for a given code.

If we look at the table focusing only where RIME has characters, we're left with the following.  The ZM codes that Le and Qifang have from the website are boldfaced.


**Old** table, so **disregard**:

|   | ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	| 
| --------:	| -------------	| ----------------	| ---------------	| ---------------	| -   |
| 10830	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 46676	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 46683	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 55940	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 57279	| **yt**	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 57280	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 59600	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 59986	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 61544	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 64795	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 67433	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 67434	| wbr-	|	| 远 (6)	| 远 (6)	|	|  
| 67640	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 71191	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 86721	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 134229	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 193680	| **yi-**	|	|	|	| 那些   |


**New** table, so use this:

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 10975	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 47652	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 87147	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 87572	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 89351	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 95992	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 95993	| wbr-1	|	| 远 (6)	| 远 (6)	|	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


So we encode with the longest code possible and we get this:


**Old** table, so **disregard**...

|   | ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	| 
| --------:	| -------------	| ----------------	| ---------------	| ---------------	| -   |
| 10830	| ~~fl~~	| 协	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 46676	| ~~um~~	| 商	| 性 (0)	| 性 (0)	| ~~性 (0)~~	|  
| 46683	| **umc**	| 疫	| 性 (0)	| 性 (0)	| **性 (0)**	|  
| 55940	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 57279	| **yt**	| 放	| 习 (4)	| 习 (4)	| **习 (4)**	|  
| 57280	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 59600	| ~~wp~~	|	| 近 (2)	| 近 (2)	| ~~近 (2)~~	|  
| 59986	| ~~brw~~	|	| 远 (6)	| 远 (6)	| ~~远 (6)~~	|  
| 61544	| ~~flv~~	|	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 64795	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 67433	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 67434	| wbr-	|	| 远 (6)	| 远 (6)	|	|  
| 67640	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 71191	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 86721	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| **相 (1, 5)**	|  
| 134229	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 193680	| **yi-**	|	|	|	| 那些   |


**New** table, so use this...

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 10975	| ~~fl~~	| 协	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 47652	| ~~um~~	| 商	| 性 (0)	| 性 (0)	| ~~性 (0)~~	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 87147	| ~~wp~~	|	| 近 (2)	| 近 (2)	| ~~近 (2)~~	|  
| 87572	| ~~brw~~	|	| 远 (6)	| 远 (6)	| ~~远 (6)~~	|  
| 89351	| ~~flv~~	|	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 95992	| ~~wbr~~	|	| 冠	| 冠	| ~~远 (6)~~	|  
| 95993	| ~~wbr-1~~	|	| 远 (6)	| 远 (6)	| ~~...~~	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


Hmmm... so that tells us how to distinguish between the **boldfaced** *CJK characters* to get a unique code.  But that actually doesn't tell us how to distinguish between the characters labeled by question marks (?).  We need a heuristic to distinguish between `pwd` and `wpd` for 近 (2) (??), and between `bdrw` and `wbrd` for 远 (6) (???).



If we **assume** we've done that somehow, then that would get us to the following.


**Old** table, so **disregard**...

|   | ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	| 
| --------:	| -------------	| ----------------	| ---------------	| ---------------	| -   |
| 46683	| **umc**	| 疫	| 性 (0)	| 性 (0)	| **性 (0)**	|  
| 55940	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 57279	| **yt**	| 放	| 习 (4)	| 习 (4)	| **习 (4)**	|  
| 64795	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 71191	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 86721	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| **相 (1, 5)**	|  
| 193680	| **yi-**	|	|	|	| 那些   |


**New** table, so use this...

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


And there it would seem we'd just have to contend with the ambiguity of the code `yi` (and `yi-`).  And our heuristic of throwing away any string of more than one CJK character would seem to do the trick.

#### 3.3.3 Another Potential Way Out

Eric, Le, and Qifan suggest the following possible resolution to the issues raised above:

> Since we make and assign our oligomers manually, it is easy to make the strand represent only "pdw" and "bdrw" and never "wpd" or "wbrd." Therefore, the input should be unique, and the program does not need to distinguish between two codes that represent the same Chinese character. 

Some thoughts:

* The ZM encoding *as a whole*... I think... *needs* to allow both `pdw` and `bdrw`... as well as `wpd` and `wbrd`.  They're all in the database for a reason.  Specifically, the codes derive from analyzing a given Chinese character into a collection of basic shapes, the so-called *primary* (and perhaps *secondary*) *roots*.
    * Sometimes the order of decomposition is ambiguous, and the rules I've encountered so far do not seem always to stipulate priority.  So some characters will be able to be decomposed in multiple ways.  To assist the typist, who at times does this decomposition on the fly, the ZM encoding tries to accept a variety of decomposition orders.
        * So in a sense, we can't "eliminate" or "omit" any particular codes for a given character.  Otherwise we won't be representing ZM.  ZM has this ambiguity built in.
            * You'll note that the difference between `bdrw` and `wbrd` can't be one of a "shortcut".  And it's no accident that one is a permutation of the other: that's the ambiguity of the decomposition order.  I think that, in part, this comes from ambiguity as to when to write the lines on the left and bottom of the character.
        * Perhaps this means my idea earlier of "using the longest code" or similar heuristics for *en*coding might be in error.  If we do that, we're removing by fiat some of the ambiguity that the authors of the ZM encoding built into the system.
    * Side note: this is **different** from the situation in 4-corner codes.  That system looks only at the shapes that appear in the 4 corners of a character.  Because visual elements recur in Chinese characters, the set of necessary shapes is small.  And the assignment of a code to a character is unambiguous, largely because the process of assessing shapes doesn't require further analysis or decomposition.  However, also due to the recurrence of shapes, *many* characters will get assigned to the same code; so reversing the process for decoding becomes quite ambiguous without additional context.
        * Evidently with the ZM method, we have ambiguity in *both* directions.  But it seems like, on an intuitive level, that ambiguity might be "bigger" in the *en*coding phase.
* But in our case, in a sense, **we are the typist**.  So we get to **choose the code we want to use to represent a character**.
    * So perhaps it's **OK after all** for us to say that, given 近 (2) we'll only write `pdw` and never `wpd`, and likewise given 远 (6) we'll only write `bdrw` and never `wbrd`.
    * Then we'd only have the ambiguity for *de*coding of `yi` corresponding to 也 or 那些, and we omit strings of more than one character.
* Well, maybe it's not so easy.
    * Perhaps there are **two approaches to rendering an encoding**.
        * **We're the typist.**
        * **We might be implementing an encoding as a whole.**
    * Sometimes these are the same thing, sometimes not.
        * It seems like in the 4-corner codes, being able to encode any given character is the same as implementing the 4-corner system as a whole.
        * In the ZM scenario, we could encode any given character without ever implmenting the whole ZM encoding.
            * For example, we could always decide we're going to encode 性 as `umc` instead of `um`.
                * As a typist, we're free to always prefer `umc` to `um`.
                * But if we're trying to encode ZM as a whole, we need have a map between input and storage that allows a typist to use **either `umc` or `um`**.
                    * Maybe the latter is what we can say we're doing (implicitly) in this project (in general), but in *this particular quote* we have to make the typist's choice of which code to use.

Out of curiosity, how often does the RIME database specifically leave us in the 也 situation of a single code representing more than one CJK string?

In [55]:
df_zm_hyphenated_rime = df_zm_hyphenated[df_zm_hyphenated['RIME Characters'] != '']

In [56]:
df_zm_hyphenated_rime.shape

(9172, 5)

So, roughly 9000 times we have a hyphenated (i.e. repeated) code in our overall database for which the RIME column is not empty: this should generally mean that the duplication applies to RIME as well.

How many times do these hyphenated characters contain only a single character?

In [57]:
df_zm_hyphenated_rime_unique = df_zm_hyphenated_rime[df_zm_hyphenated_rime['RIME Characters'].apply(lambda x: len(str(x)) == 1)]

In [58]:
df_zm_hyphenated_rime_unique.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
13798,ggyy-1,一方,厄,厄,厄
25571,lkai-1,回荡,睼,睼,罡
25805,llyy-1,逻辑设计,岂,岂,屺
31423,pewy-1,家禽,铹,铹,铹
38447,rrrr-1,拉拉扯扯,比,比,毙
38448,rrrr-2,拖拖拉拉,毙,毙,比
50935,wdai-1,做东,寔,寔,宁
51053,wdtg-1,倚重,过头,过头,实
87086,c-1,,現,現,理
87087,o-1,,會,會,很


In [59]:
df_zm_hyphenated_rime_unique.shape

(926, 5)

OK, so this happens about 900 times.  Interesting.

## 4 Converting between Characters & Codes

So let's try to get a working implementation of a routine that converts from Chinese characters to Zheng Ma codes, and then back again.

### 4.1 Characters to Codes

We'll start simple: give me a character string, and I'll give you the code for each character.  We've tried this before, just returning any row where the character is the one we want.  But there could be several codes for a given character.  So we need to decide how to get only one code.  We can try taking the longest code.

First, a little sanity-check using Python's `max()` function.

In [60]:
little_list = ['pwd', 'cd', 'pmdb', 'wpd', 'ab', 'bpmd']
longest = max(little_list, key=len)
longest

'pmdb'

In [61]:
longests = [c for c in little_list if len(c) == len(longest)]
sorted_longests = sorted(longests)
sorted_longests

['bpmd', 'pmdb']

In [62]:
def characters_to_codes_simplistic(cjk_string, zm_dataframe, db_column='RIME Characters', zm_column='ZM Codes'):
    # Input: 
    #   string of CJK characters
    #   database of Zheng Ma codes as a pandas DataFrame
    #   name of column to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   list (dictionary?) of Zheng Ma codes
    #     - In case of multiple code correspondences, choose the longest

    characters = cjk_string.strip().replace(' ', '')

    codes = []

    for character in characters:
        # Find any rows in the desired column that have the desired character
        # Take the ZM codes in those rows as a list
        possible_codes = zm_dataframe[zm_dataframe[db_column] == character][zm_column].tolist()

        # Choose the **longest code** in that list of ZM codes
        max_code = max(possible_codes, key=len) if possible_codes else None
        # There could be several, so order alphabetically and pick the first
        desired_codes = [c for c in possible_codes if len(c) == len(max_code)] if max_code else None
        desired_code = sorted(desired_codes)[0] if desired_codes else 'N/A: no match'
        codes.append([character, desired_code])
    
    return codes

In [63]:
new_test_string1 = '三人行必有我師'
new_test_string2 = '性相近也习相远也'

In [64]:
new_test_codes_output1 = characters_to_codes_simplistic(new_test_string1, df_zm_merged)
print(new_test_codes_output1)

[['三', 'cd'], ['人', 'od'], ['行', 'oi'], ['必', 'wzm'], ['有', 'gdq'], ['我', 'mdhm'], ['師', 'N/A: no match']]


In [65]:
new_test_codes_output2 = characters_to_codes_simplistic(new_test_string2, df_zm_merged)
print(new_test_codes_output2)

[['性', 'umc'], ['相', 'flvv'], ['近', 'pdw'], ['也', 'yi'], ['习', 'yt'], ['相', 'flvv'], ['远', 'bdrw'], ['也', 'yi']]


Nice.  So that worked.  At least basically.

### 4.2 Codes to Characters

This is going to be a little dicey.  This time, you give me a list of codes, and I return you a list of characters.  The trick is, a given code could correspond to more than one Chinese character string.  So we need a heuristic: take only the single-character string.  Of course, there might be more than one, which could get us into hot water...

In [66]:
def codes_to_characters_simplistic(code_list, zm_dataframe, db_column='RIME Characters', zm_column='ZM Codes'):
    # Input: 
    #   list of ZM codes
    #   database of Zheng Ma codes as a pandas DataFrame
    #   name of column to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   string of CJK characters
    #     - In case of multiple character correspondences for a code, choose...

    cjk_string = ''

    for code in code_list:
        # Make sure the code is a valid ZM code:
        #   - fewer than 5 letters
        #   - no spaces
        if ' ' not in code:
            if len(code) < 5:
                # Get the characters for that code
                possible_characters = zm_dataframe[zm_dataframe[zm_column] == code][db_column].tolist()
                # Remove any empty strings
                viable_characters = [x for x in possible_characters if (len(x) > 0)]
                # Add the smallest string (hopefully 1 character)
                # ... watch out: there might be more than one minimum...
                # ... what does min() do?  return the first it finds in the list?
                cjk_string += min(viable_characters, key=len)
            else:
                print('Code too long: {}'.format(code))
        else:
            print('Code should not contain spaces: {}'.format(code))
    
    return cjk_string

In [67]:
new_test_codes1 = [ x[-1] for x in new_test_codes_output1]
new_test_codes2 = [ y[-1] for y in new_test_codes_output2]

In [68]:
new_test_string_output1 = codes_to_characters_simplistic(new_test_codes1, df_zm_merged)
print(new_test_string_output1)

Code should not contain spaces: N/A: no match
三人行必有我


In [69]:
new_test_string_output2 = codes_to_characters_simplistic(new_test_codes2, df_zm_merged)
print(new_test_string_output2)

性相近也习相远也


Nice!  That seems to have worked... I think...

### 4.3 Unicode Comparison

Now let's do the same procedure, but for Unicode.

In [70]:
def characters_to_unicodes(cjk_string):
    # Read a list of CJK characters (as strings)
	# Convert to a number, write the number in hexadecimal
    return [hex(ord(x)) for x in cjk_string]

Now for a little sanity check.

In [71]:
new_test_unicodes_output1 = characters_to_unicodes(new_test_string1)
print(new_test_unicodes_output1)

['0x4e09', '0x4eba', '0x884c', '0x5fc5', '0x6709', '0x6211', '0x5e2b']


In [72]:
new_test_unicodes_output2 = characters_to_unicodes(new_test_string2)
print(new_test_unicodes_output2)

['0x6027', '0x76f8', '0x8fd1', '0x4e5f', '0x4e60', '0x76f8', '0x8fdc', '0x4e5f']


And let's create a function to go in the opposite direction: from Unicode code points to Unicode characters.

In [73]:
def unicodes_to_characters(code_list):
	# Read a list of hexadecimal codes as strings (with '0x' prefix)
	# Convert to hexadecimal numbers, then get the corresponding Unicode character
	return [chr(int(code, 16)) for code in code_list]

Now let's check.

In [74]:
new_test_unicodes1 = new_test_unicodes_output1
new_test_unicodes2 = new_test_unicodes_output2

In [75]:
new_test_unicode_string_output1 = unicodes_to_characters(new_test_unicodes1)
print(new_test_unicode_string_output1)

['三', '人', '行', '必', '有', '我', '師']


In [76]:
new_test_unicode_string_output2 = unicodes_to_characters(new_test_unicodes2)
print(new_test_unicode_string_output2)

['性', '相', '近', '也', '习', '相', '远', '也']


It worked!