# ZhengMa Character Conversion: Tests

## 3 Test Implementation: Character by Character

So let's just get something that works, in the sense that it takes us from characters to codes and back again.  For pretty-printing dictionaries, cf. [this post](https://datagy.io/python-pretty-print-dictionary/).

In [1]:
# If running in Google Colab
#from google.colab import drive
#drive.mount('/content/gdrive')
#
#path_prefix = "/content/gdrive/My Drive/Colab Notebooks/zhengma/raw/"
#data_prefix = '/content/gdrive/My Drive/Colab Notebooks/zhengma/data/'

In [2]:
# If running on local system
path_prefix = "../raw/"
data_prefix = '../data/'

In [3]:
import pickle 

# Load pickle    
with open(data_prefix + 'df_zm_merged.pkl', 'rb') as pickle_file:
    df_zm_merged = pickle.load(pickle_file)

df_zm_hyphenated = df_zm_merged[df_zm_merged['ZM Codes'].str.contains('-')]

In [4]:
import pprint

In [5]:
test_string1 = '三人行必有我師'
test_string2 = '性相近也习相远也'
test_columns = ['MS Characters', 'fcitx Characters', 'IBus Characters', 'RIME Characters']

In [6]:
def characters_to_codes(cjk_string, zm_dataframe, cjk_columns=['MS Characters'], zm_column='ZM Codes'):
    # Input: 
    #   string of CJK characters
    #   database of Zheng Ma codes as a pandas DataFrame
    #   list of columns to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   list (dictionary?) of Zheng Ma codes

    characters = cjk_string.strip().replace(' ', '')

    codes = {}

    for character in characters:
        codes[character] = {}

        for column in cjk_columns:
            # This part won't work based on the test above
            # For the 'MS Characters' column, for example, CJK character '三' returns **3 rows**: dg, dgg, dggg
            codes[character][column] = zm_dataframe[zm_dataframe[column] == character][zm_column]
    
    return codes

### 3.1 Finding Multiple Codes for Multiple Characters

In [7]:
df_zm_quicklook = df_zm_merged[df_zm_merged[test_columns[0]] == test_string1[0]]
df_zm_quicklook.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
5919,dg,三,扩大,扩大,扩大
5974,dgg,三,,,
5980,dggg,三,㩡,㩡,


This is a little surprising.  Within the Microsoft file for the Zheng Ma encoding, the character `'三'` has three different poassible ZM codes: `'dg'`, `'dgg'`, or `'dggg'`.  I don't quite get why that's the case.

If you run the same little test on the columns containing the `fcitx`, IBus, or RIME versions, you don't get the multiple codes... *for this character*.

But for the character `'師'`, you **don't get any match at all... _in any database!_**  Note that the [entry in the Arch Chinese Dictionary](https://www.archchinese.com/chinese_english_dictionary.html?find=%E5%B8%AB) also lacks a ZM code for this character.  Wait... that's not so: it seems to correspond to the code `'myal'` in the `fcitx` and IBus databases, but it's not included in RIME.

In [8]:
test_codes_output1 = characters_to_codes(test_string1, df_zm_merged, cjk_columns=test_columns)

# Pretty-Print the resulting dictionary
pprint.pprint(test_codes_output1)

{'三': {'IBus Characters': 4290    cd
Name: ZM Codes, dtype: object,
       'MS Characters': 5919      dg
5974     dgg
5980    dggg
Name: ZM Codes, dtype: object,
       'RIME Characters': 4290    cd
Name: ZM Codes, dtype: object,
       'fcitx Characters': 4290      cd
59515    ^cd
Name: ZM Codes, dtype: object},
 '人': {'IBus Characters': 30262    od
Name: ZM Codes, dtype: object,
       'MS Characters': 50562         w
53811    wwww-3
Name: ZM Codes, dtype: object,
       'RIME Characters': 30262    od
Name: ZM Codes, dtype: object,
       'fcitx Characters': 30262     od
59692    ^od
Name: ZM Codes, dtype: object},
 '師': {'IBus Characters': 142532    myal
Name: ZM Codes, dtype: object,
       'MS Characters': Series([], Name: ZM Codes, dtype: object),
       'RIME Characters': Series([], Name: ZM Codes, dtype: object),
       'fcitx Characters': 63645     ^my-6
142532     myal
Name: ZM Codes, dtype: object},
 '必': {'IBus Characters': 96348    wzm
Name: ZM Codes, dtype: object,
      

Actually, let's see if we can find where all those characters are contained across any of the columns.  For some of the techniques used, see [this post](https://kanoki.org/2022/02/04/pandas-search-a-string-in-dataframe-across-all-columns/) and [this StackOverflow thread](https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter).

In [9]:
# This is a *very* time-intensive search, as
# it returns any row that merely *contains* the string,
# not only exact matches.
# So only uncomment it if you really need it.
#
# search_string1 = '|'.join([letter for letter in test_string1])
# search_string2 = test_string1[0]
# df_zm_scavenger = df_zm_merged[df_zm_merged.apply(lambda row: row.astype(str).str.contains(search_string2, case=False).any(), axis=1)]
# df_zm_scavenger.head(20)

In [10]:
# df_zm_scavenger.shape

I think [this post on `pandas` DataFrames and masking](https://www.shecancode.io/blog/filter-a-pandas-dataframe-by-a-partial-string-or-pattern-in-8-ways) is particularly helpful.

In [11]:
individual_characters = [cjk for cjk in test_string2]

mask_ms = df_zm_merged[test_columns[0]].isin(individual_characters)
mask_fcitx = df_zm_merged[test_columns[1]].isin(individual_characters)
mask_ibus = df_zm_merged[test_columns[2]].isin(individual_characters)
mask_rime = df_zm_merged[test_columns[3]].isin(individual_characters)

df_zm_find_string = df_zm_merged[mask_ms | mask_fcitx | mask_ibus | mask_rime]

df_zm_find_string.head(100)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
3426,bn,也,增值,增值,增值
3445,bnhn,也,,,
10975,fl,协,相,相,相
11642,fqp,远,,,
11647,fqpv,远,,,
29447,ntg,性,儣,儣,
29457,ntgg,性,,,
29643,nu,习,伪,伪,伪
29659,nud-1,习,,,
37973,rp,近,然后,然后,然后


We want to capture, in addition, any codes that

- match the codes in `df_zm_find_string`, but also
- possibly contain `'-'` followed by some number at the end of the letter string.

That will help us understand if ZM assigns those characters *unique* codes or not.

But for simplicity, we'll omit codes with a leading `'^'`, since that seems to be a project-specific database modification.

In [12]:
simple_codes = df_zm_find_string['ZM Codes'].tolist()

# Look for codes with the same base, but a hyphen and numerical suffix
# ... and for codes with the same base, but no hyphen (if the original has a hyphen)
augmented_codes = []

for x in simple_codes:
    if '^' in x:
        continue
    elif '-' in x:
        base_code, n_suffix = x.split('-')
    else:
        base_code = x

    # Make sure the new list has the basic ZM code
    if base_code not in augmented_codes:
        augmented_codes.append(base_code)

    # then add codes with that base and possible numerical suffixes
    for n in range(100):
        possible_code = base_code + '-' + str(n)

        if possible_code not in augmented_codes:
            augmented_codes.append(possible_code)

#print(augmented_codes)

In principle, we should check for strings with more than 100 possible numbers after the '-', since (in the case of '^br') we can easily find 70 or more characters assigned to a given a code.  But these "large" variants seem confined to codes preceded by '^', and the collection of '^' codes seems already to duplicate items also assigned to other "regular" (not '-'-initial) codes.  So for practical purposes, we'll consider only the "augmented codes" above.

In [13]:
mask_alt_codes = df_zm_merged['ZM Codes'].isin(augmented_codes)
df_zm_find_string_alternates = df_zm_merged[mask_alt_codes | mask_ms | mask_fcitx | mask_ibus | mask_rime]

df_zm_find_string_alternates.head(100)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
3426,bn,也,增值,增值,增值
3445,bnhn,也,,,
10975,fl,协,相,相,相
11642,fqp,远,,,
11647,fqpv,远,,,
29447,ntg,性,儣,儣,
29457,ntgg,性,,,
29643,nu,习,伪,伪,伪
29658,nud,买,,,
29659,nud-1,习,,,


### 3.2 Comments on the Process thus Far

I've started taking a closer look at the Microsoft, `fcitx`, IBus, and RIME databases for the characters in our proposed quote:

> 性相近也习相远也

First, for reference, here's a table of the codes obtained through the [stl56 website](http://www.stl56.com/zhengma/).

| Code | Character | Number |
| :-- | --: | --: |
| umc | 性 | 0 |
| flvv | 相 | 1 |
| pdw | 近 | 2 |
| yi | 也 | 3 |
| yt | 习 | 4 |
| flvv | 相 | 5 |
| bdrw | 远 | 6 |
| yi | 也 | 7 |


#### 3.2.1 Some Initial Issues


The table below, from our databases. collects all rows where *any* column contains a single character from the target string above.  It also includes any rows for *other* characters that have the *same* ZM code as one of the characters in our quote.  That is, it's looking not just for what are the codes for the characters want, but it looks at whether those same codes give us unwanted characters too.

The table below comes from the implementation *before* I gave each new instance of a duplicated code a numerical suffix.  I'm just keeping this around for historical reasons.  At that point the routine to create new, suffixed codes looked like this:

```python
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                if zm_code in zm_codes.keys():
                    zm_code += '-'
                zm_codes[zm_code] = cjk_char

                cjk_count += 1
```

With that code, if, say, `'yi'` already existed, it would create a new key `'yi-'` and assign a character to that.  But on the *next* pass, it would be looking for `'yi'` again, find it, add `'-'` to it, and then potentially overwrite the `'yi-'` that it had just created.  So we'd correctly recognize when one code was used for more than one character (we'd have `'yi'` and `'yi-'` in the keys), but we would not necessarily count correctly *how many* characters corresponded to that code.

Since then, I've updated the routine to contain the following code:

```python
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                # So append '-' and then add a number suffix
                # ... but make sure that new code isn't already there...
                while zm_code in zm_codes.keys():
                    if '-' not in zm_code:
                        zm_code += '-'
                    
                    base_code, n_suffix = zm_code.split('-')
                    
                    # Take the numerical suffix and add 1
                    # But if n_suffix is None, int(n_suffix) is undefined
                    zm_code = base_code + '-' + str(int(0 if n_suffix in (None, '') else n_suffix) + 1)

                    # Next loop... see if this incremented code is itself already in the keys
                    # If not, done.  If it is, increment again.
                
                # We should now have a zm_code not in the keys
                zm_codes[zm_code] = cjk_char

                cjk_count += 1
```

The idea here is to give a unique numerical suffix to each code, and check if the code-with-number combination is already in the keys.  Keep going until you get a number that isn't there, then assign the character to that unrepresented key.



Below is the **updated table**, after accounting for overwritten codes with '-'.

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 3426	| bn	| 也 (3, 7)	| 增值	| 增值	| 增值	|  
| 3445	| bnhn	| 也 (3, 7)	|	|	|	|  
| 10975	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 11642	| fqp	| 远 (6)	|	|	|	|  
| 11647	| fqpv	| 远 (6)	|	|	|	|  
| 29447	| ntg	| 性 (0)	| 儣	| 儣	|	|  
| 29457	| ntgg	| 性 (0)	|	|	|	|  
| 29643	| nu	| 习 (4)	| 伪	| 伪	| 伪	|  
| 29658	| nud	| 买	|	|	|	|  
| 29659	| nud-1	| 习 (4)	|	|	|	|  
| 37973	| rp	| 近 (2)	| 然后	| 然后	| 然后	|  
| 38030	| rpk	| 近 (2)	| 鱕	| 鱕	|	|  
| 40567	| sh	| 相 (1, 5)	| 亡	| 亡	| 亡	|  
| 40586	| shg	| 相 (1, 5)	|	|	|	|  
| 47652	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 59601	| ^yi	|	| 也 (3, 7)	|	|	|  
| 59602	| ^yt	|	| 习 (4)	|	|	|  
| 64153	| ^um-6	|	| 性 (0)	|	|	|  
| 69993	| ^fl-48	|	| 相 (1, 5)	|	|	|  
| 76352	| ^wp-19	|	| 近 (2)	|	|	|  
| 76363	| ^wb-49	|	| 远 (6)	|	|	|  
| 87147	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 87572	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 89351	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 92545	| ntg-1	|	| 𠆲	| 𠆲	|	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 95992	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 95993	| wbr-1	|	| 远 (6)	| 远 (6)	|	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


The table can be a little confusing to sift through, and I'm still trying to make sure I see all the details, but for now here are some points to note.

* Where you see a `'-'` followed by a number in a ZM code, that's something I had to insert.  There are codes where a single ZM code represents two or more different Chinese characters.  To make sure I didn't overwrite a previous correspondence while gathering the data, I added `'-'` and a number to any ZM code that was already in the database and had a Chinese character assigned.
	* In short, treat ZM codes with `'-'` followed by a number as if they didn't have them: e.g. `'yi-1'` and `'yi'` are the *same code*.
* I've **boldfaced** the specific ZM codes that Le and Qifan have already used in the quote.
* I've numbered the characters in our quote from 0 to 7, so that we could keep track of them in the data table.
	* If you focus on a specific column, you can look for 0, 1, 2, ..., 7 in order and verify that each database represented here does in fact contain all the characters.
* You can also see that **no database (column) _uniquely_ assigns one code to one character**.
	* In each database, each character appears at least twice (... except for 也 (3, 7) and 习 (4) in the RIME database).  That means at least two codes can represent the same character: e.g. `'pdw'` and `'wpd'` both represent 近 (2).
	* And you see many `'-'`s, which means that frequently the same code can represent *two* characters (or character strings): e.g. 
		* `'wbr'` can represent 远 (6) or 冠 in the `fcitx` and IBus databases;
		* `'yi'` can represent 也 (3, 7) or 那些 in the RIME database;
		* `'nud'` can represent 习 (4) or 买 in the Microsoft database.
* **Practical Upshot: we need a _heuristic_ to resolve the ambiguities**, regardless of the database we choose.
	* This gets us back to the situation with the 4-corner codes: there we resolved the ambiguities through considering frequency.  Here we might try something else.
	* We could **try using the longest code available for each character**, assuming that shorter codes are "shortcuts".
		* We will still run into trouble with the code `'bdrw`', representing 远 (6), but also '黿' and yet another character which, for some reason I haven't understood yet, doesn't render in the `fcitx` and IBus databases.
			* But we can see on the [stl56 website](http://www.stl56.com/zhengma/) that even there `'bdrw'` corresponds to two characters, if I'm understanding the output properly: 远 (6) and 黿.
			* We could try using the code `'wbrd'` (I'm not sure what our heuristic would be for choosing that code over the other, since they're the same length).  But the website doesn't render any characters for that code, if I'm understanding correctly.
	* We could **try using the shortest code available for each character**.
		* This seems to work for the Microsoft database, though it doesn't give the codes y'all got from the website.
		* This doesn't seem to work for the character 远 (6) in the `fcitx`, IBus, and RIME databases, since that could have code `'brw'` or `'wbr'`.




#### 3.2.2 An Initial Heuristic

If we assume we're using the RIME database, then we could perhaps use the heuristic below:

* For *en*coding, choose the **longest available code** for a given character.
* For *de*coding, choose only **individual characters** for a given code.

If we look at the table focusing only where RIME has characters, we're left with the following.  The ZM codes that Le and Qifang have from the website are boldfaced.


**New**, updated table:

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 10975	| fl	| 协	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 47652	| um	| 商	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 87147	| wp	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 87572	| brw	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 89351	| flv	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 95992	| wbr	|	| 冠	| 冠	| 远 (6)	|  
| 95993	| wbr-1	|	| 远 (6)	| 远 (6)	|	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


So we encode with the longest code possible and we get this:


**New**, updated table...

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 10975	| ~~fl~~	| 协	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 47652	| ~~um~~	| 商	| 性 (0)	| 性 (0)	| ~~性 (0)~~	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 58539	| yta	| 旗	| 习 (4)	| 习 (4)	|	|  
| 87147	| ~~wp~~	|	| 近 (2)	| 近 (2)	| ~~近 (2)~~	|  
| 87572	| ~~brw~~	|	| 远 (6)	| 远 (6)	| ~~远 (6)~~	|  
| 89351	| ~~flv~~	|	| 相 (1, 5)	| 相 (1, 5)	| ~~相 (1, 5)~~	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 95992	| ~~wbr~~	|	| 冠	| 冠	| ~~远 (6)~~	|  
| 95993	| ~~wbr-1~~	|	| 远 (6)	| 远 (6)	| ~~...~~	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 100145	| **bdrw-1**	|	| 黿	| 黿	|	|  
| 100146	| **bdrw-2**	|	| 𪓣	| 𪓣	|	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


Hmmm... so that tells us how to distinguish between the **boldfaced** *CJK characters* to get a unique code.  But that actually doesn't tell us how to distinguish between the characters labeled by question marks (?).  We need a heuristic to distinguish between `pwd` and `wpd` for 近 (2) (??), and between `bdrw` and `wbrd` for 远 (6) (???).



If we **assume** we've done that somehow, then that would get us to the following.


**New**, updated table, so use this...

|	| ZM Codes	| MS Characters	| fcitx Characters	| IBus Characters	| RIME Characters	|  
| ------	| --------	| -------------	| ----------------	| ---------------	| ---------------	|  
| 47659	| **umc**	| 疫	| 性 (0)	| 性 (0)	| 性 (0)	|  
| 57165	| **yi**	| 就	| 也 (3, 7)	| 也 (3, 7)	| 也 (3, 7)	|  
| 58538	| yt	| 放	| 习 (4)	| 习 (4)	| 习 (4)	|  
| 92955	| **pdw**	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 96222	| wpd	|	| 近 (2)	| 近 (2)	| 近 (2) (??)	|  
| 100144	| **bdrw**	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 117572	| **flvv**	|	| 相 (1, 5)	| 相 (1, 5)	| 相 (1, 5)	|  
| 172887	| wbrd	|	| 远 (6)	| 远 (6)	| 远 (6) (???)	|  
| 235543	| **yi-1**	|	|	|	| 那些   |


And there it would seem we'd just have to contend with the ambiguity of the code `yi` (and `yi-`).  And our heuristic of throwing away any string of more than one CJK character would seem to do the trick.

#### 3.3.3 Another Potential Way Out

Eric, Le, and Qifan suggest the following possible resolution to the issues raised above:

> Since we make and assign our oligomers manually, it is easy to make the strand represent only "pdw" and "bdrw" and never "wpd" or "wbrd." Therefore, the input should be unique, and the program does not need to distinguish between two codes that represent the same Chinese character. 

Some thoughts:

* The ZM encoding *as a whole*... I think... *needs* to allow both `pdw` and `bdrw`... as well as `wpd` and `wbrd`.  They're all in the database for a reason.  Specifically, the codes derive from analyzing a given Chinese character into a collection of basic shapes, the so-called *primary* (and perhaps *secondary*) *roots*.
    * Sometimes the order of decomposition is ambiguous, and the rules I've encountered so far do not seem always to stipulate priority.  So some characters will be able to be decomposed in multiple ways.  To assist the typist, who at times does this decomposition on the fly, the ZM encoding tries to accept a variety of decomposition orders.
        * So in a sense, we can't "eliminate" or "omit" any particular codes for a given character.  Otherwise we won't be representing ZM.  ZM has this ambiguity built in.
            * You'll note that the difference between `bdrw` and `wbrd` can't be one of a "shortcut".  And it's no accident that one is a permutation of the other: that's the ambiguity of the decomposition order.  I think that, in part, this comes from ambiguity as to when to write the lines on the left and bottom of the character.
        * Perhaps this means my idea earlier of "using the longest code" or similar heuristics for *en*coding might be in error.  If we do that, we're removing by fiat some of the ambiguity that the authors of the ZM encoding built into the system.
    * Side note: this is **different** from the situation in 4-corner codes.  That system looks only at the shapes that appear in the 4 corners of a character.  Because visual elements recur in Chinese characters, the set of necessary shapes is small.  And the assignment of a code to a character is unambiguous, largely because the process of assessing shapes doesn't require further analysis or decomposition.  However, also due to the recurrence of shapes, *many* characters will get assigned to the same code; so reversing the process for decoding becomes quite ambiguous without additional context.
        * Evidently with the ZM method, we have ambiguity in *both* directions.  But it seems like, on an intuitive level, that ambiguity might be "bigger" in the *en*coding phase.
* But in our case, in a sense, **we are the typist**.  So we get to **choose the code we want to use to represent a character**.
    * So perhaps it's **OK after all** for us to say that, given 近 (2) we'll only write `pdw` and never `wpd`, and likewise given 远 (6) we'll only write `bdrw` and never `wbrd`.
    * Then we'd only have the ambiguity for *de*coding of `yi` corresponding to 也 or 那些, and we omit strings of more than one character.
* Well, maybe it's not so easy.
    * Perhaps there are **two approaches to rendering an encoding**.
        * **We're the typist.**
        * **We might be implementing an encoding as a whole.**
    * Sometimes these are the same thing, sometimes not.
        * It seems like in the 4-corner codes, being able to encode any given character is the same as implementing the 4-corner system as a whole.
        * In the ZM scenario, we could encode any given character without ever implmenting the whole ZM encoding.
            * For example, we could always decide we're going to encode 性 as `umc` instead of `um`.
                * As a typist, we're free to always prefer `umc` to `um`.
                * But if we're trying to encode ZM as a whole, we need have a map between input and storage that allows a typist to use **either `umc` or `um`**.
                    * Maybe the latter is what we can say we're doing (implicitly) in this project (in general), but in *this particular quote* we have to make the typist's choice of which code to use.

Out of curiosity, how often does the RIME database specifically leave us in the 也 situation of a single code representing more than one CJK string?

In [14]:
df_zm_hyphenated_rime = df_zm_hyphenated[df_zm_hyphenated['RIME Characters'] != '']

In [15]:
df_zm_hyphenated_rime.shape

(9172, 5)

So, roughly 9000 times we have a hyphenated (i.e. repeated) code in our overall database for which the RIME column is not empty: this should generally mean that the duplication applies to RIME as well.

How many times do these hyphenated characters contain only a single character?

In [16]:
df_zm_hyphenated_rime_unique = df_zm_hyphenated_rime[df_zm_hyphenated_rime['RIME Characters'].apply(lambda x: len(str(x)) == 1)]

In [17]:
df_zm_hyphenated_rime_unique.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
13798,ggyy-1,一方,厄,厄,厄
25571,lkai-1,回荡,睼,睼,罡
25805,llyy-1,逻辑设计,岂,岂,屺
31423,pewy-1,家禽,铹,铹,铹
38447,rrrr-1,拉拉扯扯,比,比,毙
38448,rrrr-2,拖拖拉拉,毙,毙,比
50935,wdai-1,做东,寔,寔,宁
51053,wdtg-1,倚重,过头,过头,实
87086,c-1,,現,現,理
87087,o-1,,會,會,很


In [18]:
df_zm_hyphenated_rime_unique.shape

(926, 5)

OK, so this happens about 900 times.  Interesting.