# ZhengMa Character Conversion: Data

## 2 Gathering Data

Let's see if we can try to get some of this data into memory.

In [47]:
# If running in Google Colab
#from google.colab import drive
#drive.mount('/content/gdrive')
#
#path_prefix = "/content/gdrive/My Drive/Colab Notebooks/zhengma/raw/"
#data_prefix = '/content/gdrive/My Drive/Colab Notebooks/zhengma/data/'

In [48]:
# If running on local system
path_prefix = "../raw/"
data_prefix = '../data/'

In [49]:
import pandas as pd
import re

In [50]:
def read_zm(file, file_encoding='utf-16', head=5, re_pattern=r"^\"(\w+)\"=\"(\w+)\"", verbose=False, find_character=None):
    # Read the Zheng Ma data files
    # Create a dictionary database, where 
    #   keys are the ZM codes, and
    #   values are the corresponding CJK characters
    # Input:
    #   file: filename
    #   file_encoding: what flavor of UTF encoding, or other?
    #   head: show how many (key, value) pairs?
    #   re_pattern: give a raw-string with the regex for reading codes & characters from the file
    # Output:
    #   zm_codes: dictionary of (code, character) pairs
    #   print:
    #     how many lines read
    #     how many lines with characters
    #     head-number of (key, value) pairs in zm_codes
    
    data_pattern = re.compile(re_pattern)
    
    zm_codes = {}
    line_count = 0
    cjk_count = 0
    head_count = 0
    
    with open(file, encoding=file_encoding) as fi:
        for line in fi:
            row = line.strip()

            if not row:
                # Optionally let me know if we've skipped a line
                if verbose:
                    print('Skipping line: {}'.format(row))
                continue
            
            line_count += 1
            
            m = data_pattern.match(line)
            if m:
                zm_code, cjk_char = m.group(1), m.group(2)

                # It turns out that some ZM codes are used
                # for more than one CJK character string.
                # So we need to make sure not to overwrite earlier characters
                # by making the new ZM code string unique.
                # (Example: the code yi in the RIME database)
                # So append '-' and then add a number suffix
                # ... but make sure that new code isn't already there...
                while zm_code in zm_codes.keys():
                    if '-' not in zm_code:
                        zm_code += '-'
                    
                    base_code, n_suffix = zm_code.split('-')
                    
                    # Take the numerical suffix and add 1
                    # But if n_suffix is None, int(n_suffix) is undefined
                    zm_code = base_code + '-' + str(int(0 if n_suffix in (None, '') else n_suffix) + 1)

                    # Next loop... see if this incremented code is itself already in the keys
                    # If not, done.  If it is, increment again.
                
                # We should now have a zm_code not in the keys
                zm_codes[zm_code] = cjk_char

                cjk_count += 1

                # In case we want to make sure that we read in
                # a certain CJK character from the database
                if find_character:
                    if str(find_character) in cjk_char:
                        print('Found one instance of {}: \nLine: {:>}\t{:>}\t{:>}'.format(str(find_character), cjk_count, zm_code, cjk_char))
            else:
                # Optionally let me know if there was no regex match
                if verbose:
                    print('No pattern match in line {:>}: {}'.format(line_count, row))
                continue

   
    print('\nTotal lines read:  {:>10}'.format(line_count))
    print('Total codes found: {:>10}\n'.format(cjk_count))

    if head > 0:
        # If you want to see some of the ZM codes and CJK characters read in
        print('Some of the initial codes:\n')
        for code_idx, cjk_string in zm_codes.items():
            if head_count < head:
                print('{}:\t{}'.format(code_idx, cjk_string))
                head_count += 1
            else:
                break

    return zm_codes

### 2.1 Windows Data

In [51]:
windows_filename = "TableTextServiceSimplifiedZhengMa.txt"
zm_data_file_windows = path_prefix + windows_filename

In [52]:
zm_codes_windows = read_zm(zm_data_file_windows)


Total lines read:       59587
Total codes found:      59506

Some of the initial codes:

a:	工
aa:	式
aaa:	工
aaaa:	工
aaaa-1:	恭恭敬敬


For the explanation of using the `list` of dictionary `items()` to get the dictionary keys as row elements, see [this reference](https://www.stackvidhya.com/convert-dictionary-to-pandas-dataframe-python/).

In [53]:
df_zm_windows = pd.DataFrame(list(zm_codes_windows.items()), columns=['ZM Codes', 'MS Characters'])

In [54]:
df_zm_windows.head()

Unnamed: 0,ZM Codes,MS Characters
0,a,工
1,aa,式
2,aaa,工
3,aaaa,工
4,aaaa-1,恭恭敬敬


### 2.2 `fcitx` Data

In [55]:
fcitx_filename = "zhengma-large.txt"
zm_data_file_fcitx = path_prefix + fcitx_filename

In [56]:
zm_codes_fcitx = read_zm(zm_data_file_fcitx, re_pattern=r"^(\^\w+|\w+)\s(\S+)", file_encoding='utf-8', find_character='黿')
#zm_codes_fcitx = read_zm(zm_data_file_fcitx, re_pattern=r"^(\^\w+|\w+)\s(\S+)", file_encoding='utf-8')

Found one instance of 黿: 
Line: 20733	^br-47	黿
Found one instance of 黿: 
Line: 43433	bdrw-1	黿

Total lines read:      151070
Total codes found:     151058

Some of the initial codes:

^av:	一
^ai:	丁
^az:	丂
^hd:	七
^ia:	丄


If we look closer at the data, there's a bunch of codes preceded by a `'^'`.  This group appears to be a little strange.  For example, there seem to be upwards of 73 appearances of the code '^br', one of them corresponding to the character '黿'.


But after this group, the rest of the file seems to contain "normal" ZM codes without the preceding `'^'`.  Those start with `a` corresponding to `一`, which seems to be in accord with the other databases.

But there are several codes which seem only to correspond to a missing character:

```
bii 址
bii 𧈫
bii 𧈬
bij 坫
bij 𢀛
bik 墟
bil 盐
bio 𪤳
biq 𪣯
bix 五
```

It's not clear if those are supposed to be blanks or not.  The file's encoding is UTF-8, so I'm not sure if that's an error brought about by "clipping" the underlying bytes of the characters.

In [57]:
df_zm_fcitx = pd.DataFrame(list(zm_codes_fcitx.items()), columns=['ZM Codes', 'fcitx Characters'])

In [58]:
df_zm_fcitx.head()

Unnamed: 0,ZM Codes,fcitx Characters
0,^av,一
1,^ai,丁
2,^az,丂
3,^hd,七
4,^ia,丄


In [59]:
df_zm_fcitx[df_zm_fcitx['ZM Codes'].isin(['bdrw', 'bdrw-', 'bdrw--'])]

Unnamed: 0,ZM Codes,fcitx Characters
43431,bdrw,远


### 2.3 IBus Data

In [60]:
ibus_filename = "zhengma.txt"
zm_data_file_ibus = path_prefix + ibus_filename

In [61]:
zm_codes_ibus = read_zm(zm_data_file_ibus, re_pattern=r"^(\w+)\s(\w+)\s\d+", file_encoding='utf-8')


Total lines read:      151125
Total codes found:     123383

Some of the initial codes:

a:	一
b:	地
c:	现
c-1:	現
d:	的


In [62]:
df_zm_ibus = pd.DataFrame(list(zm_codes_ibus.items()), columns=['ZM Codes', 'IBus Characters'])

In [63]:
df_zm_ibus.head()

Unnamed: 0,ZM Codes,IBus Characters
0,a,一
1,b,地
2,c,现
3,c-1,現
4,d,的


### 2.4 RIME Data

In [64]:
rime_filename = "zhengma.dict.yaml"
zm_data_file_rime = path_prefix + rime_filename

In [65]:
# If you need to hunt down a specific character as it's being read in,
# use the following line:
# zm_codes_rime = read_zm(zm_data_file_rime, re_pattern=r"^([a-z]+)\s+(\w+)", file_encoding='utf-8', find_character='也')
#
# Otherwise, if you want a less verbose read-in, use this line:
zm_codes_rime = read_zm(zm_data_file_rime, re_pattern=r"^([a-z]+)\s+(\w+)", file_encoding='utf-8')


Total lines read:       81485
Total codes found:      81463

Some of the initial codes:

a:	一
a-1:	下
a-2:	平
aa:	一下
aa-1:	一天


In [66]:
df_zm_rime = pd.DataFrame(list(zm_codes_rime.items()), columns=['ZM Codes', 'RIME Characters'])

In [67]:
df_zm_rime.head()

Unnamed: 0,ZM Codes,RIME Characters
0,a,一
1,a-1,下
2,a-2,平
3,aa,一下
4,aa-1,一天


### 2.5 Merging Data

Now let's try to line up all this data.  For each Zheng Ma code, let's see what each database has.  [Here](https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes)'s a handy reference for merging several `pandas` DataFrames at once.

In [68]:
import functools as ft

#### 2.5.1 Getting the Data into One Place

Let's first try to get all the different databases into a single DataFrame.

In [69]:
dfs = [df_zm_windows, df_zm_fcitx, df_zm_ibus, df_zm_rime]

In [70]:
df_zm_merged = ft.reduce(lambda  left,right: pd.merge(left,right,on=['ZM Codes'], how='outer'), dfs).fillna('')

In [71]:
df_zm_merged.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
2,aaa,工,,,
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
5,aaab,工作,,,
6,aaad,工期,,,
7,aaae,黄花菜,,,
8,aaah,葡萄牙,,,
9,aaal,花花世界,,,百无一用


In [72]:
df_zm_merged.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238449,zzwv,,,,出逃
238450,zzww-1,,,,缝缝补补
238451,zzww-2,,,,结结实实
238452,zzxe-1,,,,出展
238453,zzxh,,,,出戏
238454,zzxs,,,,出尽
238455,zzxu-1,,,,红绿灯
238456,zzyf,,,,缝纫机
238457,zzyh-1,,,,乡民
238458,zzyu,,,,纨绔子弟


#### 2.5.2 Looking at Details of the Data

Let's take a momento to see what the data looks like in detail, finding where things are similar and where they diverge.  Let's have a look at codes with `'-'` in them, i.e. codes that map to more than one specific character.  We can collect those codes along with the other instances of the ones they're duplicating.

For using `apply()` to gain access to properties of the `values` of the specified DataFrame column, see [this StackOverflow thread](https://stackoverflow.com/questions/19937362/filter-string-data-based-on-its-string-length).

In [73]:
df_zm_short_codes = df_zm_merged[df_zm_merged['ZM Codes'].apply(lambda x: len(str(x)) == 2)]

In [74]:
df_zm_short_codes.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
1,aa,式,一下,一下,一下
142,ab,节,一起,一起,一起
199,ac,芭,平静,平静,平静
229,ad,基,于,于,于
457,ae,菜,开,开,开
534,af,革,末,末,末
733,ag,七,无,无,无
842,ah,牙,形成,形成,形成
916,ai,东,丁,丁,丁
1063,aj,划,可,可,可


In [75]:
df_zm_short_codes.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
87158,zg,,出面,出面,出面
87159,zh,,纯,纯,纯
87160,zi,,如此,如此,如此
87161,zj,,如,如,如
87162,zk,,细,细,细
87163,zl,,组,组,组
87164,zm,,女,女,女
87165,zn,,以便,以便,她们
87166,zo,,以,以,以
87167,zp,,以后,以后,以后


Let's have a look at how many codes there are with `'-'` in them, i.e. how many codes map to more than one specific character.

In [76]:
df_zm_hyphenated = df_zm_merged[df_zm_merged['ZM Codes'].str.contains('-')]

In [77]:
df_zm_hyphenated.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
4,aaaa-1,恭恭敬敬,,,
28,aadn-1,慝,,,
31,aady-1,工矿,,,
38,aaff-1,蓬蓬勃勃,,,
45,aagk-1,工事,,,
55,aahw-1,工龄,,,
59,aaig-1,工艺水平,,,
107,aatk-1,工种,,,
131,aaww-1,世世代代,,,开开心心
132,aaww-2,草菅人命,,,严严实实


In [78]:
df_zm_hyphenated.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238427,zzuu-1,,,,熊熊燃烧
238435,zzvs-1,,,,始发站
238436,zzvw-2,,,,出演
238438,zzwa-1,,,,女娲补天
238439,zzwb-1,,,,娓娓道来
238442,zzwe-1,,,,出塞
238445,zzwj-1,,,,丝绸之路
238446,zzwk-1,,,,幽冥
238447,zzwr-1,,,,纠察
238450,zzww-1,,,,缝缝补补


In [79]:
df_zm_hyphenated.shape

(72600, 5)

Let's also try to collect those codes along with the other instances of the ones they're duplicating.

In [80]:
hyphenated_codes = df_zm_merged[df_zm_merged['ZM Codes'].str.contains('-')]['ZM Codes'].tolist()
dehyphenated_codes = [x.split('-')[0] for x in hyphenated_codes]
hyphen_adjacent_codes = hyphenated_codes + dehyphenated_codes

df_zm_hyphen_adjacent = df_zm_merged[df_zm_merged['ZM Codes'].isin(hyphen_adjacent_codes)]

In [81]:
df_zm_hyphen_adjacent.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
27,aadn,葚,,,
28,aadn-1,慝,,,
30,aady,落落大方,,,
31,aady-1,工矿,,,
37,aaff,苷,,,
38,aaff-1,蓬蓬勃勃,,,


As we can see with `aadn` and `aadn-1`, representing 葚 and 慝 respectively, it is **not the case that repeated codes always correspond to multiple-character strings**.

In [82]:
df_zm_hyphen_adjacent.tail(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
238438,zzwa-1,,,,女娲补天
238439,zzwb-1,,,,娓娓道来
238441,zzwe,,,,出赛
238442,zzwe-1,,,,出塞
238444,zzwj,,,,出宫
238445,zzwj-1,,,,丝绸之路
238446,zzwk-1,,,,幽冥
238447,zzwr-1,,,,纠察
238450,zzww-1,,,,缝缝补补
238451,zzww-2,,,,结结实实


In [83]:
df_zm_hyphen_adjacent.shape

(103260, 5)

Let's see specifically where the columns differ among themselves.  Using the method of [this StackOverflow thread](https://stackoverflow.com/questions/22701799/pandas-dataframe-find-rows-where-all-columns-equal), we can try checking where each column is individually equal (or not) to the first column.

In [84]:
df_zm_different = df_zm_merged[~df_zm_merged.eq(df_zm_merged.iloc[:, 0], axis=0).all(1)]

In [85]:
df_zm_different.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
0,a,工,一,一,一
1,aa,式,一下,一下,一下
2,aaa,工,,,
3,aaaa,工,,,
4,aaaa-1,恭恭敬敬,,,
5,aaab,工作,,,
6,aaad,工期,,,
7,aaae,黄花菜,,,
8,aaah,葡萄牙,,,
9,aaal,花花世界,,,百无一用


In [86]:
df_zm_different.shape

(238469, 5)

In [87]:
df_zm_same = df_zm_merged[df_zm_merged.eq(df_zm_merged.iloc[:, 0], axis=0).all(1)]

In [88]:
df_zm_same.shape

(0, 5)

Hmmm... that's a bummer.  So there are *no* codes for which *all* the databases have the same CJK characters.  What about the specific code examples we had in the lessons mentioned above?

| Phrases | Phrase code | Character normal codes | Character short codes |
| :-- | :-- | :-- | :-- |
| 生态系统 | mgmz | mc+gdsw+mzvv+zszr | mc+gsw+mzv+zs |
| 高等教育 | smbs | sjld+mbds+bmym+szq | sjl+ms+bmm+szq |
| 新石器时代 | sgjk | sufp+ga+jjjj+kds+nhs | sf+ga+jjg+kd+nh |
| 合成洗涤剂 | ohvv | odaj+hmy+vmrd+vrf+sonk | oaj+h+vmr+vrf+snk |
| 中华人民共和国 | jnoy | jivv+nred+od+yybh+eao+mfj+jdcs | |
| 全国工商业联合会 | ojbs | odc+jdcs+bi+suld+ku+ceug+odaj+odbz | |
| 中国有色金属工业总公司 | jjgr | jivv+jdcs+gdq+ryia+pa+xmil+bi+ku+udjw+ozs+yaj | |


In [89]:
tutorial_examples = [\
    ['mgmz', '生态系统'], \
    ['smbs', '高等教育'], \
    ['sgjk', '新石器时代'], \
    ['ohvv', '合成洗涤剂'], \
    ['jnoy', '中华人民共和国'], \
    ['ojbs', '全国工商业联合会'], \
    ['jjgr', '中国有色金属工业总公司'], \
]
tutorial_codes = [x[0] for x in tutorial_examples]
df_zm_examples = df_zm_merged[df_zm_merged['ZM Codes'].isin(tutorial_codes)]
df_zm_examples.shape

(4, 5)

In [90]:
df_zm_examples.head(20)

Unnamed: 0,ZM Codes,MS Characters,fcitx Characters,IBus Characters,RIME Characters
21937,jnoy,电炉,𡄆,𡄆,中华人民共和国
40372,sgjk,相互影响,𠝒,𠝒,新石器时代
217649,mgmz,,,,生态系统
226396,smbs,,,,高等教育


#### 2.5.3 Exporting Gathered Data

Let's take a moment to export this collected data in some formats useful for later processing.

In [91]:
import pickle 

# Write pickle
with open(data_prefix + 'df_zm_merged.pkl', 'wb') as pickle_file:
    pickle.dump(df_zm_merged, pickle_file)

In [92]:
df_zm_merged.to_csv(data_prefix + 'df_zm_merged.csv')

### 2.6 Important Conclusions

So these databases, generally speaking, **do not recapitulate the multiple-character examples from the tutorial.**

Moreover, as we saw above, `aadn` and `aadn-1` represent 葚 and 慝, respectively, so that **repeated codes do _not_ always correspond to multiple-character strings**.