# ZhengMa Character Conversion: Converter

## 4 Converting between Characters & Codes

So let's try to get a working implementation of a routine that converts from Chinese characters to Zheng Ma codes, and then back again.

In [1]:
# If running in Google Colab
#from google.colab import drive
#drive.mount('/content/gdrive')
#
#path_prefix = "/content/gdrive/My Drive/Colab Notebooks/zhengma/raw/"
#data_prefix = '/content/gdrive/My Drive/Colab Notebooks/zhengma/data/'

In [2]:
# If running on local system
path_prefix = "../raw/"
data_prefix = '../data/'

In [3]:
import pickle 

# Load pickle    
with open(data_prefix + 'df_zm_merged.pkl', 'rb') as pickle_file:
    df_zm_merged = pickle.load(pickle_file)

### 4.1 Characters to Codes

We'll start simple: give me a character string, and I'll give you the code for each character.  We've tried this before, just returning any row where the character is the one we want.  But there could be several codes for a given character.  So we need to decide how to get only one code.  We can try taking the longest code.

In [4]:
def characters_to_codes_simplistic(cjk_string, zm_dataframe, db_column='RIME Characters', zm_column='ZM Codes'):
    # Input: 
    #   string of CJK characters
    #   database of Zheng Ma codes as a pandas DataFrame
    #   name of column to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   list (dictionary?) of Zheng Ma codes
    #     - In case of multiple code correspondences, choose the longest

    characters = cjk_string.strip().replace(' ', '')

    codes = []

    for character in characters:
        # Find any rows in the desired column that have the desired character
        # Take the ZM codes in those rows as a list
        possible_codes = zm_dataframe[zm_dataframe[db_column] == character][zm_column].tolist()

        # Choose the **longest code** in that list of ZM codes
        max_code = max(possible_codes, key=len) if possible_codes else None
        # There could be several, so order alphabetically and pick the first
        desired_codes = [c for c in possible_codes if len(c) == len(max_code)] if max_code else None
        desired_code = sorted(desired_codes)[0] if desired_codes else 'N/A: no match'
        codes.append([character, desired_code])
    
    return codes

In [5]:
new_test_string1 = '三人行必有我師'
new_test_string2 = '性相近也习相远也'

In [6]:
new_test_codes_output1 = characters_to_codes_simplistic(new_test_string1, df_zm_merged)
print(new_test_codes_output1)

[['三', 'cd'], ['人', 'od'], ['行', 'oi'], ['必', 'wzm'], ['有', 'gdq'], ['我', 'mdhm'], ['師', 'N/A: no match']]


In [7]:
new_test_codes_output2 = characters_to_codes_simplistic(new_test_string2, df_zm_merged)
print(new_test_codes_output2)

[['性', 'umc'], ['相', 'flvv'], ['近', 'pdw'], ['也', 'yi'], ['习', 'yt'], ['相', 'flvv'], ['远', 'bdrw'], ['也', 'yi']]


Nice.  So that worked.  At least basically.

### 4.2 Codes to Characters

This is going to be a little dicey.  This time, you give me a list of codes, and I return you a list of characters.  The trick is, a given code could correspond to more than one Chinese character string.  So we need a heuristic: take only the single-character string.  Of course, there might be more than one, which could get us into hot water...

In [8]:
def codes_to_characters_simplistic(code_list, zm_dataframe, db_column='RIME Characters', zm_column='ZM Codes'):
    # Input: 
    #   list of ZM codes
    #   database of Zheng Ma codes as a pandas DataFrame
    #   name of column to check for characters
    #   name of column containing Zheng Ma codes
    # Output: 
    #   string of CJK characters
    #     - In case of multiple character correspondences for a code, choose...

    cjk_string = ''

    for code in code_list:
        # Make sure the code is a valid ZM code:
        #   - fewer than 5 letters
        #   - no spaces
        if ' ' not in code:
            if len(code) < 5:
                # Get the characters for that code
                possible_characters = zm_dataframe[zm_dataframe[zm_column] == code][db_column].tolist()
                # Remove any empty strings
                viable_characters = [x for x in possible_characters if (len(x) > 0)]
                # Add the smallest string (hopefully 1 character)
                # ... watch out: there might be more than one minimum...
                # ... what does min() do?  return the first it finds in the list?
                cjk_string += min(viable_characters, key=len)
            else:
                print('Code too long: {}'.format(code))
        else:
            print('Code should not contain spaces: {}'.format(code))
    
    return cjk_string

In [9]:
new_test_codes1 = [ x[-1] for x in new_test_codes_output1]
new_test_codes2 = [ y[-1] for y in new_test_codes_output2]

In [10]:
new_test_string_output1 = codes_to_characters_simplistic(new_test_codes1, df_zm_merged)
print(new_test_string_output1)

Code should not contain spaces: N/A: no match
三人行必有我


In [11]:
new_test_string_output2 = codes_to_characters_simplistic(new_test_codes2, df_zm_merged)
print(new_test_string_output2)

性相近也习相远也


Nice!  That seems to have worked... I think...

### 4.3 Unicode Comparison

Now let's do the same procedure, but for Unicode.

In [12]:
def characters_to_unicodes(cjk_string):
    # Read a list of CJK characters (as strings)
	# Convert to a number, write the number in hexadecimal
    return [hex(ord(x)) for x in cjk_string]

Now for a little sanity check.

In [13]:
new_test_unicodes_output1 = characters_to_unicodes(new_test_string1)
print(new_test_unicodes_output1)

['0x4e09', '0x4eba', '0x884c', '0x5fc5', '0x6709', '0x6211', '0x5e2b']


In [14]:
new_test_unicodes_output2 = characters_to_unicodes(new_test_string2)
print(new_test_unicodes_output2)

['0x6027', '0x76f8', '0x8fd1', '0x4e5f', '0x4e60', '0x76f8', '0x8fdc', '0x4e5f']


And let's create a function to go in the opposite direction: from Unicode code points to Unicode characters.

In [15]:
def unicodes_to_characters(code_list):
	# Read a list of hexadecimal codes as strings (with '0x' prefix)
	# Convert to hexadecimal numbers, then get the corresponding Unicode character
	return [chr(int(code, 16)) for code in code_list]

Now let's check.

In [16]:
new_test_unicodes1 = new_test_unicodes_output1
new_test_unicodes2 = new_test_unicodes_output2

In [17]:
new_test_unicode_string_output1 = unicodes_to_characters(new_test_unicodes1)
print(new_test_unicode_string_output1)

['三', '人', '行', '必', '有', '我', '師']


In [18]:
new_test_unicode_string_output2 = unicodes_to_characters(new_test_unicodes2)
print(new_test_unicode_string_output2)

['性', '相', '近', '也', '习', '相', '远', '也']


It worked!