# Pinyin Syllable Count Generator Working Draft Notes
This document details the process of experimentation related to the construction of the Pinyin Syllable Count Generator function. The function takes pinyin of any size or word count, evaluates the number of syllables based on the supplied Latin characters, and outputs a list of integers representing the syllables of each word.

The initial stages of development tests three methods of storing data for optimized processing comparison: Python list, NumPy Array, and Pandas DataFrame.

In [1]:
import pandas as pd
import numpy as np

Because pinyin is a structured system with accepted combinations of "initials" (the initial sound of a syllable, e.g., the first consonant) and "finals" (the ending sound of a syllable, e.g., vowels that sometimes end with a consonant). One syllable in pinyin can be as small as one letter and as large as six letters. Therefore, detecting a syllable can occur in one of two ways:
1. A sequence of pinyin letters compared to a list of accepted pinyin syllables. If the sequence appears in the list, or if the sequence is part of a longer accepted term ("tia" by itself is not an accepted syllable, however "tiao" and "tian" are), then an additional letter from the evaluated pinyin can be added to the sequence and reevaluated. If the sequence does not appear in the list and is not part of a longer accepted term, then the last character of the sequence becomes the first character of the next syllable, and the syllable count incremented.
2. A sequence of letters is processed incrementally with the initials identified first, and then the finals after the appearance of the first vowel. The identified initial and final values act as indexes to a table (either a NumPy Array or Pandas DataFrame) of boolean values that evaluate the initial and final pairing. If the table returns False, then the final is compared to a list of other final values to determine whether it can be part of a more extended final. If it is, then the function continues onto the next incremented letter. Otherwise, the function determines that the current letter is the start of the next syllable and increments the number of syllables for that word.

Given these possibilities, two metrics need to be considered to settle on the proper method for the evaluation function: speed and size. I hypothesize that comparing a sequence of letters to a list of terms can be faster; however, the list itself may be excessively large. Evaluating sequences by their combination of initials and finals, however, may take longer but results in a much smaller package. The following exercises test each method on a smaller scale to get a general sense of the speed vs. size issue.

## Setup for Testing

The code below imports two CSV files. The first contains the list of acceptable pinyin syllables (e.g. "yang," "wei"). The second contains an array of 1 values with a column header of initials and row headers with finals. Empty values convert to 0, and both 1 and 0 values convert to the boolean values True and False, respectively in their array and DataFrame forms.

In [19]:
import csv
import sys

## LIST ##
# Import list of all pinyin combinations
smList = []
with open("pinyinList.csv") as f:
    r = csv.reader(f)
    for row in r:
        smList += row
        
## NUMPY/DATAFRAME ##
# List of initials, finals for Numpy evaluation
PY = np.genfromtxt('pinyinDF.csv', delimiter=',', dtype=str)
initList = list(PY[:,0][1:])
finList = list(PY[0][1:])

# Import Numpy array from CSV
smAr = PY[1:][:,1:] == '1'

# Import dataframe from CSV 
smDF = pd.read_csv('pinyinDF.csv', header=0, index_col=0, dtype={'INDEX':str}
                  ).fillna(0).astype('bool')

# Print examples of each form of reference data.
print('Python List:')
print(str(smList[0:30]))
print('Displaying 30 of ' + str(len(smList)) + ' possible pinyin combinations ' + '\n')
print('——————————————————\n')
print('# Header lists for use in indexing the array/DataFrame')
print('Initials List:')
print(initList)
print('Finals List:')
print(str(finList) + '\n')
print('——————————————————\n')
print('Numpy Array:')
print(str(smAr[0]))
print('Displaying 1 of ' + str(len(smAr)) + ' rows within the array' + '\n')
print('Pandas DataFrame:')
print(smDF.head(3))

Python List:
['a', 'ai', 'ao', 'an', 'ang', 'e', 'ei', 'en', 'eng', 'er', 'o', 'ou', 'yi', 'ya', 'yao', 'ye', 'you', 'yan', 'yang', 'yin', 'ying', 'yong', 'wu', 'wa', 'wo', 'wei', 'wai', 'wan', 'wen', 'wang']
Displaying 30 of 407 possible pinyin combinations 

——————————————————

# Header lists for use in indexing the array/DataFrame
Initials List:
['ø', 'b', 'p', 'm', 'f', 'd', 't', 'n', 'l', 'z', 'c', 's', 'zh', 'ch', 'sh', 'r', 'j', 'q', 'x', 'g', 'k', 'h', 'y', 'w']
Finals List:
['a', 'ai', 'ao', 'an', 'ang', 'e', 'ei', 'en', 'eng', 'er', 'o', 'ong', 'ou', 'i', 'ia', 'iao', 'ie', 'iu', 'ian', 'iang', 'in', 'ing', 'iong', 'u', 'ua', 'uo', 'ue', 'ui', 'uai', 'uan', 'uang', 'un', 'v', 've']

——————————————————

Numpy Array:
[ True  True  True  True  True  True  True  True  True  True  True False
  True False False False False False False False False False False False
 False False False False False False False False False False]
Displaying 1 of 24 rows within the array

Pandas DataFram

## Performing the Tests
To best compare the size and performance speed of each method this test works with two examples: a small two-letter phrase that is an acceptable pinyin term ("ba"), and a longer five-letter term that is not an acceptable pinyin term ("quang"). The %timeit function measures processing time with each method after running 100,000 iterations.

### Methods
1. The Python "in" function: the script searches through the list of acceptable pinyin terms for the supplied example.
2. The examples split into their initial and final pairs, their indexes derived from within the list of initials and finals, and those values are then used to find the appropriate boolean within the NumPy Array.
3. Pandas DataFrame 

In [3]:
# Variables used for testing
smEval1 = 'ba'
smInitial1 = 'b'
smFinal1 = 'a'

smEval2 = 'quang'
smInitial2 = 'q'
smFinal2 = 'uang'


# Subject 1: Short, acceptable term
print('# Results of \"ba\" (short, valid):')

# Method 1: Python List
print('Python List: ' + str(sys.getsizeof(smList)))
%timeit -n 100000 smEval1 in smList

# Method 2: Numpy Array
print('\nNumpy array: ' + str(sys.getsizeof(initList) +
                              sys.getsizeof(finList) +
                              sys.getsizeof(smAr)) +
      ', without initials and finals lists: ' + str(sys.getsizeof(smAr)))
%timeit -n 100000 smAr[initList.index(smInitial1),finList.index(smFinal1)] != True

# Method 3: Pandas DataFrame
# 3a: slice
print('\nDataFrame: ' + str(sys.getsizeof(smDF)))
print('slice')
%timeit -n 100000 smDF[smFinal1][smInitial1] != True

# 3b: loc
print('loc')
%timeit -n 100000 smDF.loc[smInitial1,smFinal1] != True

print('iloc')
# 3c: iloc
%timeit -n 100000 smDF.iloc[initList.index(smInitial1),finList.index(smFinal1)] != True



# Subject 2: Longer, unacceptable term
print('\n\n# Results of \"quang\" (longer, invalid):')

# Method 1: Python List
print('Python List: ' + str(sys.getsizeof(smList)))
%timeit -n 100000 smEval2 in smList

# Method 2: Numpy Array
print('\nNumpy array: ' + str(sys.getsizeof(initList) +
                              sys.getsizeof(finList) +
                              sys.getsizeof(smAr)) +
      ', without initials and finals lists: ' + str(sys.getsizeof(smAr)))
%timeit -n 100000 smAr[initList.index(smInitial2),finList.index(smFinal2)] != True

# Method 3: Pandas DataFrame
# 3a: slice
print('\nDataFrame: ' + str(sys.getsizeof(smDF)))
print('slice')
%timeit -n 100000 smDF[smFinal2][smInitial2] != True

# 3b: loc
print('loc')
%timeit -n 100000 smDF.loc[smInitial2,smFinal2] != True

print('iloc')
# 3c: iloc
%timeit -n 100000 smDF.iloc[initList.index(smInitial2),finList.index(smFinal2)] != True

# Results of "ba" (short, valid):
Python List: 3760
399 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numpy array: 1672, without initials and finals lists: 928
1.23 µs ± 7.34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

DataFrame: 2427
slice
8.57 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
loc
6.99 µs ± 51.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
iloc
8.52 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# Results of "quang" (longer, invalid):
Python List: 3760
3.68 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numpy array: 1672, without initials and finals lists: 928
1.78 µs ± 34.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

DataFrame: 3067
slice
8.61 µs ± 49.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
loc
7.12 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
iloc
9.2 µs ± 47.4 ns per loop (mean ± 

In [181]:
# import necessary packages for function
import numpy as np
import csv

# List of pinyin initials, finals for Numpy evaluation
PY = np.genfromtxt('pinyinDF.csv', delimiter=',', dtype=str)
initPYList = list(PY[:,0][1:])
finPYList = list(PY[0][1:])

# List of Wade-Giles initials, finals
WG = np.genfromtxt('wadegilesDF.csv', delimiter=',', dtype=str)
initWGList = list(WG[:,0][1:])
finWGList = list(WG[0][1:])

# Import Numpy arrays from CSV
arPY = PY[1:][:,1:] == '1'
arWG = WG[1:][:,1:] == '1'

# sylCount definition with parameters:
    # test = Show breadcrumbs of NumPy evaluation
    # errorReport = Return error messages as strings, NaN for non-errors
def sylCount(pinyin, test=False, errorReport=False, CBDB=False):
    '''Count the number of Chinese characters based on inputted pinyin text.'''
    
    # Vowel list for reference
    vowel = ['a','e','i','o','u','v']
    
    # Replace ideographic spaces with regular space, also trim leading and following spaces.
    try:
        pinyin = pinyin.replace('\u3000',' ').strip()
    except:
        errorMsg = 'Error: null value'
        if test == True: print(errorMsg)
        if errorReport == True: return errorMsg
        return [0]
    
    # Make sure no symbols get through.
    if not pinyin.replace(' ','').replace('\'','').isalpha():
        errorMsg = 'Error: invalid non-alpha characters in name'
        if test == True: print(errorMsg)
        if errorReport == True: return errorMsg
        return [0]
    
    # CBDB testing purposes ONLY: if the family name is duplicated in the given name
    if CBDB == True:
        dupTest = pinyin.split()
        if len(dupTest) > 1:
            dupTestLen = [len(dupTest[0]), len(dupTest[1])]
            if (dupTest[0] == dupTest[1][0:dupTestLen[0]] and 
                dupTestLen[1] > dupTestLen[0] and 
                dupTest[1][len(dupTest[0])] != '\''):
                errorMsg = 'Error: duplicate family name in given name'
                if test == True: print(errorMsg)
                if errorReport == True: return errorMsg 
                return [0]

    # Variables used for evaluation
    initial = ''
    final = ''
    newSyl = False
    syl = 1
    sylCount = []
    exFin = ''
    
    if test == True: print('# Testing ' + pinyin + ' #')
    
    for i,n in enumerate(str.lower(pinyin)):

        if n == '\'':
            step = '0. new syl'
            notes = 'apostrophe'
            newSyl = True
        
        # If current character is a space...
        elif n == ' ':
            step = '0. new word'
            if pinyin[i+1] == ' ':
                notes = 'discard extra space'
            else:
                sylCount.append(syl)
                newSyl = True
                notes = 'space'

        # If final has already been set, regardless of current character
        elif final:
            step = 'b. ' + initial + '|' + final + n
            if (final[-1] == 'e' and n == 'r' or 
                final[-1] == 'n'):
                notes = final[-1] + n
                if i != len(pinyin) - 1 and pinyin[i+1] != ' ':
                    newSyl = True
                    exFin = final[-1] + n
                    notes += ' - not \"' + exFin + '\" final'
                else:
                    if (notes == 'er' and len(final) > 1) or n in vowel:
                        syl += 1
                        notes += ' - extra syl at end'

            else:
                notes = 'fin'
                try:
                    result = arPY[
                        initPYList.index(initial),
                        finPYList.index(final + n)
                    ]
                    notes += ' - array lookup ' + initial + final + n + ': ' + str(result)
                except:
                    result = False
                    notes += ' - array error'

                if not result:
                    listFins = list(i for i in finPYList if i.startswith(final+n))
                    if not listFins or not any(
                        i for i in listFins if arPY[
                            initPYList.index(initial),
                            finPYList.index(i)
                        ]
                    ):
                        newSyl = True
                        notes += ' - new syl'
                    else:
                        final += n
                        notes += ' - definitely more'
                else:
                    final += n
                    notes += ' - maybe more'
          
        # If there is no final set (work on initial)
        else:
            step = 'a. ' + initial + n
            notes = 'init'
            if n not in vowel:
                if initial + n not in initPYList:
                    errorMsg = 'Error: ' + initial + n + ' - quitting, invalid initial'
                    if test == True: print(errorMsg)
                    if errorReport == True: return errorMsg
                    return [0]
                initial += n
                notes += ' - cons'
            elif not initial:
                initial = 'ø'
                final = n
                notes += ' - starting vowel'
            else:
                final = n
                notes += ' - vowel'
                
        # Print debug notes if parameter is true
        if test == True: print(step + ': ' + notes)
        
        # Processes for next syllable
        if newSyl:
            if n in vowel:
                if exFin:
                    initial = 'n'
                else:
                    initial = 'ø'
                syl += 1
                final = n
            elif n != ' ':
                if (exFin and pinyin[i+1] not in vowel) or n == '\'':
                    initial = ''
                    if exFin == 'er' and len(final) > 1:
                        syl+= 1
                else:
                    initial = n
                exFin = False
                final = ''
                syl += 1
            else:
                initial = ''
                final = ''
                syl = 1
            if test == True: print('new syl: ' + str(syl))
            newSyl = False
    
    sylCount.append(syl)
    
    if errorReport == True:
        return np.nan
    else:
        return sylCount

In [193]:
import pandas as pd

# Import CBDB data
names = pd.read_csv('allCBDBnames.csv', index_col=0, names=['EngName','ChiName'], dtype=str, header=0)

# Run sylCount function on 'EngName' column
names['syl'] = names['EngName'].apply(sylCount, CBDB=True, errorReport=True)

# Add 'check' column to validate results
def tryAdd(x):
    try:
        return sum(x)
    except:
        return
names['check'] = names['ChiName'].str.len() == names['syl'].apply(tryAdd)

# Filter results to show false negatives
names[~names['check']]

Unnamed: 0,EngName,ChiName,syl,check
0,Wei Xiang,未詳,,False
1,An Dun,安惇,,False
2,An Fang,安邡,,False
3,An Tao,安燾,,False
4,Zha Dao,查道,,False
5,Zha Yue,查籥,,False
6,Chai Chengwu,柴成務,,False
7,Chai Tianyin,柴天因,,False
8,Chang Jin,常僅,,False
9,Chang Renzhi,常任秩,,False


In [192]:
names[names['syl'].apply(lambda x: x==[0])]

Unnamed: 0,EngName,ChiName,syl,check
26,Chen Ji(2),陳機,[0],False
31,Chen Jian(2),陳戩,[0],False
32,Chen Zhi(5),陳致,[0],False
37,Chen Jing(5),陳靖,[0],False
41,Chen Zhu(2),陳鑄,[0],False
42,Chen Ju(3),陳鉅,[0],False
48,Chen Xiang(2),陳相,[0],False
54,Chen Hong(2),陳洪,[0],False
58,Chen Ruxi(2),陳汝錫,[0],False
83,Chen Shu(2),陳恕,[0],False


In [199]:
len(names[names['syl'].notnull()])

53459

In [201]:
names.groupby('syl').agg('count')

Unnamed: 0_level_0,EngName,ChiName,check
syl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Error: bb - quitting, invalid initial",4,4,4
"Error: cc - quitting, invalid initial",3,3,3
"Error: dd - quitting, invalid initial",4,4,4
Error: duplicate family name in given name,2051,2051,2051
"Error: ff - quitting, invalid initial",1,1,1
"Error: gf - quitting, invalid initial",1,1,1
"Error: gg - quitting, invalid initial",1,1,1
"Error: hh - quitting, invalid initial",1,1,1
"Error: hj - quitting, invalid initial",1,1,1
"Error: hy - quitting, invalid initial",1,1,1
