# Pinyin Syllable Count Generator Working Draft Notes
This document details the process of experimentation related to the construction of the Pinyin Syllable Count Generator module. The module takes two columns of string data, one with unboken text in Chinese, the other with its pinyin counterpart. The package will evaluate the number of syllables in the chinese text based on the number of syllables it can detect in the supplied Latin characters. The results will be delivered as integers with commas separating the syllables in each individual word.

The Pandas package will be essential for the vector manipulation and storage.

In [4]:
import pandas as pd
import numpy as np
import csv
import sys

Because pinyin is a structured system with accepted combinations of "initials" (the initial sound of a syllable, e.g. the first consonant) and "finals" (the ending sound of a syllable, e.g. vowels ending with a consonant). One syllable in pinyin can be as small as one letter and as large as six letters. Therefore, detecting a syllable can occur in one of two ways:
1. A sequence of pinyin letters can be compared to a list of accepted pinyin syllables. If the sequence appears in the list, or if the sequence is part of a longer accepted term ("tia" by itself is not an accepted syllable, however "tiao" and "tian" are), then an additional letter from the evaluated word can be added to the sequence and reevaluated. If the sequence does not appear in the list and is not part of a longer accepted term, then the last character of the sequence becomes the first character of the next syllable and the syllable count is incremented.
2. A sequence of pinyin letters is evaluated against a DataFrame of initials and finals, utilizing the .loc command to spot matches if they exist or null values if they don't.

Given these possiblities two metrics need to be considered in order to settle on the proper method for the evaluation function: speed and size. Comparing a sequence to a list of terms can be faster, however the list itself may be excessively large. Evaluating sequences by their combination of initials and finals, however, may take longer but will be result in a much smaller package. The following exercises before will test each method in a smaller scale to get a general sense of the speed vs. size issue.

## Setup for small tests

An example sequence of letters will be tested with both methods.

In [5]:
## LIST ##
# Import list of all pinyin combinations
smList = []
with open("pinyinList.csv") as f:
    r = csv.reader(f)
    for row in r:
        smList += row
        
## NUMPY/DATAFRAME ##
# List of initials, finals for Numpy evaluation
initList = ['ø','b','p','m','f','d','t','n','l','z','c','s',
            'zh','ch','sh','r','j','q','x','g','k','h','y','w']
finList = ['a','ai','ao','an','ang','e','ei','en','eng','er',
           'o','ong','ou','i','ia','iao','ie','iu','ian','iang',
           'in','ing','iong','u','ua','uo','ue','ui','uai','uan','uen','uang','un','v','ve']

# Import Numpy array from CSV
smAr = np.genfromtxt('pinyinDF.csv', delimiter=',', skip_header=1, filling_values=0
                    ).astype('bool_')
smAr = np.delete(smAr, 0, 1)

# Import dataframe from CSV 
smDF = pd.read_csv('pinyinDF.csv', header=0, index_col=0, dtype={'INDEX':str})
smDF = smDF.fillna(0).astype('bool')

print('Python List:')
print(str(smList[50:55]) + '\n')
print('Numpy Array:')
print(str(smAr[0])  + '\n')
print('Pandas DataFrame:')
print(smDF.head(3))

Python List:
['bu', 'pa', 'pai', 'pao', 'pan']

Numpy Array:
[ True  True  True  True  True  True  True  True  True  True  True False
  True  True  True  True  True  True  True  True  True  True  True False
 False False False False False False False False False False False]

Pandas DataFrame:
      a    ai    ao    an   ang      e    ei    en   eng     er  ...     uo  \
ø  True  True  True  True  True   True  True  True  True   True  ...  False   
b  True  True  True  True  True  False  True  True  True  False  ...  False   
p  True  True  True  True  True  False  True  True  True  False  ...  False   

      ue     ui    uai    uan    uen   uang     un      v     ve  
ø  False  False  False  False  False  False  False  False  False  
b  False  False  False  False  False  False  False  False  False  
p  False  False  False  False  False  False  False  False  False  

[3 rows x 35 columns]


In [3]:
# Method 1: Python List
print('Python List: ' + str(sys.getsizeof(smList)))
%timeit -n 100000 smEval in smList

# Method 2: Numpy Array
print('\nNumpy array: ' + str(sys.getsizeof(initList) +
                              sys.getsizeof(finList) +
                              sys.getsizeof(smAr)) +
      ', without initials and finals lists: ' + str(sys.getsizeof(smAr)))
%timeit -n 100000 smAr[initList.index(smInitial),finList.index(smFinal)] != True

# Method 3: Pandas DataFrame
# 3a: slice
print('\nDataFrame: ' + str(sys.getsizeof(smDF)))
print('slice')
%timeit -n 100000 smDF[smFinal][smInitial] != True

# 3b: loc
print('loc')
%timeit -n 100000 smDF.loc[smInitial,smFinal] != True

print('iloc')
# 3c: iloc
%timeit -n 100000 smDF.iloc[initList.index(smInitial),finList.index(smFinal)] != True

Python List: 3760
1.32 µs ± 8.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numpy array: 1552, without initials and finals lists: 952
1.48 µs ± 37.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

DataFrame: 2443
slice
9.05 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
loc
7.71 µs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
iloc
9.08 µs ± 51.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [24]:
def sylCount(pinyin, test=False):
    '''Count the number of Chinese characters based on inputted pinyin text.'''

    # Lists for reference
    consonant = ['b','c','ch','d','f','g','h','j','k','l','m','n',
                 'p','q','r','s','sh','t','w','x','y','z','zh']
    vowel = ['a','e','i','o','u','v']
    
    # Variables used for pinyin evaluation
    
    pinyin = pinyin.strip()

    initial = ''
    final = ''
    newSyl = False
    syl = 1
    sylCount = []
    er = False
    
    
    for i,n in enumerate(str.lower(pinyin)):
        
        endOfWord = i == len(pinyin) - 1 or pinyin[i+1] == ' '
        
        if n == ' ':
            step = '0. space'
            if pinyin[i+1] == ' ':
                notes = 'discard extra space'
            else:
                sylCount.append(syl)
                newSyl = True
                notes = 'new word'

        elif final:
            step = 'b. ' + initial + '|' + final + n
            if final[-1] == 'e' and n == 'r':
                notes = 'er'
                if not endOfWord:
                    newSyl = True
                    er = True
                    notes += ' - not \"er\" final'
                else:
                    if len(final) > 1:
                        syl += 1
                        notes += ' - extra syl at end'
            else:
                notes = 'fin'
                try:
                    result = smAr[initList.index(initial),
                                  finList.index(final + n)]
                    notes += ' - trying: ' + str(result)
                except:
                    result = False
                    notes += ' - array error'
                    
                if not result and not any(i for i in finList if (final+n) in i):
                    newSyl = True
                    notes += ' - new syl'
                else:
                    final += n
                    notes += ' - maybe more'

        else:
            step = 'a. ' + initial + n
            notes = 'init'
            if n not in vowel:
                initial += n
                notes += ' - cons'
            elif not initial:
                initial = 'ø'
                final = n
                notes += ' - starting vowel'
            else:
                final = n
                notes += ' - vowel'
                
        if test == True: print(step + ': ' + notes)
        
        if newSyl:
            if n in vowel:
                initial = 'ø'
                final = n
                syl += 1
            elif n != ' ':
                if er and pinyin[i+1] not in vowel:
                    initial = ''
                    if len(final) > 1:
                        syl+= 1
                    er = False
                else:
                    initial = n
                final = ''
                syl += 1
            else:
                initial = ''
                final = ''
                syl = 1
            if test == True: print('new syl: ' + str(syl))
            newSyl = False
    
    sylCount.append(syl)
    
    return sylCount

In [25]:
print(sylCount('Baerda Qingsheng', True))

a. b: init - cons
a. ba: init - vowel
b. b|ae: fin - array error - new syl
new syl: 2
b. ø|er: er - not "er" final
new syl: 3
a. d: init - cons
a. da: init - vowel
0. space: new word
new syl: 1
a. q: init - cons
a. qi: init - vowel
b. q|in: fin - trying: True - maybe more
b. q|ing: fin - trying: True - maybe more
b. q|ings: fin - array error - new syl
new syl: 2
a. sh: init - cons
a. she: init - vowel
b. sh|en: fin - trying: True - maybe more
b. sh|eng: fin - trying: True - maybe more
[3, 2]


In [26]:
names = pd.read_csv('Chinese Names.csv')
names['syl'] = names['EngName'].apply(sylCount, test=False)
names['check'] = names['ChiName'].str.len() == names['syl'].apply(sum)
names[~names['check']]

Unnamed: 0,EngName,ChiName,syl,check
198,Ai Airen,愛仁,"[1, 2]",False
199,Ai Aixinga,愛興阿,"[1, 3]",False
433,Aixinjueluo Bukuliyongshun,布庫里雍順,"[4, 5]",False
455,Aixinjueluo Chongshan,充善,"[4, 2]",False
500,Aixinjueluo Duoergun,多尔衮,"[4, 3]",False
504,Aixinjueluo Eerdeng,愛新覺羅爾登,"[4, 3]",False
534,Aixinjueluo Fulin,福临,"[4, 2]",False
535,Aixinjueluo Fuman,福滿,"[4, 2]",False
592,Aixinjueluo Hongli,弘历,"[4, 2]",False
615,Aixinjueluo Huangtaiji,皇太极,"[4, 3]",False


In [27]:
names[(names['ChiName'].str.len() - names['syl'].apply(sum))*-1 < 2][names['check'] == False]

  """Entry point for launching an IPython kernel.


Unnamed: 0,EngName,ChiName,syl,check
198,Ai Airen,愛仁,"[1, 2]",False
199,Ai Aixinga,愛興阿,"[1, 3]",False
504,Aixinjueluo Eerdeng,愛新覺羅爾登,"[4, 3]",False
648,Aixinjueluo Jishen,愛新覺羅紳,"[4, 2]",False
854,Aixinjueluo Suku,愛新覺羅庫,"[4, 2]",False
858,Aixinjueluo Tai,愛新覺羅卓泰,"[4, 1]",False
965,Aixinjueluo Yishu,愛新覺羅奕雨澍,"[4, 2]",False
1023,Aixinjueluo Yuen,愛新覺羅裕恩,"[4, 1]",False
1260,An Anduha,安篤哈,"[1, 3]",False
1261,An Anfu,安福,"[1, 2]",False


In [16]:
names[names['EngName'].str.contains('yuen')]

Unnamed: 0,EngName,ChiName,syl,check
40089,Fangyuyuenai,防禦岳鼐,[4],True
