# Pinyin Syllable Count Generator Working Draft Notes
This document details the process of experimentation related to the construction of the Pinyin Syllable Count Generator module. The module takes two columns of string data, one with unboken text in Chinese, the other with its pinyin counterpart. The package will evaluate the number of syllables in the chinese text based on the number of syllables it can detect in the supplied Latin characters. The results will be delivered as integers with commas separating the syllables in each individual word.

The Pandas package will be essential for the vector manipulation and storage.

In [1]:
import pandas as pd
import numpy as np
import csv
import sys

Because pinyin is a structured system with accepted combinations of "initials" (the initial sound of a syllable, e.g. the first consonant) and "finals" (the ending sound of a syllable, e.g. vowels ending with a consonant). One syllable in pinyin can be as small as one letter and as large as six letters. Therefore, detecting a syllable can occur in one of two ways:
1. A sequence of pinyin letters can be compared to a list of accepted pinyin syllables. If the sequence appears in the list, or if the sequence is part of a longer accepted term ("tia" by itself is not an accepted syllable, however "tiao" and "tian" are), then an additional letter from the evaluated word can be added to the sequence and reevaluated. If the sequence does not appear in the list and is not part of a longer accepted term, then the last character of the sequence becomes the first character of the next syllable and the syllable count is incremented.
2. A sequence of pinyin letters is evaluated against a DataFrame of initials and finals, utilizing the .loc command to spot matches if they exist or null values if they don't.

Given these possiblities two metrics need to be considered in order to settle on the proper method for the evaluation function: speed and size. Comparing a sequence to a list of terms can be faster, however the list itself may be excessively large. Evaluating sequences by their combination of initials and finals, however, may take longer but will be result in a much smaller package. The following exercises before will test each method in a smaller scale to get a general sense of the speed vs. size issue.

## Setup for small tests

An example sequence of letters will be tested with both methods.

In [2]:
# Sample sequence
smEval = 'teng'

## LIST ##
# Import list of all pinyin combinations
smList = []
with open("pinyinList.csv") as f:
    r = csv.reader(f)
    for row in r:
        smList += row
        
## NUMPY/DATAFRAME ##
# Variables for evaluation
smInitial = 't'
smFinal = 'eng'

# List of initials, finals for Numpy evaluation
initList = ['ø','b','p','m','f','d','t','n','l','z','c','s',
            'zh','ch','sh','r','j','q','x','g','k','h','y','w']
finList = ['a','ai','ao','an','ang','e','ei','en','eng','er',
           'o','ong','ou','i','ia','iao','ie','iu','ian','iang',
           'in','ing','iong','u','ua','uo','ue','ui','uai','uan','uen','uang','un','v','ve']

# Import Numpy array from CSV
smAr = np.genfromtxt('pinyinDF.csv', delimiter=',', skip_header=1, filling_values=0
                    ).astype('bool_')
smAr = np.delete(smAr, 0, 1)

# Import dataframe from CSV 
smDF = pd.read_csv('pinyinDF.csv', header=0, index_col=0, dtype={'INDEX':str})
smDF = smDF.fillna(0).astype('bool')

print('Python List:')
print(str(smList[50:55]) + '\n')
print('Numpy Array:')
print(str(smAr[0])  + '\n')
print('Pandas DataFrame:')
print(smDF.head(3))

Python List:
['bu', 'pa', 'pai', 'pao', 'pan']

Numpy Array:
[ True  True  True  True  True  True  True  True  True  True  True False
  True  True  True  True  True  True  True  True  True  True  True False
 False False False False False False False False False False False]

Pandas DataFrame:
      a    ai    ao    an   ang      e    ei    en   eng     er  ...     uo  \
ø  True  True  True  True  True   True  True  True  True   True  ...  False   
b  True  True  True  True  True  False  True  True  True  False  ...  False   
p  True  True  True  True  True  False  True  True  True  False  ...  False   

      ue     ui    uai    uan    uen   uang     un      v     ve  
ø  False  False  False  False  False  False  False  False  False  
b  False  False  False  False  False  False  False  False  False  
p  False  False  False  False  False  False  False  False  False  

[3 rows x 35 columns]


In [3]:
# Method 1: Python List
print('Python List: ' + str(sys.getsizeof(smList)))
%timeit -n 100000 smEval in smList

# Method 2: Numpy Array
print('\nNumpy array: ' + str(sys.getsizeof(initList) +
                              sys.getsizeof(finList) +
                              sys.getsizeof(smAr)) +
      ', without initials and finals lists: ' + str(sys.getsizeof(smAr)))
%timeit -n 100000 smAr[initList.index(smInitial),finList.index(smFinal)] != True

# Method 3: Pandas DataFrame
# 3a: slice
print('\nDataFrame: ' + str(sys.getsizeof(smDF)))
print('slice')
%timeit -n 100000 smDF[smFinal][smInitial] != True

# 3b: loc
print('loc')
%timeit -n 100000 smDF.loc[smInitial,smFinal] != True

print('iloc')
# 3c: iloc
%timeit -n 100000 smDF.iloc[initList.index(smInitial),finList.index(smFinal)] != True

Python List: 3760
1.32 µs ± 8.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numpy array: 1552, without initials and finals lists: 952
1.48 µs ± 37.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

DataFrame: 2443
slice
9.05 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
loc
7.71 µs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
iloc
9.08 µs ± 51.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [67]:
def sylCountAudit(pinyin):
    '''Count the number of Chinese characters based on inputted pinyin text.'''

    # Lists for reference
    consonant = ['b','c','ch','d','f','g','h','j','k','l','m','n',
                 'p','q','r','s','sh','t','w','x','y','z','zh']
    vowel = ['a','e','i','o','u','v']
    
    # Variables used for pinyin evaluation
    
    pinyin = pinyin.strip()
    
    pEval = ''
    initial = ''
    final = ''
    result = False
    initAr = []
    space = False
    syl = 1
    sylCount = []
    
    
    for i,n in enumerate(str.lower(pinyin)):
        
        if n == ' ':
            if not space:
                sylCount.append(syl)
                initial = ''
                final = ''
                syl = 1
            if pinyin[i+1] == ' ':
                space = True
                print('0. ' + ' space - double space')
            else:
                print('0. ' + ' space - new word')
                space = False
            print('0. ' + initial + '|' + final + n + ' = space - new word')
        elif final: 
            if pinyin[i-1:i+1] == 'er':
                try:
                    t = pinyin[i+1]
                except:
                    syl += 1
                else:
                    if t not in vowel:
                        if final == 'e':
                            syl += 1
                        else:
                            syl += 2
                    initial = ''
                    final = ''
                    print('1c.' + initial + '|' + final + n + ' = fin - \"er\", new syl = ' + str(syl))
                    continue
            try:
                result = smAr[initList.index(initial),finList.index(final + n)]
                print('1a. ' + initial + '|' + final + n + ' = init - trying: ' + str(result))
            except:
                result = False
                print('1b. ' + initial + '|' + final + n + ' = fin - error')
            if not result and not any(i for i in finList if (final+n) in i):
                syl += 1
                if n in vowel:
                    initial = 'ø'
                    final = n
                else:
                    initial = n
                    final = ''
                result = False
                print('2a. ' + initial + '|' + final + ' = fin - new syl = ' + str(syl))
            else:
                final += n
                print('2b. ' + initial + '|' + final + ' = fin - more')
        elif n not in vowel:
            initial += n
            print('3. ' + initial + ' = init - cons')
        elif not initial:
            initial = 'ø'
            final = n
            print('4. ' + initial + '|' + final + ' = init - vowel')
        else:
            final = n
            print('5. ' + initial + '|' + final + ' = fin - vowel')
    
    sylCount.append(syl)
    
    return sylCount

In [91]:
def sylCount(pinyin):
    '''Count the number of Chinese characters based on inputted pinyin text.'''

    # Lists for reference
    consonant = ['b','c','ch','d','f','g','h','j','k','l','m','n',
                 'p','q','r','s','sh','t','w','x','y','z','zh']
    vowel = ['a','e','i','o','u','v']
    
    # Variables used for pinyin evaluation
    
    pinyin = pinyin.strip()
    
    pEval = ''
    initial = ''
    final = ''
    result = False
    initAr = []
    space = False
    syl = 1
    sylCount = []
    
    
    for i,n in enumerate(str.lower(pinyin)):
        
        if n == ' ':
            if not space:
                sylCount.append(syl)
                initial = ''
                final = ''
                syl = 1
            if pinyin[i+1] == ' ':
                space = True
                #print('0. ' + ' space - double space')
            else:
                #print('0. ' + ' space - new word')
                space = False
            #print('0. ' + initial + '|' + final + n + ' = space - new word')
        elif final: 
            if pinyin[i-1:i+1] == 'er':
                try:
                    t = pinyin[i+1]
                except:
                    syl += 1
                    continue
                else:
                    if t not in vowel:
                        syl += 1
                        initial = ''
                        final = ''
                        #print('1c.' + initial + '|' + final + n + ' = fin - \"er\", new syl = ' + str(syl))
                        continue
            try:
                result = smAr[initList.index(initial),finList.index(final + n)]
                #print('1a. ' + initial + '|' + final + n + ' = init - trying: ' + str(result))
            except:
                result = False
                #print('1b. ' + initial + '|' + final + n + ' = fin - error')
            if not result and not any(i for i in finList if (final+n) in i):
                syl += 1
                if n in vowel:
                    initial = 'ø'
                    final = n
                else:
                    initial = n
                    final = ''
                result = False
                #print('2a. ' + initial + '|' + final + ' = fin - new syl = ' + str(syl))
            else:
                final += n
                #print('2b. ' + initial + '|' + final + ' = fin - more')
        elif n not in vowel:
            initial += n
            #print('3. ' + initial + ' = init - cons')
        elif not initial:
            initial = 'ø'
            final = n
            #print('4. ' + initial + '|' + final + ' = init - vowel')
        else:
            final = n
            #print('5. ' + initial + '|' + final + ' = fin - vowel')
    
    sylCount.append(syl)
    
    return sylCount

In [92]:
sylCount('Aixinjueluo Jierhalang')

[4, 3]

In [66]:
pinyin = 'eywhadh'
result = False
for i, n in enumerate(pinyin):
    if pinyin[i-1:i+1] == 'ha':
        result = True
    print(str(result) + str(i))
    print(len(pinyin))

False0
7
False1
7
False2
7
False3
7
True4
7
True5
7
True6
7


In [45]:
smAr[initList.index('w'),finList.index('eiz')]

ValueError: 'eiz' is not in list

In [7]:
def sylCount(word):
    
    tEval = ''
    syl = 1
    
    for n in word:
        tEval = tEval + n
        if (tEval not in varList) and not any(i for i in varList if tEval in i):
            syl += 1
            tEval = n
    return syl

In [86]:
names = pd.read_csv('Chinese Names.csv')
names[names['ChiName'].str.len() > 5]

Unnamed: 0,EngName,ChiName
102,A Shinaaboshe,阿史那阿波設
104,A Shinachuoluokehan,阿史那啜羅可汗
106,A Shinahuseluo,阿史那斛瑟羅
107,A Shinaqiminkehan,阿史那啟民可汗
120,Abilule Suofeiyinga,阿畢魯勒索費英阿
123,Aboerjijite Aleqinga,阿博爾濟吉特阿勒清阿
152,Ademishimenggu,阿的彌失蒙古
153,Ademishitiemuer,阿的迷失帖木兒
156,Adouhalamanhudoupuhuayaozhu,阿都哈剌蠻護都普花咬住
170,Aerhaihesaer,阿兒孩合撒兒


In [87]:
names['syl'] = names['EngName'].apply(sylCount)
names['check'] = names['ChiName'].str.len() == names['syl'].apply(sum)
names[~names['check']]

Unnamed: 0,EngName,ChiName,syl,check
19,Duoerzhi,朶兒只,[4],False
22,Eerdeng,額爾登,[4],False
23,Eerte,額爾忒,[4],False
42,Laier,萊兒,[3],False
123,Aboerjijite Aleqinga,阿博爾濟吉特阿勒清阿,"[7, 4]",False
136,Achaer,阿察兒,[4],False
164,Aerbai,阿爾拜,[4],False
165,Aerbanga,阿爾邦阿,[5],False
166,Aerbena,阿爾本阿,[5],False
167,Aerbu,阿爾布,[4],False


In [88]:
names[names['EngName'].str.contains('er')][names['check'] == False]

  """Entry point for launching an IPython kernel.


Unnamed: 0,EngName,ChiName,syl,check
19,Duoerzhi,朶兒只,[4],False
22,Eerdeng,額爾登,[4],False
23,Eerte,額爾忒,[4],False
42,Laier,萊兒,[3],False
123,Aboerjijite Aleqinga,阿博爾濟吉特阿勒清阿,"[7, 4]",False
136,Achaer,阿察兒,[4],False
164,Aerbai,阿爾拜,[4],False
165,Aerbanga,阿爾邦阿,[5],False
166,Aerbena,阿爾本阿,[5],False
167,Aerbu,阿爾布,[4],False
