# Tang History Database - Name Syllable Challenge
**Jeff Heller, Data and Project Coordinator - East Asian Studies, Princeton University**

*April 26, 2019*

A biography in the Old and New Tang History will typically begin with a figure's full name (both family and given name) mentioned in the introductory paragraph. From that point, however, the figure is mentioned only by their given name. Chinese names do not use spaces as delimiters like with English names (e.g. Wenren Suian, 聞人遂安), therefore it is difficult to determine when the given name appears after its initial mention in the text. However, each syllable in a Chinese name is its own character. Therefore, this process counts the number of syllables in the English family and given name and then uses that result to determine the Chinese given name of the person.

Please be aware that the structure of this notebook is untraditional in that it details the process of its creation in reverse order.

## The Completed Function and Results
The success of the project is based heavily on the romanization of Chinese names. Romanization does not employ "ye olde European" spelling trickery with multiple silent letters (e.g. "esque" pronounced as *esk* instead of *esk-uh*, "Champagne" pronounced as *sham-payn* instead of *sham-pog-nuh*). Therefore, consonants paired with vowels will always designate a new syllable, and that greatly aids in the process of counting syllables in a name.
> "Bai" will produce *bye,* "Bao" will produce *bow.* 

However, this behavior is not consistent based on certain vowel combinations that are written in Chinese as two syllables:
>"Huaien" will produce *hoowhy-en,* "Sheer" will produce *shee-eer.*

Therefore it was important to address the concurrent vowels that appear in our data and label each with how many syllables they are, and that process is detailed in the later part of this report. First, the finished function is displayed below: 

In [156]:
import sys

# verbose option provided to audit function process. Mark True to utilize.
verbose = False

def count_syllables(word):
    '''Based off of countsyl (https://github.com/akkana/scripts/blob/master/countsyl)'''

    # The parameter is converted to all lowercase letters. Capitalized letters
    # will not be identified correctly if left in.
    word = word.lower()
    
    vowels = ['a', 'e', 'i', 'o', 'u']

    # The following variables specifically address the vowel combinations found
    # in the data. Variable oneGang has been verified as complete using the following link:
    # https://www.fluentu.com/blog/chinese/2018/02/28/chinese-vowels/).
    oneGang = ['ai','ao','ei','ia','ie','io','iu','ou','ua','ue','ui','uo','iao','uai']
    twoGang = ['oe','aoe','iai','uaie','uia']

    # Initialization of variables used below.
    syl = 0
    prev = ''
    concurVowel = ''
    
    for c in word:
        
        # Each letter is evaluated. If a letter is a vowel, it is compared with
        # the previous letter, and if both are vowels then the new vowel is added
        # next to the previous in concurVowel, otherwise it is overwritten.
        if prev in vowels and c in vowels:
            concurVowel += c
        elif c in vowels:
            concurVowel = c
            syl += 1
            
        # Special conditionals for cases not following above logic.
        elif prev == 'e' and c == 'r' and len(concurVowel) > 1:
            syl += 1
        elif prev == 'v' and c != 'e':
            syl += 1
            
        # If any combination of concurrent vowels appear in twoGang, add a syllable.
        if len(concurVowel) > 1 and concurVowel in twoGang:
            syl += 1
            concurVowel = ''
            
        # In all instances, move the current letter to the prev variable for
        # future comparisons.
        prev = c
        
    # In case none of the logic above applies, mark as one syllable.
    if not syl:
        syl = 1

    return syl

In [157]:
import pandas as pd

syllableTest = pd.read_csv('TangEngNames.csv', header=None, names=['familyName','chiName'])
syllableTest.loc[syllableTest['familyName'].str.split().str.len() == 2, 'givenName'] = syllableTest['familyName'].str.split().str[-1]
syllableTest['familyName'] = syllableTest['familyName'].str.split().str[0]
syllableTest = syllableTest[['familyName', 'givenName', 'chiName']]
syllableTest.head()

Unnamed: 0,familyName,givenName,chiName
0,An,Lushan,安祿山
1,An,Qingxu,安慶緒
2,An,Sishun,安思順
3,An,Taiqing,安太清
4,An,Xinggui,安興貴


In [158]:
def repeatVowel(str):
    '''Returns vowels that sit next to each other in a word.'''

    str = str.lower()
    vowels = ['a', 'e', 'i', 'o', 'u']
    concurVowel = ''
    saveResult = ''
  
    # Find any vowel sitting next to another vowel.
    x = False
    for c in str: 
        y = c in vowels
        if x and y:
            concurVowel += c
            if len(concurVowel) > 1:
                saveResult = concurVowel
            #print(concurVowel)
        elif y:
            concurVowel = c
            #print(concurVowel)
        else:
            concurVowel = ''
            #print(concurVowel)
        x = y
    return saveResult

In [159]:
syllableTest['familySyl'] = syllableTest['familyName'].apply(count_syllables)
syllableTest['givenSyl'] = syllableTest['givenName'].apply(count_syllables)
syllableTest['familyMulti'] = syllableTest['familyName'].apply(repeatVowel)
syllableTest['givenMulti'] = syllableTest['givenName'].apply(repeatVowel)
display(syllableTest)

Unnamed: 0,familyName,givenName,chiName,familySyl,givenSyl,familyMulti,givenMulti
0,An,Lushan,安祿山,1,2,,
1,An,Qingxu,安慶緒,1,2,,
2,An,Sishun,安思順,1,2,,
3,An,Taiqing,安太清,1,2,,ai
4,An,Xinggui,安興貴,1,2,,ui
5,An,Xiuren,安修仁,1,2,,iu
6,Ashina,Sheer,阿史那社爾,3,2,,ee
7,Ashina,Sunishi,阿史那蘇尼失,3,3,,
8,Ashina,Zhong,阿史那忠,3,1,,
9,Bai,Juyi,白居易,1,2,ai,


In [160]:
syllableTest['Verify'] = ( syllableTest['familySyl']
                          + syllableTest['givenSyl']
                          == syllableTest['chiName'].apply(len)
                         )
syllableTest.loc[syllableTest['Verify'] == False]

Unnamed: 0,familyName,givenName,chiName,familySyl,givenSyl,familyMulti,givenMulti,Verify
46,Chen,Ziang,陳子昂,1,1,,ia,False
1039,Tian,Jian,田季安,1,1,ia,ia,False
1077,Wang,Jun?,王君㚟,1,1,,,False
1398,Yao,Ziang,藥子昂,1,1,ao,ia,False


In [45]:
syllableTest.loc[syllableTest['lastMulti'].str.contains('iai')]

Unnamed: 0,firstName,lastName,firstSyl,lastSyl,firstMulti,lastMulti
205,Fang,Yiai,1,2,,iai


In [237]:
syllableTest.firstDouble.append(syllableTest.lastDouble).sort_values().unique()

array(['', 'ai', 'ao', 'ee', 'ei', 'ia', 'ie', 'io', 'iu', 'oe', 'ou',
       'ua', 'ue', 'ui', 'uo'], dtype=object)

In [238]:
def multiVowel(word):
    '''Identify words where a vowel appears three or more times.'''
    
    vowels = ['a', 'e', 'i', 'o', 'u']
    
    groupVowel = ''
    cList = ''
    vCount = 0
    for c in word:
        if c in vowels:
            vCount += 1
            cList += c
        elif c not in vowels:
            vCount = 0
            cList = ''
        if vCount > 2:
            groupVowel = cList
    return groupVowel

In [239]:
syllableTest['firstMulti'] = syllableTest['firstName'].apply(multiVowel)
syllableTest['lastMulti'] = syllableTest['lastName'].apply(multiVowel)
display(syllableTest)

Unnamed: 0,firstName,lastName,firstSyl,lastSyl,firstDouble,lastDouble,firstMulti,lastMulti
0,An,Lushan,1,2,,,,
1,An,Qingxu,1,2,,,,
2,An,Sishun,1,2,,,,
3,An,Taiqing,1,2,,ai,,
4,An,Xinggui,1,2,,ui,,
5,An,Xiuren,1,2,,iu,,
6,Ashina,Sheer,3,2,,ee,,
7,Ashina,Sunishi,3,3,,,,
8,Ashina,Zhong,3,1,,,,
9,Bai,Juyi,1,2,ai,,,


In [240]:
syllableTest.firstMulti.append(syllableTest.lastMulti).sort_values().unique()

array(['', 'aoe', 'iai', 'iao', 'uai', 'uaie', 'uia'], dtype=object)

In [241]:
oneGang = ['iao', 'uai']
twoGang = ['aoe', 'iai', 'uaie', 'uia']

In [242]:
syllableTest.loc[syllableTest['lastMulti'].str.contains('uaie')]

Unnamed: 0,firstName,lastName,firstSyl,lastSyl,firstDouble,lastDouble,firstMulti,lastMulti
192,Dugu,Huaien,2,2,,ie,,uaie
953,Pugu,Huaien,2,2,,ie,,uaie


In [243]:
import numpy as np
syllableTest["lastSyl"] = np.where(syllableTest["lastMulti"].isin(twoGang), syllableTest["lastSyl"] + 1, syllableTest["lastSyl"])

In [244]:
syllableTest.loc[syllableTest['lastMulti'].str.contains('aoe')]

Unnamed: 0,firstName,lastName,firstSyl,lastSyl,firstDouble,lastDouble,firstMulti,lastMulti
1408,Yu,Chaoen,1,3,,oe,,aoe
