# Tang History Database - Name Syllable Challenge
**Jeff Heller, Data and Project Coordinator - East Asian Studies, Princeton University**

*April 26, 2019*

A biography in the Old and New Tang History will typically begin with a figure's full name (both family and given name) mentioned in the introductory paragraph. From that point, however, the figure is mentioned only by their given name. Chinese names do not use spaces as delimiters like with English names (e.g. Wenren Suian, 聞人遂安), therefore it is difficult to determine when the given name appears after its initial mention in the text. However, each syllable in a Chinese name is its own character. Therefore, this process counts the number of syllables in the English family and given name and then uses that result to determine the Chinese given name of the person.

Please be aware that the structure of this notebook is untraditional in that it details the process of its creation in reverse order.

## The Completed Function and Results
The success of the project is based heavily on the romanization of Chinese names. Romanization does not employ "ye olde European" spelling trickery with multiple silent letters (e.g. "esque" pronounced as *esk* instead of *esk-uh*, "Champagne" pronounced as *sham-payn* instead of *sham-pog-nuh*). Therefore, consonants paired with vowels will always designate a new syllable, and that greatly aids in the process of counting syllables in a name.
> "Bai" will produce *bye,* "Bao" will produce *bow.* 

However, this behavior is not consistent based on certain vowel combinations that are written in Chinese as two syllables:
>"Huaien" will produce *hoowhy-en,* "Sheer" will produce *shee-eer.*

Therefore it was important to address the concurrent vowels that appear in our data and label each with how many syllables they are, and that process is detailed in the later part of this report. First, the finished function is displayed below: 

### The `count_syllables` Function

In [69]:
# verbose option provided to audit function process.
# Mark True to utilize.
verbose = False

def count_syllables(word):
    '''Based off of countsyl \
    (https://github.com/akkana/scripts/blob/master/countsyl)'''

    # The parameter is converted to all lowercase letters. Capitalized letters
    # will not be identified correctly if left in.
    word = word.lower()
    
    vowels = ['a', 'e', 'i', 'o', 'u']

    # The following variables specifically address found vowel combinations.
    # Variable oneGang has been verified as complete using the following link:
    # https://www.fluentu.com/blog/chinese/2018/02/28/chinese-vowels/).
    oneGang = ['ai','ao','ei','ia','ie','io','iu','ou',
               'ua','ue','ui','uo','iao','uai']
    twoGang = ['oe','aoe','iai','uaie','uia']

    # Initialization of variables used below.
    syl = 0
    prev = ''
    concurVowel = ''
    
    for c in word:
        
        # Each letter is evaluated. If a letter is a vowel, it is compared with
        # the previous letter. If both are vowels then the new vowel is added
        # next to the previous in concurVowel, otherwise it is overwritten.
        if prev in vowels and c in vowels:
            concurVowel += c
        elif c in vowels:
            concurVowel = c
            syl += 1
            
        # Special conditionals for cases not following above logic.
        elif prev == 'e' and c == 'r' and len(concurVowel) > 1:
            syl += 1
        elif prev == 'v' and c != 'e':
            syl += 1
            
        # If any combination of vowels appears in twoGang, add a syllable.
        if len(concurVowel) > 1 and concurVowel in twoGang:
            syl += 1
            concurVowel = ''
            
        # In all instances, move the current letter to the prev variable for
        # future comparisons.
        prev = c
        
    # In case none of the logic above applies, mark as one syllable.
    if not syl:
        syl = 1

    return syl

### Import the Data

With the function defined, the data is imported and organized into a workable format.

In [70]:
import pandas as pd

# Import from csv file and set initial columns (familyName includes given)
syllableTest = pd.read_csv('TangEngNames.csv',
                           header=None,
                           names=['CBDB_ID','familyName','chiName']
                          )

# Split given name from family name
syllableTest.loc[
    syllableTest['familyName'].str.split().str.len() == 2, 'givenName'] = (
    syllableTest['familyName'].str.split().str[-1]
)

# Adjust familyName to be only family name
syllableTest['familyName'] = syllableTest['familyName'].str.split().str[0]

# Reorder columns
syllableTest = syllableTest[['CBDB_ID','familyName', 'givenName', 'chiName']]
syllableTest.head()

Unnamed: 0,CBDB_ID,familyName,givenName,chiName
0,379873,An,Lushan,安祿山
1,380462,An,Qingxu,安慶緒
2,378063,An,Sishun,安思順
3,378621,An,Taiqing,安太清
4,162825,An,Xinggui,安興貴


### Run the `count_syllables` Function on the Data

With the data now clear, the goal is more apparent: match the count of perceived syllables in the Pinyin to the number of characters in the `chiName` variable.

Below, two new columns are added to display the result of the `count_syllables` function that is applied to the `familyName` and `givenName` columns.

In [71]:
syllableTest['familySyl'] = syllableTest['familyName'].apply(count_syllables)
syllableTest['givenSyl'] = syllableTest['givenName'].apply(count_syllables)
syllableTest.head()

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl
0,379873,An,Lushan,安祿山,1,2
1,380462,An,Qingxu,安慶緒,1,2
2,378063,An,Sishun,安思順,1,2
3,378621,An,Taiqing,安太清,1,2
4,162825,An,Xinggui,安興貴,1,2


### Verify the Results

To test the accuracy of the function, we will take the `familySyl` and `givenSyl` column values and add them together. If the sum equals the count of Chinese characters in the `chiName` column, the `Verify` column receives the value of `True`.

In [72]:
syllableTest['Verify'] = ( syllableTest['familySyl']
                          + syllableTest['givenSyl']
                          == syllableTest['chiName'].apply(len)
                         )
syllableTest.loc[syllableTest['Verify'] == False]

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify
46,31322,Chen,Ziang,陳子昂,1,1,False
1039,379910,Tian,Jian,田季安,1,1,False
1077,378272,Wang,Jun?,王君㚟,1,1,False
1398,192209,Yao,Ziang,藥子昂,1,1,False


### Manually Adjust the Failures

The four figures with challenging names are adjusted below manually by their CBDB_ID number. This value is more specific to them than their index in the DataFrame. The process is then checked for accuracy by re-running the verification test done in the previous step.

In [73]:
syllableTest['givenSyl'].loc[syllableTest['CBDB_ID'] == 31322] += 1
syllableTest['givenSyl'].loc[syllableTest['CBDB_ID'] == 379910] += 1
syllableTest['givenSyl'].loc[syllableTest['CBDB_ID'] == 378272] += 1
syllableTest['givenSyl'].loc[syllableTest['CBDB_ID'] == 192209] += 1

syllableTest['Verify'] = ( syllableTest['familySyl']
                          + syllableTest['givenSyl']
                          == syllableTest['chiName'].apply(len)
                         )
syllableTest.loc[syllableTest['Verify'] == False]

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify


The resulting table above is empty, signifying that there are no longer any figures that have not had their syllables matching their symbol counts.

### Exporting the Results
A new DataFrame, `syllableFinal` is created with only the figure's serial number in the database (`CBDB_ID`) and syllable counts for family and given name. That is then exported as a CSV file.

In [75]:
syllableFinal = pd.DataFrame(data={
    'CBDB_ID':syllableTest['CBDB_ID'],
    'familySyl':syllableTest['familySyl'],
    'givenSyl':syllableTest['givenSyl']
})
syllableFinal = syllableFinal.set_index('CBDB_ID')
syllableFinal.to_csv('syllableResults.csv')
syllableFinal.head()

Unnamed: 0_level_0,familySyl,givenSyl
CBDB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
379873,1,2
380462,1,2
378063,1,2
378621,1,2
162825,1,2


## Explaining the Failures
### "Ziang"
The process shows that 1605 of the 1609 figures can have their syllables accurately counted using the `count_syllables` function. However, the four listed here will have to be manually adjusted in the database due to the complexities of their existence. Take a look at the following example of rows that contain the 'iang' combination in the `givenName` column.

In [47]:
syllableTest.loc[syllableTest['givenName'].str.contains('iang')].head()

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify
46,31322,Chen,Ziang,陳子昂,1,1,False
53,145635,Chou,Shiliang,仇士良,1,2,True
54,30936,Chu,Liang,褚亮,1,1,True
55,30938,Chu,Suiliang,褚遂良,1,2,True
56,32007,Chu,Wuliang,褚無量,1,2,True


The issue is clear when comparing the first row to the other four. "Ziang" is counted as two characters in Chinese, not one, where is "Liang" is counted as one syllable. Adapting the function to compensate for these minimals errors would be time wasted.

### "Jian"
"Jian" is even more strange. The following rows all contain the given name.

In [48]:
syllableTest.loc[syllableTest['givenName'] == 'Jian']

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify
169,92854,Du,Jian,杜兼,1,1,True
227,94339,Gao,Jian,高儉,1,1,True
341,145810,Hun,Jian,渾瑊,1,1,True
482,194774,Li,Jian,李兼,1,1,True
483,382421,Li,Jian,李建,1,1,True
728,32971,Linghu,Jian,令狐建,2,1,True
884,93963,Meng,Jian,孟簡,1,1,True
1024,162124,Tang,Jian,唐鑒,1,1,True
1025,142459,Tang,Jian,唐儉,1,1,True
1039,379910,Tian,Jian,田季安,1,1,False


In most cases, Jian is one syllable. However, there is one instance where Jian is two. Again, to modify the function to address this error would be a waste and would open the possibility for false positives on any additional names added in the future.

### "Jun?"

In [49]:
syllableTest.loc[syllableTest['givenName'] == 'Jun?']

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify
1077,378272,Wang,Jun?,王君㚟,1,1,False


The question mark in the name represents an unidentified character. Because of this, we will not be able to get an accurate representation of this name in Pinyin. Due to the rarity of this issue, it will not be addressed.

## Postscript: Additional Tools
### The `repeatVowel` Function
To aid in the process of identifying concurrent Vowel combinations, the `repeatVowel` function was constructed to pull vowels from their words and supply that back to a DataFrame column so that a list of these unique combinations could be listed and evaluated.

In [50]:
def repeatVowel(str):
    '''Returns vowels that sit next to each other in a word.'''

    str = str.lower()
    vowels = ['a', 'e', 'i', 'o', 'u']
    concurVowel = ''
    saveResult = ''
  
    # Find any vowel sitting next to another vowel.
    x = False
    for c in str: 
        y = c in vowels
        if x and y:
            concurVowel += c
            if len(concurVowel) > 1:
                saveResult = concurVowel
            #print(concurVowel)
        elif y:
            concurVowel = c
            #print(concurVowel)
        else:
            concurVowel = ''
            #print(concurVowel)
        x = y
    return saveResult

The constructed function is then run on both the family and given names.

In [51]:
syllableTest['familyMulti'] = syllableTest['familyName'].apply(repeatVowel)
syllableTest['givenMulti'] = syllableTest['givenName'].apply(repeatVowel)
display(syllableTest)

Unnamed: 0,CBDB_ID,familyName,givenName,chiName,familySyl,givenSyl,Verify,familyMulti,givenMulti
0,379873,An,Lushan,安祿山,1,2,True,,
1,380462,An,Qingxu,安慶緒,1,2,True,,
2,378063,An,Sishun,安思順,1,2,True,,
3,378621,An,Taiqing,安太清,1,2,True,,ai
4,162825,An,Xinggui,安興貴,1,2,True,,ui
5,378334,An,Xiuren,安修仁,1,2,True,,iu
6,376419,Ashina,Sheer,阿史那社爾,3,2,True,,ee
7,377887,Ashina,Sunishi,阿史那蘇尼失,3,3,True,,
8,384386,Ashina,Zhong,阿史那忠,3,1,True,,
9,32227,Bai,Juyi,白居易,1,2,True,ai,


Finally, the list of vowel combinations is generated from the results and sorted alphabetically.

In [52]:
syllableTest.familyMulti.append(syllableTest.givenMulti).sort_values().unique()

array(['', 'ai', 'ao', 'aoe', 'ee', 'ei', 'ia', 'iai', 'iao', 'ie', 'io',
       'iu', 'ou', 'ua', 'uai', 'uaie', 'ue', 'ui', 'uia', 'uo'],
      dtype=object)