# Summary

* Implemented a dependency-parsing based algorithm to simulate the Mandarin Tone 3 Sandhi (T3S) process, which yields different patterns of realization of T3S depending on the syntactic structure of the expression. 

* The current implementation produces natural output comparable to Google Text-to-Speech (gTTS).

# Background

* Mandarin is a tonal language with four distinctive tones: T1 (high), T2 (low-high), T3 (low), and T4 (high-low).

* T3S is a phonological process by which a T3 is changed into a T2 when it is immediately followed by another T3.

* When the input expression consists of more than two successive T3s, the patterns of realization of T3S depend on the syntactic structure of the expression.

* Over-application of T3S application -- applying T3S when it is not necessary -- is acceptable (at fast speech), but makes the output less natural.

In [None]:
!pip install -U spacy
!python -m spacy download zh_core_web_lg # https://spacy.io/models/zh

import zh_core_web_lg
nlp = zh_core_web_lg.load()

In [23]:
from spacy import displacy

In [81]:
sent = nlp('五百碗酒') # 'five-hundred bowls of wine' 
                    # Underlying tones: 3333; Expected surface tones: 2223 (other patterns not acceptable)
for token in sent:
    print(token.text, token.pos_, token.dep_, list(token.ancestors))

displacy.render(sent, style='dep', jupyter=True)

五百 NUM nummod [酒]
碗 NUM mark:clf [五百, 酒]
酒 NOUN ROOT []


In [34]:
sent = nlp('想买米酒') # 'want to buy rice wine'
                    # Underlying tones: 3333; Expected surface tones: 2323 (most natural) among other acceptable patterns
for token in sent:
    print(token.text, token.pos_, token.dep_, list(token.ancestors))

displacy.render(sent, style='dep', jupyter=True)

想 VERB ROOT []
买 VERB ccomp [想]
米酒 NOUN dobj [买, 想]


In [87]:
sent = nlp('我很想买米酒') # 'I really want to buy rice wine'
                        # Underlying tones: 333333; Expected surface tones: 322323 (most natural) among other acceptable patterns
for token in sent:
    print(token.text, token.pos_, token.dep_, list(token.ancestors))

displacy.render(sent, style='dep', jupyter=True)

我 PRON nsubj [想]
很 ADV advmod [想]
想 VERB ROOT []
买 VERB ccomp [想]
米酒 NOUN dobj [买, 想]


In [89]:
sent = nlp('我想买五百碗酒') # 'I want to buy five-hundred bowls of wine'
                        # Underlying tones: 3333333; Expected surface tones: 3232223 (most natural) among other acceptable patterns
for token in sent:
    print(token.text, token.pos_, token.dep_, list(token.ancestors))

displacy.render(sent, style='dep', jupyter=True)

我 PRON nsubj [想]
想 VERB ROOT []
买 VERB ccomp [想]
五百 NUM nummod [酒, 买, 想]
碗 NUM mark:clf [五百, 酒, 买, 想]
酒 NOUN dobj [买, 想]


In [90]:
sent = nlp('老李很想买五百碗米酒') # 'Old Li really wants to buy  bowls of rice wine'
                              # Underlying tones: 333333333; Expected surface tones: 232232323 (most natural) among other acceptable patterns
for token in sent:
    print(token.text, token.pos_, token.dep_, list(token.ancestors))

displacy.render(sent, style='dep', jupyter=True)

老李 PROPN nsubj [想]
很 ADV advmod [想]
想 VERB ROOT []
买 VERB ccomp [想]
五百 NUM nummod [米酒, 买, 想]
碗 NUM mark:clf [五百, 米酒, 买, 想]
米酒 NOUN dobj [买, 想]


# Implementation

In [91]:
def T3S(text, tone_list):
    sent = nlp(text)

    # Apply T3S within tokens
    for token in sent:
        for i in range(len(token) - 1):
            if tone_list[token.idx + i] == tone_list[token.idx + i + 1] == '3': #idx: true index 
                tone_list[token.idx + i] = '2'

    # Apply T3S to structurally adjacent T3s, with the possibility of acceptable over-application
    for token in sent:   
        for token_anc in token.ancestors:
            # token_anc immediately follows token 
            if (token_anc.i == token.i + 1) and (tone_list[token_anc.idx] == tone_list[token_anc.idx - 1] == '3'):
                tone_list[token_anc.idx - 1] = '2'         
            # token immediately follows token_anc
            elif (token.i == token_anc.i + 1) and (tone_list[token.idx] == tone_list[token.idx - 1] == '3'):
                tone_list[token.idx - 1] = '2'
        
    # Apply T3S to remaining adjacent T3s, left-to-right
    for i in range(len(tone_list) - 1):
        if tone_list[i] == tone_list[i + 1] == '3':
            tone_list[i] = '2'

    print(tone_list)

# Results

In [None]:
! pip install pinyin # https://pypi.org/project/pinyin/
import pinyin

In [104]:
text = '五百碗酒' # 'five-hundred bowls of wine' 
                # Underlying tones: 3333; Expected surface tones: 2223 (other patterns unacceptable)
py = pinyin.get(text, format = 'numerical')
tone_list = [s for s in py if s.isnumeric()]
T3S(text, tone_list)

['2', '2', '2', '3']


In [105]:
text = '想买米酒' # 'want to buy rice wine'
                # Underlying tones: 3333; Expected surface tones: 2323 (most natural) among other acceptable patterns
py = pinyin.get(text, format = 'numerical')
tone_list = [s for s in py if s.isnumeric()]
T3S(text, tone_list)

['2', '3', '2', '3']


In [106]:
text = '我很想买米酒' # 'I really want to buy rice wine'
                  # Underlying tones: 333333; Expected surface tones: 322323 (most natural) among other acceptable patterns
py = pinyin.get(text, format = 'numerical')
tone_list = [s for s in py if s.isnumeric()]
T3S(text, tone_list)

['3', '2', '2', '3', '2', '3']


In [107]:
text = '我想买五百碗酒' # 'I want to buy five-hundred bowls of wine'
                  # Underlying tones: 3333333; Expected surface tones: 3232223 (most natural) among other acceptable patterns
py = pinyin.get(text, format = 'numerical')
tone_list = [s for s in py if s.isnumeric()]
T3S(text, tone_list)

['2', '2', '3', '2', '2', '2', '3']


In [108]:
text = '老李很想买五百碗米酒' # 'Old Li really wants to buy five-hundred bowls of rice wine'
                        # Underlying tones: 3333333333; Expected surface tones: 2322322323 (most natural) among other acceptable patterns
py = pinyin.get(text, format = 'numerical')
tone_list = [s for s in py if s.isnumeric()]
T3S(text, tone_list)

['2', '3', '2', '2', '3', '2', '2', '3', '2', '3']


# gTTS

* gTTs features more instances of acceptable over-application of T3S.
* The current implementation of T3S can help make the output more natural. 

In [None]:
!pip install gTTS 
import gtts
import IPython

In [97]:
tts = gtts.gTTS('五百碗酒', lang='zh') # 'five-hundred bowls of wine'
tts.save('五百碗酒.mp3')
IPython.display.Audio('五百碗酒.mp3') # Output: 2223 (acceptable)

In [109]:
tts = gtts.gTTS('想买米酒', lang='zh') # 'want to buy rice wine'
tts.save('想买米酒.mp3')
IPython.display.Audio('想买米酒.mp3') # Output: 2323 (most natural)

In [111]:
tts = gtts.gTTS('我很想买米酒', lang='zh') # 'I really want to buy rice wine'
tts.save('我很想买米酒.mp3')
IPython.display.Audio('我很想买米酒.mp3') # Output: 222323 (acceptable); cf. 322323 (most natural)

In [113]:
tts = gtts.gTTS('我想买五百碗酒', lang='zh') # 'I want to buy five-hundred bowls of wine'
tts.save('我想买五百碗酒.mp3')
IPython.display.Audio('我想买五百碗酒.mp3') # Output: 2232223 (acceptable); cf. 3232223 (most natural)

In [115]:
tts = gtts.gTTS('老李很想买五百碗米酒', lang='zh') # 'Old Li really wants to buy five-hundred bowls of good wine'
tts.save('老李很想买五百碗米酒.mp3')
IPython.display.Audio('老李很想买五百碗米酒.mp3') # Output: 2222322323 (acceptable); cf. 2322322323 (most natural)