Approaches:
- Rule Based: Advantage: Fast, Straight forward Disadvantage: Hard to distinguish right matched word but only based on rules
- Statistical based method: Improve rule based method but using statistics to help. Depend on dataset mainly.
Main methods(Most people used): combine rule based and statistical method Procedure:
- English <==> English Phonetic symbols
- English Phonetic symbols <==> Chinese Pinyin
- Chinese Pinyin <==> Transliteration
Things may need improve:
- Chinese Pinyin dataset always do not have tone data. And one simple Pinyin could match several words E.g wo = 我,窝,握 etc
- English Phonetic symbols match with Chinese Pinyin
- Could engage with other techniques such as web mining
Experiment design
- Decide using dataset include: CMU Pronunciation dicitonary, LDC2005T34, LDC2013T06 √
- Preprocessing data to suitable format √
- Decide evaluation method
- Start normal rule based method to construct baseline
- Use Statistical method to Try language model to see if can be improved.
- Try Other methods include RNN, RNN with LSTM, Seq2seq, word2vec etc
Do experiement with analysis and adjust plans when necessary. Add other improve method when the accuracy is sastisfactory.
2017/9/3 New Issue: When doing enlgish phoneme symbols alignment with Chinese Pinyin. It turns out to be a hard issue.
- CMU dict gives soundmarks are not combined yet. Need a good strategy to combine.
- Chinese Character has good resources that transform to Pinyin with any need format.
- Alginment of Chinese Pinyin and English soundmarks seems to be a hard problem Possible solution: 1. simple rule-based method(some early papers use this method) 2. Use model to do the allignment. Check paper: How to Speak a Language without Knowing It. The model that they used(wfst b) is exactly what I want.
- CMU dict gives some bad soundmarks. Ignore it?
- Traning size problem. Most paper just has limited resources. eg. not over 10k.
- Try ask Trevor for opinion.