Transliteration Research Project

Research Result

Approaches:

Rule Based: Advantage: Fast, Straight forward Disadvantage: Hard to distinguish right matched word but only based on rules
Statistical based method: Improve rule based method but using statistics to help. Depend on dataset mainly.

Main methods(Most people used): combine rule based and statistical method Procedure:

Things may need improve:

Chinese Pinyin dataset always do not have tone data. And one simple Pinyin could match several words E.g wo = 我，窝，握 etc
English Phonetic symbols match with Chinese Pinyin
Could engage with other techniques such as web mining

Experiment design

Decide using dataset include: CMU Pronunciation dicitonary, LDC2005T34, LDC2013T06 √
Preprocessing data to suitable format √
Decide evaluation method
Start normal rule based method to construct baseline
Use Statistical method to Try language model to see if can be improved.
Try Other methods include RNN, RNN with LSTM, Seq2seq, word2vec etc

Do experiement with analysis and adjust plans when necessary. Add other improve method when the accuracy is sastisfactory.

2017/9/3 New Issue: When doing enlgish phoneme symbols alignment with Chinese Pinyin. It turns out to be a hard issue.

CMU dict gives soundmarks are not combined yet. Need a good strategy to combine.
Chinese Character has good resources that transform to Pinyin with any need format.
Alginment of Chinese Pinyin and English soundmarks seems to be a hard problem Possible solution: 1. simple rule-based method(some early papers use this method) 2. Use model to do the allignment. Check paper: How to Speak a Language without Knowing It. The model that they used(wfst b) is exactly what I want.
CMU dict gives some bad soundmarks. Ignore it?
Traning size problem. Most paper just has limited resources. eg. not over 10k.
Try ask Trevor for opinion.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
dataset		dataset
tmp		tmp
tutorail		tutorail
useful_papers		useful_papers
README.md		README.md
data.zip		data.zip
data_utils.py		data_utils.py
dataset.txt		dataset.txt
get_pinyin_file.ipynb		get_pinyin_file.ipynb
lamtram_data_prepare.ipynb		lamtram_data_prepare.ipynb
pinyin.txt		pinyin.txt
seq2seq_transliteration.ipynb		seq2seq_transliteration.ipynb
todo.txt		todo.txt