Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

xinghai-sun · 2017-05-17T16:32:22Z

We are planning to build Deep Speech 2 (DS2) [1], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:

Release a basic distributed implementation of DS2 on PaddlePaddle.
Contribute a chapter of Deep Speech to PaddlePaddle Book.

Intensive system optimization and low-latency inference library (details in [1]) are not yet covered in this first-stage plan.

Tasks

We roughly break down the project into 14 tasks:

Develop an audio data provider:
- Json filelist generator
- Audio file format transformer.
- Spectrogram feature extraction, power normalization etc.
- Batch data reader with SortaGrad.
- Data augmentation (optional).
- Prepare (one or more) public English data sets & baseline.
- Add audio data provider and preprocessor for speech recognition datasets. Paddle#2226
- Add audio data augmentation process to audio data provider. Paddle#2227
Create a simplified DS2 model configuration:
- With only fixed-length (by padding) audio sequences (otherwise need Task 3).
- With only bidirectional-GRU (otherwise need Task 4).
- With only greedy decoder (otherwise need Task 5, 6).
- Add simplified model configuration for DeepSpeech2. Paddle#2231
Develop to support variable-shaped dense-vector (image) batches of input data.
- Update DenseScanner in dataprovider_converter.py, etc.
- Support variable-length input feature for convolution operation. Paddle#2198
Develop a new lookahead-row-convolution layer (See [1] for details):
- Lookahead convolution windows.
- Within-row convolution, without kernels shared across rows.
- Add lookahead row convolution layer. Paddle#2228
Build KenLM n-gram language model for beam search decoding:
- Use KenLM toolkit, Kneser-Ney smoothed, 5-gram, with pruning etc.
- Prepare the corpus & train the model.
- Create infererence interfaces plugable to CTC beam search (for Task 6).
- Build n-gram language model for DeepSpeech2, and add inference interfaces insertable to CTC decoder. Paddle#2229
Develop a beam search decoder with CTC + LM + WORDCOUNT:
- Beam search with CTC.
- Beam search with external custom scorer (e.g. LM).
- Try to design a more general beam search interface.
- Add CTC-LM-beam-search decoder. Paddle#2230
Develop a Word Error Rate evaluator:
- update ctc_error_evaluator(CER) to support WER.
Prepare internal dataset for Mandarin (optional):
- Dataset, baseline, evaluation details.
- Particular data preprocessing for Mandarin.
- Might need cooperating with the Department of Speech.
- Prepare internal speech recognition dataset for Mandarin. Paddle#2232
Create standard DS2 model configuration:
- With variable-length audio sequences (need Task 3).
- With unidirectional-GRU + row-convolution (need Task 4).
- With CTC-LM beam search decoder (need Task 5, 6).
Make it run perfectly on clusters.
Experiments and benchmarking (for accuracy, not efficiency):
- With public English dataset.
- With internal (Baidu) Mandarin dataset (optional).
Time profiling and optimization.
Prepare docs.
Prepare PaddlePaddle Book chapter with a simplified version.

Task Dependency

Tasks parallelizable within phases:

Roadmap	Description	Parallelizable Tasks
Phase I	Basic model & components	Task 1 ~ Task 8
Phase II	Standard model & benchmarking & profiling	Task 9 ~ Task 12
Phase III	Documentations	Task13 ~ Task14

Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!

Possible Future Work

Efficiency Improvement
Accuracy Improvement
Low-latency Inference Library
Large-scale benchmarking

References

Dario Amodei, etc., Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. ICML 2016.

The text was updated successfully, but these errors were encountered:

shanyi15 · 2018-08-15T10:11:39Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

…tion-capitest-imageclassif extracted analyzer_image_classification test from Paddle

xinghai-sun mentioned this issue May 22, 2017

Overall plan and design doc for DeepSpeech2 on PaddlePaddle. PaddlePaddle/Paddle#2233

Closed

shanyi15 closed this as completed Aug 15, 2018

wojtuss pushed a commit to wojtuss/models that referenced this issue Mar 4, 2019

Merge pull request PaddlePaddle#44 from AIPG/sfraczek/develop-integra…

e0b3dc0

…tion-capitest-imageclassif extracted analyzer_image_classification test from Paddle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

xinghai-sun commented May 17, 2017 •

edited

Loading

shanyi15 commented Aug 15, 2018

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

Comments

xinghai-sun commented May 17, 2017 • edited Loading

Tasks

Task Dependency

Possible Future Work

References

shanyi15 commented Aug 15, 2018

xinghai-sun commented May 17, 2017 •

edited

Loading