-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ds2 run on PaddleCloud and optimize performance #144
Conversation
1. Refine data_utils/data.py to read bytes from tar file 2. Add scripts to submit paddle cloud job for ds2 trainning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Could you please add a detailed tutorial for how to quickly start a paddle cloud version of DS2?
deep_speech_2/data_utils/data.py
Outdated
import paddle.v2 as paddle | ||
import tarfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put import tarfile
before import paddle
. Widely used module comes first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix.
deep_speech_2/data_utils/data.py
Outdated
@@ -64,7 +68,6 @@ def __init__(self, | |||
window_ms=20.0, | |||
max_freq=None, | |||
specgram_type='linear', | |||
use_dB_normalization=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why to remove use_dB_normalization
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to remove use_dB_normalization
. Maybe i made a mistake on git merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix.
deep_speech_2/pcloud_split_data.py
Outdated
@@ -0,0 +1,47 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a simple file doc, like other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move pcloud_split_data.py and pcloud_prepare_data.py (rename pclound_data.py to this) into deep_speech_2/cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/pcloud_split_data.py
Outdated
import argparse | ||
|
||
|
||
def split_data(inManifest, tar_path, outManifest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inManifest --> in_manifest, outManifest --> out_manifest.
Keep consistency for variable naming style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/pcloud_split_data.py
Outdated
def split_data(inManifest, tar_path, outManifest): | ||
trainer_id = 1 | ||
trainer_count = 2 | ||
#with open("/trainer_id", "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove L9-L12 or Remove L7-L8 ?
Please remove experimental or temporary code in the pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
sound_file = json_data['audio_filepath'] | ||
filename = os.path.basename(sound_file) | ||
out_tar.add(sound_file, arcname=filename) | ||
json_data['audio_filepath'] = filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the cloud tar file is generated, the filepath in out_manifest should be modified to "tar:%s#%s" immediately, instead of putting the modification in pcloud_split_data.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we generate tar file locally, we don't known the real path of tar file on cloud. So the filepath in out_manifest can`t be modified to "tar:%s#%s" immediately.
deep_speech_2/pcloud_split_data.py
Outdated
help="Input manifest path. (default: %(default)s)") | ||
parser.add_argument( | ||
"--data_tar_path", | ||
default='datasets/dev.tar', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the data in ./cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/pcloud_split_data.py
Outdated
|
||
parser.add_argument( | ||
"--in_manifest_path", | ||
default='datasets/dev.mani', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the file in ./cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/pcloud_train.sh
Outdated
@@ -0,0 +1,32 @@ | |||
#setted by user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move pcloud_submit.sh into ./cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file since it has been added in ./cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I forgot to delete this file.
deep_speech_2/pcloud_train.sh
Outdated
@@ -0,0 +1,32 @@ | |||
#setted by user | |||
TRAIN_MANI='/pfs/dlnel/home/yanxu05@baidu.com/wanghaoshuang/data/ds2_data/demo.mani' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能否准备一个和单机版一样的配置,而不是使用demo数据?
并且注意将“wanghaoshuang” 等字符串替换掉,我们需要提供一个清晰的用户下载下来,少量改动后,就可以完整得run的版本。
另外,辛苦在README.md中仔细说明下cloud的使用。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinghai-sun
Fixed.
- 现在默认的数据与单机版一致。
- 数据已转移到paddle cloud提供的公用路径下,无个人名字字样。
- 在deep_speech_2/cloud/README.md中补充了使用公开数据跑ds2的说明。更多使用方式,我以后另起pr?
deep_speech_2/cloud/pcloud_submit.sh
Outdated
@@ -0,0 +1,17 @@ | |||
DS2_PATH=../ | |||
tar -czf deepspeech.tar.gz ${DS2_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The latest paddlecloud client does't need this.
deep_speech_2/cloud/README.md
Outdated
I0727 05:01:50.454787 25 GradientMachine.cpp:85] Initing parameters.. | ||
I0727 05:01:50.690007 25 GradientMachine.cpp:92] Init parameters done. | ||
``` | ||
[More optins and cmd aoubt paddle cloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> For more information, please refer to PaddleCloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/README.md
Outdated
**Step3:** Get logs from paddle cloud by cmd: `paddlecloud logs -n 10000 deepspeech20170727130129`. | ||
|
||
``` | ||
$ paddlecloud logs -n 10000 deepspeech20170727130129 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why to paste such logs?
deep_speech_2/cloud/README.md
Outdated
|
||
## Run DS2 by public data | ||
|
||
**Step1: ** Make sure current dir is `models/deep_speech_2/cloud/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to prepare data in cloud for users? Please prepare a script helping users to pack and upload data.
@@ -0,0 +1,61 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the file to "update_data.py"
help="Data tar file path. (default: %(default)s)") | ||
parser.add_argument( | ||
"--out_manifest_path", | ||
default='./cloud/data/dev.mani.split', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> manifest.train.local ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to local.train.manifest
. Use manifest as suffix for indicating the format of this file?
deep_speech_2/pcloud_train.sh
Outdated
@@ -0,0 +1,32 @@ | |||
#setted by user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file since it has been added in ./cloud
deep_speech_2/data_utils/data.py
Outdated
|
||
def _read_soundbytes(self, filepath): | ||
""" | ||
Read bytes from file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge L240 and L239
deep_speech_2/data_utils/data.py
Outdated
@@ -215,6 +225,46 @@ def vocab_list(self): | |||
""" | |||
return self._speech_featurizer.vocab_list | |||
|
|||
def _parse_tar(self, file): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge L228 and L229
deep_speech_2/data_utils/data.py
Outdated
|
||
def _process_utterance(self, filename, transcript): | ||
"""Load, augment, featurize and normalize for speech data.""" | ||
speech_segment = SpeechSegment.from_bytes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以复用原来的_process_utterance, 因其filename参数可以接受file object。建议在该函数调用前,增加一个方法例如_filepath_to_object处理一下传入的filename,如果是tar格式的路径,从tar文件提取并返回它的file object。
1. Refine data_utils/data.py, reuse process_utterance function. 2. Modified README. 3. Implement uploading data in cloud/upload_data.py 4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud
…into ds2_pcloud
deep_speech_2/cloud/README.md
Outdated
``` | ||
|
||
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove L19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/README.md
Outdated
": "nor is mister ..."} | ||
``` | ||
|
||
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.This file
-->. This file
. Don't forget a dot mark.- TRAIN_MANIFEST -->
TRAIN_MANIFEST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/README.md
Outdated
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST. | ||
|
||
- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem. | ||
- `MEAN_STD_FILE`: Absolute path of vocabulary file in local filesytem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vocabulary file --> normalizer's statistic file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/README.md
Outdated
|
||
>Note: Make sure current directory is `models/deep_speech_2/cloud/` | ||
|
||
## Step1 Configure data set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step1 --> Step-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/README.md
Outdated
|
||
## Step1 Configure data set | ||
|
||
You can configure your input data and output path in pcloud_submit.sh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> Configure your input data and output path in pcloud_submit.sh
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/cloud/upload_data.py
Outdated
ret = 1 | ||
# train data | ||
if args.train_manifest_path != "": | ||
ret = call(['paddlecloud', 'ls', cloud_train_manifest]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we need try except
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不会因为cloud文件的不存在为而抛异常,没必要try except
。如果是 PaddleCloud not installed
等原因带来的异常,我们应该让它快速失败,然后在调用这个脚本的地方处理异常。
deep_speech_2/cloud/upload_data.py
Outdated
local_test_manifest) | ||
call( | ||
['paddlecloud', 'cp', local_test_manifest, cloud_test_manifest]) | ||
call(['paddlecloud', 'cp', local_test_tar, cloud_test_tar]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write a function to do cloud fs cp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/data_utils/data.py
Outdated
@@ -82,12 +89,15 @@ def __init__(self, | |||
self._num_threads = num_threads | |||
self._rng = random.Random(random_seed) | |||
self._epoch = 0 | |||
# for caching tar files info | |||
self.tar2info = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove L93-94, not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return f, result | ||
|
||
def _get_file_object(self, file): | ||
"""Get file object by file path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line after L239.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
deep_speech_2/data_utils/data.py
Outdated
return local_data.tar2object[tarpath].extractfile( | ||
local_data.tar2info[tarpath][filename]) | ||
else: | ||
return open(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open(file, 'r')
Steps:
fix #143