Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ds2 run on PaddleCloud and optimize performance #144

Merged
merged 12 commits into from
Aug 14, 2017

Conversation

wanghaoshuang
Copy link
Contributor

@wanghaoshuang wanghaoshuang commented Jul 3, 2017

  1. Refine data_utils/data.py to read bytes from tar file
  2. Add scripts to submit paddle cloud job for ds2 trainning

Steps:

  1. Pack train and test data according user's manifest file
  2. Upload packed data and manifest file to PaddleCloud
  3. Submit PaddleCloud job

fix #143

1. Refine data_utils/data.py to read bytes from tar file
2. Add scripts to submit paddle cloud job for ds2 trainning
Copy link
Contributor

@xinghai-sun xinghai-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Could you please add a detailed tutorial for how to quickly start a paddle cloud version of DS2?

import paddle.v2 as paddle
import tarfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put import tarfile before import paddle. Widely used module comes first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix.

@@ -64,7 +68,6 @@ def __init__(self,
window_ms=20.0,
max_freq=None,
specgram_type='linear',
use_dB_normalization=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to remove use_dB_normalization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to remove use_dB_normalization. Maybe i made a mistake on git merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix.

@@ -0,0 +1,47 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a simple file doc, like other files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move pcloud_split_data.py and pcloud_prepare_data.py (rename pclound_data.py to this) into deep_speech_2/cloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

import argparse


def split_data(inManifest, tar_path, outManifest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inManifest --> in_manifest, outManifest --> out_manifest.
Keep consistency for variable naming style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

def split_data(inManifest, tar_path, outManifest):
trainer_id = 1
trainer_count = 2
#with open("/trainer_id", "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove L9-L12 or Remove L7-L8 ?
Please remove experimental or temporary code in the pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

sound_file = json_data['audio_filepath']
filename = os.path.basename(sound_file)
out_tar.add(sound_file, arcname=filename)
json_data['audio_filepath'] = filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the cloud tar file is generated, the filepath in out_manifest should be modified to "tar:%s#%s" immediately, instead of putting the modification in pcloud_split_data.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we generate tar file locally, we don't known the real path of tar file on cloud. So the filepath in out_manifest can`t be modified to "tar:%s#%s" immediately.

help="Input manifest path. (default: %(default)s)")
parser.add_argument(
"--data_tar_path",
default='datasets/dev.tar',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the data in ./cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


parser.add_argument(
"--in_manifest_path",
default='datasets/dev.mani',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the file in ./cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,32 @@
#setted by user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move pcloud_submit.sh into ./cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file since it has been added in ./cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I forgot to delete this file.

@@ -0,0 +1,32 @@
#setted by user
TRAIN_MANI='/pfs/dlnel/home/yanxu05@baidu.com/wanghaoshuang/data/ds2_data/demo.mani'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否准备一个和单机版一样的配置,而不是使用demo数据?
并且注意将“wanghaoshuang” 等字符串替换掉,我们需要提供一个清晰的用户下载下来,少量改动后,就可以完整得run的版本。
另外,辛苦在README.md中仔细说明下cloud的使用。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinghai-sun
Fixed.

  1. 现在默认的数据与单机版一致。
  2. 数据已转移到paddle cloud提供的公用路径下,无个人名字字样。
  3. 在deep_speech_2/cloud/README.md中补充了使用公开数据跑ds2的说明。更多使用方式,我以后另起pr?

@@ -0,0 +1,17 @@
DS2_PATH=../
tar -czf deepspeech.tar.gz ${DS2_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The latest paddlecloud client does't need this.

I0727 05:01:50.454787 25 GradientMachine.cpp:85] Initing parameters..
I0727 05:01:50.690007 25 GradientMachine.cpp:92] Init parameters done.
```
[More optins and cmd aoubt paddle cloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> For more information, please refer to PaddleCloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

**Step3:** Get logs from paddle cloud by cmd: `paddlecloud logs -n 10000 deepspeech20170727130129`.

```
$ paddlecloud logs -n 10000 deepspeech20170727130129
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to paste such logs?


## Run DS2 by public data

**Step1: ** Make sure current dir is `models/deep_speech_2/cloud/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to prepare data in cloud for users? Please prepare a script helping users to pack and upload data.

@@ -0,0 +1,61 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the file to "update_data.py"

help="Data tar file path. (default: %(default)s)")
parser.add_argument(
"--out_manifest_path",
default='./cloud/data/dev.mani.split',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> manifest.train.local ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to local.train.manifest. Use manifest as suffix for indicating the format of this file?

@@ -0,0 +1,32 @@
#setted by user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file since it has been added in ./cloud


def _read_soundbytes(self, filepath):
"""
Read bytes from file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge L240 and L239

@@ -215,6 +225,46 @@ def vocab_list(self):
"""
return self._speech_featurizer.vocab_list

def _parse_tar(self, file):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge L228 and L229


def _process_utterance(self, filename, transcript):
"""Load, augment, featurize and normalize for speech data."""
speech_segment = SpeechSegment.from_bytes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以复用原来的_process_utterance, 因其filename参数可以接受file object。建议在该函数调用前,增加一个方法例如_filepath_to_object处理一下传入的filename,如果是tar格式的路径,从tar文件提取并返回它的file object。

@wanghaoshuang wanghaoshuang changed the title Make ds2 run on paddle cloud Make ds2 run on paddle cloud and optimize performance Aug 11, 2017
@wanghaoshuang wanghaoshuang changed the title Make ds2 run on paddle cloud and optimize performance Make ds2 run on PaddleCloud and optimize performance Aug 11, 2017
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud
```

- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove L19.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

": "nor is mister ..."}
```

- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. .This file --> . This file. Don't forget a dot mark.
  2. TRAIN_MANIFEST --> TRAIN_MANIFEST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST.

- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem.
- `MEAN_STD_FILE`: Absolute path of vocabulary file in local filesytem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocabulary file --> normalizer's statistic file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


>Note: Make sure current directory is `models/deep_speech_2/cloud/`

## Step1 Configure data set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step1 --> Step-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


## Step1 Configure data set

You can configure your input data and output path in pcloud_submit.sh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> Configure your input data and output path in pcloud_submit.sh:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

ret = 1
# train data
if args.train_manifest_path != "":
ret = call(['paddlecloud', 'ls', cloud_train_manifest])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we need try except ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不会因为cloud文件的不存在为而抛异常,没必要try except。如果是 PaddleCloud not installed等原因带来的异常,我们应该让它快速失败,然后在调用这个脚本的地方处理异常。

local_test_manifest)
call(
['paddlecloud', 'cp', local_test_manifest, cloud_test_manifest])
call(['paddlecloud', 'cp', local_test_tar, cloud_test_tar])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a function to do cloud fs cp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -82,12 +89,15 @@ def __init__(self,
self._num_threads = num_threads
self._rng = random.Random(random_seed)
self._epoch = 0
# for caching tar files info
self.tar2info = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove L93-94, not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return f, result

def _get_file_object(self, file):
"""Get file object by file path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line after L239.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return local_data.tar2object[tarpath].extractfile(
local_data.tar2info[tarpath][filename])
else:
return open(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open(file, 'r')

@xinghai-sun xinghai-sun merged commit 65fdaca into PaddlePaddle:develop Aug 14, 2017
@wanghaoshuang wanghaoshuang deleted the ds2_pcloud branch August 14, 2017 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make ds2 run on paddle cloud
2 participants