Make ds2 run on PaddleCloud and optimize performance #144

wanghaoshuang · 2017-07-03T07:20:03Z

Refine data_utils/data.py to read bytes from tar file
Add scripts to submit paddle cloud job for ds2 trainning

Steps:

Pack train and test data according user's manifest file
Upload packed data and manifest file to PaddleCloud
Submit PaddleCloud job

fix #143

1. Refine data_utils/data.py to read bytes from tar file 2. Add scripts to submit paddle cloud job for ds2 trainning

xinghai-sun

Great!
Could you please add a detailed tutorial for how to quickly start a paddle cloud version of DS2?

xinghai-sun · 2017-07-04T13:12:11Z

deep_speech_2/data_utils/data.py

 import paddle.v2 as paddle
+import tarfile


Put import tarfile before import paddle. Widely used module comes first.

xinghai-sun · 2017-07-04T13:14:29Z

deep_speech_2/data_utils/data.py

@@ -64,7 +68,6 @@ def __init__(self,
                 window_ms=20.0,
                 max_freq=None,
                 specgram_type='linear',
-                 use_dB_normalization=True,


Why to remove use_dB_normalization ?

I didn't mean to remove use_dB_normalization. Maybe i made a mistake on git merging?

xinghai-sun · 2017-07-04T13:19:12Z

deep_speech_2/pcloud_split_data.py

@@ -0,0 +1,47 @@
+import os


Add a simple file doc, like other files.

Move pcloud_split_data.py and pcloud_prepare_data.py (rename pclound_data.py to this) into deep_speech_2/cloud.

xinghai-sun · 2017-07-04T13:20:12Z

deep_speech_2/pcloud_split_data.py

+import argparse
+
+
+def split_data(inManifest, tar_path, outManifest):


inManifest --> in_manifest, outManifest --> out_manifest.
Keep consistency for variable naming style.

xinghai-sun · 2017-07-04T13:20:29Z

deep_speech_2/pcloud_split_data.py

+def split_data(inManifest, tar_path, outManifest):
+    trainer_id = 1
+    trainer_count = 2
+    #with open("/trainer_id", "r") as f:


Remove L9-L12 or Remove L7-L8 ?
Please remove experimental or temporary code in the pull request.

xinghai-sun · 2017-07-04T13:56:13Z

deep_speech_2/datasets/librispeech/pcloud_data.py

+        sound_file = json_data['audio_filepath']
+        filename = os.path.basename(sound_file)
+        out_tar.add(sound_file, arcname=filename)
+        json_data['audio_filepath'] = filename


Once the cloud tar file is generated, the filepath in out_manifest should be modified to "tar:%s#%s" immediately, instead of putting the modification in pcloud_split_data.py.

When we generate tar file locally, we don't known the real path of tar file on cloud. So the filepath in out_manifest can`t be modified to "tar:%s#%s" immediately.

xinghai-sun · 2017-07-04T13:57:26Z

deep_speech_2/pcloud_split_data.py

+        help="Input manifest path. (default: %(default)s)")
+    parser.add_argument(
+        "--data_tar_path",
+        default='datasets/dev.tar',


Put the data in ./cloud

xinghai-sun · 2017-07-04T13:57:49Z

deep_speech_2/pcloud_split_data.py

+
+    parser.add_argument(
+        "--in_manifest_path",
+        default='datasets/dev.mani',


Put the file in ./cloud

xinghai-sun · 2017-07-04T13:59:00Z

deep_speech_2/pcloud_train.sh

@@ -0,0 +1,32 @@
+#setted by user


Move pcloud_submit.sh into ./cloud

Remove this file since it has been added in ./cloud

Ok. I forgot to delete this file.

xinghai-sun · 2017-07-04T14:03:10Z

deep_speech_2/pcloud_train.sh

@@ -0,0 +1,32 @@
+#setted by user
+TRAIN_MANI='/pfs/dlnel/home/yanxu05@baidu.com/wanghaoshuang/data/ds2_data/demo.mani'


能否准备一个和单机版一样的配置，而不是使用demo数据？
并且注意将“wanghaoshuang” 等字符串替换掉，我们需要提供一个清晰的用户下载下来，少量改动后，就可以完整得run的版本。
另外，辛苦在README.md中仔细说明下cloud的使用。

@xinghai-sun
Fixed.

现在默认的数据与单机版一致。

数据已转移到paddle cloud提供的公用路径下，无个人名字字样。

在deep_speech_2/cloud/README.md中补充了使用公开数据跑ds2的说明。更多使用方式，我以后另起pr?

… ds2_pcloud

xinghai-sun · 2017-08-09T10:10:14Z

deep_speech_2/cloud/pcloud_submit.sh

@@ -0,0 +1,17 @@
+DS2_PATH=../
+tar -czf deepspeech.tar.gz ${DS2_PATH}


No need for this?

Yes. The latest paddlecloud client does't need this.

xinghai-sun · 2017-08-09T10:47:57Z

deep_speech_2/cloud/README.md

+I0727 05:01:50.454787    25 GradientMachine.cpp:85] Initing parameters..
+I0727 05:01:50.690007    25 GradientMachine.cpp:92] Init parameters done.
+```
+[More  optins and cmd aoubt paddle cloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md)


--> For more information, please refer to PaddleCloud

xinghai-sun · 2017-08-09T10:48:32Z

deep_speech_2/cloud/README.md

+**Step3:** Get logs from paddle cloud by cmd: `paddlecloud logs -n 10000 deepspeech20170727130129`.
+
+```
+$ paddlecloud logs -n 10000 deepspeech20170727130129


Why to paste such logs?

xinghai-sun · 2017-08-09T10:48:58Z

deep_speech_2/cloud/README.md

+
+## Run DS2 by public data
+
+**Step1: ** Make sure current dir is `models/deep_speech_2/cloud/`


How to prepare data in cloud for users? Please prepare a script helping users to pack and upload data.

xinghai-sun · 2017-08-09T10:58:38Z

deep_speech_2/cloud/pcloud_prepare_data.py

@@ -0,0 +1,61 @@
+"""


rename the file to "update_data.py"

xinghai-sun · 2017-08-09T11:03:37Z

deep_speech_2/cloud/pcloud_split_data.py

+    help="Data tar file path. (default: %(default)s)")
+parser.add_argument(
+    "--out_manifest_path",
+    default='./cloud/data/dev.mani.split',


--> manifest.train.local ?

I renamed it to local.train.manifest. Use manifest as suffix for indicating the format of this file?

xinghai-sun · 2017-08-09T11:06:21Z

deep_speech_2/pcloud_train.sh

@@ -0,0 +1,32 @@
+#setted by user


Remove this file since it has been added in ./cloud

xinghai-sun · 2017-08-09T11:06:43Z

deep_speech_2/data_utils/data.py

+
+    def _read_soundbytes(self, filepath):
+        """
+        Read bytes from file.


merge L240 and L239

xinghai-sun · 2017-08-09T11:07:00Z

deep_speech_2/data_utils/data.py

@@ -215,6 +225,46 @@ def vocab_list(self):
        """
        return self._speech_featurizer.vocab_list

+    def _parse_tar(self, file):
+        """


Merge L228 and L229

xinghai-sun · 2017-08-09T11:15:20Z

deep_speech_2/data_utils/data.py

+
+    def _process_utterance(self, filename, transcript):
+        """Load, augment, featurize and normalize for speech data."""
+        speech_segment = SpeechSegment.from_bytes(


可以复用原来的_process_utterance, 因其filename参数可以接受file object。建议在该函数调用前，增加一个方法例如_filepath_to_object处理一下传入的filename，如果是tar格式的路径，从tar文件提取并返回它的file object。

1. Refine data_utils/data.py, reuse process_utterance function. 2. Modified README. 3. Implement uploading data in cloud/upload_data.py 4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

… ds2_pcloud

…into ds2_pcloud

xinghai-sun · 2017-08-11T07:43:30Z

deep_speech_2/cloud/README.md

+```
+
+- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like  TRAIN_MANIFEST.
+


Remove L19.

xinghai-sun · 2017-08-11T07:44:02Z

deep_speech_2/cloud/README.md

+": "nor is mister ..."}
+```
+
+- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like  TRAIN_MANIFEST.


.This file --> . This file. Don't forget a dot mark.

TRAIN_MANIFEST --> TRAIN_MANIFEST

xinghai-sun · 2017-08-11T07:46:36Z

deep_speech_2/cloud/README.md

+- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like  TRAIN_MANIFEST.
+
+- `VOCAB_FILE`:  Absolute path of vocabulary file in local filesytem.
+- `MEAN_STD_FILE`: Absolute path of vocabulary file in local filesytem.


vocabulary file --> normalizer's statistic file

xinghai-sun · 2017-08-11T07:48:58Z

deep_speech_2/cloud/README.md

+
+>Note: Make sure current directory is `models/deep_speech_2/cloud/`
+
+## Step1  Configure data set


Step1 --> Step-1

xinghai-sun · 2017-08-11T07:49:49Z

deep_speech_2/cloud/README.md

+
+## Step1  Configure data set
+
+You can configure your input data and output path in pcloud_submit.sh:


--> Configure your input data and output path in pcloud_submit.sh:

xinghai-sun · 2017-08-11T08:22:57Z

deep_speech_2/cloud/upload_data.py

+    ret = 1
+    # train data
+    if args.train_manifest_path != "":
+        ret = call(['paddlecloud', 'ls', cloud_train_manifest])


Should we need try except ?

这里不会因为cloud文件的不存在为而抛异常，没必要try except。如果是 PaddleCloud not installed等原因带来的异常，我们应该让它快速失败，然后在调用这个脚本的地方处理异常。

xinghai-sun · 2017-08-11T08:24:06Z

deep_speech_2/cloud/upload_data.py

+                      local_test_manifest)
+            call(
+                ['paddlecloud', 'cp', local_test_manifest, cloud_test_manifest])
+            call(['paddlecloud', 'cp', local_test_tar, cloud_test_tar])


Write a function to do cloud fs cp.

xinghai-sun · 2017-08-11T08:36:14Z

deep_speech_2/data_utils/data.py

@@ -82,12 +89,15 @@ def __init__(self,
        self._num_threads = num_threads
        self._rng = random.Random(random_seed)
        self._epoch = 0
+        # for caching tar files info
+        self.tar2info = {}


remove L93-94, not used.

xinghai-sun · 2017-08-11T08:36:36Z

deep_speech_2/data_utils/data.py

+        return f, result
+
+    def _get_file_object(self, file):
+        """Get file object by file path.


Add a blank line after L239.

xinghai-sun · 2017-08-11T08:37:14Z

deep_speech_2/data_utils/data.py

+            return local_data.tar2object[tarpath].extractfile(
+                local_data.tar2info[tarpath][filename])
+        else:
+            return open(file)


open(file, 'r')

Make ds2 run on paddle cloud

69f49a2

1. Refine data_utils/data.py to read bytes from tar file 2. Add scripts to submit paddle cloud job for ds2 trainning

xinghai-sun requested changes Jul 4, 2017

View reviewed changes

wanghaoshuang added 4 commits July 26, 2017 20:02

Merge branch 'develop' of https://github.com/PaddlePaddle/models into…

4c5115a

… ds2_pcloud

Refine submitting scripts for deepspeech2 on paddle cloud.

eb3fd4c

Clean warning logs in cloud/README.md

e942447

Merge branch 'develop' into ds2_pcloud

1b15f83

xinghai-sun requested changes Aug 9, 2017

View reviewed changes

Implement uploading data in submit scripts and fix issues

56d195f

wanghaoshuang changed the title ~~Make ds2 run on paddle cloud~~ Make ds2 run on paddle cloud and optimize performance Aug 11, 2017

wanghaoshuang changed the title ~~Make ds2 run on paddle cloud and optimize performance~~ Make ds2 run on PaddleCloud and optimize performance Aug 11, 2017

wanghaoshuang added 4 commits August 11, 2017 14:06

Implement uploading data to PaddleCloud

2f7fef0

1. Refine data_utils/data.py, reuse process_utterance function. 2. Modified README. 3. Implement uploading data in cloud/upload_data.py 4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

Merge branch 'develop' of https://github.com/PaddlePaddle/models into…

3c8091a

… ds2_pcloud

Merge branch 'ds2_pcloud' of https://github.com/wanghaoshuang/models …

72a048a

…into ds2_pcloud

remove binary files

6099adf

xinghai-sun requested changes Aug 11, 2017

View reviewed changes

wanghaoshuang added 2 commits August 11, 2017 18:55

Fix some syntax errors.

30344ca

Change the default values in pcloud_train to those listed in train.py.

8e0cb2d

xinghai-sun approved these changes Aug 14, 2017

View reviewed changes

xinghai-sun merged commit 65fdaca into PaddlePaddle:develop Aug 14, 2017

wanghaoshuang deleted the ds2_pcloud branch August 14, 2017 07:37

		import argparse


		def split_data(inManifest, tar_path, outManifest):

		@@ -0,0 +1,32 @@
		#setted by user
		TRAIN_MANI='/pfs/dlnel/home/yanxu05@baidu.com/wanghaoshuang/data/ds2_data/demo.mani'

		@@ -0,0 +1,17 @@
		DS2_PATH=../
		tar -czf deepspeech.tar.gz ${DS2_PATH}


		## Run DS2 by public data

		Step1: Make sure current dir is `models/deep_speech_2/cloud/`

		```

		- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST.


		>Note: Make sure current directory is `models/deep_speech_2/cloud/`

		## Step1 Configure data set


		## Step1 Configure data set

		You can configure your input data and output path in pcloud_submit.sh:

Make ds2 run on PaddleCloud and optimize performance #144

Make ds2 run on PaddleCloud and optimize performance #144

Conversation

wanghaoshuang commented Jul 3, 2017 • edited

xinghai-sun left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghaoshuang Jul 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghaoshuang Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinghai-sun Aug 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinghai-sun Aug 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanghaoshuang commented Jul 3, 2017 •

edited

xinghai-sun left a comment •

edited

wanghaoshuang Jul 5, 2017 •

edited

wanghaoshuang Jul 27, 2017 •

edited

xinghai-sun Aug 9, 2017 •

edited

xinghai-sun Aug 9, 2017 •

edited