Sampler that splits utterances into buckets #83

EgorLakomkin · 2017-06-11T17:03:06Z

I think it might be a solution to #30. Instead of random sampling, we could split firstly the whole dataset into several buckets depending on the length of the utterance and then randomly sample from each of the buckets. As a result, we can have batches with similar length utterances.

SeanNaren · 2017-06-12T09:14:26Z

Thanks for this @EgorLakomkin, has this shown an improvement of memory usage when profiled or checking nvidia-smi once this kicks in?

EDIT: initial tests has shown improvements in memory usage over random sampling across the entire dataset. Just going to add a few comments!

EgorLakomkin · 2017-06-13T07:05:40Z

I am also testing on librispeech+ted+voxforge for a couple of epochs to measure if memory blows up as in random sampling.

ryanleary · 2017-06-13T16:01:26Z

I've run a few epochs of librispeech 1k hrs training with this branch and still ended up with a random OOM.

I think this is a step in the right direction, but I wonder if we can't take it a step further to try to guarantee a maximum sequence length.

EgorLakomkin · 2017-06-13T18:15:27Z

Strange, mine is working quit well.
Did you limit the length of the utterances? I have tried with max 15 seconds.

ryanleary · 2017-06-13T19:19:38Z

Nope. Not limiting the length anywhere (where are you doing that, by the way?). I think the longest utterance in the LibriSpeech corpus is something like 35 seconds.

EgorLakomkin · 2017-06-13T19:23:17Z

There is an option in merge_manifests script to limit minimum or maximum length. 35 seconds is indeed long, I guess for such long utterances the only solution would be to decrease that batch size.

…king for files with space in the path

EgorLakomkin · 2017-06-14T14:11:30Z

I have tried several times and also caught OOM. Set-up is batch_size 16, gru 5 layer, 1024 hidden size, max length 15.

Now I have added one more thing: del loss & del out references during training. I have seen somewhere that it might help with OOMs. The same set-up is working for me now without OOMs

ryanleary · 2017-06-14T14:16:44Z

Also saw that in the forums recently. Glad that's helping.

…rd, small style changes

# Conflicts: # train.py

SeanNaren · 2017-06-12T13:21:46Z

data/bucketing_sampler.py

+            self.bins_to_samples[bin_id].append(idx)
+
+class BucketingSampler(Sampler):
+    """


Mind adding a little description as to what this bucket sampler does?

I have added few comments on SpectrogramDatasetWithLength

SeanNaren · 2017-06-12T13:25:31Z

train.py

@@ -52,6 +53,7 @@
 parser.add_argument('--tensorboard', dest='tensorboard', action='store_true', help='Turn on tensorboard graphing')
 parser.add_argument('--log_dir', default='visualize/deepspeech_final', help='Location of tensorboard log')
 parser.add_argument('--log_params', dest='log_params', action='store_true', help='Log parameter values and gradients')
+parser.add_argument('--bucketing', dest='bucketing', action='store_true', help='Split utterances into buckets and sample from them')


I think 'bucketing' should replace sorta grad due to the save in memory, would it make sense to make this the default, and replace this parameter with --no_sorta_grad that replaces the sampler before starting the training process?

SeanNaren · 2017-06-14T14:24:46Z

Thought I'd help out, finished the review and dealt with merge commits on the main branch. Let me know what you think!

EgorLakomkin · 2017-06-14T14:36:13Z

Looks good! Thank you :)

SeanNaren · 2017-06-15T14:11:31Z

Pulling in this now and thanks for the del fix, seems like its been doing good things for memory!

+bucketing

6441435

SeanNaren mentioned this pull request Jun 12, 2017

Pre-trained models tracker #85

Closed

3 tasks

deleting loss and out references to prevent OOMs, fixed duration chec…

2a898d5

…king for files with space in the path

SeanNaren added 2 commits June 14, 2017 15:21

Added docs to sampler, modified behaviour to have bucketing as standa…

e51bad0

…rd, small style changes

Merge remote-tracking branch 'upstream/master' into bucketing

b0370d0

# Conflicts: # train.py

SeanNaren approved these changes Jun 14, 2017

View reviewed changes

+info about SpectrogramDatasetWithLength

09d06d7

SeanNaren merged commit cea3b01 into SeanNaren:master Jun 15, 2017

EgorLakomkin deleted the bucketing branch June 16, 2017 14:13

ryanleary mentioned this pull request Jun 26, 2017

During training utterances are always processed sorted by duration #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampler that splits utterances into buckets #83

Sampler that splits utterances into buckets #83

EgorLakomkin commented Jun 11, 2017

SeanNaren commented Jun 12, 2017 •

edited

Loading

EgorLakomkin commented Jun 13, 2017 •

edited

Loading

ryanleary commented Jun 13, 2017

EgorLakomkin commented Jun 13, 2017

ryanleary commented Jun 13, 2017

EgorLakomkin commented Jun 13, 2017

EgorLakomkin commented Jun 14, 2017

ryanleary commented Jun 14, 2017

SeanNaren Jun 12, 2017

EgorLakomkin Jun 14, 2017

SeanNaren Jun 12, 2017

ryanleary Jun 14, 2017

SeanNaren commented Jun 14, 2017

EgorLakomkin commented Jun 14, 2017

SeanNaren commented Jun 15, 2017

Sampler that splits utterances into buckets #83

Sampler that splits utterances into buckets #83

Conversation

EgorLakomkin commented Jun 11, 2017

SeanNaren commented Jun 12, 2017 • edited Loading

EgorLakomkin commented Jun 13, 2017 • edited Loading

ryanleary commented Jun 13, 2017

EgorLakomkin commented Jun 13, 2017

ryanleary commented Jun 13, 2017

EgorLakomkin commented Jun 13, 2017

EgorLakomkin commented Jun 14, 2017

ryanleary commented Jun 14, 2017

SeanNaren Jun 12, 2017

Choose a reason for hiding this comment

EgorLakomkin Jun 14, 2017

Choose a reason for hiding this comment

SeanNaren Jun 12, 2017

Choose a reason for hiding this comment

ryanleary Jun 14, 2017

Choose a reason for hiding this comment

SeanNaren commented Jun 14, 2017

EgorLakomkin commented Jun 14, 2017

SeanNaren commented Jun 15, 2017

SeanNaren commented Jun 12, 2017 •

edited

Loading

EgorLakomkin commented Jun 13, 2017 •

edited

Loading