Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampler that splits utterances into buckets #83

Merged
merged 5 commits into from
Jun 15, 2017

Conversation

EgorLakomkin
Copy link
Contributor

I think it might be a solution to #30. Instead of random sampling, we could split firstly the whole dataset into several buckets depending on the length of the utterance and then randomly sample from each of the buckets. As a result, we can have batches with similar length utterances.

@SeanNaren
Copy link
Owner

SeanNaren commented Jun 12, 2017

Thanks for this @EgorLakomkin, has this shown an improvement of memory usage when profiled or checking nvidia-smi once this kicks in?

EDIT: initial tests has shown improvements in memory usage over random sampling across the entire dataset. Just going to add a few comments!

@SeanNaren SeanNaren mentioned this pull request Jun 12, 2017
3 tasks
@EgorLakomkin
Copy link
Contributor Author

EgorLakomkin commented Jun 13, 2017

I am also testing on librispeech+ted+voxforge for a couple of epochs to measure if memory blows up as in random sampling.

@ryanleary
Copy link
Collaborator

I've run a few epochs of librispeech 1k hrs training with this branch and still ended up with a random OOM.

I think this is a step in the right direction, but I wonder if we can't take it a step further to try to guarantee a maximum sequence length.

@EgorLakomkin
Copy link
Contributor Author

Strange, mine is working quit well.
Did you limit the length of the utterances? I have tried with max 15 seconds.

@ryanleary
Copy link
Collaborator

Nope. Not limiting the length anywhere (where are you doing that, by the way?). I think the longest utterance in the LibriSpeech corpus is something like 35 seconds.

@EgorLakomkin
Copy link
Contributor Author

There is an option in merge_manifests script to limit minimum or maximum length. 35 seconds is indeed long, I guess for such long utterances the only solution would be to decrease that batch size.

@EgorLakomkin
Copy link
Contributor Author

I have tried several times and also caught OOM. Set-up is batch_size 16, gru 5 layer, 1024 hidden size, max length 15.

Now I have added one more thing: del loss & del out references during training. I have seen somewhere that it might help with OOMs. The same set-up is working for me now without OOMs

@ryanleary
Copy link
Collaborator

Also saw that in the forums recently. Glad that's helping.

self.bins_to_samples[bin_id].append(idx)

class BucketingSampler(Sampler):
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding a little description as to what this bucket sampler does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added few comments on SpectrogramDatasetWithLength

train.py Outdated
@@ -52,6 +53,7 @@
parser.add_argument('--tensorboard', dest='tensorboard', action='store_true', help='Turn on tensorboard graphing')
parser.add_argument('--log_dir', default='visualize/deepspeech_final', help='Location of tensorboard log')
parser.add_argument('--log_params', dest='log_params', action='store_true', help='Log parameter values and gradients')
parser.add_argument('--bucketing', dest='bucketing', action='store_true', help='Split utterances into buckets and sample from them')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'bucketing' should replace sorta grad due to the save in memory, would it make sense to make this the default, and replace this parameter with --no_sorta_grad that replaces the sampler before starting the training process?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concur

@SeanNaren
Copy link
Owner

Thought I'd help out, finished the review and dealt with merge commits on the main branch. Let me know what you think!

@EgorLakomkin
Copy link
Contributor Author

Looks good! Thank you :)

@SeanNaren SeanNaren merged commit cea3b01 into SeanNaren:master Jun 15, 2017
@SeanNaren
Copy link
Owner

Pulling in this now and thanks for the del fix, seems like its been doing good things for memory!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants