Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement sentence based max len padding #67

Closed
Waino opened this issue May 13, 2024 · 0 comments · Fixed by #69
Closed

reimplement sentence based max len padding #67

Waino opened this issue May 13, 2024 · 0 comments · Fixed by #69
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@Waino
Copy link
Collaborator

Waino commented May 13, 2024

#55 implements fixed size minibatches with both a fixed number of sentences and a fixed number of tokens per sentence, with truncating/padding to enforce the length.

This seems to result in good throughput and GPU utilization under some conditions, but is sensitive to the --max_length parameter: too short will fill up the batches but result in truncation and inability to translate long sentences, while too long will lead to excessive padding which is wasteful in terms of compute.

The implementation in #55 is built on top of the spiral LookAheadBucketing, which was removed in #66. It should be reimplemented as a standalone component.

@Waino Waino added enhancement New feature or request good first issue Good for newcomers labels May 13, 2024
@Waino Waino mentioned this issue May 13, 2024
8 tasks
Waino added a commit that referenced this issue May 20, 2024
Fixed size batching entails using
- "batch_type: sents" to fix the batch dimension to batch_size, and
- "pad_to_max_length: true" together with "max_length" to fix the
  sequence length dimension.

Closes #67
Waino added a commit that referenced this issue May 20, 2024
Fixed size batching entails using
- "batch_type: sents" to fix the batch dimension to batch_size, and
- "pad_to_max_length: true" together with "max_length" to fix the
  sequence length dimension.

Closes #67
@Waino Waino closed this as completed in #69 May 20, 2024
Waino added a commit that referenced this issue May 20, 2024
Fixed size batching entails using
- "batch_type: sents" to fix the batch dimension to batch_size, and
- "pad_to_max_length: true" together with "max_length" to fix the
  sequence length dimension.

Closes #67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant