Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: --disable_group_texts 1 keep short samples #649

Merged
merged 1 commit into from
Sep 22, 2023

Conversation

research4pan
Copy link
Contributor

@research4pan research4pan commented Sep 22, 2023

--disable_group_texts 1 would discard short samples with length smaller than model_max_length or --block_size, and long samples will be cut into blocks, with last block being discarded. Now the behavior is changed to what is expected:

  • The short samples will be right padded to model_max_length
  • The long samples will be truncated to model_max_length

Along with the corresponding cache mechanism (new cache will be generated once --disable_group_texts argument is different from previous caches)

`--disable_group_texts 1` would discard short samples with length smaller than
`model_max_length` or `--block_size`. Now the behavior is changed to what is
expected:
    - The short samples will be right padded to `model_max_length`
    - The long samples will be truncated to `model_max_length`

Along with the corresponding cache mechanism (new cache will be generated once
`--disable_group_texts` argument is different from previous caches)
Copy link
Member

@SHUMKASHUN SHUMKASHUN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SHUMKASHUN SHUMKASHUN merged commit 6fae940 into main Sep 22, 2023
2 checks passed
@research4pan research4pan deleted the rpan-disable-group-text branch March 31, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants