Skip to content

fix: do not remove newlines after final eos_token in data processing#948

Merged
terrykong merged 7 commits intomainfrom
tk/rstrip-newline
Aug 20, 2025
Merged

fix: do not remove newlines after final eos_token in data processing#948
terrykong merged 7 commits intomainfrom
tk/rstrip-newline

Conversation

@terrykong
Copy link
Collaborator

What does this PR do ?

Currently the data processing will strip the final newline after an eos token which may nudge training to not produce a newline after eos token. The tokenizer that this could be an issue for is qwen which always append a newline after eos_token.

Issues

List issues that this PR closes (syntax):

closes #932

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested review from ashors1 and yuki-97 August 20, 2025 07:07
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yuki Huang <48991475+yuki-97@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
@terrykong terrykong added this pull request to the merge queue Aug 20, 2025
Merged via the queue into main with commit 38e9ef1 Aug 20, 2025
19 checks passed
@terrykong terrykong deleted the tk/rstrip-newline branch August 20, 2025 23:04
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
…948)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Yuki Huang <48991475+yuki-97@users.noreply.github.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 28, 2025
…VIDIA-NeMo#948)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Yuki Huang <48991475+yuki-97@users.noreply.github.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025
…VIDIA-NeMo#948)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Yuki Huang <48991475+yuki-97@users.noreply.github.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…VIDIA-NeMo#948)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Yuki Huang <48991475+yuki-97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove rstrip of \n in the final message

2 participants