dehardcode test string #8865

JimmyZhang12 · 2024-04-09T22:58:14Z

What does this PR do ?

By default to_word_list_format uses '<extra_id_1>' as its test string to map a string to its tokens, but using '<extra_id_1>' may not map correctly, so add option to specifiy the test string.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

for more information, see https://pre-commit.ci

odelalleau

Just a minor request to update the comment.

Besides this lgtm, but I just want to point out that using <extra_id_1> as stop word with the Llama2 tokenizer is a bad idea since as you mentioned it may be tokenized in various ways when merged with what comes before. When using the text generation from NeMo at least we fall back to the (slower) string matching (here), but with TRT-LLM I don't know if such a string matching exists (?) => one solution could be to postprocess the output to be sure we stop on the first stop word (this has nothing to do with this PR though -- just mentioning it in case this would be relevant to your work)

nemo/export/trt_llm/nemo_utils.py

Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com>

oyilmaz-nvidia

LGTM.

JimmyZhang12 · 2024-04-20T22:28:54Z

Can this be merged @oyilmaz-nvidia ?

oyilmaz-nvidia

LGTM, thanks!

oyilmaz-nvidia · 2024-05-02T17:06:49Z

jenkins

* dehardcode test string Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update nemo_utils.py Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> --------- Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>

JimmyZhang12 force-pushed the export_wordlist_fix branch from 3619adb to 635a28b Compare April 9, 2024 22:59

dehardcode test string

845dc17

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

JimmyZhang12 force-pushed the export_wordlist_fix branch from 52b7d8d to 845dc17 Compare April 9, 2024 23:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

78ac1ba

for more information, see https://pre-commit.ci

odelalleau reviewed Apr 10, 2024

View reviewed changes

nemo/export/trt_llm/nemo_utils.py Show resolved Hide resolved

JimmyZhang12 added 2 commits April 16, 2024 12:38

Update nemo_utils.py

ec29505

Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com>

Merge branch 'main' into export_wordlist_fix

9c54ba6

oyilmaz-nvidia approved these changes Apr 16, 2024

View reviewed changes

Merge branch 'main' into export_wordlist_fix

494e4a1

odelalleau approved these changes Apr 20, 2024

View reviewed changes

oyilmaz-nvidia and others added 3 commits April 22, 2024 15:48

Merge branch 'main' into export_wordlist_fix

37ead2b

Merge branch 'main' into export_wordlist_fix

c743937

Merge branch 'main' into export_wordlist_fix

7678ff4

oyilmaz-nvidia approved these changes May 2, 2024

View reviewed changes

Merge branch 'main' into export_wordlist_fix

0a04163

oyilmaz-nvidia added the Run CICD label May 2, 2024

oyilmaz-nvidia approved these changes May 3, 2024

View reviewed changes

oyilmaz-nvidia merged commit 57c55f3 into NVIDIA:main May 3, 2024
130 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dehardcode test string #8865

dehardcode test string #8865

JimmyZhang12 commented Apr 9, 2024

odelalleau left a comment

oyilmaz-nvidia left a comment

JimmyZhang12 commented Apr 20, 2024

oyilmaz-nvidia left a comment

oyilmaz-nvidia commented May 2, 2024

dehardcode test string #8865

dehardcode test string #8865

Conversation

JimmyZhang12 commented Apr 9, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

odelalleau left a comment

Choose a reason for hiding this comment

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

JimmyZhang12 commented Apr 20, 2024

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

oyilmaz-nvidia commented May 2, 2024