Skip to content

Conversation

@vinay-raman
Copy link
Contributor

@vinay-raman vinay-raman commented Feb 5, 2025

Description

Fix bugs in retriever sdg notebook.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

vinay-raman and others added 4 commits February 5, 2025 10:18
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
@ryantwolf ryantwolf changed the title NV bug 5025154 fix Fix bugs in retriever sdg notebook Feb 10, 2025
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
…iraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>
…dia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
if __name__ == "__main__":
dask_client = get_client()
main()
# dask_client.cancel(dask_client.futures, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this filter.py, generate.py, and main.py? They look nearly identical. They also all look to be CLI scripts but only main.py is mentioned in the README.

Copy link
Contributor Author

@vinay-raman vinay-raman Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README has both generate.py and filter.py mentioned, please have a look.
This is needed if the user just needs to generate data or filter pre-generated data.

Copy link
Contributor

@ryantwolf ryantwolf Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding two files, can you just add two command line arguments like so?

  • --generate-only
  • --filter-only

With two new files, it's very hard to tell if the differences between them are correct or buggy. CLI args in a single file make it much easier to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, please check the latest commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this yet.

if __name__ == "__main__":
dask_client = get_client()
main()
# dask_client.cancel(dask_client.futures, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there.

Copy link
Contributor

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more minor things. Add the two CLI args to main.py instead of two separate files generate.py and filter.py is the big one though.

"output_type": "stream",
"text": [
"Generator model used = mistralai/mixtral-8x22b-instruct-v0.1\n"
"Generator model used = nvdev/mistralai/mixtral-8x22b-instruct-v0.1\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert model name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
…Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Copy link
Contributor

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinay-raman I don't think your changes were pushed or something.

if __name__ == "__main__":
dask_client = get_client()
main()
# dask_client.cancel(dask_client.futures, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there.

if __name__ == "__main__":
dask_client = get_client()
main()
# dask_client.cancel(dask_client.futures, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this yet.

Signed-off-by: viraman <viraman@nvidia.com>
@ryantwolf ryantwolf merged commit a46fb87 into NVIDIA-NeMo:main Feb 19, 2025
6 checks passed
ryantwolf pushed a commit that referenced this pull request Feb 19, 2025
* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed qa bug 5008113, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* bug fixes for generator, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed precommit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed filters, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed all issues, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed bug with document id, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* check if filtering pipeline is present, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed notebook, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* added functionality to filter pre-generated datasets, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* separated generation & filtering pipelines, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed pre-commit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* minor changes, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed Ryan Wolf's comments, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed minor bugs in configs, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed commented code in main.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* added CLI flags for generation & filtering removed code duplication, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* minor fix to quickstart notebook, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed filter.py & generate.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

---------

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
ko3n1g pushed a commit that referenced this pull request Feb 20, 2025
* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed qa bug 5008113, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* bug fixes for generator, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed precommit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed filters, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed all issues, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed bug with document id, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* check if filtering pipeline is present, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed notebook, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* added functionality to filter pre-generated datasets, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* separated generation & filtering pipelines, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed pre-commit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* minor changes, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed Ryan Wolf's comments, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed minor bugs in configs, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed commented code in main.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* added CLI flags for generation & filtering removed code duplication, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* minor fix to quickstart notebook, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed filter.py & generate.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

---------

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
ryantwolf added a commit that referenced this pull request Feb 20, 2025
* Signed-off by viraman@nvidia.com



* Signed-off by viraman@nvidia.com



* fixed qa bug 5008113, Signed-off by viraman@nvidia.com



* bug fixes for generator, Signed-off by viraman@nvidia.com



* fixed precommit, Signed-off by viraman@nvidia.com



* fixed filters, Signed-off by viraman@nvidia.com



* fixed all issues, Signed-off by viraman@nvidia.com



* fixed bug with document id, Signed-off by viraman@nvidia.com



* check if filtering pipeline is present, Signed-off by viraman@nvidia.com



* fixed notebook, Signed-off by viraman@nvidia.com



* added functionality to filter pre-generated datasets, Signed-off by viraman@nvidia.com



* separated generation & filtering pipelines, Signed-off by viraman@nvidia.com



* fixed pre-commit, Signed-off by viraman@nvidia.com



* minor changes, Signed-off by viraman@nvidia.com



* fixed Ryan Wolf's comments, Signed-off by viraman@nvidia.com



* fixed minor bugs in configs, Signed-off by viraman@nvidia.com



* removed commented code in main.py, Signed-off by viraman@nvidia.com



* added CLI flags for generation & filtering removed code duplication, Signed-off by viraman@nvidia.com



* minor fix to quickstart notebook, Signed-off by viraman@nvidia.com



* removed filter.py & generate.py, Signed-off by viraman@nvidia.com



---------

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Co-authored-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed qa bug 5008113, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* bug fixes for generator, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed precommit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed filters, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed all issues, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed bug with document id, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* check if filtering pipeline is present, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed notebook, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* added functionality to filter pre-generated datasets, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* separated generation & filtering pipelines, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed pre-commit, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* minor changes, Signed-off by viraman@nvidia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>

* fixed Ryan Wolf's comments, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* fixed minor bugs in configs, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed commented code in main.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* added CLI flags for generation & filtering removed code duplication, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* minor fix to quickstart notebook, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

* removed filter.py & generate.py, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>

---------

Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants