Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo Integrate #125

Merged
merged 69 commits into from
Feb 3, 2023
Merged

NeMo Integrate #125

merged 69 commits into from
Feb 3, 2023

Conversation

cat-state
Copy link
Collaborator

@cat-state cat-state commented Dec 5, 2022

Currently implemented:

  • Training for ILQL
  • Generation using ILQL (i.e validation)
  • Inference script for ILQL

Future issues:

  • More reusable nemo abstraction
  • bf16 appears unstable compared to fp16

@cat-state cat-state marked this pull request as ready for review January 16, 2023 18:16
trlx/trainer/nemo_ilql_trainer.py Outdated Show resolved Hide resolved
return torch.utils.data.DataLoader(
dataset,
batch_sampler=batch_sampler,
# For some reason this causes a crash when using >0 workers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _reconfigure for generate is uncommented, as currently is, num_workers=self.cfg.data.num_workers seems to work fine. Do you think we should change it back from 0?

trlx/trainer/nemo/gpt.py Outdated Show resolved Hide resolved
@cat-state cat-state mentioned this pull request Feb 2, 2023
5 tasks
@jon-tow
Copy link
Collaborator

jon-tow commented Feb 2, 2023

I think this PR is in a really good state right now. Any updates on the remaining issues you've listed?

  • Verify checkpointing
  • ILQL loss spikes near the start of training (this doesn't seem to be the case now?)

@cat-state
Copy link
Collaborator Author

I think this PR is in a really good state right now. Any updates on the remaining issues you've listed?

* [ ]  Verify checkpointing

* [ ]  ILQL loss spikes near the start of training (this doesn't seem to be the case now?)

Yeah, I think the last one isn't a case anymore. I'll add a script showing how to load a checkpoint and infer it.

@cat-state
Copy link
Collaborator Author

I added the inference script and made the checkpointing save to a reloadable name (doesn't work with metrics with / in their name)

Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some very minor nits

trlx/data/configs.py Outdated Show resolved Hide resolved
trlx/trainer/nemo/README.md Outdated Show resolved Hide resolved
trlx/trainer/nemo/gpt.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work 🥳 🥳 🥳 Thanks @cat-state !!

@jon-tow jon-tow merged commit b70bc92 into main Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants