-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NeMo Integrate #125
NeMo Integrate #125
Conversation
return torch.utils.data.DataLoader( | ||
dataset, | ||
batch_sampler=batch_sampler, | ||
# For some reason this causes a crash when using >0 workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _reconfigure
for generate is uncommented, as currently is, num_workers=self.cfg.data.num_workers
seems to work fine. Do you think we should change it back from 0
?
I think this PR is in a really good state right now. Any updates on the remaining issues you've listed?
|
Yeah, I think the last one isn't a case anymore. I'll add a script showing how to load a checkpoint and infer it. |
I added the inference script and made the checkpointing save to a reloadable name (doesn't work with metrics with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some very minor nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work 🥳 🥳 🥳 Thanks @cat-state !!
Currently implemented:
Future issues:
bf16
appears unstable compared to fp16