-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap all fp8 extra states in LocalNonpersistentObject #9422
Conversation
Signed-off-by: Jan Baczek <jbaczek@nvidia.com>
7f822fa
to
549c541
Compare
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
@@ -1850,7 +1850,7 @@ def sharded_state_dict(self, prefix: str = '') -> Dict[str, Any]: | |||
|
|||
# WAR: This is a temporary fix to skip loading FP8 parameters for Dot Product Attention | |||
def skip_fp8_load(x): | |||
if isinstance(x, ShardedObject) and 'fused_attention' in x.key and '_extra_state' in x.key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this PR basically makes the GPT models ignore the FP8 state for all layers in the checkpoint.
In the meantime I prepared a thorough solution, with an almost merged MCore branch and a corresponding NeMo branch.
@jbaczek could you check (in theory or in practice) if this would solve your problem?
The required flag to set would be model.dist_ckpt_load_sctrictness=log_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that non-strict loading would solve the problem. I see that this branch is already merged to mcore. When should we expect the sync of repositories, so I could use the public implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked for a public sync, ideally should be available later today, but we don't have an official ETA
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do ?
This PR generalizes FP8 extra state wrapping for all tensors.
Collection: nlp
Before your PR is "Ready for review"
Pre checks:
PR Type: