Skip to content

feat: support override HF model name in convert_megatron_to_hf#2202

Merged
yuki-97 merged 1 commit into
NVIDIA-NeMo:mainfrom
dhineshkumar-r:checkpoint-conversion-hf-model-override
Apr 4, 2026
Merged

feat: support override HF model name in convert_megatron_to_hf#2202
yuki-97 merged 1 commit into
NVIDIA-NeMo:mainfrom
dhineshkumar-r:checkpoint-conversion-hf-model-override

Conversation

@dhineshkumar-r
Copy link
Copy Markdown
Contributor

…F format.

What does this PR do ?

Enables a way to override hf_model_name when converting checkpoints from megatron to HF format. This is useful for models like GPT-OSS whose base checkpoint precision(mxfp4) is different from supported export precision(bfloat16) in Megatron-Bridge, Ref.

Issues

List issues that this PR closes (syntax):

closes #2124

Usage

If openai/gpt-oss-20b is finetuned in bfloat16 precision and checkpoints are stored in megatron format, the override argument can be used to pass the supported unsloth/gpt-oss-20b-BF16 hf model name to use the config corresponding to bf16 precision.

uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
  --config <path_to_gpt_oss_20b_megatron_ckpt>/config.yaml \
  --hf-model-name unsloth/gpt-oss-20b-BF16 \
  --megatron-ckpt-path <path_to_gpt_oss_20b_megatron_ckpt>/policy/weights/iter_xxxxx \
  --hf-ckpt-path <path_to_save_hf_ckpt>

Before your PR is "Ready for review"

Pre checks:

  • [Y] Make sure you read and followed Contributor guidelines
  • [N] Did you write any new necessary tests?
  • [N] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • [Y] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • Although there are not unit tests, verified the change works as expected locally by running above mentioned command with gpt-oss-20b checkpoint finetuned in bfloat16 precision. Did not notice any layers missing warnings.
  • ...

@dhineshkumar-r dhineshkumar-r requested a review from a team as a code owner April 3, 2026 06:35
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dhineshkumar-r , changes lgtm.

could you help to add what you mentioned in the PR description to docs/design-docs/checkpointing.md, this will help people who meet the same situation.

@dhineshkumar-r dhineshkumar-r force-pushed the checkpoint-conversion-hf-model-override branch 2 times, most recently from 5951c90 to bed5b3b Compare April 3, 2026 22:38
@dhineshkumar-r dhineshkumar-r requested a review from a team as a code owner April 3, 2026 22:38
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Apr 3, 2026
@dhineshkumar-r dhineshkumar-r force-pushed the checkpoint-conversion-hf-model-override branch 2 times, most recently from 4d22f5a to d162d13 Compare April 3, 2026 22:49
@dhineshkumar-r
Copy link
Copy Markdown
Contributor Author

dhineshkumar-r commented Apr 3, 2026

Done. Please take a look.
cc: @yuki-97

@yuki-97 yuki-97 changed the title Argument to override HF model name when converting from megatron to H… feat: support override HF model name in convert_megatron_to_hf Apr 4, 2026
yuki-97
yuki-97 previously approved these changes Apr 4, 2026
@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 4, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 4, 2026

/ok to test d162d13

@yuki-97 yuki-97 enabled auto-merge (squash) April 4, 2026 11:01
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 4, 2026

hi @dhineshkumar-r , there's a lint check fail, could you use pre-commit run --all-files to fix it?

auto-merge was automatically disabled April 4, 2026 14:26

Head branch was pushed to by a user without write access

@dhineshkumar-r dhineshkumar-r force-pushed the checkpoint-conversion-hf-model-override branch from d162d13 to 33bdeec Compare April 4, 2026 14:26
…F format.

Signed-off-by: Dhineshkumar Ramasubbu <dhineshkumar.ramasubbu@gmail.com>
@dhineshkumar-r dhineshkumar-r force-pushed the checkpoint-conversion-hf-model-override branch from 33bdeec to 3ea26b7 Compare April 4, 2026 14:29
@dhineshkumar-r
Copy link
Copy Markdown
Contributor Author

Yes, I don't see it fail anymore. Please let me know if anything else.

Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, let me re-trigger CI.

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 4, 2026

/ok to test 3ea26b7

@yuki-97 yuki-97 enabled auto-merge (squash) April 4, 2026 14:36
@yuki-97 yuki-97 merged commit fe3c4fc into NVIDIA-NeMo:main Apr 4, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) community-request Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Many layers missing when converting GPTOSS Megatron checkpoint to HF format

3 participants