Skip to content

NotImplementedError: aten::equal on meta tensors during multi-GPU model init with transformers >= 5.4.0 #1765

@sharonyu-115

Description

@sharonyu-115

Describe the bug

Issue discovered while working on NVIDIA-NeMo/RL#2212

When loading an HF model with tie_word_embeddings=True (e.g., Qwen/Qwen3-0.6B) on multi-GPU, model initialization crashes with:

NotImplementedError: aten::equal: attempted to run this operator with Meta tensors,
but there was no fake impl or Meta kernel registered.

The crash occurs because _build_model wraps the entire _init_model call — including HF's from_pretrained — inside an init_empty_weights() context (meta device). This means that by the time HF's _finalize_model_loading calls tie_weights(missing_keys=...), the model parameters are still meta tensors. Transformers v5.4.0 added a torch.equal() call inside tie_weights to compare tied parameter values (HF PR #44497), and torch.equal does not support meta tensors.

Call chain

_build_model (auto_model.py:359)
  with [no_init_weights(), init_empty_weights()]:     ← meta device context wraps everything
    _init_model (model_init.py:396)
      _from_pretrained_parent_class (auto_model.py:205)
        HF AutoModelForCausalLM.from_pretrained
          model.__init__()                             ← meta tensors created here
          _load_pretrained_model()                     ← weights loaded, but STILL META (inside init_empty_weights)
          _finalize_model_loading (modeling_utils.py:4290)
            tie_weights(missing_keys=...)
              torch.equal(source_param, target_param)  ← CRASH: meta tensors don't support this

Steps/Code to reproduce bug

Run the existing qwen3_0p6b_hellaswag.yaml SFT recipe on multiple GPUs.

automodel examples/llm_finetune/qwen/qwen3_0p6b_hellaswag.yaml --nproc-per-node 2

Impact seems to be: Any model with tie_word_embeddings=True in its config.json will trigger this when loaded via the HF fallback path (i.e., not a custom-registered model) on multi-GPU.

Additional context

Error log from my reproduce with automodel sft on dfw /lustre/fsw/portfolios/coreai/users/shuangy/src/NeMo-RL/nemo-rl/meta_tensor_issue_reproduce.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions