Convert megatron lm ckpt to nemo #5517

bew-pbwt · 2022-11-29T03:34:04Z

Describe the bug

I have tried converting megatron lm ckpt to nemo (pytorch .pt to .nemo file ) with model_optim_rng.pt from nvidia/megatron_bert_345m by using https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling)/megatron_ckpt_to_nemo.py script

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
unzip megatron_bert_345m_v0.1_uncased.zip

python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

seem like torch has some issue with GPU

 [NeMo W 2022-11-29 02:35:59 optimizers:77] Could not import distributed_fused_adam optimizer from Apex
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[NeMo W 2022-11-29 02:36:11 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-11-29 02:36:11 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-11-29 02:36:11 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
Traceback (most recent call last):
  File "megatron_lm_ckpt_to_nemo.py", line 476, in <module>
    assert world_size % args.tensor_model_parallel_size == 0
AssertionError
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
   warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9735) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-29_02:36:16
  host      : e792987d9454
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

inside docker can detect all GPUs (nvidia-smi)

Expected behavior

Should get nemo LM model

Environment overview (please complete the following information)

Environment location: nvcr.io/nvidia/nemo:22.08 docker
Method of NeMo install: inside image already had nemo

sudo docker pull nvcr.io/nvidia/nemo:22.08 && sudo nvidia-docker run -it -v --shm-size=16g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:22.08

Additional context

GPU: Tesla V100
server: nvidia DGX1

The text was updated successfully, but these errors were encountered:

yidong72 · 2022-12-02T22:08:00Z

since this model checkpoint uses model parallel

--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

You need to launch 4 processes --nproc_per_node=4 to run it. And please make sure you have 4 GPUs to do the conversion.

bew-pbwt · 2022-12-06T02:35:42Z

i already tried but all got the same error as above

python -m torch.distributed.launch --nproc_per_node=8 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 1
--pipeline_model_parallel_size 1

python -m torch.distributed.launch --nproc_per_node=2 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 1
--pipeline_model_parallel_size 1

python -m torch.distributed.launch --nproc_per_node=4 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

python -m torch.distributed.launch --nproc_per_node=8 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

yidong72 · 2022-12-06T03:29:52Z

you need to figure out what is the model parallel size for your original BERT model, i.e. the correct tensor model parallel and pipeline model parallel size and set it properly. Also, the checkpoint_folder excludes mp_rank_00.

bew-pbwt · 2022-12-06T04:20:40Z

we just download checkpoint from nvidia to verify this issue and try
reference from https://github.com/NVIDIA/Megatron-LM/ it uses megatron_bert_345m if you download the same checkpint to your pc it will be in release/mp_rank_00 folder. If I take out mp_rank_00 it will go to wrong folder

areoll · 2022-12-06T13:04:58Z

Hello yidonh72,

Please have a reference which we just simply to use Nvidia pre-train model Bert 345M Model and try to find out the issue. Depend on your suggest we try all the options between tensor_model_parallel_size, pipeline_model_parallel_size and nproc_per_node, but still come out the same error.

For Bert 345M checkpoint, we found out the info which trained by single GPU. So concept would be tensor_model_parallel_size=1, pipeline_model_parallel_size=0 and nproc_per_node=1. Is it correct ? And do you have any other comment ? please advise.

Thanks,
Areoll

yidong72 · 2022-12-06T14:03:54Z

If it is single GPU model,
tensor_model_parallel_size=1, pipeline_model_parallel_size=1 and nproc_per_node=1.
Please check the branch r1.6.0, the following notebook for an example of converting the checkpoint.
https://github.com/NVIDIA/NeMo/blob/r1.6.0/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb

bew-pbwt · 2022-12-07T04:53:25Z

Hi @yidong72
I already tried that but in branch r1.6.0 has issue with 'MixedFusedLayerNorm'

 /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

 

 warnings.warn(
[NeMo W 2022-12-07 11:34:01 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-07 11:34:01 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-12-07 11:34:02 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
 initializing tensor model parallel with size 1
 initializing pipeline model parallel with size 1
 initializing data parallel with size 1
[NeMo I 2022-12-07 11:34:04 mg:424] loading checkpoint /workspace/data/NeMo/examples/nlp/language_modeling/MegatronBERT.pt
converted 332.59M parameters
[NeMo W 2022-12-07 11:34:05 mg:384] the checkpoint version is 0
[NeMo I 2022-12-07 11:34:05 megatron_init:204] Rank 0 has data parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:207] All data parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:208] Ranks 0 has data parallel rank: 0
[NeMo I 2022-12-07 11:34:05 megatron_init:216] Rank 0 has model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:217] All model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:227] Rank 0 has tensor model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:231] All tensor model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:232] Rank 0 has tensor model parallel rank: 0
[NeMo I 2022-12-07 11:34:05 megatron_init:246] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:258] Rank 0 has embedding group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:264] All pipeline model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:265] Rank 0 has pipeline model parallel rank 0
[NeMo I 2022-12-07 11:34:05 megatron_init:266] All embedding group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:267] Rank 0 has embedding rank: 0
[NeMo I 2022-12-07 11:34:05 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-bert-345m-cased, custom vocab file: None, and merges file: None
[NeMo I 2022-12-07 11:34:05 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-large-cased, vocab_file: /root/.cache/torch/megatron/megatron-bert-345m-cased_vocab, merges_files: None, special_tokens_dict: {}, and use_fast: False
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo I 2022-12-07 11:34:10 megatron_base_model:185] Padded vocab_size: 29056, original vocab_size: 28996, dummy tokens: 60.
Traceback (most recent call last):
  File "megatron_lm_ckpt_to_nemo.py", line 515, in <module>
    convert(local_rank, rank, world_size, args)
  File "megatron_lm_ckpt_to_nemo.py", line 485, in convert
    model = load_model(MegatronBertModel, checkpoint, strict=False, trainer=trainer)
  File "megatron_lm_ckpt_to_nemo.py", line 262, in load_model
    model = ptl_load_state(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 250, in _load_state
    obj = cls(**_cls_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_bert_model.py", line 67, in __init__
    self.model = BertModel(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron/bert_model.py", line 179, in __init__
    self.language_model, self._language_model_key = get_language_model(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/language_model.py", line 93, in get_language_model
    language_model = TransformerLanguageModel(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/language_model.py", line 444, in __init__
    self.encoder = ParallelTransformer(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1744, in __init__
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1744, in <listcomp>
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1675, in build_layer
    return ParallelTransformerLayer(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1512, in __init__
    super(ParallelTransformerLayer, self).__init__(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1132, in __init__
    self.input_layernorm = get_layer_norm(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py", line 62, in get_layer_norm
    return MixedFusedLayerNorm(hidden_size, eps, sequence_parallel_enbaled=sequence_parallel)
NameError: name 'MixedFusedLayerNorm' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13402) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-07_11:34:16
  host      : 39497a957367
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13402)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

yidong72 · 2022-12-07T14:00:34Z

you need to have 1.6 branch docker image to work with 1.6 branch code.
https://github.com/NVIDIA/NeMo/blob/r1.6.0/Dockerfile

bew-pbwt · 2022-12-12T02:03:14Z

HI @yidong72

Thank you for your help now I can convert it

github-actions · 2023-01-12T01:58:41Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2023-01-19T02:00:27Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

bew-pbwt added the bug Something isn't working label Nov 29, 2022

okuchaiev assigned yidong72 Dec 2, 2022

github-actions bot added the stale label Jan 12, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert megatron lm ckpt to nemo #5517

Convert megatron lm ckpt to nemo #5517

bew-pbwt commented Nov 29, 2022 •

edited

yidong72 commented Dec 2, 2022

bew-pbwt commented Dec 6, 2022

yidong72 commented Dec 6, 2022 •

edited

bew-pbwt commented Dec 6, 2022

areoll commented Dec 6, 2022

yidong72 commented Dec 6, 2022

bew-pbwt commented Dec 7, 2022 •

edited

yidong72 commented Dec 7, 2022

bew-pbwt commented Dec 12, 2022

github-actions bot commented Jan 12, 2023

github-actions bot commented Jan 19, 2023

Convert megatron lm ckpt to nemo #5517

Convert megatron lm ckpt to nemo #5517

Comments

bew-pbwt commented Nov 29, 2022 • edited

yidong72 commented Dec 2, 2022

bew-pbwt commented Dec 6, 2022

yidong72 commented Dec 6, 2022 • edited

bew-pbwt commented Dec 6, 2022

areoll commented Dec 6, 2022

yidong72 commented Dec 6, 2022

bew-pbwt commented Dec 7, 2022 • edited

yidong72 commented Dec 7, 2022

bew-pbwt commented Dec 12, 2022

github-actions bot commented Jan 12, 2023

github-actions bot commented Jan 19, 2023

bew-pbwt commented Nov 29, 2022 •

edited

yidong72 commented Dec 6, 2022 •

edited

bew-pbwt commented Dec 7, 2022 •

edited