Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert megatron lm ckpt to nemo #5517

Closed
bew-pbwt opened this issue Nov 29, 2022 · 11 comments
Closed

Convert megatron lm ckpt to nemo #5517

bew-pbwt opened this issue Nov 29, 2022 · 11 comments
Assignees
Labels
bug Something isn't working stale

Comments

@bew-pbwt
Copy link

bew-pbwt commented Nov 29, 2022

Describe the bug

I have tried converting megatron lm ckpt to nemo (pytorch .pt to .nemo file ) with model_optim_rng.pt from nvidia/megatron_bert_345m by using https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling)/megatron_ckpt_to_nemo.py script

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
unzip megatron_bert_345m_v0.1_uncased.zip

python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

seem like torch has some issue with GPU

 [NeMo W 2022-11-29 02:35:59 optimizers:77] Could not import distributed_fused_adam optimizer from Apex
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[NeMo W 2022-11-29 02:36:11 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-11-29 02:36:11 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-11-29 02:36:11 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
Traceback (most recent call last):
  File "megatron_lm_ckpt_to_nemo.py", line 476, in <module>
    assert world_size % args.tensor_model_parallel_size == 0
AssertionError
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
   warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9735) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-29_02:36:16
  host      : e792987d9454
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

inside docker can detect all GPUs (nvidia-smi)
image

Expected behavior

Should get nemo LM model

Environment overview (please complete the following information)

  • Environment location: nvcr.io/nvidia/nemo:22.08 docker
  • Method of NeMo install: inside image already had nemo

sudo docker pull nvcr.io/nvidia/nemo:22.08 && sudo nvidia-docker run -it -v --shm-size=16g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:22.08

Additional context

GPU: Tesla V100
server: nvidia DGX1

@bew-pbwt bew-pbwt added the bug Something isn't working label Nov 29, 2022
@yidong72
Copy link
Collaborator

yidong72 commented Dec 2, 2022

since this model checkpoint uses model parallel

--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

You need to launch 4 processes --nproc_per_node=4 to run it. And please make sure you have 4 GPUs to do the conversion.

@bew-pbwt
Copy link
Author

bew-pbwt commented Dec 6, 2022

i already tried but all got the same error as above

python -m torch.distributed.launch --nproc_per_node=8 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 1
--pipeline_model_parallel_size 1

python -m torch.distributed.launch --nproc_per_node=2 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 1
--pipeline_model_parallel_size 1

python -m torch.distributed.launch --nproc_per_node=4 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

python -m torch.distributed.launch --nproc_per_node=8 megatron_lm_ckpt_to_nemo.py
--checkpoint_folder release/mp_rank_00/
--checkpoint_name model_optim_rng.pt
--nemo_file_path all-thai-lm.nemo
--model_type bert
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 2

@yidong72
Copy link
Collaborator

yidong72 commented Dec 6, 2022

you need to figure out what is the model parallel size for your original BERT model, i.e. the correct tensor model parallel and pipeline model parallel size and set it properly. Also, the checkpoint_folder excludes mp_rank_00.

@bew-pbwt
Copy link
Author

bew-pbwt commented Dec 6, 2022

we just download checkpoint from nvidia to verify this issue and try
reference from https://github.com/NVIDIA/Megatron-LM/ it uses megatron_bert_345m if you download the same checkpint to your pc it will be in release/mp_rank_00 folder. If I take out mp_rank_00 it will go to wrong folder

@areoll
Copy link

areoll commented Dec 6, 2022

Hello yidonh72,

Please have a reference which we just simply to use Nvidia pre-train model Bert 345M Model and try to find out the issue. Depend on your suggest we try all the options between tensor_model_parallel_size, pipeline_model_parallel_size and nproc_per_node, but still come out the same error.

For Bert 345M checkpoint, we found out the info which trained by single GPU. So concept would be tensor_model_parallel_size=1, pipeline_model_parallel_size=0 and nproc_per_node=1. Is it correct ? And do you have any other comment ? please advise.

Thanks,
Areoll

@yidong72
Copy link
Collaborator

yidong72 commented Dec 6, 2022

If it is single GPU model,
tensor_model_parallel_size=1, pipeline_model_parallel_size=1 and nproc_per_node=1.
Please check the branch r1.6.0, the following notebook for an example of converting the checkpoint.
https://github.com/NVIDIA/NeMo/blob/r1.6.0/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb

@bew-pbwt
Copy link
Author

bew-pbwt commented Dec 7, 2022

Hi @yidong72
I already tried that but in branch r1.6.0 has issue with 'MixedFusedLayerNorm'

 /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

 

 warnings.warn(
[NeMo W 2022-12-07 11:34:01 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-07 11:34:01 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-12-07 11:34:02 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
 initializing tensor model parallel with size 1
 initializing pipeline model parallel with size 1
 initializing data parallel with size 1
[NeMo I 2022-12-07 11:34:04 mg:424] loading checkpoint /workspace/data/NeMo/examples/nlp/language_modeling/MegatronBERT.pt
converted 332.59M parameters
[NeMo W 2022-12-07 11:34:05 mg:384] the checkpoint version is 0
[NeMo I 2022-12-07 11:34:05 megatron_init:204] Rank 0 has data parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:207] All data parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:208] Ranks 0 has data parallel rank: 0
[NeMo I 2022-12-07 11:34:05 megatron_init:216] Rank 0 has model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:217] All model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:227] Rank 0 has tensor model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:231] All tensor model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:232] Rank 0 has tensor model parallel rank: 0
[NeMo I 2022-12-07 11:34:05 megatron_init:246] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:258] Rank 0 has embedding group: [0]
[NeMo I 2022-12-07 11:34:05 megatron_init:264] All pipeline model parallel group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:265] Rank 0 has pipeline model parallel rank 0
[NeMo I 2022-12-07 11:34:05 megatron_init:266] All embedding group ranks: [[0]]
[NeMo I 2022-12-07 11:34:05 megatron_init:267] Rank 0 has embedding rank: 0
[NeMo I 2022-12-07 11:34:05 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-bert-345m-cased, custom vocab file: None, and merges file: None
[NeMo I 2022-12-07 11:34:05 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-large-cased, vocab_file: /root/.cache/torch/megatron/megatron-bert-345m-cased_vocab, merges_files: None, special_tokens_dict: {}, and use_fast: False
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo I 2022-12-07 11:34:10 megatron_base_model:185] Padded vocab_size: 29056, original vocab_size: 28996, dummy tokens: 60.
Traceback (most recent call last):
  File "megatron_lm_ckpt_to_nemo.py", line 515, in <module>
    convert(local_rank, rank, world_size, args)
  File "megatron_lm_ckpt_to_nemo.py", line 485, in convert
    model = load_model(MegatronBertModel, checkpoint, strict=False, trainer=trainer)
  File "megatron_lm_ckpt_to_nemo.py", line 262, in load_model
    model = ptl_load_state(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 250, in _load_state
    obj = cls(**_cls_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_bert_model.py", line 67, in __init__
    self.model = BertModel(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron/bert_model.py", line 179, in __init__
    self.language_model, self._language_model_key = get_language_model(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/language_model.py", line 93, in get_language_model
    language_model = TransformerLanguageModel(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/language_model.py", line 444, in __init__
    self.encoder = ParallelTransformer(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1744, in __init__
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1744, in <listcomp>
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1675, in build_layer
    return ParallelTransformerLayer(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1512, in __init__
    super(ParallelTransformerLayer, self).__init__(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1132, in __init__
    self.input_layernorm = get_layer_norm(
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py", line 62, in get_layer_norm
    return MixedFusedLayerNorm(hidden_size, eps, sequence_parallel_enbaled=sequence_parallel)
NameError: name 'MixedFusedLayerNorm' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13402) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-07_11:34:16
  host      : 39497a957367
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13402)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


@yidong72
Copy link
Collaborator

yidong72 commented Dec 7, 2022

you need to have 1.6 branch docker image to work with 1.6 branch code.
https://github.com/NVIDIA/NeMo/blob/r1.6.0/Dockerfile

@bew-pbwt
Copy link
Author

HI @yidong72

Thank you for your help now I can convert it

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jan 12, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants