-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert megatron lm ckpt to nemo #5517
Comments
since this model checkpoint uses model parallel
You need to launch 4 processes |
i already tried but all got the same error as above
|
you need to figure out what is the model parallel size for your original BERT model, i.e. the correct tensor model parallel and pipeline model parallel size and set it properly. Also, the |
we just download checkpoint from nvidia to verify this issue and try |
Hello yidonh72, Please have a reference which we just simply to use Nvidia pre-train model Bert 345M Model and try to find out the issue. Depend on your suggest we try all the options between tensor_model_parallel_size, pipeline_model_parallel_size and nproc_per_node, but still come out the same error. For Bert 345M checkpoint, we found out the info which trained by single GPU. So concept would be tensor_model_parallel_size=1, pipeline_model_parallel_size=0 and nproc_per_node=1. Is it correct ? And do you have any other comment ? please advise. Thanks, |
If it is single GPU model, |
Hi @yidong72
|
you need to have 1.6 branch docker image to work with 1.6 branch code. |
HI @yidong72 Thank you for your help now I can convert it |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Describe the bug
I have tried converting megatron lm ckpt to nemo (pytorch .pt to .nemo file ) with model_optim_rng.pt from nvidia/megatron_bert_345m by using https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling)/megatron_ckpt_to_nemo.py script
seem like torch has some issue with GPU
inside docker can detect all GPUs (nvidia-smi)
Expected behavior
Should get nemo LM model
Environment overview (please complete the following information)
Additional context
GPU: Tesla V100
server: nvidia DGX1
The text was updated successfully, but these errors were encountered: