Megatron Finetuning hangs with more than 1 A100 GPU #8655

dkajtoch · 2024-03-14T09:36:56Z

dkajtoch
Mar 14, 2024

I am trying to run an example that looks similar to this one: https://github.com/NVIDIA/NeMo/blob/48b8204d57e59c8790aaa6eaa20384b046b1a574/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py

I am using the Docker container nvcr.io/nvidia/nemo:24.01.framework with torchrun command. My initial model is a converted Mistral 7B to Nemo format. Execution looks like this:

torchrun --standalone --nproc-per-node 2 .../megatron_gpt_finetuning.py model.tensor_model_parallel_size=2 model.pipeline_model_parallel_size=1

When I use L4 GPU everything is ok (but end up with Cuda OOM) also H100 works. On the other hand, when I switch to A100 80GB initialization hangs before checkpoint loading. Below is the screenshot for L4. For A100 it hangs before (I never see a blue message) "Loading distributed checkpoint ...". Any ideas how to fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron Finetuning hangs with more than 1 A100 GPU #8655

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Megatron Finetuning hangs with more than 1 A100 GPU #8655

dkajtoch Mar 14, 2024

Replies: 0 comments

dkajtoch
Mar 14, 2024