Fine tuning only runs on CPU #29

diabeticpilot · 2024-03-11T16:49:55Z

Hello,

I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.

nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?

Here is how I am running it and all the settings:

export CUDA_VISIBLE_DEVICES=1,0
python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \

Performance:
[42:45<2887:27:12, 803.50s/it]

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

johnowhitaker · 2024-03-12T22:18:36Z

I think on some shared machines export CUDA_VISIBLE_DEVICES=1,0 might reference cards other than the ones you're assigned. (Don't quote me on this but I think I just hit a similar issue). Removing that and running just the training script in a new shell where CUDA_VISIBLE_DEVICES isn't defined worked in my case.

js-2024 · 2024-04-15T01:43:23Z

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

zhksh · 2024-04-23T16:57:03Z

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

same here, alternating usage of GPU (4x3090) and CPU (24cores maxed out), training llama-3-8b, ~45s/it .
No idea whats going on, but the loss logs right after CPU drops and GPU takes over, feels like inference is done on CPU and backprop on GPU.

zhksh · 2024-04-23T17:13:01Z

ok sorry, --use_cpu_offload false helps, i assumed "false" to be default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning only runs on CPU #29

Fine tuning only runs on CPU #29

diabeticpilot commented Mar 11, 2024

johnowhitaker commented Mar 12, 2024

js-2024 commented Apr 15, 2024

zhksh commented Apr 23, 2024 •

edited

Loading

zhksh commented Apr 23, 2024 •

edited

Loading

Fine tuning only runs on CPU #29

Fine tuning only runs on CPU #29

Comments

diabeticpilot commented Mar 11, 2024

johnowhitaker commented Mar 12, 2024

js-2024 commented Apr 15, 2024

zhksh commented Apr 23, 2024 • edited Loading

zhksh commented Apr 23, 2024 • edited Loading

zhksh commented Apr 23, 2024 •

edited

Loading

zhksh commented Apr 23, 2024 •

edited

Loading