多卡增量训练，数据集处理完后tokenizer都不进行的 #3723

pan-xi · 2024-05-13T10:30:48Z

使用镜像：nvcr.io/nvidia/pytorch:24.01-py3
单卡是没问题的
执行命令：
bash examples/lora_multi_gpu/ds_zero3.sh（bash examples/lora_multi_gpu/single_node.sh的效果也一样）

yaml文件只修改了模型模型名称

bash文件只修改了进程数

config文件未做修改

处理完数据集后就不动了，也没有tokenizer的信息，accelerate和deepspeed的效果都一样的
对应nvdia-smi的状态

hiyouga added pending This problem is yet to be addressed. labels May 13, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 29, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024

Provide feedback