数据集数量加载缺失，缺失之后训练的模型输出全部为空 #3800

pangpang-xuan · 2024-05-18T04:59:41Z

Reminder

I have read the README and searched the existing issues.

Reproduction

dataset=Bilingual-code-v10-en
output_dir=outputs/Bilingual-code-v10-en-LLaMA3-8B-5epoch
ds_config=configs/deepspeed/ds_config_zero2.json #zero2deepspeed
model_name_or_path=/home/LLM/LLaMA3-8B
template=自定义的模板
date +"%Y-%m-%d %H:%M:%S"
torchrun --nnodes ${NODES}
--nproc_per_node ${NUM_GPUS}
--node_rank=${NODE_RANK}
--master_addr=${MASTER_ADDR}
--master_port=${MASTER_PORT}
src/train_bash.py
--deepspeed ${ds_config}
--stage sft
--do_train
--finetuning_type lora
--lora_target all
--lora_rank 64
--model_name_or_path ${model_name_or_path}
--template ${template}
--dataset ${dataset}
--output_dir ${output_dir}
--per_device_train_batch_size 1
--gradient_accumulation_steps 64
--lr_scheduler_type cosine
--logging_steps 2
--save_strategy epoch
--learning_rate 3e-4
--num_train_epochs 5.0
--warmup_ratio 0.1
--plot_loss
--fp16
--flash_attn
--seed 42
--ddp_timeout 1800000
--dataloader_num_workers 1
--cutoff_len 2048 >> $OUTPUT_LOG 2>&1
--quantization_bit 4

Expected behavior

数据集总数量应该是94822 并且检查了全部数据集token长度全部小于cutoff的值，但是训练的时候只加载了其中的60646条
数据集的格式如第一张图，正式训练加载的数据集数量如第二张图，并且60646条训练之后的模型全部输出为空
微调之后的lora权重文件如图三，lora合并之后的全部模型权重如图四

System Info

No response

Others

No response

hiyouga · 2024-05-18T05:45:58Z

加载数量没有实际多通常是因为数据集中包含不规范的样本

pangpang-xuan · 2024-05-18T11:01:18Z

加载数量没有实际多通常是因为数据集中包含不规范的样本

好的感谢了我明白了原因数据集只会加载prompt和output同时不为“”的数据
我的某些输出数据中含有“” 因此导致数据量减少了感谢

hiyouga added the pending This problem is yet to be addressed. label May 18, 2024

pangpang-xuan closed this as completed May 18, 2024

hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

数据集数量加载缺失，缺失之后训练的模型输出全部为空 #3800

数据集数量加载缺失，缺失之后训练的模型输出全部为空 #3800

pangpang-xuan commented May 18, 2024

hiyouga commented May 18, 2024

pangpang-xuan commented May 18, 2024

数据集数量加载缺失，缺失之后训练的模型输出全部为空 #3800

数据集数量加载缺失，缺失之后训练的模型输出全部为空 #3800

Comments

pangpang-xuan commented May 18, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented May 18, 2024

pangpang-xuan commented May 18, 2024