Skip to content

微调阶段,fp16训练loss值为0,fp32训练loss正常 #1370

@zhangxiangchn

Description

@zhangxiangchn

模型

jina-embeddings-v2-base-zh

模型微调参考

FlagEmbedding/examples/finetune/embedder/encoder_only/ 路径下的脚本修改

微调命令:

export WANDB_MODE=disabled
train_data="/xxx/xxx/data/finetune_data_score_v2.jsonl"
num_train_epochs=4
per_device_train_batch_size=256
num_gpus=2
if [ -z "$HF_HUB_CACHE" ]; then
export HF_HUB_CACHE="$HOME/.cache/huggingface/hub"
fi
model_args="
--model_name_or_path /SharedNFS/LLM_model/jina-embeddings-v2-base-zh
--cache_dir $HF_HUB_CACHE
--trust_remote_code True
"
data_args="
--train_data $train_data
--cache_path ~/.cache
--train_group_size 15
--query_max_len 32
--passage_max_len 32
--pad_to_multiple_of 8
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: '
--query_instruction_format '{}{}'
--knowledge_distillation True
"
training_args="
--output_dir ./knowledge_distillation_agent_minedHN_score_test_encoder_only_base_jina-embeddings-v2-base-zh
--overwrite_output_dir
--learning_rate 1e-5
--fp16 \ # 唯一区别就是这里要不要指定fp16进行训练
--num_train_epochs $num_train_epochs
--per_device_train_batch_size $per_device_train_batch_size
--dataloader_drop_last True
--warmup_ratio 0.1
--gradient_checkpointing
--deepspeed ../../ds_stage0.json
--logging_steps 1
--save_steps 1000
--negatives_cross_device
--temperature 0.02
--sentence_pooling_method mean
--normalize_embeddings True
--kd_loss_type kl_div
"
cmd="torchrun --nproc_per_node $num_gpus
-m FlagEmbedding.finetune.embedder.encoder_only.base
$model_args
$data_args
$training_args
"
echo $cmd
eval $cmd

log输出(fp16):

/data/anaconda3/envs/c-mteb/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
0%| | 1/452 [00:04<32:47, 4.36s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.01}
0%| | 1/452 [00:04<32:47, 4.36s/it]
0%| | 2/452 [00:07<26:18, 3.51s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.02}
0%| | 2/452 [00:07<26:18, 3.51s/it]
1%| | 3/452 [00:10<24:00, 3.21s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.03}
1%| | 3/452 [00:10<24:00, 3.21s/it]
1%| | 4/452 [00:13<25:15, 3.38s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.04}
1%| | 4/452 [00:13<25:15, 3.38s/it]
1%| | 5/452 [00:16<24:40, 3.31s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.04}
1%| | 5/452 [00:16<24:40, 3.31s/it]
1%|▏ | 6/452 [00:19<23:28, 3.16s/it]

log输出(fp32):

/data/anaconda3/envs/c-mteb/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
0%| | 1/452 [00:07<59:15, 7.88s/it]

{'loss': 5.9738, 'learning_rate': 0.0, 'epoch': 0.01}
0%| | 1/452 [00:07<59:15, 7.88s/it]
0%| | 2/452 [00:14<52:21, 6.98s/it]

{'loss': 5.2496, 'learning_rate': 1.8104259678004022e-06, 'epoch': 0.02}
0%| | 2/452 [00:14<52:21, 6.98s/it]
1%| | 3/452 [00:20<50:08, 6.70s/it]

{'loss': 5.9845, 'learning_rate': 2.8694572692954448e-06, 'epoch': 0.03}
1%| | 3/452 [00:20<50:08, 6.70s/it]
1%| | 4/452 [00:28<54:47, 7.34s/it]

{'loss': 5.684, 'learning_rate': 3.6208519356008044e-06, 'epoch': 0.04}
1%| | 4/452 [00:28<54:47, 7.34s/it]
1%| | 5/452 [00:36<54:48, 7.36s/it]

{'loss': 5.5646, 'learning_rate': 4.203678918349396e-06, 'epoch': 0.04}
1%| | 5/452 [00:36<54:48, 7.36s/it]
1%|▏ | 6/452 [00:42<52:31, 7.07s/it]

另外想问一下,loss大概在多少模型差不多收敛

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions