change the val dataset sampler from sequential to deterministically shuffled #29

yangapku · 2022-12-11T12:59:07Z

改动前的代码，在模型finetune时，对于验证集使用分布式顺序读取的样本sampler（训练集则使用shuffle的分布式样本sampler）。如果验证集中，图文对原始标注存在一条文本对应多个图片（如MUGE），在卡数较少的情况下（如单卡或两卡），容易出现验证时一张GPU上的local batch内几个图文对样本的文本相同的情况。现在我们训练中计算的验证集inbatch accuracy，是最简单的以样本自身的图文作为ground truth的机制，不能正确处理local batch里面，文对图一对多的这种情况，会导致其验证集inbatch accuracy呈现结果偏低（参见issue #28 ）。但是模型本身的训练和收敛不受任何影响，也完全不影响最终模型的Recall效果指标，只是finetune过程中打印的inbatch accuracy结果偏低。因此，我们将验证集也按照固定的随机种子进行shuffle，规避掉这种特殊情况，从而能使inbatch accuracy正确反映模型的训练趋势。

yangapku

Approve

yangapku

Approve

change the val dataset sampler from sequential to determisticly shuffled

1bc9ddf

yangapku commented Dec 11, 2022

View reviewed changes

yangapku self-assigned this Dec 11, 2022

yangapku commented Dec 11, 2022

View reviewed changes

yangapku merged commit 1924b1b into master Dec 11, 2022

yangapku deleted the dev_modify_val_sampler branch December 11, 2022 13:04

yangapku mentioned this pull request Dec 11, 2022

复现不出结果 #28

Closed

yangapku changed the title ~~change the val dataset sampler from sequential to determisticly shuffled~~ change the val dataset sampler from sequential to deterministically shuffled Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change the val dataset sampler from sequential to deterministically shuffled #29

change the val dataset sampler from sequential to deterministically shuffled #29

yangapku commented Dec 11, 2022 •

edited

yangapku left a comment

yangapku left a comment

change the val dataset sampler from sequential to deterministically shuffled #29

change the val dataset sampler from sequential to deterministically shuffled #29

Conversation

yangapku commented Dec 11, 2022 • edited

yangapku left a comment

Choose a reason for hiding this comment

yangapku left a comment

Choose a reason for hiding this comment

yangapku commented Dec 11, 2022 •

edited