change the val dataset sampler from sequential to deterministically shuffled #29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
改动前的代码,在模型finetune时,对于验证集使用分布式顺序读取的样本sampler(训练集则使用shuffle的分布式样本sampler)。如果验证集中,图文对原始标注存在一条文本对应多个图片(如MUGE),在卡数较少的情况下(如单卡或两卡),容易出现验证时一张GPU上的local batch内几个图文对样本的文本相同的情况。现在我们训练中计算的验证集inbatch accuracy,是最简单的以样本自身的图文作为ground truth的机制,不能正确处理local batch里面,文对图一对多的这种情况,会导致其验证集inbatch accuracy呈现结果偏低(参见issue #28 )。但是模型本身的训练和收敛不受任何影响,也完全不影响最终模型的Recall效果指标,只是finetune过程中打印的inbatch accuracy结果偏低。 因此,我们将验证集也按照固定的随机种子进行shuffle,规避掉这种特殊情况,从而能使inbatch accuracy正确反映模型的训练趋势。