Model Zoo

Note

For all the pretraining and finetuning, we adopt spaese/uniform sampling.
#Frame $=$ #input_frame $\times$ #crop $\times$ #clip
#input_frame means how many frames are input for model per inference
#crop means spatial crops (e.g., 3 for left/right/center)
#clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

Model	Pretrain-I	Pretrain-V	Finetuned	#Frame	Weight
ViViM-T	deit,IN1K	-	K400	16x3x4	🤗 HF link
ViViM-S	deit,IN1K	-	K400	16x3x4	🤗 HF link

Method	#Frame	Top-1 Acc	Top-5 Acc	shell
ViViM-T (Ours)	16	77.51	93.27	script.sh
ViViM-S (Ours)	16	80.47	94.75	script.sh