Hi, thank you for sharing the model weights. I evaluated the model StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln on the val_unseen split (R2R_VLNCE_v1-3 preprocessed) using 4 × L20 GPUs and followed your evaluation pipeline closely.
However, I observed that the results are significantly lower than those reported in your paper. Please see the attached screenshots for comparison:
Interestingly, I ran the same evaluation setup for two other baselines, and their results are consistent with the paper. I was wondering if the released checkpoints for StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln are the same as those used to produce the final results reported in the paper?
I have attached my final results and looking forward to your advice and clarification—thank you!
result.json
Hi, thank you for sharing the model weights. I evaluated the model
StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevlnon theval_unseensplit (R2R_VLNCE_v1-3 preprocessed) using 4 × L20 GPUs and followed your evaluation pipeline closely.However, I observed that the results are significantly lower than those reported in your paper. Please see the attached screenshots for comparison:
Interestingly, I ran the same evaluation setup for two other baselines, and their results are consistent with the paper. I was wondering if the released checkpoints for
StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevlnare the same as those used to produce the final results reported in the paper?I have attached my final results and looking forward to your advice and clarification—thank you!
result.json