Evaluation results are significantly lower than reported in the paper

Hi, thank you for sharing the model weights. I evaluated the model `StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln` on the `val_unseen` split (`R2R_VLNCE_v1-3 preprocessed`) using 4 × L20 GPUs and followed your evaluation pipeline closely.

However, I observed that the results are significantly lower than those reported in your paper. Please see the attached screenshots for comparison:

<img src="https://github.com/user-attachments/assets/8ed0b847-389a-4012-92f1-bcfe5f906efd" alt="Result 1" width="80%" /> 
<img src="https://github.com/user-attachments/assets/76826584-ede6-4da0-9a2d-dada01d7296c" alt="Result 2" width="80%" />

Interestingly, I ran the same evaluation setup for two other baselines, and their results are consistent with the paper. I was wondering if the released checkpoints for `StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln` are the same as those used to produce the final results reported in the paper?

I have attached my final results and looking forward to your advice and clarification—thank you!

[result.json](https://github.com/user-attachments/files/21310859/result.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation results are significantly lower than reported in the paper #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation results are significantly lower than reported in the paper #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions