Skip to content

Evaluation results are significantly lower than reported in the paper #9

@Eku127

Description

@Eku127

Hi, thank you for sharing the model weights. I evaluated the model StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln on the val_unseen split (R2R_VLNCE_v1-3 preprocessed) using 4 × L20 GPUs and followed your evaluation pipeline closely.

However, I observed that the results are significantly lower than those reported in your paper. Please see the attached screenshots for comparison:

Result 1 Result 2

Interestingly, I ran the same evaluation setup for two other baselines, and their results are consistent with the paper. I was wondering if the released checkpoints for StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln are the same as those used to produce the final results reported in the paper?

I have attached my final results and looking forward to your advice and clarification—thank you!

result.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions