Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of QA Task Issues #14

Open
gaohan-cmd opened this issue Apr 27, 2024 · 4 comments
Open

Reproduction of QA Task Issues #14

gaohan-cmd opened this issue Apr 27, 2024 · 4 comments

Comments

@gaohan-cmd
Copy link

Hello there! I'm interested in your work, but I'm having some differences when reproducing the results of the paper. So, I'd like to consult with you.

  1. In the QA task, is it true that the result of 74.80 tested by bash eval.generalist.sh is for ll3da-generalist/checkpoint.pth rather than ll3da-generalist/checkpoint_best.pth?
  2. Similarly, for the fine-tuned 76.79 result, is it evaluated using eval.scanqa.sh for ll3da-scanqa-tuned/checkpoint.pth or ll3da-scanqa-tuned/checkpoint_best.pth?
  3. Another question is about the visual prompts in Table 4 and Table 8. What are the differences, and why does Table 4 only show an effectiveness of 74.80 after adding visual prompts? Did the fine-tuning phase also use these two Text and Visual prompts? The fine-tuned result is only 76.69, not reaching the 82.91 in Table 8. Where in the code corresponds to Table 8? Currently, I only found the following about the click part in unified_scanqa.py.

image
image
image

@ch3cook-fdu
Copy link
Contributor

  1. We always evaluate our method with the checkpoint_best.pth.
  2. The evaluations are similar to 1. However there is a slight difference between our released codebase and the main paper. The reported results are trained with all the 3D-LLM data, regardless of duplications. Meanwhile, we drop the duplicates in our released codebase.
  3. Table 8 shows the effectiveness of "test-time" visual prompts, while other tables evaluates the model with text-only interactions.

@gaohan-cmd
Copy link
Author

  1. We always evaluate our method with the checkpoint_best.pth.
  2. The evaluations are similar to 1. However there is a slight difference between our released codebase and the main paper. The reported results are trained with all the 3D-LLM data, regardless of duplications. Meanwhile, we drop the duplicates in our released codebase.
  3. Table 8 shows the effectiveness of "test-time" visual prompts, while other tables evaluates the model with text-only interactions.

Thank you very much for your response! But I still have some questions:

For answer two, does "The reported results are trained with all the 3D-LLM data" mean that when I run bash scripts/opt-1.3b/train.generalist.sh, I only need to use the datasets unified_3dllm_scene_description, unified_3dllm_embodied_dialogue, unified_3dllm_embodied_planning, and the rest of the datasets are only used during fine-tuning?
Regarding answer three, how are "test-time" visual prompts specifically implemented in the code? Are visual prompts operations like clik and _encode_box_coords in the unified_scanqa.py file? How can I easily control whether to use visual prompts or text prompts during testing?

@ch3cook-fdu
Copy link
Contributor

More comment on:

Q2 - No, the all the 3D-LLM data refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.

Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.

@gaohan-cmd
Copy link
Author

More comment on:

Q2 - No, the all the 3D-LLM data refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.

Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.

OK, thank you for your answer 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants