Code release for models #2

KairosXu · 2023-12-13T02:11:42Z

Thanks for your nice work!
But when I tried to run the code for training LL3DA, I found that the "models" module was missing. Is that correct? If so, could you tell me when your team will release all the code and the training/evaluation script? Hope for your reply soon!

ch3cook-fdu · 2023-12-13T10:08:49Z

Thanks for your interest in our work! We will gradually upload the codes, weights, and training/evaluation scripts starting in late December. Please stay tuned.

KairosXu · 2023-12-29T08:24:18Z

Sorry to bother again. Due to the excellent performance LL3DA has achieved, we would like to conduct some further research based on your nice work. Could you please release your model checkpoints and training/evaluation codes asap?
Thanks and hope for your reply soon!

ch3cook-fdu · 2024-01-02T07:01:57Z

Thank you for your recognition of our work and sorry for the delay. As we are validating the reproducibility of our code and the extensibility to support different large language model backends, it may take a few days. After the verification, we will the release as soon as possible!

KairosXu · 2024-01-25T04:49:57Z

Sorry for bothering. Here are some questions about the Interact3D module.

Since the original Q-Former architecture in BLIP-2 requires the input feature dimension to be 1408, did the scene feature after the scene encoder keep the same?
I found that you added an extra visual prompt compared to 3D-LLM, so I would like to ask how you organized the architecture of Interact3D? And how did the self-attention work with the additional input in the module?
Did your pipeline also need the text instructions in the inference phase? Or only need the 3D feature and visual prompt like BLIP-2? If the former, did the text instruction play the role of the condition? And how did it work?
Hope for your reply soon!

ch3cook-fdu · 2024-01-25T05:30:33Z

Thanks for your interest!

In practice, you can customize the encoder_hidden_size within InstructBlipQFormerConfig for our multimodal transformer. We also adopt an FFN to project the scene feature.

InstructBlipQFormerConfig(
    num_hidden_layers=6,
    encoder_hidden_size=self.encoder_hidden_size
)

We pad the visual prompts with 0s, and set attention_mask for self-attention. You can look into https://huggingface.co/docs/transformers/model_doc/instructblip#transformers.InstructBlipQFormerModel for more information on implementation.
Yes, we need text instructions for inference. The visual prompts are optional. Text instructions play two roles in our architecture: 1) conditional feature aggregation in multi-modal transformer, and 2) conditional text generation in LLM.

gujiaqivadin · 2024-02-05T05:05:35Z

Hello, @ch3cook-fdu Thanks for your paper and code! Any news for the training/testing main code update?

ch3cook-fdu · 2024-03-04T04:02:42Z

Thrilled to announce that our paper is accepted to CVPR 2024! The code is now released!

Please stay tuned for our further updates!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code release for models #2

Code release for models #2

KairosXu commented Dec 13, 2023

ch3cook-fdu commented Dec 13, 2023

KairosXu commented Dec 29, 2023

ch3cook-fdu commented Jan 2, 2024

KairosXu commented Jan 25, 2024

ch3cook-fdu commented Jan 25, 2024

gujiaqivadin commented Feb 5, 2024

ch3cook-fdu commented Mar 4, 2024

Code release for models #2

Code release for models #2

Comments

KairosXu commented Dec 13, 2023

ch3cook-fdu commented Dec 13, 2023

KairosXu commented Dec 29, 2023

ch3cook-fdu commented Jan 2, 2024

KairosXu commented Jan 25, 2024

ch3cook-fdu commented Jan 25, 2024

gujiaqivadin commented Feb 5, 2024

ch3cook-fdu commented Mar 4, 2024