You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, authors! I have a question about choosing a dataset format and corresponding weights. I am doing a classification task with multiple images and prompt input. If multiple images are regarded as videos, there are two options: SD format (single <image> + single <Users>, where <image> represents all images) and DC mode (single <image> + multiple <Users>) . I understand their difference lies in the use of prompt. DC mode is more suitable for each picture with detailed prompts, while SD mode is suitable for all pictures to use a unified prompt. Is my understanding correct?
In addition, I used the Image-MPT7B weight in SD mode before, but it seems that the Video-LLaMA7B-DenseCaption weight in DC/SD mode is more suitable for the video frame mode. Is my understanding correct?
The text was updated successfully, but these errors were encountered:
Yes, it's pretty correct! I suggest you use DC mode and use Video pretrained weights. You could see via our web demo, the backend model is Video-LLaMA7B-DC.
Remember to put the multiple images as frames in the [B, T, F, C, H, W]'s F dimension (debug at vision_x to see the actual dimension during your training)
And I will suggest you to try both template:
Hello, authors! I have a question about choosing a dataset format and corresponding weights. I am doing a classification task with multiple images and prompt input. If multiple images are regarded as videos, there are two options: SD format (single <image> + single <Users>, where <image> represents all images) and DC mode (single <image> + multiple <Users>) . I understand their difference lies in the use of prompt. DC mode is more suitable for each picture with detailed prompts, while SD mode is suitable for all pictures to use a unified prompt. Is my understanding correct?
In addition, I used the Image-MPT7B weight in SD mode before, but it seems that the Video-LLaMA7B-DenseCaption weight in DC/SD mode is more suitable for the video frame mode. Is my understanding correct?
The text was updated successfully, but these errors were encountered: