-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems of using RADIO in LLAVA setting. #15
Comments
Hello, thank you for your interest in RADIO! Using the weights that we released for RADIO v1 we found it that the magnitude of activations is somewhat larger than usual, with standard deviations in the many tens, v.s. the usual one. We thus tried adding a LayerNorm at the output of RADIOv1 by setting
Note that we used the val_all set for GQA. I realize most papers report on the testdev set. Sorry about that! Our training procedure was exactly that of LLaVA 1.5, i.e. we ran pre-training (multimodal alignment) followed by instruction tuning. We think RADIO is well suited, particularly for tasks that require better spatial understanding. RADIO is flexible about the input image dimension, which will allow you to try out different resolutions, or variations of input pre-processing (such as removing padding around rectangular images in order to process non-square images). We also find it that some of the common benchmarks are not sensitive enough to the vision encoder: for example, the performance of SQA is mainly a function of how good the LLM is. Similarly, the OCR hints in TextVQA make it possible to answer most questions without even looking at the image. Please let us know if the feature normalization helps on your end! Thank you. |
Thanks for your reply! But we still have some confusions:
|
Hello, When you set To be specific, in LLaVA my code for instantiating a RADIO vision encoder looks like:
If you are using the HuggingFace model I believe the same can be achieved by simply adding the
For the tokens into the LLM, I did try variations: (a) clearly performs worse, however the difference between (b) and (c) is rather small. |
First thanks for your great job!
Now We're trying to replace the vision encoder in llava, i.e., clip-l-336, with RADIO. Under the default LLaVA 1.5 settings, we pretrain a multimodal projection MLP and then run instruction tuning to finetune a Vicuna 7B-1.5 model with LORA. The results are shown below, two experiments are under the same setting.
Vision encoder | GQA | SQA | Text VQA | VQA v2 |
clip-l-336 | 62.9 | 68.4 | 58.6 | 77.3 |
RADIO | 59.2 | 68.3 | 51.7 | 74.0 |
Unfortunately, we do not observe the improvement by using RADIO, which is different with the results in the paper. And below is my question:
Your prompt reply would be greatly appreciated, thanks!
The text was updated successfully, but these errors were encountered: