-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to reproduce SQA results for llava-1.5 #115
Comments
Thank you for reporting the issue. I will try to look into this error later. |
I I encountered the same problem when reproduce llava-1.6-mistral-7b results in ScienceQA. I found the reason maybe the following lines in lmms-eval/lmms_eval/models/llava.py Lines 361 to 371 in efb5295
Although the annotation says “The above for loop has bug” when input has no visuals, but actualy, the above loop run normally and add a After remove these line code, the scienceqa-full result changes from 36.3 to 76.8. |
Hi @GoGoJoestar , I think your fix is correct. We previously use flattened visuals instead of batched visuals in the previous loop, resulting error when handling none visuals. I will remove these lines |
…volvingLMMs-Lab#115) * Resolve conflict when merge the kr_ego with internal_main_dev * fix the bug of file overwrite * Optimize the inference of videochatgpt dataset * Resolve conflict * delete repeated line * reformat the code * rename the file name for inference results * group the same task together for cvrr and videochatgpt * group the same task together for videochatgpt and cvrr * reformat the code * fix the bug of videochatgpt_consistency multiocessing * Rename the metric from submission to subtask * fix the bug of consistency where different answers agre generated in pred2 * add accuracy into the evaluation of cvrr * add accuracy metric to cvrr dataset * remove duplicate rows when merging from main branch * Refactor videochatgpt_gen and videochatgpt_temporal for correct score parsing * enable the webm video loader for llavavid as required in cvrr dataset * Refactor process_results function to handle full_docs in videochatgpt task * add tqdm to consistency gpt_eval * Refactor the cvrr for correct aggregate logic * change backend to decord for videochatgpt eval * Fix for mkv video path * add perceptiontest dataset test split * doublecheck and optimize the code in egoschema * rename metric name of perceptiontest * add perceptiontest_validation dataset * remove egoschema aggregate function name * add temcompass mc dataset * remove redundant files --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
I was attempting to reproduce llava-1.5's results in ScienceQA but was not able to reproduce the results reported.
Command:
Config:
The results I got:
which is far from what's reported in the paper, for example, SQA-IMG is reported to have 71.6 in the llava-1.5 paper and SQA in general is reported to be around 70.4 in the excel sheet
What could be wrong?
The text was updated successfully, but these errors were encountered: