Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce SQA results for llava-1.5 #115

Open
clairez-cerebras opened this issue Jun 15, 2024 · 3 comments
Open

Unable to reproduce SQA results for llava-1.5 #115

clairez-cerebras opened this issue Jun 15, 2024 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@clairez-cerebras
Copy link

I was attempting to reproduce llava-1.5's results in ScienceQA but was not able to reproduce the results reported.
Command:

python -m accelerate.commands.launch --num_processes=1 -m lmms_eval --config ./configs/eval_scienceqa_llava1.5.yaml

Config:

- model: llava
  model_args: pretrained=liuhaotian/llava-v1.5-7b,use_flash_attention_2=False,model_name=llava
  tasks: scienceqa_full
  batch_size: 1
  log_samples: true
  log_samples_suffix: llava1.5_sqa
  output_path: "./logs/"

The results I got:

|     Tasks      |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|----------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full  |N/A    |none  |     0|exact_match|0.3699|±  |0.0097|
| - scienceqa    |Yaml   |none  |     0|exact_match|0.3744|±  |0.0074|
| - scienceqa_img|Yaml   |none  |     0|exact_match|0.3604|±  |0.0107|

|    Groups    |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|--------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full|N/A    |none  |     0|exact_match|0.3699|±  |0.0097|

which is far from what's reported in the paper, for example, SQA-IMG is reported to have 71.6 in the llava-1.5 paper and SQA in general is reported to be around 70.4 in the excel sheet
What could be wrong?

@Luodian Luodian added the help wanted Extra attention is needed label Jun 15, 2024
@kcz358
Copy link
Contributor

kcz358 commented Jun 17, 2024

Thank you for reporting the issue. I will try to look into this error later.

@GoGoJoestar
Copy link

I I encountered the same problem when reproduce llava-1.6-mistral-7b results in ScienceQA. I found the reason maybe the following lines in models/llava.py.

# The above for loop has bugs. When there is no visuals, e.g. pure text,
# there will be no for loop execute resulting in an empty question_input (because no visuals)
# Scenario 1 won't even be execute
if len(flattened_visuals) == 0:
for context in contexts:
question = context
conv = conv_templates[self.conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
question_input.append(prompt_question)

Although the annotation says “The above for loop has bug” when input has no visuals, but actualy, the above loop run normally and add a prompt_question to the question_input list, and then these line add a prompt_question again. As the result, these no visual inputs generate 2 answers, leading to order mismatch of questions and answers.

After remove these line code, the scienceqa-full result changes from 36.3 to 76.8.

@kcz358
Copy link
Contributor

kcz358 commented Jun 20, 2024

Hi @GoGoJoestar , I think your fix is correct. We previously use flattened visuals instead of batched visuals in the previous loop, resulting error when handling none visuals. I will remove these lines

lorenzomammana pushed a commit to lorenzomammana/lmms-eval that referenced this issue Jun 20, 2024
…volvingLMMs-Lab#115)

* Resolve conflict when merge the kr_ego with internal_main_dev

* fix the bug of file overwrite

* Optimize the inference of videochatgpt dataset

* Resolve conflict

* delete repeated line

* reformat the code

* rename the file name for inference results

* group the same task together for cvrr and videochatgpt

* group the same task together for videochatgpt and cvrr

* reformat the code

* fix the bug of videochatgpt_consistency multiocessing

* Rename the metric from submission to subtask

* fix the bug of consistency where different answers agre generated in pred2

* add accuracy into the evaluation of cvrr

* add accuracy metric to cvrr dataset

* remove duplicate rows when merging from main branch

* Refactor videochatgpt_gen and videochatgpt_temporal for correct score parsing

* enable the webm video loader for llavavid as required in cvrr dataset

* Refactor process_results function to handle full_docs in videochatgpt task

* add tqdm to consistency gpt_eval

* Refactor the cvrr for correct aggregate logic

* change backend to decord for videochatgpt eval

* Fix for mkv video path

* add perceptiontest dataset test split

* doublecheck and optimize the code in egoschema

* rename metric name of perceptiontest

* add perceptiontest_validation dataset

* remove egoschema aggregate function name

* add temcompass mc dataset

* remove redundant files

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants