Unable to reproduce SQA results for llava-1.5 #115

clairez-cerebras · 2024-06-15T01:56:59Z

I was attempting to reproduce llava-1.5's results in ScienceQA but was not able to reproduce the results reported.
Command:

python -m accelerate.commands.launch --num_processes=1 -m lmms_eval --config ./configs/eval_scienceqa_llava1.5.yaml

Config:

- model: llava
  model_args: pretrained=liuhaotian/llava-v1.5-7b,use_flash_attention_2=False,model_name=llava
  tasks: scienceqa_full
  batch_size: 1
  log_samples: true
  log_samples_suffix: llava1.5_sqa
  output_path: "./logs/"

The results I got:

|     Tasks      |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|----------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full  |N/A    |none  |     0|exact_match|0.3699|±  |0.0097|
| - scienceqa    |Yaml   |none  |     0|exact_match|0.3744|±  |0.0074|
| - scienceqa_img|Yaml   |none  |     0|exact_match|0.3604|±  |0.0107|

|    Groups    |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|--------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full|N/A    |none  |     0|exact_match|0.3699|±  |0.0097|

which is far from what's reported in the paper, for example, SQA-IMG is reported to have 71.6 in the llava-1.5 paper and SQA in general is reported to be around 70.4 in the excel sheet
What could be wrong?

The text was updated successfully, but these errors were encountered:

kcz358 · 2024-06-17T04:00:06Z

Thank you for reporting the issue. I will try to look into this error later.

GoGoJoestar · 2024-06-19T07:46:44Z

I I encountered the same problem when reproduce llava-1.6-mistral-7b results in ScienceQA. I found the reason maybe the following lines in models/llava.py.

lmms-eval/lmms_eval/models/llava.py

Lines 361 to 371 in efb5295

    
           # The above for loop has bugs. When there is no visuals, e.g. pure text, 
        
           # there will be no for loop execute resulting in an empty question_input (because no visuals) 
        
           # Scenario 1 won't even be execute 
        
           if len(flattened_visuals) == 0: 
        
               for context in contexts: 
        
                   question = context 
        
                   conv = conv_templates[self.conv_template].copy() 
        
                   conv.append_message(conv.roles[0], question) 
        
                   conv.append_message(conv.roles[1], None) 
        
                   prompt_question = conv.get_prompt() 
        
                   question_input.append(prompt_question)

Although the annotation says “The above for loop has bug” when input has no visuals, but actualy, the above loop run normally and add a prompt_question to the question_input list, and then these line add a prompt_question again. As the result, these no visual inputs generate 2 answers, leading to order mismatch of questions and answers.

After remove these line code, the scienceqa-full result changes from 36.3 to 76.8.

kcz358 · 2024-06-20T01:53:46Z

Hi @GoGoJoestar , I think your fix is correct. We previously use flattened visuals instead of batched visuals in the previous loop, resulting error when handling none visuals. I will remove these lines

…volvingLMMs-Lab#115) * Resolve conflict when merge the kr_ego with internal_main_dev * fix the bug of file overwrite * Optimize the inference of videochatgpt dataset * Resolve conflict * delete repeated line * reformat the code * rename the file name for inference results * group the same task together for cvrr and videochatgpt * group the same task together for videochatgpt and cvrr * reformat the code * fix the bug of videochatgpt_consistency multiocessing * Rename the metric from submission to subtask * fix the bug of consistency where different answers agre generated in pred2 * add accuracy into the evaluation of cvrr * add accuracy metric to cvrr dataset * remove duplicate rows when merging from main branch * Refactor videochatgpt_gen and videochatgpt_temporal for correct score parsing * enable the webm video loader for llavavid as required in cvrr dataset * Refactor process_results function to handle full_docs in videochatgpt task * add tqdm to consistency gpt_eval * Refactor the cvrr for correct aggregate logic * change backend to decord for videochatgpt eval * Fix for mkv video path * add perceptiontest dataset test split * doublecheck and optimize the code in egoschema * rename metric name of perceptiontest * add perceptiontest_validation dataset * remove egoschema aggregate function name * add temcompass mc dataset * remove redundant files --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

Luodian added the help wanted Extra attention is needed label Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce SQA results for llava-1.5 #115

Unable to reproduce SQA results for llava-1.5 #115

clairez-cerebras commented Jun 15, 2024

kcz358 commented Jun 17, 2024

GoGoJoestar commented Jun 19, 2024

kcz358 commented Jun 20, 2024

Unable to reproduce SQA results for llava-1.5 #115

Unable to reproduce SQA results for llava-1.5 #115

Comments

clairez-cerebras commented Jun 15, 2024

kcz358 commented Jun 17, 2024

GoGoJoestar commented Jun 19, 2024

kcz358 commented Jun 20, 2024