Inconsistent Results on EgoSchema #4

Poeroz · 2024-04-09T12:28:11Z

Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command:

standard prompt

python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json

Results:
    "num_total": 500,
    "num_valids": 453,
    "num_corrects": 266,
    "acc": 0.532,

(C, Q) —> S prompt

python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json

python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json

Results:
    "num_total": 500,
    "num_valids": 493,
    "num_corrects": 278,
    "acc": 0.556,

However, it seems the results are different with the reported results in the README:

LaViLa	gpt-3.5-turbo-1106	standard	55.2
LaViLa	gpt-3.5-turbo-1106	(C, Q) —> S	58.8

I have not modified any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you!

Best regards

The text was updated successfully, but these errors were encountered:

CeeZh · 2024-04-11T02:18:38Z

Hi Qingkai, Thanks for reaching out! The short answer is that GPT models got updated. Even the models with the same name (e.g. gpt-3.5-turbo-1106) keep changing. The GPT models are secretly updated. They are trained to avoid answering sensitive questions. I believe recent updates make GPT more conservative, i.e. refuse to answer questions that it is not sure about. That might be why your "num_valids" is much less than "num_total". Our output files ( https://drive.google.com/file/d/1d7a-FuQzdfQ7ZAzU5Y8HJpog1gm_sye_/view?usp=drive_link) were generated 3 months ago. If you check our standard_qa_1106.json, the "num_valids" is 500. Our metric is strict because it considers invalid examples to be false. However, it also makes sense to use random guesses. If we apply random guesses, the standard prompt accuracy should be (0.532 * 500 + 0.2 * (500 - 453)) / 500 = 55.1%. This is already very close to 55.2%. For the (C, Q) —> S prompt, I think the GPT model update is the main reason. Also, we used "--temperature 1.0" so there might be some randomness. In our paper we used gpt-3.5-turbo-0613 because most of our experiments were finished before gpt-3.5-turbo-1106 was released. That is why the numbers in our paper are different from the repo. Hope this email answers your questions! Best, Ce

…

On Tue, Apr 9, 2024 at 8:28 AM Qingkai Fang ***@***.***> wrote: Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command: standard prompt python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json Results: "num_total": 500, "num_valids": 453, "num_corrects": 266, "acc": 0.532, (C, Q) —> S prompt python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json Results: "num_total": 500, "num_valids": 493, "num_corrects": 278, "acc": 0.556, However, it seems the results are different with the reported results in the README: LaViLa gpt-3.5-turbo-1106 standard 55.2 LaViLa gpt-3.5-turbo-1106 (C, Q) —> S 58.8 I have not modify any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you! Best regards — Reply to this email directly, view it on GitHub <#4>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKXZD56CYPREI3Z6O33DUX3Y4PNHBAVCNFSM6AAAAABF6PI4JGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTGMZSHE4DMMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Poeroz · 2024-04-11T08:06:03Z

Hi Ce,

Thanks for you quick response! I have understood the reason for inconsistency results. Thank you again for your great work!

Best regards,
Qingkai

Poeroz closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Results on EgoSchema #4

Inconsistent Results on EgoSchema #4

Poeroz commented Apr 9, 2024 •

edited

Loading

CeeZh commented Apr 11, 2024 via email

Poeroz commented Apr 11, 2024

Inconsistent Results on EgoSchema #4

Inconsistent Results on EgoSchema #4

Comments

Poeroz commented Apr 9, 2024 • edited Loading

CeeZh commented Apr 11, 2024 via email

Poeroz commented Apr 11, 2024

Poeroz commented Apr 9, 2024 •

edited

Loading