-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Results on EgoSchema #4
Comments
Hi Qingkai,
Thanks for reaching out!
The short answer is that GPT models got updated. Even the models with the
same name (e.g. gpt-3.5-turbo-1106) keep changing.
The GPT models are secretly updated. They are trained to avoid answering
sensitive questions. I believe recent updates make GPT more conservative,
i.e. refuse to answer questions that it is not sure about. That might be
why your "num_valids" is much less than "num_total".
Our output files (
https://drive.google.com/file/d/1d7a-FuQzdfQ7ZAzU5Y8HJpog1gm_sye_/view?usp=drive_link)
were generated 3 months ago. If you check our standard_qa_1106.json, the
"num_valids" is 500.
Our metric is strict because it considers invalid examples to be false.
However, it also makes sense to use random guesses. If we apply random
guesses, the standard prompt accuracy should be (0.532 * 500 + 0.2 * (500 -
453)) / 500 = 55.1%. This is already very close to 55.2%.
For the (C, Q) —> S prompt, I think the GPT model update is the main
reason. Also, we used "--temperature 1.0" so there might be some
randomness.
In our paper we used gpt-3.5-turbo-0613 because most of our experiments
were finished before gpt-3.5-turbo-1106 was released. That is why the
numbers in our paper are different from the repo.
Hope this email answers your questions!
Best,
Ce
…On Tue, Apr 9, 2024 at 8:28 AM Qingkai Fang ***@***.***> wrote:
Hi, Thanks for your great work! I tried to reproduce your results on
EgoSchema but found some inconsistency. Specifically, I tried to reproduce
the results with standard prompt and (C, Q) —> S prompt with the following
command:
standard prompt
python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json
Results:
"num_total": 500,
"num_valids": 453,
"num_corrects": 266,
"acc": 0.532,
(C, Q) —> S prompt
python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json
python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json
Results:
"num_total": 500,
"num_valids": 493,
"num_corrects": 278,
"acc": 0.556,
However, it seems the results are different with the reported results in
the README:
LaViLa gpt-3.5-turbo-1106 standard 55.2
LaViLa gpt-3.5-turbo-1106 (C, Q) —> S 58.8
I have not modify any code and use the captions you released. Any possible
reasons for the inconsistency? I also noticed that the results in the
README are slightly different with those in the paper. Could you please
tell me what is the reason behind? Thank you!
Best regards
—
Reply to this email directly, view it on GitHub
<#4>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKXZD56CYPREI3Z6O33DUX3Y4PNHBAVCNFSM6AAAAABF6PI4JGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTGMZSHE4DMMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi Ce, Thanks for you quick response! I have understood the reason for inconsistency results. Thank you again for your great work! Best regards, |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command:
standard prompt
(C, Q) —> S prompt
However, it seems the results are different with the reported results in the README:
I have not modified any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you!
Best regards
The text was updated successfully, but these errors were encountered: