Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate results found in table 1 #4

Open
theblackcat102 opened this issue May 2, 2024 · 5 comments
Open

Recreate results found in table 1 #4

theblackcat102 opened this issue May 2, 2024 · 5 comments

Comments

@theblackcat102
Copy link

Hi, I wanted to check if running launchw.sh is the command which recreates the number for table 1?
Cause I'm trying to rerun REMEMBERER for gpt-3.5-instruct-0913 due davinci-003 was no longer accessible from openai platform.

But the results I got is quite low with only 0.07 success rate

[2024-05-02 12:45:31,856 INFO webshop/186-MainProcess] END! TaskIdx: 99, TaskId: 99, #Steps: 4(0), Reward: 0.50, Succeds: False
[2024-05-02 12:45:31,856 INFO webshop/189-MainProcess] ──────────8.44──────────0.254──────────0.070──────────
[2024-05-02 12:45:31,857 INFO webshop/497-MainProcess] ━━━━━━━━━━━━━━━━━━━Epoch 0━━━━━━━━━━━━━━━━━━━━
[2024-05-02 12:45:31,857 INFO webshop/498-MainProcess] Size: 4, Avg AD Size: 1

I was wonder if there's any params I didn't get right for the launchw.sh?

This was the command found in launchw.sh:

python webshop.py --log-dir logs\
				  --observation-mode text_rich\
				  --load-replay history-pools/init_pool.wq.yaml\
				  --load-replay history-pools/init_pool.wq.yaml\
				  --save-replay history-pools/init_pool.wqu."$date_str".%d.a.yaml\
				  --save-replay history-pools/init_pool.wqu."$date_str".%d.b.yaml\
				  --item-capacity 500\
				  --action-capacity 20\
				  --matcher pgpat+insrel\
				  --prompt-template prompts/\
				  --max-tokens 200 \
				  --stop "Discouraged" \
				  --request-timeout 10.\
				  --starts-from 0\
				  --epochs 3\
				  --trainseta 0\
				  --trainsetb 10\
				  --testseta 0\
				  --testsetb 100

@zdy023
Copy link
Collaborator

zdy023 commented May 2, 2024

Hello, thanks for your question. Our recent results on another task set also reveal the performance decrease of gpt-3.5-instruct compared to text-davince-003 on decision-making tasks. Maybe this is attributed to the base capability variation of GPT models. Besides, it is weird that your history memory size is only 4 after an epoch of training on 10 tasks. Could you please have a double check on your training process? Currently, I don't find unusual arguments in your launch command.

@theblackcat102
Copy link
Author

@zdy023 it seems I didn't run with --train arguments, now I get a much higher history memory size. However with the new arguments I do not yield a higher success rate ( 0.022 compare to 0.070 ) is this normal on your side?

@zdy023
Copy link
Collaborator

zdy023 commented May 6, 2024

Hello, I don't think this is a normal result. Currently, I haven't conducted experiments on WebShop with gpt-instruct. I will follow your setting to try to reproduce the results in these weeks when I'm free.

@zdy023
Copy link
Collaborator

zdy023 commented May 10, 2024

@theblackcat102 Hello, just for sure, are you using the model gpt-3.5-turbo-instruct? I don't see a model named gpt-3.5-instuct-0913 on OpenAI's online document.

@zdy023
Copy link
Collaborator

zdy023 commented May 24, 2024

Hello, we conducted experiments with gpt-3.5-turbo-instruct and obtained the results as average score of 0.54 and success rate of 0.22. This is about a half performance of text-davinci-003, which is consistent with our observation on WikiHow task set.

We plan to test more recent models in the following weeks. Once the results are ready, we will update it in the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants