Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

Closed
rangehow opened this issue May 18, 2024 · 4 comments · Fixed by #1857
Closed

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

rangehow opened this issue May 18, 2024 · 4 comments · Fixed by #1857

Comments

@rangehow
Copy link

I test google/gemma-1.1-2b-it on gsm8k with following command
CUDA_VISIBLE_DEVICES=3 lm_eval --model vllm \ --model_args pretrained=gemma-1.1-2b-it ,dtype=auto,gpu_memory_utilization=0.8, \ --tasks gsm8k \ --batch_size auto
My result is :
image

I think this is a weird result since gemma-2b reported a score close to 18 while here google/gemma-1.1-2b-it only get less than 10...
Any idea? 😢

@rangehow
Copy link
Author

The problem is relevant with gemma and vllm. Use naive method to inference can get kind of normal result like(noticed here I use 8-shot):
image

@haileyschoelkopf
Copy link
Collaborator

Hi! Could you rerun with add_bos_token = True? The HF model type adds a BOS token, and notes this in the logs--just added in #1857 a change to the default behavior so that VLLM should match this.

For reasons unclear to me, Gemma performance is dramatically lower when it does not receive a BOS token.

@rangehow
Copy link
Author

Hi! Could you rerun with add_bos_token = True? The HF model type adds a BOS token, and notes this in the logs--just added in #1857 a change to the default behavior so that VLLM should match this.

For reasons unclear to me, Gemma performance is dramatically lower when it does not receive a BOS token.

Thanks, happy to know this flag, will try it later. But I still have some question about gemma, one is it version dramatically worse than base version (10 points lower on gsm8k), the other is gemma model get lower score on gsm8k-cot than gsm8k.

However I have tried a gsm8k script from gemma official github, which do show considerable benifit from cot prompt. I tried to align hf-eval and deepmind script but in vain. Would you help like to check this two strange question? Thanks for your brilliant job here.

@rangehow
Copy link
Author

rangehow commented May 27, 2024

Hi, @haileyschoelkopf I have tried add_bos_token=True like this

 lm_eval --model vllm    --model_args pretrained=gemma-2b,add_bos_token=True --tasks  gsm8k-cot   --batch_size auto

image
This score still does not match the official score of gemma-2b (approximately 18), however, using the scripts from the gemma repository can. https://github.com/google-deepmind/gemma/blob/main/colabs/gsm8k_eval.ipynb
I modified it with vllm , get 19.18119787717968

Using lm_eval harness hf model to test gsm8k_cot can get -> 16.76(19.26) which is quite similar to my result.
So there might exists other problem with vllm+gemma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants