Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

rangehow · 2024-05-18T14:45:40Z

I test google/gemma-1.1-2b-it on gsm8k with following command
CUDA_VISIBLE_DEVICES=3 lm_eval --model vllm \ --model_args pretrained=gemma-1.1-2b-it ,dtype=auto,gpu_memory_utilization=0.8, \ --tasks gsm8k \ --batch_size auto
My result is :

I think this is a weird result since gemma-2b reported a score close to 18 while here google/gemma-1.1-2b-it only get less than 10...
Any idea? 😢

The text was updated successfully, but these errors were encountered:

rangehow · 2024-05-19T05:53:07Z

The problem is relevant with gemma and vllm. Use naive method to inference can get kind of normal result like(noticed here I use 8-shot):

haileyschoelkopf · 2024-05-19T16:52:44Z

Hi! Could you rerun with add_bos_token = True? The HF model type adds a BOS token, and notes this in the logs--just added in #1857 a change to the default behavior so that VLLM should match this.

For reasons unclear to me, Gemma performance is dramatically lower when it does not receive a BOS token.

rangehow · 2024-05-19T23:47:24Z

Hi! Could you rerun with add_bos_token = True? The HF model type adds a BOS token, and notes this in the logs--just added in #1857 a change to the default behavior so that VLLM should match this.

For reasons unclear to me, Gemma performance is dramatically lower when it does not receive a BOS token.

Thanks, happy to know this flag, will try it later. But I still have some question about gemma, one is it version dramatically worse than base version (10 points lower on gsm8k), the other is gemma model get lower score on gsm8k-cot than gsm8k.

However I have tried a gsm8k script from gemma official github, which do show considerable benifit from cot prompt. I tried to align hf-eval and deepmind script but in vain. Would you help like to check this two strange question? Thanks for your brilliant job here.

rangehow · 2024-05-27T11:43:15Z

Hi, @haileyschoelkopf I have tried add_bos_token=True like this

 lm_eval --model vllm    --model_args pretrained=gemma-2b,add_bos_token=True --tasks  gsm8k-cot   --batch_size auto

This score still does not match the official score of gemma-2b (approximately 18), however, using the scripts from the gemma repository can. https://github.com/google-deepmind/gemma/blob/main/colabs/gsm8k_eval.ipynb
I modified it with vllm , get 19.18119787717968

Using lm_eval harness hf model to test gsm8k_cot can get -> 16.76(19.26) which is quite similar to my result.
So there might exists other problem with vllm+gemma

haileyschoelkopf mentioned this issue May 19, 2024

Force BOS token usage in 'gemma' models for VLLM #1857

Merged

lintangsutawika closed this as completed in #1857 Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

rangehow commented May 18, 2024

rangehow commented May 19, 2024

haileyschoelkopf commented May 19, 2024

rangehow commented May 19, 2024

rangehow commented May 27, 2024 •

edited

Loading

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

Is there something wrong with 'google/gemma-1.1-2b-it' ? #1854

Comments

rangehow commented May 18, 2024

rangehow commented May 19, 2024

haileyschoelkopf commented May 19, 2024

rangehow commented May 19, 2024

rangehow commented May 27, 2024 • edited Loading

rangehow commented May 27, 2024 •

edited

Loading