Poor Benchmark Results (Needs Addressed) #30

MarkSchmidty · 2023-04-20T06:40:59Z

As seen in this popular spreadsheet by @lhl , StableLM-Alpha-7B currently scores below 5 year old 1GB models with 700M parameters and well below its architectural cousin GPT-J-6B which is only trained on 300B tokens.

This is a serious issue which needs to be addressed.

Edit:
@abacaj on twitter posted these 3B results:

jon-tow · 2023-04-20T07:03:16Z

We're well aware of this (I was one of the core devs of lm-eval - we perform downstream benchmarking the same way 😄). A few things are going on for why we believe this is happening, and hopefully, we can pin them down in our following write-up.
For the time being, you should find that modifying the contexts into dialog prompt format (e.g. Question: -> User: ) should improve scores.

MarkSchmidty · 2023-04-20T07:07:31Z

Okay, I made the issue title less alarming since you've chimed in.

Open communication about the issue and what is being done to address it would be appreciated by many. This thread/issue may be a good place to reach more technical users/devs who are keeping tabs.

lhl · 2023-04-20T07:18:37Z

I dropped a line to the lm@stability address mentioned in the announcement to ask about if there was anything I'm doing wrong w/ benchmarks, was curious evals weren't included w/ the model card even as an alpha release (or a note that low benchmark scores were a known issue at least), but will be following w/ interest.

Curious as a foundational model, what's going on w/ dialog prompt formatting? I grepped through tasks and question is used by the QA tasks, so would impact piqa, but how about hellaswag (completions) or winogrande (it's own format)?

Ph0rk0z · 2023-04-21T11:28:03Z

Not gonna lie, chatted with it and it's pretty bad. The longer context does work.

I've never gone OOM on a 7b before.

MohamedAliRashad · 2023-04-25T12:11:28Z

Any updates on this ?

mallorbc · 2023-04-25T15:09:45Z

@jon-tow Using that prompt format for the base model will help? Perhaps you are talking about the tuned model?

MarkSchmidty changed the title ~~Very Poor Benchmarks (Isse~~ Very Poor Benchmarks (Needs Addressed) Apr 20, 2023

MarkSchmidty changed the title ~~Very Poor Benchmarks (Needs Addressed)~~ 📉 Very Poor Benchmarks (Needs Addressed) Apr 20, 2023

MarkSchmidty mentioned this issue Apr 20, 2023

StableLM Alpha Testing Issue AUGMXNT/llm-experiments#1

Closed

MarkSchmidty changed the title ~~📉 Very Poor Benchmarks (Needs Addressed)~~ Poor Benchmark Results (Needs Addressed) Apr 20, 2023

MarkSchmidty mentioned this issue Apr 22, 2023

⭐ Support StableLM From StabilityAI oobabooga/text-generation-webui#1383

Closed

twmmason closed this as completed Apr 25, 2023

twmmason reopened this Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor Benchmark Results (Needs Addressed) #30

Poor Benchmark Results (Needs Addressed) #30

MarkSchmidty commented Apr 20, 2023 •

edited

Loading

jon-tow commented Apr 20, 2023

MarkSchmidty commented Apr 20, 2023

lhl commented Apr 20, 2023 •

edited

Loading

Ph0rk0z commented Apr 21, 2023

MohamedAliRashad commented Apr 25, 2023

mallorbc commented Apr 25, 2023 •

edited

Loading

Poor Benchmark Results (Needs Addressed) #30

Poor Benchmark Results (Needs Addressed) #30

Comments

MarkSchmidty commented Apr 20, 2023 • edited Loading

jon-tow commented Apr 20, 2023

MarkSchmidty commented Apr 20, 2023

lhl commented Apr 20, 2023 • edited Loading

Ph0rk0z commented Apr 21, 2023

MohamedAliRashad commented Apr 25, 2023

mallorbc commented Apr 25, 2023 • edited Loading

MarkSchmidty commented Apr 20, 2023 •

edited

Loading

lhl commented Apr 20, 2023 •

edited

Loading

mallorbc commented Apr 25, 2023 •

edited

Loading