Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Benchmark Results (Needs Addressed) #30

Open
MarkSchmidty opened this issue Apr 20, 2023 · 6 comments
Open

Poor Benchmark Results (Needs Addressed) #30

MarkSchmidty opened this issue Apr 20, 2023 · 6 comments

Comments

@MarkSchmidty
Copy link
Contributor

MarkSchmidty commented Apr 20, 2023

As seen in this popular spreadsheet by @lhl , StableLM-Alpha-7B currently scores below 5 year old 1GB models with 700M parameters and well below its architectural cousin GPT-J-6B which is only trained on 300B tokens.

This is a serious issue which needs to be addressed.


Edit:
@abacaj on twitter posted these 3B results:
image

@MarkSchmidty MarkSchmidty changed the title Very Poor Benchmarks (Isse Very Poor Benchmarks (Needs Addressed) Apr 20, 2023
@MarkSchmidty MarkSchmidty changed the title Very Poor Benchmarks (Needs Addressed) 📉 Very Poor Benchmarks (Needs Addressed) Apr 20, 2023
@jon-tow
Copy link
Collaborator

jon-tow commented Apr 20, 2023

We're well aware of this (I was one of the core devs of lm-eval - we perform downstream benchmarking the same way 😄). A few things are going on for why we believe this is happening, and hopefully, we can pin them down in our following write-up.
For the time being, you should find that modifying the contexts into dialog prompt format (e.g. Question: -> User: ) should improve scores.

@MarkSchmidty MarkSchmidty changed the title 📉 Very Poor Benchmarks (Needs Addressed) Poor Benchmark Results (Needs Addressed) Apr 20, 2023
@MarkSchmidty
Copy link
Contributor Author

Okay, I made the issue title less alarming since you've chimed in.

Open communication about the issue and what is being done to address it would be appreciated by many. This thread/issue may be a good place to reach more technical users/devs who are keeping tabs.

@lhl
Copy link

lhl commented Apr 20, 2023

I dropped a line to the lm@stability address mentioned in the announcement to ask about if there was anything I'm doing wrong w/ benchmarks, was curious evals weren't included w/ the model card even as an alpha release (or a note that low benchmark scores were a known issue at least), but will be following w/ interest.

Curious as a foundational model, what's going on w/ dialog prompt formatting? I grepped through tasks and question is used by the QA tasks, so would impact piqa, but how about hellaswag (completions) or winogrande (it's own format)?

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 21, 2023

Not gonna lie, chatted with it and it's pretty bad. The longer context does work.

I've never gone OOM on a 7b before.

@MohamedAliRashad
Copy link

Any updates on this ?

@mallorbc
Copy link

mallorbc commented Apr 25, 2023

@jon-tow Using that prompt format for the base model will help? Perhaps you are talking about the tuned model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants