New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiMedQA #1198
MultiMedQA #1198
Conversation
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be great to have the README.md for the tasks.
@lintangsutawika I didnt know the tasks have their own README... is there some sort of template or example of that? |
Yup, you can check here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/templates/new_yaml_task/README.md |
sorry for the delay, let me know if this README looks fine |
Looks great! |
* multimedqa * Update medqa.yaml * move to benchmarks folder * add README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
This PR implements the MultiMedQA suite of tasks:
Note that MultiMedQA also technically includes longform answer tasks (LiveQA, MedicationQA, HealthSearchQA). However, these tasks are evaluated by expert evaluation, and are therefore ignored. Other papers also only focus on the multiple choice QA tasks when evaluating on MultiMedQA.
Benchmark for Llama-2-7b:
Using vLLM, we can easily run the benchmarks on larger models, like Llama-2-70b:
(It would be nice if instead of saying
stem
it saidmultimedqa
but I can't seem to figure that out... not a huge problem though, just an aesthetic issue)Work done in collaboration with @jbdel and @katielink at @MedARC-AI.