MultiMedQA #1198

tmabraham · 2023-12-22T03:20:26Z

This PR implements the MultiMedQA suite of tasks:

Adds MedQA 4-options task
Added MedMCQA task
Adds MultiMedQA group that includes the above-mentioned tasks and also PubMedQA and a variety of MMLU tasks

Note that MultiMedQA also technically includes longform answer tasks (LiveQA, MedicationQA, HealthSearchQA). However, these tasks are evaluated by expert evaluation, and are therefore ignored. Other papers also only focus on the multiple choice QA tasks when evaluating on MultiMedQA.

Benchmark for Llama-2-7b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.3803|±  |0.0944|
|                        |       |none  |     0|acc_norm|0.3382|±  |0.0001|
| - medmcqa              |Yaml   |none  |     0|acc     |0.3438|±  |0.0073|
|                        |       |none  |     0|acc_norm|0.3438|±  |0.0073|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.3197|±  |0.0131|
|                        |       |none  |     0|acc_norm|0.3197|±  |0.0131|
| - anatomy              |Yaml   |none  |     0|acc     |0.4296|±  |0.0428|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.4377|±  |0.0305|
| - college_biology      |Yaml   |none  |     0|acc     |0.4444|±  |0.0416|
| - college_medicine     |Yaml   |none  |     0|acc     |0.4277|±  |0.0377|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.4700|±  |0.0502|
| - professional_medicine|Yaml   |none  |     0|acc     |0.4338|±  |0.0301|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7140|±  |0.0202|

Using vLLM, we can easily run the benchmarks on larger models, like Llama-2-70b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.5493|±  |0.0966|
|                        |       |none  |     0|acc_norm|0.4993|±  |0.0008|
| - medmcqa              |Yaml   |none  |     0|acc     |0.4808|±  |0.0077|
|                        |       |none  |     0|acc_norm|0.4808|±  |0.0077|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.5601|±  |0.0139|
|                        |       |none  |     0|acc_norm|0.5601|±  |0.0139|
| - anatomy              |Yaml   |none  |     0|acc     |0.5556|±  |0.0429|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.7019|±  |0.0282|
| - college_biology      |Yaml   |none  |     0|acc     |0.7847|±  |0.0344|
| - college_medicine     |Yaml   |none  |     0|acc     |0.6879|±  |0.0353|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.7200|±  |0.0451|
| - professional_medicine|Yaml   |none  |     0|acc     |0.7684|±  |0.0256|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7440|±  |0.0195|

(It would be nice if instead of saying stem it said multimedqa but I can't seem to figure that out... not a huge problem though, just an aesthetic issue)

Work done in collaboration with @jbdel and @katielink at @MedARC-AI.

lm_eval/tasks/medqa/medqa.yaml

lintangsutawika · 2023-12-22T05:37:46Z

The stem seems to be a bug. It's an alias for the mmlu_stem sub-group. Will make a fix. Functionally, though, it won't change the evals and the result json should reflect the task name correctly.

lintangsutawika

Would also be great to have the README.md for the tasks.

lm_eval/tasks/multimedqa/_multimedqa.yaml

tmabraham · 2023-12-23T08:14:10Z

@lintangsutawika I didnt know the tasks have their own README... is there some sort of template or example of that?

lintangsutawika · 2023-12-23T09:01:39Z

Yup, you can check here

https://github.com/EleutherAI/lm-evaluation-harness/blob/main/templates/new_yaml_task/README.md

tmabraham · 2024-01-11T02:54:50Z

sorry for the delay, let me know if this README looks fine

lintangsutawika · 2024-01-11T03:06:50Z

Looks great!

* multimedqa * Update medqa.yaml * move to benchmarks folder * add README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

tmabraham and others added 2 commits December 22, 2023 03:06

multimedqa

f7eee28

Merge branch 'EleutherAI:main' into main

bc9580c

tmabraham requested review from haileyschoelkopf and lintangsutawika as code owners December 22, 2023 03:20

griff4692 reviewed Dec 22, 2023

View reviewed changes

lm_eval/tasks/medqa/medqa.yaml Outdated Show resolved Hide resolved

haileyschoelkopf self-assigned this Dec 22, 2023

lintangsutawika mentioned this pull request Dec 22, 2023

Group Alias issue #1199

Closed

Update medqa.yaml

0a46714

lintangsutawika requested changes Dec 22, 2023

View reviewed changes

lm_eval/tasks/multimedqa/_multimedqa.yaml Outdated Show resolved Hide resolved

haileyschoelkopf mentioned this pull request Dec 22, 2023

Added MedQA USMLE benchmarking task #587

Closed

move to benchmarks folder

1f51205

haileyschoelkopf removed their assignment Dec 23, 2023

add README.md

7b7da16

tmabraham requested a review from lintangsutawika January 11, 2024 02:54

lintangsutawika approved these changes Jan 11, 2024

View reviewed changes

lintangsutawika merged commit 818c056 into EleutherAI:main Jan 11, 2024
7 of 8 checks passed

wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Jan 18, 2024

MultiMedQA (EleutherAI#1198)

ffc4f23

* multimedqa * Update medqa.yaml * move to benchmarks folder * add README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiMedQA #1198

MultiMedQA #1198

tmabraham commented Dec 22, 2023

lintangsutawika commented Dec 22, 2023

lintangsutawika left a comment

tmabraham commented Dec 23, 2023

lintangsutawika commented Dec 23, 2023

tmabraham commented Jan 11, 2024

lintangsutawika commented Jan 11, 2024

MultiMedQA #1198

MultiMedQA #1198

Conversation

tmabraham commented Dec 22, 2023

lintangsutawika commented Dec 22, 2023

lintangsutawika left a comment

Choose a reason for hiding this comment

tmabraham commented Dec 23, 2023

lintangsutawika commented Dec 23, 2023

tmabraham commented Jan 11, 2024

lintangsutawika commented Jan 11, 2024