Skip to content

[FEAT] Add gpqa diamond#17

Merged
nhhoang96 merged 13 commits intomainfrom
scratch/gpqa
Apr 18, 2026
Merged

[FEAT] Add gpqa diamond#17
nhhoang96 merged 13 commits intomainfrom
scratch/gpqa

Conversation

@shruthan
Copy link
Copy Markdown
Collaborator

📌 Description

Adds GPQA Diamond Audio.

Has 155 of 198 speakable samples converted to speech for evaluation at ServiceNow-AI/gpqa_audio.

On parallel text run, GPT 4o mini scores 39 (reported 40.8 at https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)

On audio:
GPT 4o mini: 28.9 +- 0.86 (5 runs)
Voxtral Small: 27.1
Phi 4 Multimodal Instruct: 22.58

🛠️ Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality including new tasks)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor / Code cleanup
  • Maintenance / Chore / Task
  • Other (please describe):

✅ How Has This Been Tested?

  • Unit tests
  • Integration tests
  • Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

  • Code follows project style guidelines
  • Tests have been added/updated (if applicable)
  • Documentation has been updated (if applicable)
  • Linked relevant issue(s)
  • Self-reviewed my code

🙌 Additional Notes

Copy link
Copy Markdown
Collaborator

@akshaykalkunte akshaykalkunte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shruthan
Copy link
Copy Markdown
Collaborator Author

With some more filtering of samples for audio quality, the dataset now has 147 samples.
Scores are largely similar except Voxtral Small that now scores 29.93

oluwanifemibamgbose and others added 11 commits September 24, 2025 20:29
updating turn handling for multi-turn evals
#22)

* added phonetics, speech_disorder, and speech_enhancement tasks - still in need of full model scoring. Fixed small inconsistency bug in config by changing judge_properties to judge_settings.

* Update the correct HF path for noise_detection task

* updated scores

---------

Co-authored-by: hoang <huuhoang.nguyen@servicenow.com>
@nhhoang96 nhhoang96 self-requested a review April 18, 2026 15:48
Copy link
Copy Markdown
Collaborator

@nhhoang96 nhhoang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Resolving documentation conflicts before merging to main

@nhhoang96 nhhoang96 merged commit 5011dd8 into main Apr 18, 2026
@nhhoang96 nhhoang96 deleted the scratch/gpqa branch April 18, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants