-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add golden dataset for eval #411
Conversation
d56b25d
to
386b762
Compare
Just putting this here as a friendly suggestion, can take it or leave it 😄 Long-term open source maintainability can improve drastically by PRs having at a minimum a one or two sentence description (especially for The benefits:
I recently have found this extremely beneficial in Cloud SQL repos (example) Github also recommends this themselves |
llm_demo/evaluation/eval_golden.py
Outdated
|
||
goldens = [ | ||
{ | ||
"Search Airport Tool": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to the structure -- why is this list -> object {key: object{} }?
Wouldn't it be better to just use an object at the higher level with key -> object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each evaluation data object contains of "query, tool call, output etc.".
And since I am trying to categorize them into different categories (e.g. we'll provide 2 test data for 'Search Airport Tool' so that we can test out 2 different queries that trigger that tool. If we want to add more queries in the future we can just add it to the category's list.), that's why we have list of evaluation data based on the category (a.k.a. key).
+1 to Jack's comment. We could do a better job on PR descriptions. |
Adding golden datasets that will be used in llm system evaluation. The golden dataset is separated into multiple types of queries: - queries that uses a specific tool - airline related queries (no tool calling - answer is within prompt) - assistant related question (no tool calling - answer is within prompt) - out of context questions (no tool calling) - multitool selections (agent selecting multiple tool before returning a final answer to the user)
Adding golden datasets that will be used in llm system evaluation.
The golden dataset is separated into multiple types of queries: