Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add golden dataset for eval #411

Merged
merged 5 commits into from
Jul 11, 2024
Merged

chore: add golden dataset for eval #411

merged 5 commits into from
Jul 11, 2024

Conversation

Yuan325
Copy link
Collaborator

@Yuan325 Yuan325 commented Jun 13, 2024

Adding golden datasets that will be used in llm system evaluation.

The golden dataset is separated into multiple types of queries:

  • queries that uses a specific tool
  • airline related queries (no tool calling - answer is within prompt)
  • assistant related question (no tool calling - answer is within prompt)
  • out of context questions (no tool calling)
  • multitool selections (agent selecting multiple tool before returning a final answer to the user)

@Yuan325 Yuan325 requested a review from a team as a code owner June 13, 2024 21:12
@Yuan325 Yuan325 force-pushed the eval-dataset branch 3 times, most recently from d56b25d to 386b762 Compare June 13, 2024 23:20
@jackwotherspoon
Copy link
Collaborator

Just putting this here as a friendly suggestion, can take it or leave it 😄

Long-term open source maintainability can improve drastically by PRs having at a minimum a one or two sentence description (especially for feat PRs but even for all types). Just going over what you are adding and why.

The benefits:

  • Helps your reviewer quickly understand the PR
  • Helps future maintainers follow-along with PRs
  • Allows you to more easily search for old PRs in the future
  • Encourages open source external contributions as external users better understand the features and history

I recently have found this extremely beneficial in Cloud SQL repos (example)

Github also recommends this themselves

llm_demo/evaluation/eval_golden.py Outdated Show resolved Hide resolved

goldens = [
{
"Search Airport Tool": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to the structure -- why is this list -> object {key: object{} }?

Wouldn't it be better to just use an object at the higher level with key -> object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each evaluation data object contains of "query, tool call, output etc.".

And since I am trying to categorize them into different categories (e.g. we'll provide 2 test data for 'Search Airport Tool' so that we can test out 2 different queries that trigger that tool. If we want to add more queries in the future we can just add it to the category's list.), that's why we have list of evaluation data based on the category (a.k.a. key).

@kurtisvg
Copy link
Collaborator

+1 to Jack's comment. We could do a better job on PR descriptions.

@Yuan325 Yuan325 changed the title feat: add golden dataset for eval chore: add golden dataset for eval Jun 21, 2024
@Yuan325 Yuan325 requested a review from kurtisvg June 24, 2024 17:24
llm_demo/evaluation/eval_golden.py Outdated Show resolved Hide resolved
@Yuan325 Yuan325 merged commit cf071b8 into evaluation Jul 11, 2024
4 checks passed
@Yuan325 Yuan325 deleted the eval-dataset branch July 11, 2024 20:07
Yuan325 added a commit that referenced this pull request Jul 26, 2024
Adding golden datasets that will be used in llm system evaluation.

The golden dataset is separated into multiple types of queries:
- queries that uses a specific tool
- airline related queries (no tool calling - answer is within prompt)
- assistant related question (no tool calling - answer is within prompt)
- out of context questions (no tool calling)
- multitool selections (agent selecting multiple tool before returning a
final answer to the user)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants