chore: add golden dataset for eval #411

Yuan325 · 2024-06-13T21:12:53Z

Adding golden datasets that will be used in llm system evaluation.

The golden dataset is separated into multiple types of queries:

queries that uses a specific tool
airline related queries (no tool calling - answer is within prompt)
assistant related question (no tool calling - answer is within prompt)
out of context questions (no tool calling)
multitool selections (agent selecting multiple tool before returning a final answer to the user)

jackwotherspoon · 2024-06-18T14:15:38Z

Just putting this here as a friendly suggestion, can take it or leave it 😄

Long-term open source maintainability can improve drastically by PRs having at a minimum a one or two sentence description (especially for feat PRs but even for all types). Just going over what you are adding and why.

The benefits:

Helps your reviewer quickly understand the PR
Helps future maintainers follow-along with PRs
Allows you to more easily search for old PRs in the future
Encourages open source external contributions as external users better understand the features and history

I recently have found this extremely beneficial in Cloud SQL repos (example)

Github also recommends this themselves

llm_demo/evaluation/eval_golden.py

kurtisvg · 2024-06-21T16:05:55Z

llm_demo/evaluation/eval_golden.py

+
+goldens = [
+    {
+        "Search Airport Tool": [


Curious to the structure -- why is this list -> object {key: object{} }?

Wouldn't it be better to just use an object at the higher level with key -> object?

Each evaluation data object contains of "query, tool call, output etc.".

And since I am trying to categorize them into different categories (e.g. we'll provide 2 test data for 'Search Airport Tool' so that we can test out 2 different queries that trigger that tool. If we want to add more queries in the future we can just add it to the category's list.), that's why we have list of evaluation data based on the category (a.k.a. key).

kurtisvg · 2024-06-21T16:13:20Z

+1 to Jack's comment. We could do a better job on PR descriptions.

llm_demo/evaluation/eval_golden.py

Adding golden datasets that will be used in llm system evaluation. The golden dataset is separated into multiple types of queries: - queries that uses a specific tool - airline related queries (no tool calling - answer is within prompt) - assistant related question (no tool calling - answer is within prompt) - out of context questions (no tool calling) - multitool selections (agent selecting multiple tool before returning a final answer to the user)

Yuan325 requested a review from a team as a code owner June 13, 2024 21:12

Yuan325 force-pushed the eval-dataset branch 3 times, most recently from d56b25d to 386b762 Compare June 13, 2024 23:20

feat: add golden dataset for eval

a56b1c4

Yuan325 force-pushed the eval-dataset branch from 386b762 to a56b1c4 Compare June 14, 2024 22:46

update golden dataset

e09ff55

kurtisvg requested changes Jun 21, 2024

View reviewed changes

Yuan325 changed the title ~~feat: add golden dataset for eval~~ chore: add golden dataset for eval Jun 21, 2024

update proper capitalization

c0d4052

Yuan325 requested a review from kurtisvg June 24, 2024 17:24

update data format to python object

e4d2733

kurtisvg approved these changes Jul 1, 2024

View reviewed changes

llm_demo/evaluation/eval_golden.py Outdated Show resolved Hide resolved

update tool call to a class

5ae5db0

Yuan325 force-pushed the eval-dataset branch from 06b8b5c to 5ae5db0 Compare July 11, 2024 20:05

Yuan325 merged commit cf071b8 into evaluation Jul 11, 2024
4 checks passed

Yuan325 deleted the eval-dataset branch July 11, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add golden dataset for eval #411

chore: add golden dataset for eval #411

Yuan325 commented Jun 13, 2024 •

edited

Loading

jackwotherspoon commented Jun 18, 2024

kurtisvg Jun 21, 2024

Yuan325 Jun 24, 2024

kurtisvg commented Jun 21, 2024

chore: add golden dataset for eval #411

chore: add golden dataset for eval #411

Conversation

Yuan325 commented Jun 13, 2024 • edited Loading

jackwotherspoon commented Jun 18, 2024

kurtisvg Jun 21, 2024

Choose a reason for hiding this comment

Yuan325 Jun 24, 2024

Choose a reason for hiding this comment

kurtisvg commented Jun 21, 2024

Yuan325 commented Jun 13, 2024 •

edited

Loading