Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llm as judge mt-bench dataset and metrics #791

Merged
merged 164 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
e035f36
add mt_bench_single_turn_gpt4_judge dataset
OfirArviv May 2, 2024
aa884cc
added typings to model_response_assessment task field
OfirArviv May 2, 2024
7621fbb
fixed output_format in mt_bench template
OfirArviv May 2, 2024
9f642ec
fixed output_format in mt_bench template
OfirArviv May 2, 2024
e88c3af
add llama3 format
OfirArviv May 2, 2024
ffd946b
temporal changes to the inference engines
OfirArviv May 2, 2024
bad7244
add llama3_bam_mt_bench_prompt llm-as-judge metric
OfirArviv May 2, 2024
4633961
add assert to openai model recipe
OfirArviv May 4, 2024
44f6de0
update genai and openai inference apis
OfirArviv May 5, 2024
3d71c1e
add model_response_assessment_chat task
OfirArviv May 5, 2024
e9a1376
add ChatTemplate
OfirArviv May 5, 2024
4bb70b4
add model_response_assessment.json
OfirArviv May 5, 2024
88508bf
fix model_response_assessment.json
OfirArviv May 5, 2024
747efb3
add template and task of chat llm as judge
OfirArviv May 5, 2024
859144c
mt bench templates
OfirArviv May 5, 2024
1488ec1
mt bench templates
OfirArviv May 5, 2024
4ebdc40
model assessment tasks
OfirArviv May 5, 2024
0ff6f13
add InterleaveListsToDialogOperator operator
OfirArviv May 5, 2024
51fda0c
update dialog template
OfirArviv May 5, 2024
6b67b59
update mt bench template
OfirArviv May 5, 2024
00ab418
update mt bench template update
OfirArviv May 5, 2024
d70db84
update chat template
OfirArviv May 5, 2024
932f70a
add mt bench datasets
OfirArviv May 5, 2024
2e6b91e
small fixes
OfirArviv May 5, 2024
08cee34
update metrics
OfirArviv May 5, 2024
cd04870
update metrics
OfirArviv May 5, 2024
d4990b0
Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics
OfirArviv May 5, 2024
8233c04
delete old files
OfirArviv May 5, 2024
5622625
Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…
OfirArviv May 5, 2024
3f214d8
update test requirements file
OfirArviv May 5, 2024
6287678
Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics
OfirArviv May 5, 2024
5c6fee9
update test requirements file
OfirArviv May 5, 2024
518d016
update llam3 metric with correct format
OfirArviv May 5, 2024
e4f92f8
add model assestmnt tasks with reference
OfirArviv May 5, 2024
c9948de
update tasks
OfirArviv May 5, 2024
9a04e32
clear catalog
OfirArviv May 5, 2024
ea46120
add tasks
OfirArviv May 5, 2024
3f12332
update task
OfirArviv May 5, 2024
d4ef4f6
update templates
OfirArviv May 6, 2024
99773b8
update
OfirArviv May 6, 2024
9b868d3
update
OfirArviv May 6, 2024
1d4a937
update
OfirArviv May 6, 2024
cf815ee
add mt bench pairwise proccessor
OfirArviv May 7, 2024
049a322
remove odl file
OfirArviv May 7, 2024
7cebc9e
update
OfirArviv May 7, 2024
6ef225c
add model assesment pairwise comparison tass
OfirArviv May 7, 2024
eea313a
add pairwise templates
OfirArviv May 7, 2024
b3bcf76
fix pairwise templates
OfirArviv May 7, 2024
2e9c0d6
fix mt bench pairwise processor
OfirArviv May 7, 2024
ecfeb1d
fix template
OfirArviv May 7, 2024
85c7c4b
add mt-bench pairwise dataset
OfirArviv May 7, 2024
0b8af6a
llm as judge metric cards
OfirArviv May 7, 2024
b6879c6
add llama3 metrics
OfirArviv May 7, 2024
5507ea8
Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics
OfirArviv May 7, 2024
507c4ec
update
OfirArviv May 8, 2024
0475505
update
OfirArviv May 9, 2024
f210ba1
update prepare test python version
OfirArviv May 12, 2024
785e2d3
clean catalog
OfirArviv May 12, 2024
dc1d901
update templates
OfirArviv May 12, 2024
2785535
update tasks
OfirArviv May 12, 2024
ef0beb2
update tasks
OfirArviv May 12, 2024
cff58a5
update templates
OfirArviv May 12, 2024
a40b64a
update cards
OfirArviv May 12, 2024
3ffbfac
update cards
OfirArviv May 12, 2024
77be8a2
update templates
OfirArviv May 12, 2024
34612ad
add cards
OfirArviv May 12, 2024
b4e59d2
add cards for llm as judge metric
OfirArviv May 12, 2024
2e487a4
add cards for llm as judge metric
OfirArviv May 12, 2024
767575d
add metrics
OfirArviv May 12, 2024
9aeaf20
merge
OfirArviv May 12, 2024
62b09f7
merge
OfirArviv May 12, 2024
f6078a1
add mt becnh generation datasets
OfirArviv May 12, 2024
40de15e
fix
OfirArviv May 12, 2024
7a67fd3
fix
OfirArviv May 12, 2024
83fd39e
fix
OfirArviv May 12, 2024
b2485b7
fix
OfirArviv May 12, 2024
e783614
update python to 3.9 for catalog testing
OfirArviv May 13, 2024
a647b98
remove old catalog items
OfirArviv May 13, 2024
a4364d1
update llm as a judge
OfirArviv May 13, 2024
b00d153
update readme
OfirArviv May 13, 2024
ab72f42
update tests
OfirArviv May 15, 2024
f284a2f
update dynamic cards for llm as judge
OfirArviv May 15, 2024
0a7e182
update llm as jusge etric
OfirArviv May 15, 2024
c89de53
update tests
OfirArviv May 15, 2024
0e32579
add the ability to strip_system_prompt_and_format_from_inputs
OfirArviv May 15, 2024
95bbd9b
update tests
OfirArviv May 15, 2024
5a81128
update
OfirArviv May 15, 2024
9b7712a
update
OfirArviv May 15, 2024
58494fe
update
OfirArviv May 15, 2024
c4181f3
update
OfirArviv May 16, 2024
6279905
update
OfirArviv May 16, 2024
345cf7f
update
OfirArviv May 16, 2024
b68f3fa
update
OfirArviv May 16, 2024
ef0b44d
update
OfirArviv May 16, 2024
76fe80e
update
OfirArviv May 16, 2024
133f493
update
OfirArviv May 16, 2024
d2ff421
update
OfirArviv May 16, 2024
0902187
update
OfirArviv May 16, 2024
679a988
add phi3 format
OfirArviv May 16, 2024
a16fbac
update readme
OfirArviv May 16, 2024
eaeb84f
update readme
OfirArviv May 16, 2024
23bf027
update readme
OfirArviv May 16, 2024
9def31e
update readme
OfirArviv May 16, 2024
83391ea
update readme
OfirArviv May 16, 2024
231dc9a
update readme
OfirArviv May 16, 2024
b4716be
update readme
OfirArviv May 16, 2024
f76dc3b
update readme
OfirArviv May 16, 2024
d1475e0
update readme
OfirArviv May 16, 2024
a9345b6
update readme
OfirArviv May 16, 2024
ea230cb
update readme
OfirArviv May 16, 2024
7c37b36
update readme
OfirArviv May 16, 2024
60dd6a8
update readme
OfirArviv May 16, 2024
483e0e2
update readme
OfirArviv May 16, 2024
3094377
update readme
OfirArviv May 16, 2024
f7a146c
update readme
OfirArviv May 16, 2024
8538e74
update readme
OfirArviv May 16, 2024
4eb8322
update readme
OfirArviv May 16, 2024
ec46bb8
update readme
OfirArviv May 16, 2024
704554a
update readme
OfirArviv May 16, 2024
31ac967
merge
OfirArviv May 16, 2024
a177505
update cards with LiteralEval
OfirArviv May 19, 2024
c40166d
update cards with LiteralEval
OfirArviv May 19, 2024
fa6b440
make llm judge dynamic fields
OfirArviv May 19, 2024
948a343
add json
OfirArviv May 19, 2024
1dd773b
update
OfirArviv May 19, 2024
f15ce3d
update metric
OfirArviv May 19, 2024
e74acd4
update
OfirArviv May 19, 2024
6f5396b
fix
OfirArviv May 19, 2024
1dc2623
update readme
OfirArviv May 19, 2024
f664126
update
OfirArviv May 19, 2024
deb5642
update
OfirArviv May 19, 2024
73c010b
update
OfirArviv May 19, 2024
4e1095d
update
OfirArviv May 19, 2024
8ba40e9
update
OfirArviv May 19, 2024
9a0cc4d
update
OfirArviv May 19, 2024
273fe22
update
OfirArviv May 19, 2024
0475982
update
OfirArviv May 19, 2024
338c70d
update
OfirArviv May 19, 2024
69b3068
update
OfirArviv May 19, 2024
804d605
update
OfirArviv May 19, 2024
53b6316
update
OfirArviv May 19, 2024
be35899
update
OfirArviv May 19, 2024
2a56811
update
OfirArviv May 19, 2024
c155c17
update
OfirArviv May 19, 2024
4b01950
Update llm_as_judge.rst
yoavkatz May 19, 2024
f38da96
update
OfirArviv May 19, 2024
3e1a9d2
Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…
OfirArviv May 19, 2024
3506d92
Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics
OfirArviv May 19, 2024
5b848f3
update
OfirArviv May 19, 2024
e7905fa
Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…
OfirArviv May 19, 2024
a0b1c75
update
OfirArviv May 19, 2024
0890aa1
update
OfirArviv May 19, 2024
7e271ce
update
OfirArviv May 19, 2024
f30ebf4
Update llm_as_judge.rst (#847)
yoavkatz May 19, 2024
bae0934
update
OfirArviv May 19, 2024
635adbe
update
OfirArviv May 19, 2024
a04e319
update
OfirArviv May 19, 2024
23a6997
update
OfirArviv May 19, 2024
5a50544
update
OfirArviv May 19, 2024
4c6aa39
update
OfirArviv May 19, 2024
a14bf0a
update
OfirArviv May 19, 2024
eea4288
update
OfirArviv May 19, 2024
09360f4
small fix
OfirArviv May 20, 2024
109d501
small fix
OfirArviv May 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/catalog_consistency.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@ jobs:
runs-on: ubuntu-latest
env:
OS: ubuntu-latest
GENAI_KEY: "dummy"
UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API: "True"

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.8'
python-version: '3.9'
cache: 'pip' # caching pip dependencies
- run: pip install -r requirements/base.rqr
- run: pip install -r requirements/tests.rqr
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/catalog_preparation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@ jobs:
runs-on: ubuntu-latest
env:
OS: ubuntu-latest
GENAI_KEY: "dummy"
UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API: "True"

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.8'
python-version: '3.9'
cache: 'pip' # caching pip dependencies
- run: pip install -r requirements/base.rqr
- run: pip install -r requirements/tests.rqr
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/library_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:

- uses: actions/setup-python@v5
with:
python-version: '3.8'
python-version: '3.9'
cache: 'pip' # caching pip dependencies
- run: pip install -r requirements/base.rqr
- run: pip install -r requirements/tests.rqr
Expand Down
379 changes: 309 additions & 70 deletions docs/docs/llm_as_judge.rst

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions prepare/cards/dynamic_cards_for_llm_judges/llm_as_judge_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from unitxt.blocks import TaskCard
from unitxt.catalog import add_to_catalog

tasks = [
"tasks.response_assessment.rating.single_turn",
"tasks.response_assessment.rating.single_turn_with_reference",
]
for task in tasks:
card = TaskCard(loader=None, preprocess_steps=[], task=task)
sub_task = ".".join(task.split(".")[-2:])
add_to_catalog(
card,
f"cards.dynamic_cards_for_llm_judges.{sub_task}",
overwrite=True,
)
42 changes: 42 additions & 0 deletions prepare/cards/mt_bench/generation/english_single_turn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
AddFields,
CopyFields,
RenameFields,
)
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(path="dim/mt_bench_en", split="train"),
preprocess_steps=[
RenameSplits({"train": "test"}),
CopyFields(field_to_field={"turns/0": "turns"}),
RenameFields(
field_to_field={
"turns": "input",
"category": "group",
}
),
AddFields(
fields={
"output": "None",
"type_of_input": "question",
"type_of_output": "answer",
}
),
],
task="tasks.generation",
templates=["templates.empty"],
)

test_card(card, demos_taken_from="test", strict=False)
add_to_catalog(
card,
"cards.mt_bench.generation.english_single_turn",
overwrite=True,
)
42 changes: 42 additions & 0 deletions prepare/cards/mt_bench/generation/japanese_single_turn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
AddFields,
CopyFields,
RenameFields,
)
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(path="shi3z/MTbenchJapanese", split="train"),
preprocess_steps=[
RenameSplits({"train": "test"}),
CopyFields(field_to_field={"turns/0": "turns"}),
RenameFields(
field_to_field={
"turns": "input",
"category": "group",
}
),
AddFields(
fields={
"output": "None",
"type_of_input": "question",
"type_of_output": "answer",
}
),
],
task="tasks.generation",
templates=["templates.empty"],
)

test_card(card, demos_taken_from="test", strict=False)
add_to_catalog(
card,
"cards.mt_bench.generation.japanese_single_turn",
overwrite=True,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
FilterByCondition,
InterleaveListsToDialogOperator,
MapInstanceValues,
RenameFields,
)
from unitxt.processors import LiteralEval
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(
path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train"
),
preprocess_steps=[
RenameSplits({"train": "test"}),
FilterByCondition(values={"turn": 2}, condition="eq"),
FilterByCondition(values={"reference": "[]"}, condition="eq"),
FilterByCondition(
values={"winner": ["model_1", "tie", "model_2"]}, condition="in"
),
MapInstanceValues(
mappers={
"winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"}
}
),
RenameFields(
field_to_field={
"category": "group",
}
),
LiteralEval("model_input", to_field="model_input"),
LiteralEval("model_1_output", to_field="model_1_output"),
LiteralEval("model_2_output", to_field="model_2_output"),
InterleaveListsToDialogOperator(
user_turns_field="model_input",
assistant_turns_field="model_1_output",
to_field="dialog_a",
),
InterleaveListsToDialogOperator(
user_turns_field="model_input",
assistant_turns_field="model_2_output",
to_field="dialog_b",
),
],
task="tasks.response_assessment.pairwise_comparison.multi_turn",
templates=[
"templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_shuffle"
],
)

test_card(card, demos_taken_from="test", strict=False)
add_to_catalog(
card,
"cards.mt_bench.response_assessment.pairwise_comparison.multi_turn_gpt4_judgement",
overwrite=True,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
FilterByCondition,
InterleaveListsToDialogOperator,
MapInstanceValues,
RenameFields,
)
from unitxt.processors import LiteralEval
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(
path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train"
),
preprocess_steps=[
RenameSplits({"train": "test"}),
FilterByCondition(values={"turn": 2}, condition="eq"),
FilterByCondition(values={"reference": "[]"}, condition="ne"),
FilterByCondition(
values={"winner": ["model_1", "tie", "model_2"]}, condition="in"
),
MapInstanceValues(
mappers={
"winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"}
}
),
RenameFields(field_to_field={"category": "group"}),
LiteralEval("model_input", to_field="model_input"),
LiteralEval("model_1_output", to_field="model_1_output"),
LiteralEval("model_2_output", to_field="model_2_output"),
LiteralEval("reference", to_field="reference"),
InterleaveListsToDialogOperator(
user_turns_field="model_input",
assistant_turns_field="model_1_output",
to_field="dialog_a",
),
InterleaveListsToDialogOperator(
user_turns_field="model_input",
assistant_turns_field="model_2_output",
to_field="dialog_b",
),
InterleaveListsToDialogOperator(
user_turns_field="model_input",
assistant_turns_field="reference",
to_field="reference_dialog",
),
],
task="tasks.response_assessment.pairwise_comparison.multi_turn_with_reference",
templates=[
"templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_reference_with_shuffle"
],
)

test_card(card, demos_taken_from="test", strict=False, loader_limit=1000)
add_to_catalog(
card,
"cards.mt_bench.response_assessment.pairwise_comparison.multi_turn_with_reference_gpt4_judgement",
overwrite=True,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
CopyFields,
FilterByCondition,
MapInstanceValues,
RenameFields,
)
from unitxt.processors import LiteralEval
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(
path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train"
),
preprocess_steps=[
RenameSplits({"train": "test"}),
FilterByCondition(values={"turn": 1}, condition="eq"),
FilterByCondition(values={"reference": "[]"}, condition="eq"),
FilterByCondition(
values={"winner": ["model_1", "tie", "model_2"]}, condition="in"
),
MapInstanceValues(
mappers={
"winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"}
}
),
RenameFields(
field_to_field={
"model_input": "question",
"model_1_output": "answer_a",
"model_2_output": "answer_b",
"category": "group",
}
),
LiteralEval("question", to_field="question"),
CopyFields(field_to_field={"question/0": "question"}),
LiteralEval("answer_a", to_field="answer_a"),
CopyFields(field_to_field={"answer_a/0": "answer_a"}),
LiteralEval("answer_b", to_field="answer_b"),
CopyFields(field_to_field={"answer_b/0": "answer_b"}),
],
task="tasks.response_assessment.pairwise_comparison.single_turn",
templates=[
"templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_shuffle"
],
)

test_card(card, demos_taken_from="test", strict=False)
add_to_catalog(
card,
"cards.mt_bench.response_assessment.pairwise_comparison.single_turn_gpt4_judgement",
overwrite=True,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from unitxt.blocks import (
TaskCard,
)
from unitxt.catalog import add_to_catalog
from unitxt.loaders import LoadHF
from unitxt.operators import (
CopyFields,
FilterByCondition,
MapInstanceValues,
RenameFields,
)
from unitxt.processors import LiteralEval
from unitxt.splitters import RenameSplits
from unitxt.test_utils.card import test_card

card = TaskCard(
loader=LoadHF(
path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train"
),
preprocess_steps=[
RenameSplits({"train": "test"}),
FilterByCondition(values={"turn": 1}, condition="eq"),
FilterByCondition(values={"reference": "[]"}, condition="ne"),
FilterByCondition(
values={"winner": ["model_1", "tie", "model_2"]}, condition="in"
),
MapInstanceValues(
mappers={
"winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"}
}
),
RenameFields(
field_to_field={
"model_input": "question",
"model_1_output": "answer_a",
"model_2_output": "answer_b",
"reference": "reference_answer",
"category": "group",
}
),
LiteralEval("question", to_field="question"),
CopyFields(field_to_field={"question/0": "question"}),
LiteralEval("answer_a", to_field="answer_a"),
CopyFields(field_to_field={"answer_a/0": "answer_a"}),
LiteralEval("answer_b", to_field="answer_b"),
CopyFields(field_to_field={"answer_b/0": "answer_b"}),
LiteralEval("reference_answer", to_field="reference_answer"),
CopyFields(field_to_field={"reference_answer/0": "reference_answer"}),
],
task="tasks.response_assessment.pairwise_comparison.single_turn_with_reference",
templates=[
"templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_reference_with_shuffle"
],
)

test_card(card, demos_taken_from="test", strict=False, loader_limit=1000)
add_to_catalog(
card,
"cards.mt_bench.response_assessment.pairwise_comparison.single_turn_with_reference_gpt4_judgement",
overwrite=True,
)
Loading
Loading