Add llm as judge mt-bench dataset and metrics #791

OfirArviv · 2024-05-05T11:57:46Z

No description provided.

…set_and_metrics' into users/ofir/add_llm_as_judge_dataset_and_metrics

* add mt_bench_single_turn_gpt4_judge dataset * added typings to model_response_assessment task field * fixed output_format in mt_bench template * fixed output_format in mt_bench template * add llama3 format * temporal changes to the inference engines * add llama3_bam_mt_bench_prompt llm-as-judge metric * add assert to openai model recipe * update genai and openai inference apis * add model_response_assessment_chat task * add ChatTemplate * add model_response_assessment.json * fix model_response_assessment.json * add template and task of chat llm as judge * mt bench templates * mt bench templates * model assessment tasks * add InterleaveListsToDialogOperator operator * update dialog template * update mt bench template * update mt bench template update * update chat template * add mt bench datasets * small fixes * update metrics * update metrics * delete old files * update test requirements file * update test requirements file * update llam3 metric with correct format * add model assestmnt tasks with reference * update tasks * clear catalog * add tasks * update task * update templates * update * update * update * add mt bench pairwise proccessor * remove odl file * update * add model assesment pairwise comparison tass * add pairwise templates * fix pairwise templates * fix mt bench pairwise processor * fix template * add mt-bench pairwise dataset * llm as judge metric cards * add llama3 metrics * update * update * update prepare test python version * clean catalog * update templates * update tasks * update tasks * update templates * update cards * update cards * update templates * add cards * add cards for llm as judge metric * add cards for llm as judge metric * add metrics * merge * add mt becnh generation datasets * fix * fix * fix * fix * update python to 3.9 for catalog testing * remove old catalog items * update llm as a judge * update readme * update tests * update dynamic cards for llm as judge * update llm as jusge etric * update tests * add the ability to strip_system_prompt_and_format_from_inputs * update tests * update * update * update * update * update * update * update * update * update * update * update * update * add phi3 format * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update readme * update cards with LiteralEval * update cards with LiteralEval * make llm judge dynamic fields * add json * update * update metric * update * fix * update readme * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * Update llm_as_judge.rst * update * update * update * update * update * Update llm_as_judge.rst (#847) * update * update * update * update * update * update * update * update * small fix * small fix --------- Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

OfirArviv added 26 commits May 2, 2024 17:09

add mt_bench_single_turn_gpt4_judge dataset

e035f36

added typings to model_response_assessment task field

aa884cc

fixed output_format in mt_bench template

7621fbb

fixed output_format in mt_bench template

9f642ec

add llama3 format

e88c3af

temporal changes to the inference engines

ffd946b

add llama3_bam_mt_bench_prompt llm-as-judge metric

bad7244

add assert to openai model recipe

4633961

update genai and openai inference apis

44f6de0

add model_response_assessment_chat task

3d71c1e

add ChatTemplate

e9a1376

add model_response_assessment.json

4bb70b4

fix model_response_assessment.json

88508bf

add template and task of chat llm as judge

747efb3

mt bench templates

859144c

mt bench templates

1488ec1

model assessment tasks

4ebdc40

add InterleaveListsToDialogOperator operator

0ff6f13

update dialog template

51fda0c

update mt bench template

6b67b59

update mt bench template update

00ab418

update chat template

d70db84

add mt bench datasets

932f70a

small fixes

2e6b91e

update metrics

08cee34

update metrics

cd04870

OfirArviv marked this pull request as ready for review May 5, 2024 12:26

OfirArviv added 3 commits May 5, 2024 15:26

Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics

d4990b0

delete old files

8233c04

Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…

5622625

…set_and_metrics' into users/ofir/add_llm_as_judge_dataset_and_metrics

OfirArviv and others added 26 commits May 19, 2024 14:40

update

69b3068

update

804d605

update

53b6316

update

be35899

update

2a56811

update

c155c17

Update llm_as_judge.rst

4b01950

update

f38da96

Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…

3e1a9d2

…set_and_metrics' into users/ofir/add_llm_as_judge_dataset_and_metrics

Merge branch 'main' into users/ofir/add_llm_as_judge_dataset_and_metrics

3506d92

update

5b848f3

Merge remote-tracking branch 'origin/users/ofir/add_llm_as_judge_data…

e7905fa

…set_and_metrics' into users/ofir/add_llm_as_judge_dataset_and_metrics

update

a0b1c75

update

0890aa1

update

7e271ce

Update llm_as_judge.rst (#847)

f30ebf4

update

bae0934

update

635adbe

update

a04e319

update

23a6997

update

5a50544

update

4c6aa39

update

a14bf0a

update

eea4288

small fix

09360f4

small fix

109d501

elronbandel approved these changes May 20, 2024

View reviewed changes

elronbandel merged commit d854109 into main May 20, 2024
7 checks passed

elronbandel deleted the users/ofir/add_llm_as_judge_dataset_and_metrics branch May 20, 2024 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llm as judge mt-bench dataset and metrics #791

Add llm as judge mt-bench dataset and metrics #791

OfirArviv commented May 5, 2024

Add llm as judge mt-bench dataset and metrics #791

Add llm as judge mt-bench dataset and metrics #791

Conversation

OfirArviv commented May 5, 2024