Benjams/add rag task card and metric #1044

benjaminsznajder · 2024-07-22T16:09:29Z

No description provided.

.gitignore

src/unitxt/catalog/templates/rag/end_to_end/json_predictions.json

matanor

@benjaminsznajder looks good IMO, left a few small questions and comments

prepare/tasks/rag/rag_end_to_end.py

prepare/cards/rag/end_to_end/clapnq.py

prepare/metrics/rag.py

prepare/templates/rag/end_to_end.py

elronbandel

A style comment (no need to adopt, totally personal preference here): when a variable name is longer or more complicated then the string it holds it should be replaced with the string.
For example:

ClapNqBenchmark.ID -> "id"

variables can break the linear flow of file reading and understanding of what exactly happening without going back and forth to the variables definitions.

elronbandel

LGTM, may need some tests, but can be done in a separate pr.

…s containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

@yoavkatz

* demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

…oints to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

…ications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

…rics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

@yoavkatz

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>

benjaminsznajder requested a review from matanor July 22, 2024 16:09

benjaminsznajder closed this Jul 22, 2024

benjaminsznajder reopened this Jul 22, 2024

benjaminsznajder requested review from yoavkatz, perlitz and elronbandel July 22, 2024 16:19

elronbandel reviewed Jul 22, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

benjaminsznajder requested a review from elronbandel July 22, 2024 21:08

yoavkatz reviewed Jul 23, 2024

View reviewed changes

src/unitxt/catalog/templates/rag/end_to_end/json_predictions.json Show resolved Hide resolved

matanor reviewed Jul 23, 2024

View reviewed changes

benjaminsznajder requested review from yoavkatz and matanor July 23, 2024 08:12

benjaminsznajder mentioned this pull request Jul 23, 2024

Add end-to-end RAG example #966

Open

benjaminsznajder linked an issue Jul 23, 2024 that may be closed by this pull request

Add RAG end-to-end task and metrics to unitxt + clapNQ dataset for RAG end-to-end task #1048

Closed

elronbandel reviewed Jul 28, 2024

View reviewed changes

prepare/templates/rag/end_to_end.py Show resolved Hide resolved

elronbandel reviewed Jul 29, 2024

View reviewed changes

elronbandel approved these changes Jul 29, 2024

View reviewed changes

elronbandel and others added 13 commits July 29, 2024 23:26

Added seed to LLM as judges for consistent results (#1029)

3c146b1

Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

replace type and __type__ in type error (#1035)

ab7c806

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add rag_end_to_end metrics

408519d

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add rag_end_to_end metrics

9e379c3

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Add task rag_end_to_end

236ee4b

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add card for clapnq end_to_end

d520274

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add sandbox_benjams

ae8dbd6

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add subset

70bcda3

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add a reduction of clap_nq

3e67e36

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add a reduction of clap_nq

0c68ff0

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

remove constants

514e573

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

rename sandbox_benjams to sandbox

c2d6695

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

eladven and others added 25 commits July 29, 2024 23:26

remove the reduced clap_nq

cfb7405

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

define an empty template for rag end_to_end

af1e649

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Implement metrics ensemble (#1047)

11c9186

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add load_json_predictions as processor in the template

cbb47c2

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add the processors/load_json_predictions.json generated to the catalog

4a3633c

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Add flores101 (#1053)

955d54d

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

fix - building test is not working. The reason is that opendatasets p…

6181b17

…oints to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add overwrite

261de93

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Update introduction.rst - - copy edits (grammar, consistency, clarity) (

1f859f2

#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Fix typo in japanese_llama system prompt (issue #964) (#1056)

d95990e

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Allow assigning None in overwrites when fetching artifacts with modif…

9af8e18

…ications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Make sure preparation times printed fully and nicely (#1046)

5521e70

Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

numeric nlg - template changes (#1041)

6852c70

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

add judge input to the metric (#1064)

d9c004f

* add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Unitxt capitalization adding_dataset.rst (#1057)

aeb0d5b

making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

Fixed clapnq to check with reasonable error values

947f5ef

Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

fix the type hint

aeb4d31

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

update catalog

6995e47

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

benjaminsznajder force-pushed the benjams/add_rag_task_card_and_metric branch from 6755b6e to 810b803 Compare July 29, 2024 20:27

merge - resolve conflict

51ac829

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

benjaminsznajder merged commit 29d2d93 into main Jul 29, 2024
8 checks passed

benjaminsznajder deleted the benjams/add_rag_task_card_and_metric branch July 29, 2024 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benjams/add rag task card and metric #1044

Benjams/add rag task card and metric #1044

benjaminsznajder commented Jul 22, 2024

matanor left a comment

elronbandel left a comment

elronbandel left a comment

Benjams/add rag task card and metric #1044

Benjams/add rag task card and metric #1044

Conversation

benjaminsznajder commented Jul 22, 2024

matanor left a comment

Choose a reason for hiding this comment

elronbandel left a comment

Choose a reason for hiding this comment

elronbandel left a comment

Choose a reason for hiding this comment