Unitxt 1.9.0
What's Changed
The most important things are:
- Addition of LLM as a Judge Metrics and Tasks for both evaluating LLMs as judge and using them for evaluation of other tasks. Read more in the LLM as a Judge Tutorial
- Addition of RAG response generation tasks and datasets as part of an effort to add comprhensive RAG evaluation to unitxt.
- Renaming FormTask to Task for simplicity
- Major improvments to documentation and tutorials
Breaking Changes 🚨
- Ensure consistent evaluation of CI across implementations [Might change previous results] by @dafnapension in #844
- Fix default format so it will be the same as formats.empty in catalog. Impacts runs that did not specify a format by @yoavkatz in #848
- LoadJson operator moved from unit.processors to unitxt.struct_data_operators
- Fixed YesNoTemplate and Diverse LabelSampler, to support binary task typing. YesNoTemplate now expect class field to contain a string and not a list of of strings with one elements by @yoavkatz in #836
Bug Fixes
- Change processor type for to_list_by_comma_from_references by @antonpibm in #815
- Handle empty text in Literal Eval by @antonpibm in #819
- Fix clash between dir names and artifact names in catalog website by @elronbandel in #825
- Ner typing had a mistake. by @yoavkatz in #832
- Fix catalog reference by @elronbandel in #838
- Fix default format by @yoavkatz in #848
- Fixed YesNoTemplate and Diverse LabelSampler, to support binary task typing. by @yoavkatz in #836
New Features
- Support prediction regex match by setting the operator as a postproce… by @antonpibm in #792
- Add sample score output in test card by @yoavkatz in #803
- Support for loading dictionaries by @pawelknes in #784
- Add ability to fuse, split, MultiStreamScoreMean, and merge all by @dafnapension in #767
- Changed default log verbosity to "info" instead of "debug" by @yoavkatz in #822
- Skip artifact prepare and verify in catalog consistency tests by @elronbandel in #839
- Add seperation between eagered streams and regular streams by @elronbandel in #846
- Add precision and recall scores to f1_binary, max_f1_binary by @lilacheden in #824
- Rename task by @elronbandel in #850
New Assets
- Add basic format for llama3 models by @arielge in #812
- Adding literal eval processor by @antonpibm in #813
- Add RAG (response generation part) tasks and datasets by @perlitz in #811
- Add 5 legalbench tasks (the 5 existing in HELM) by @perlitz in #827
- Add financebench by @perlitz in #828
- Add billsum dataset by @perlitz in #830
- Add tldr dataset by @perlitz in #831
- Add Attaq500 by @naamaz in #835
- Add llm as judge mt-bench dataset and metrics by @OfirArviv in #791
Documentation
- Documentation review by @yoavkatz in #805
- Added documentation for global and huggingface metrics by @yoavkatz in #807
- Touch up docs by @elronbandel in #809
- Remove the contents from main menu by @elronbandel in #810
- Add tags docs by @elronbandel in #814
- Reviewing Unitxt tutorials by @michal-jacovi in #817
- Fix the link to the operators tutorial by @elronbandel in #821
- More documentation changes in metrics by @yoavkatz in #820
- Update adding_task.rst by @michal-jacovi in #823
- Fix missing mandatory new line in the begging of code block in documentation by @elronbandel in #829
- Add description, homepage, and citation obtained from HF with datasets.load_dataset_builder by @dafnapension in #818
- Updated documentation by @yoavkatz in #849
New Contributors
- @antonpibm made their first contribution in #792
- @michal-jacovi made their first contribution in #817
Full Changelog: 1.8.1...1.9.0