Unitxt 1.7.4
In the 1.7.4 release, we've made significant improvements to unitxt, further enhancing its developer friendliness. This update marks a step towards our goal of offering a well-documented and user-friendly library. A key feature of this release is the introduction of a type verification mechanism, designed to enhance the developer experience by increasing transparency and preemptively addressing errors.
4 Most Important Changes:
Add Description and Tags to unitxt Artifacts (1/4)
You can now enrich unitxt artifacts with descriptions and tags. These additions aim to enhance the upcoming online catalog, enabling users to search and filter artifacts by tags for an improved browsing experience.
For instance, to add context to a TaskCard:
TaskCard(
...,
__description__="This is the WNLI dataset",
__tags__={"task":"nli", "license":"apache2.0"}
)
See more in #725
Metrics and Postrprocess Override Through Recipe (2/4)
Now metrics and postprocessors can specified directly through the recipe and override those in the dataset card.
For exmaple if we want to use "metrics.rouge" instead of "metrics.accuracy" for WNLI we can now achieve this with:
load_dataset("card=cards.wnli, ... , metrics=[metrics.rouge]")
See more in #663
Metrics Type Validation (3/4: ⚠️ Breaking Change ⚠️ )
Context: The initiative to enhance developer friendliness at unitxt, especially through type checking, aims to guide developers more effectively and preemptively identify issues.
Previously, metrics individually determined if predictions and references were correctly typed, with many lacking such checks.
Now, Metric incorporates universal code to verify the types of predictions/references and to determine if a metric supports single or multiple references per prediction.
Introducing new parameters for each metric:
# Set 'prediction_type' to the expected types of predictions and references, e.g., "List[str]", "List[Dict]", "string".
# Defaults to "None", triggering a warning for now, but future versions of unitxt will treat this as an error.
prediction_type: str = None
# Indicates if a metric allows multiple references per prediction; otherwise, it supports only one reference per prediction.
single_reference_per_prediction: bool = False
Incompatibility Notice: If any existing post-processing pipeline violates the type schema, it will emit an error.
Important: unitxt's default behavior is to handle multiple references per prediction, as seen in the HF dataset (predictions as strings, references as lists of strings), with post-processing applied accordingly. For some metrics, like those measuring retrieval, predictions and references are lists of document IDs. In scenarios like few-shot learning, this adjustment ensures metrics correctly handle lists of lists.
See more in #667
Dialog Processing Capabilities (4/4)
Dialog data is essential for tasks like dialog completion, dialog summarization, etc. Thus, we've made an initial attempt to support dialog processing in unitxt. The challenges were twofold: (1) dialog is influenced by the system format, and (2) dialog consists of multiple turns, each potentially considered as the final turn for evaluation. To address these, we've introduced a new class of dialog processing operators, which you can review here:
https://unitxt.readthedocs.io/en/latest/unitxt.dialog_operators.html.
You can review an example of card construction utilizing a few dialog processing tools here: https://github.com/IBM/unitxt/blob/main/prepare/cards/coqa.py
This card's usage can be demonstrated with the following recipe:
card=cards.coqa.completion,template=templates.completion.abstractive.standard,format=formats.textual_assistant
Resulting in this input data:
Write the best response to the dialog.
<|user|>
The Vatican Apostolic Library.... The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail....from this period, though some are very significant.
When was the Vat formally opened?
<|assistant|>
It was formally established in 1475
<|user|>
what is the library for?
<|assistant|>
research
<|user|>
for what subjects?
And this target:
history, and law
See more in #640
All Changes In Unitxt 1.7.4
Breaking Changes
- Add generic mechanism to check prediction and reference types in metrics by @yoavkatz in #667 See explaination in the previoues sections for why this change is breaking.
New Features
- Add ability to fuse sources with disjoint splits by @yoavkatz in #707
- Allow max reduction type in metric to find the best overall score over all instances by @yoavkatz in #709
- Add string operators module with many standard string operaotrs by @elronbandel in #721
- Allow disabling per group f1 scores in customF1 by @yoavkatz in #719
- Add improved type inference capabilities, inferring type_string from a given object, and infer_type therefrom via parse_type_string by @dafnapension in #706
- Add description and tags to every catalog artifact by @elronbandel in #725
- allow contexts not to be entered to metric by @perlitz in #653
- Add control over metrics and postprocessors through the recipe by @elronbandel in #663
- Add coqa and dialog processing capabilites by @elronbandel in #640
- Add pandas_load_args for LoadCSV by @elronbandel in #696
- Add safe and complete type parsing function to type_utils, for allowing better type checking. by @elronbandel in #688
- Add deprecation decorator for warning and errors for deprecation of functions and classes by @elronbandel in #689
- Add choices shuffling to MultipleChoiceTemplate by @elronbandel in #678
- Make settings utils type sensetive by @elronbandel in #674
New Assets
- Add intl to korean and arabic + improved packaged dependency checks by @pklpriv in #698
- Added BERT Score with new embedding model "distilbert-base-uncased" by @shivangibithel in #703
- Grammatical error correction task by @arielge in #718
- Add trec dataset by @elronbandel in #723
- Add templates for flan text similarity by @elronbandel in #728
- Add metrics for binary tasks with float predictions by @lilacheden in #654
- Add mistral format by @elronbandel in #660
- Added new metric for unsorted_list_exact_math by @yoavkatz in #685
- Add flan wnli truthfulness format by @elronbandel in #665
- DuplicateInstances operator by @pawelknes in #682
- introduce arabic to normalized sacrebleu by @pklpriv in #638
- 20newsgroup from sklearn by @ilyashnil in #659
- Add match_closest_option post processor for multiple choice qa by @elronbandel in #679
- Duplicate instance operator - new functionality by @pawelknes in #687
- Add babi qa dataset by @elronbandel in #666
Asset Fixes
- Add missing instruction in labrador zero shot format by @alonh in #716
- Fix title template for classification by @elronbandel in #722
- prevent cohere4ai using judge as default by @perlitz in #664
- fix summarization template by @gitMichal in #652
Bug Fixes
- Fix handling of boolean environment variables by @arielge in #711
- Handle all env variables with expected types by @arielge in #714
- Properly define the abstract fields by @elronbandel in #724
- Fix places not using general settings or logger by @elronbandel in #656
- removal of dpath -- ready for review by @dafnapension in #680
- fix: LoadFromIBMCloud empty data_dir breaks processing by @jezekra1 in #668
- Fix bug in references with none by @elronbandel in #677
- Validating that the prepare dir is consistent with catalog by @eladven in #683
New Contributors
- @shivangibithel made their first contribution in #703
- @jezekra1 made their first contribution in #668
- @pklpriv made their first contribution in #638
- @pawelknes made their first contribution in #682
Full Changelog: 1.7.1...1.7.4