You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3
To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md and config files in unified_datasets/ to generate predictions.json, then evaluate it using ../evaluate_unified_datasets.py. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.
Acc: whether all dialogue acts of an utterance are correctly predicted
F1: F1 measure of the dialogue act predictions over the corpus.
Describe the feature
Provide unified, fast evaluation script for Benchmark using unified datasets #11 and standard dataloaders #14
Evaluation metrics:
Expected behavior
The scripts can evaluate different models of a module in the same ways to make the results comparable.
Additional context
Previous evaluation scripts
convlab2/$module/evaluate.py
could be a reference but they do not use batch inference.The text was updated successfully, but these errors were encountered: