Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Benchmark Evaluation Script #15

Open
3 of 5 tasks
zqwerty opened this issue Dec 20, 2021 · 1 comment
Open
3 of 5 tasks

[Feature] Benchmark Evaluation Script #15

zqwerty opened this issue Dec 20, 2021 · 1 comment
Assignees
Milestone

Comments

@zqwerty
Copy link
Member

zqwerty commented Dec 20, 2021

Describe the feature
Provide unified, fast evaluation script for Benchmark using unified datasets #11 and standard dataloaders #14
Evaluation metrics:

  • NLU: Acc, dialog act tuple P/R/F1
  • DST: JGA, slot Acc, slot P/R/F1
  • NLG: BLEU, SER
  • E2E static: Inform, success, BLEU, combined score
  • Simulation: success rate, inform P/R/F1, avg. turn

Expected behavior
The scripts can evaluate different models of a module in the same ways to make the results comparable.

Additional context
Previous evaluation scripts convlab2/$module/evaluate.py could be a reference but they do not use batch inference.

@zqwerty
Copy link
Member Author

zqwerty commented Mar 3, 2022

NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3

To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md and config files in unified_datasets/ to generate predictions.json, then evaluate it using ../evaluate_unified_datasets.py. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.

  • Acc: whether all dialogue acts of an utterance are correctly predicted
  • F1: F1 measure of the dialogue act predictions over the corpus.
MultiWOZ 2.1 Taskmaster-1 Taskmaster-2 Taskmaster-3
Model AccF1 AccF1 AccF1 AccF1
T5-small 77.886.5 74.052.5 80.071.4 87.283.1
T5-small (context=3) 82.090.3 76.256.2 82.474.3 89.085.1
BERTNLU 74.585.9 72.850.6 79.270.6 86.181.9
BERTNLU (context=3) 80.690.3 74.252.7 80.973.3 87.883.8
MILU 72.985.2 72.949.2 79.168.7 85.480.3
MILU (context=3) 76.687.9 72.448.5 78.968.4 85.180.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant