[Feature] Benchmark Evaluation Script #15

zqwerty · 2021-12-20T03:17:22Z

Describe the feature
Provide unified, fast evaluation script for Benchmark using unified datasets #11 and standard dataloaders #14
Evaluation metrics:

NLU: Acc, dialog act tuple P/R/F1
DST: JGA, slot Acc, slot P/R/F1
NLG: BLEU, SER
E2E static: Inform, success, BLEU, combined score
Simulation: success rate, inform P/R/F1, avg. turn

Expected behavior
The scripts can evaluate different models of a module in the same ways to make the results comparable.

Additional context
Previous evaluation scripts convlab2/$module/evaluate.py could be a reference but they do not use batch inference.

The text was updated successfully, but these errors were encountered:

zqwerty · 2022-03-03T09:45:35Z

NLU benchmark for BERTNLU and MILU on multiwoz21, tm1, tm2, tm3

To illustrate that it is easy to use the model for any dataset that in our unified format, we report the performance on several datasets in our unified format. We follow README.md and config files in unified_datasets/ to generate predictions.json, then evaluate it using ../evaluate_unified_datasets.py. Note that we use almost the same hyper-parameters for different datasets, which may not be optimal.

Acc: whether all dialogue acts of an utterance are correctly predicted
F1: F1 measure of the dialogue act predictions over the corpus.

	MultiWOZ 2.1		Taskmaster-1		Taskmaster-2		Taskmaster-3
Model	Acc	F1	Acc	F1	Acc	F1	Acc	F1
T5-small	77.8	86.5	74.0	52.5	80.0	71.4	87.2	83.1
T5-small (context=3)	82.0	90.3	76.2	56.2	82.4	74.3	89.0	85.1
BERTNLU	74.5	85.9	72.8	50.6	79.2	70.6	86.1	81.9
BERTNLU (context=3)	80.6	90.3	74.2	52.7	80.9	73.3	87.8	83.8
MILU	72.9	85.2	72.9	49.2	79.1	68.7	85.4	80.3
MILU (context=3)	76.6	87.9	72.4	48.5	78.9	68.4	85.1	80.1

zqwerty added this to the Phase 2 milestone Dec 20, 2021

zqwerty self-assigned this Dec 20, 2021

zqwerty mentioned this issue Dec 20, 2021

[Feature] Add data processing code for models using the standard transformations #14 for training and evaluation #15 #19

Closed

15 tasks

This was referenced Feb 26, 2022

NLU models BERTNLU & MILU supports unified datasets #31

Merged

Add delex function; NLU benckmark result on multiwoz21, tm1, tm2, tm3 #35

Merged

zqwerty mentioned this issue Apr 10, 2022

add dst unified evaluation script #49

Merged

zqwerty mentioned this issue Apr 20, 2022

t5 nlg, multiwoz21 context=0: bleu 35.3; context=3: bleu 35.7 #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Benchmark Evaluation Script #15

[Feature] Benchmark Evaluation Script #15

zqwerty commented Dec 20, 2021 •

edited

Loading

zqwerty commented Mar 3, 2022 •

edited

Loading

[Feature] Benchmark Evaluation Script #15

[Feature] Benchmark Evaluation Script #15

Comments

zqwerty commented Dec 20, 2021 • edited Loading

zqwerty commented Mar 3, 2022 • edited Loading

zqwerty commented Dec 20, 2021 •

edited

Loading

zqwerty commented Mar 3, 2022 •

edited

Loading