-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #37 from ConvLab/unified_dataset
add camrest dataset in unified data format
- Loading branch information
Showing
4 changed files
with
2,537 additions
and
296 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,61 @@ | ||
# README | ||
# Dataset Card for Camrest | ||
|
||
## Features | ||
- **Repository:** https://www.repository.cam.ac.uk/handle/1810/260970 | ||
- **Paper:** https://aclanthology.org/D16-1233/ | ||
- **Leaderboard:** None | ||
- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com) | ||
|
||
- Annotations: dialogue act, character-level span for non-categorical slots. | ||
### Dataset Summary | ||
|
||
Statistics: | ||
Cambridge restaurant dialogue domain dataset collected for developing neural network based dialogue systems. The two papers published based on this dataset are: 1. A Network-based End-to-End Trainable Task-oriented Dialogue System 2. Conditional Generation and Snapshot Learning in Neural Dialogue Systems. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. Each dialogue contains a goal label and several exchanges between a customer and the system. Each user turn was labelled by a set of slot-value pairs representing a coarse representation of dialogue state (`slu` field). There are in total 676 dialogue, in which most of the dialogues are finished but some of dialogues were not. | ||
|
||
| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains | | ||
| ----- | ------------ | ------------- | ---------- | ----------- | ---------- | | ||
| train | 406 | 2936 | 7.23 | 11.36 | 1 | | ||
| dev | 135 | 941 | 6.97 | 11.99 | 1 | | ||
| train | 135 | 935 | 6.93 | 11.87 | 1 | | ||
- **How to get the transformed data from original data:** | ||
- Run `python preprocess.py` in the current directory. Need `../../camrest/` as the original data. | ||
- **Main changes of the transformation:** | ||
- Add dialogue act annotation according to the state change. This step was done by ConvLab-2 and we use the processed dialog acts here. | ||
- Rename `pricerange` to `price range` | ||
- Add character level span annotation for non-categorical slots. | ||
- **Annotations:** | ||
- user goal, dialogue acts, state. | ||
|
||
## Main changes | ||
### Supported Tasks and Leaderboards | ||
|
||
- domain is set to **restaurant** | ||
- ignore some rare pair | ||
- 3 values are not found in original utterances | ||
- **dontcare** values in non-categorical slots are calculated in `evaluate.py` so `da_match` in evaluation is lower than actual number. | ||
NLU, DST, Policy, NLG, E2E, User simulator | ||
|
||
## Original data | ||
### Languages | ||
|
||
camrest used in convlab2, included in `data/` path | ||
English | ||
|
||
### Data Splits | ||
|
||
| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) | | ||
| ---------- | --------- | ---------- | ------- | ---------- | ----------- | --------------------- | -------------------- | ---------------------------- | ------------------------------- | | ||
| train | 406 | 3342 | 8.23 | 10.6 | 1 | 100 | 100 | 100 | 99.83 | | ||
| validation | 135 | 1076 | 7.97 | 11.26 | 1 | 100 | 100 | 100 | 100 | | ||
| test | 135 | 1070 | 7.93 | 11.01 | 1 | 100 | 100 | 100 | 100 | | ||
| all | 676 | 5488 | 8.12 | 10.81 | 1 | 100 | 100 | 100 | 99.9 | | ||
|
||
1 domains: ['restaurant'] | ||
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage. | ||
- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage. | ||
|
||
### Citation | ||
|
||
``` | ||
@inproceedings{wen-etal-2016-conditional, | ||
title = "Conditional Generation and Snapshot Learning in Neural Dialogue Systems", | ||
author = "Wen, Tsung-Hsien and Ga{\v{s}}i{\'c}, Milica and Mrk{\v{s}}i{\'c}, Nikola and Rojas-Barahona, Lina M. and Su, Pei-Hao and Ultes, Stefan and Vandyke, David and Young, Steve", | ||
booktitle = "Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing", | ||
month = nov, | ||
year = "2016", | ||
address = "Austin, Texas", | ||
publisher = "Association for Computational Linguistics", | ||
url = "https://aclanthology.org/D16-1233", | ||
doi = "10.18653/v1/D16-1233", | ||
pages = "2153--2162", | ||
} | ||
``` | ||
|
||
### Licensing Information | ||
|
||
[**CC BY 4.0**](https://creativecommons.org/licenses/by/4.0/) |
Binary file not shown.
Oops, something went wrong.