Skip to content

Commit

Permalink
Merge pull request #37 from ConvLab/unified_dataset
Browse files Browse the repository at this point in the history
add camrest dataset in unified data format
  • Loading branch information
zqwerty committed Mar 15, 2022
2 parents ea9b776 + d643960 commit 7700747
Show file tree
Hide file tree
Showing 4 changed files with 2,537 additions and 296 deletions.
69 changes: 53 additions & 16 deletions data/unified_datasets/camrest/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,61 @@
# README
# Dataset Card for Camrest

## Features
- **Repository:** https://www.repository.cam.ac.uk/handle/1810/260970
- **Paper:** https://aclanthology.org/D16-1233/
- **Leaderboard:** None
- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com)

- Annotations: dialogue act, character-level span for non-categorical slots.
### Dataset Summary

Statistics:
Cambridge restaurant dialogue domain dataset collected for developing neural network based dialogue systems. The two papers published based on this dataset are: 1. A Network-based End-to-End Trainable Task-oriented Dialogue System 2. Conditional Generation and Snapshot Learning in Neural Dialogue Systems. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. Each dialogue contains a goal label and several exchanges between a customer and the system. Each user turn was labelled by a set of slot-value pairs representing a coarse representation of dialogue state (`slu` field). There are in total 676 dialogue, in which most of the dialogues are finished but some of dialogues were not.

| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains |
| ----- | ------------ | ------------- | ---------- | ----------- | ---------- |
| train | 406 | 2936 | 7.23 | 11.36 | 1 |
| dev | 135 | 941 | 6.97 | 11.99 | 1 |
| train | 135 | 935 | 6.93 | 11.87 | 1 |
- **How to get the transformed data from original data:**
- Run `python preprocess.py` in the current directory. Need `../../camrest/` as the original data.
- **Main changes of the transformation:**
- Add dialogue act annotation according to the state change. This step was done by ConvLab-2 and we use the processed dialog acts here.
- Rename `pricerange` to `price range`
- Add character level span annotation for non-categorical slots.
- **Annotations:**
- user goal, dialogue acts, state.

## Main changes
### Supported Tasks and Leaderboards

- domain is set to **restaurant**
- ignore some rare pair
- 3 values are not found in original utterances
- **dontcare** values in non-categorical slots are calculated in `evaluate.py` so `da_match` in evaluation is lower than actual number.
NLU, DST, Policy, NLG, E2E, User simulator

## Original data
### Languages

camrest used in convlab2, included in `data/` path
English

### Data Splits

| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
| ---------- | --------- | ---------- | ------- | ---------- | ----------- | --------------------- | -------------------- | ---------------------------- | ------------------------------- |
| train | 406 | 3342 | 8.23 | 10.6 | 1 | 100 | 100 | 100 | 99.83 |
| validation | 135 | 1076 | 7.97 | 11.26 | 1 | 100 | 100 | 100 | 100 |
| test | 135 | 1070 | 7.93 | 11.01 | 1 | 100 | 100 | 100 | 100 |
| all | 676 | 5488 | 8.12 | 10.81 | 1 | 100 | 100 | 100 | 99.9 |

1 domains: ['restaurant']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage.

### Citation

```
@inproceedings{wen-etal-2016-conditional,
title = "Conditional Generation and Snapshot Learning in Neural Dialogue Systems",
author = "Wen, Tsung-Hsien and Ga{\v{s}}i{\'c}, Milica and Mrk{\v{s}}i{\'c}, Nikola and Rojas-Barahona, Lina M. and Su, Pei-Hao and Ultes, Stefan and Vandyke, David and Young, Steve",
booktitle = "Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2016",
address = "Austin, Texas",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D16-1233",
doi = "10.18653/v1/D16-1233",
pages = "2153--2162",
}
```

### Licensing Information

[**CC BY 4.0**](https://creativecommons.org/licenses/by/4.0/)
Binary file modified data/unified_datasets/camrest/data.zip
Binary file not shown.
Loading

0 comments on commit 7700747

Please sign in to comment.