Skip to content

Commit

Permalink
More work on documentation (#10)
Browse files Browse the repository at this point in the history
  • Loading branch information
Matt Henderson committed Mar 12, 2019
1 parent b67c2b6 commit ee0547b
Show file tree
Hide file tree
Showing 9 changed files with 221 additions and 31 deletions.
1 change: 1 addition & 0 deletions BENCHMARKS.md
@@ -0,0 +1 @@
TODO: write results for baselines and encoder model
122 changes: 121 additions & 1 deletion README.md
Expand Up @@ -4,13 +4,63 @@

# conversational-datasets

why scripts? deterministic train/test split.
*A collection of large datasets for conversational response selection.*

This repository provides tools to create reproducible datasets for training and evaluating models of conversational response. This includes:

* [Reddit](reddit) - 3.7 billion comments structured in threaded conversations
* [OpenSubtitles](opensubtitles) - over 400 million lines from movie and television subtitles. Also available in other languages
* [Amazon QA](amazon_qa) - almost 4 million question response pairs in the context of Amazon products

Machine learning methods work best with large datasets such as these. At PolyAI we train models of conversational response on huge conversational datasets, and then adapt these models to domain-specific tasks in conversational AI. This general approach of pre-training large models on huge datasets has long been popular in the image community, and is now taking off in the NLP community.

Rather than providing the raw processed data, we provide scripts and instructions to generate the data yourself. This allows you to view and potentially manipulate the pre-processing and filtering. The instructions define standard datasets that can be used to define reproducible evaluations in research papers.

## Benchmarks

Benchmark results for each of the datasets can be found in [`BENCHMARKS.md`](BENCHMARKS.md).

## Conversational Dataset Format

This repo contains scripts for creating datasets in a standard format -
any dataset in this format is referred to elsewhere as simply a
*conversational dataset*.

Datasets are stored as [tensorflow record files](`https://www.tensorflow.org/tutorials/load_data/tf_records`) containing serialized [tensorflow example](https://www.tensorflow.org/tutorials/load_data/tf_records#data_types_for_tfexample) protocol buffers.
The training set is stored as one collection of tensorflow record files, and
the test set as another.

The train/test split is always deterministic, so that whenever the dataset is generated, the same train/test split is created.

Each tensorflow example contains a conversational context, and a response that goes with the context. For example:

```javascript
{
'context/1': "Hello, how are you?",
'context/0': "I am fine. And you?",
'context': "Great. What do you think of the weather?",
'response': "It doesn't feel like February."
}
```

Explicitly, each example contains a number of string features:

* A `context` feature, the most recent text in the conversational context
* A `response` feature, the text that is in direct response to the `context`.
* A number of *extra context features*, `context/0`, `context/1` etc. going
back in time through the conversation. They are named in reverse order so that `context/i` always refers to the `i^th` most recent extra context, so no padding needs to be done, and so datasets with different numbers of extra contexts can be mixed.

Depending on the dataset, there may be some extra features also included in
each example. E.g. in Reddit, the author of the context and response are
identified using additional features.

### Reading conversational datasets

The [`tools/tfrutil.py`](tools/tfrutil.py) script demonstrates how to
read a conversational dataset in Python, using tensorflow functions.

Below is some example tensorflow code for reading a conversational dataset
into a tensorflow graph:

```python

Expand Down Expand Up @@ -55,19 +105,89 @@ tensor_dict = iterator.get_next()

## Getting Started

Conversational datasets are created using [Apache Beam pipeline](https://beam.apache.org/) scripts, run on [Google Dataflow](https://cloud.google.com/dataflow/). Apache Beam requires python 2.7, so set up a python 2.7 virtual environment:

```
python2.7 -m virtualenv venv
. venv/bin/activate
pip install -r requirements.txt
```

The Dataflow scripts write conversational datasets to Google cloud storage, so you will need to [create a bucket](https://cloud.google.com/storage/docs/creating-buckets) to save the dataset to.

Lastly, you will need to [set up authentication](
https://cloud.google.com/docs/authentication/getting-started) by creating a service account with access to Dataflow and Cloud Storage, and set `GOOGLE_APPLICATION_CREDENTIALS`:

```
export GOOGLE_APPLICATION_CREDENTIALS={{json file key location}}
```

This should be enough to follow the instructions for creating each individual dataset.

## Datasets

### Reddit

### OpenSubtitles

### Amazon QA

## Evaluation

Of course you may evaluate your models in any way you like.
However when publishing results, we encourage you to include the
1-of-100 ranking accuracy, which is becoming a research community standard.

The 1-of-100 ranking accuracy is a *Recall@k* metric. In general *Recall@k*
takes *N* responses to the given conversational context, where only one response is relevant, it indicates whether the relevant response occurs in the top *k* ranked candidate responses.
The 1-of-100 metric is obtained when *k=1* and *N=100*.
This effectively means that for each query, we indicate if the correct response is the top ranked response among 100 candidates. The final score is the average across all queries.

The 1-of-100 metric is computed using random batches of 100 examples, so that the responses from other examples in the batch are used as random negative candidates. This allows for efficiently computing the metric across many examples in batches. While it is not guaranteed that the random negatives will indeed be 'true' negatives, it still provides a useful evaluation signal that correlates with downstream tasks.

The following tensorflow code shows how this metric can be computed for a dot-product style encoder model, where the score for each context and response is a dot product between corresponding vectors:

```python
# Encode the contexts and responses as vectors using tensorflow ops.
# The following are both [100, encoding_size] matrices.
context_encodings = _encode_contexts(tensor_dict['context'])
response_encodings = _encode_responses(tensor_dict['response'])

scores = tf.matmul(
context_encodings, response_encodings,
transpose_b=True) # A [100, 100] matrix.

batch_size = tf.shape(context_encodings)[0]

accuracy_1_of_100 = tf.metrics.accuracy(
labels=tf.range(batch_size),
predictions=tf.argmax(scores, 1)
)
```

See also the [baselines](baselines) for example code computing the 1-of-100 metric.

The following papers use *Recall@k* in the context of retrieval-based dialogue:

* [*Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus*](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698), Lowe et al. Dialogue and Discourse 2017.

* [*Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network*](http://aclweb.org/anthology/P18-1103), Zhou et al. ACL 2018.

* [*Improving Response Selection in Multi-turn Dialogue Systems by Incorporating Domain Knowledge*](http://aclweb.org/anthology/K18-1048),
Chaudhuri et al. CoNLL 2018.

The following papers use the 1-of-100 ranking accuracy in particular:

* [*Conversational Contextual Cues: The Case of Personalization and History for Response Ranking.*](http://arxiv.org/abs/1606.00372), Al-Rfou et al. arXiv pre-print 2016.

* [*Efficient Natural Language Response Suggestion for Smart Reply*](http://arxiv.org/abs/1705.00652), Henderson et al. arXiv pre-print 2017.

* [*Universal Sentence Encoder*](https://arxiv.org/abs/1803.11175), Cer et al. arXiv pre-print 2018.

* [*Learning Semantic Textual Similarity from Conversations.*](http://aclweb.org/anthology/W18-3022). Yang et al. Workshop on Representation Learning for NLP 2018.



## Citations

## Contributing
31 changes: 23 additions & 8 deletions amazon_qa/README.md
@@ -1,8 +1,16 @@
# Amazon QA Data

This dataset is based on a corpus extracted by McAuley et al., who scraped questions and answers from Amazon. The dataset is described at http://jmcauley.ucsd.edu/data/amazon/qa/ as well as in the following papers:

*Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems*. Mengting Wan, Julian McAuley. International Conference on Data Mining (ICDM), 2016. [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/icdm16c.pdf)

*Addressing complex and subjective product-related queries with customer reviews*. Julian McAuley, Alex Yang. World Wide Web (WWW), 2016. [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16b.pdf)

The script in this directory processes this corpus, filters long and short texts, and creates a conversational dataset.

## Statistics

Below are some statistics of the conversational dataset:

* Input files: 38
* Number of QA dictionaries: 1,569,513
Expand All @@ -20,42 +28,49 @@ Typical metrics for the Dataflow job:

# Create the conversational dataset

Below are instructions for how to generate the Amazon QA conversational dataset.

## Downloading Amazon QA dataset

Download from http://jmcauley.ucsd.edu/data/amazon/qa/.
First you must download the input data from http://jmcauley.ucsd.edu/data/amazon/qa/. In total there are 38 `.json.gz` files to download. Unzip them all and copy them to your Google cloud storage bucket:

```
cd amazon_qa
gunzip *
BUCKET="your-bucket"
gsutil -m cp -r * gs://${BUCKET?}/amazon_qa
gsutil -m cp -r * gs://${BUCKET?}/amazon_qa/raw/
```

## Creating QA data
Note that while the files are named `.json`, they are not actually valid
JSON, but rather python dictionaries in string format.

## Run the dataflow script

Run the Dataflow script:
Run the following command to process the raw input data into a conversational
dataset:

```
PROJECT="your-google-cloud-project"
DATADIR="gs://${BUCKET?}/amazon_qa/$(date +"%Y%m%d")"
python amazon_qa/create_data.py \
--file_pattern gs://${BUCKET}/amazon_qa/raw/* \
--file_pattern gs://${BUCKET?}/amazon_qa/raw/* \
--output_dir ${DATADIR} \
--runner DataflowRunner --temp_location ${DATADIR}/temp \
--staging_location ${DATADIR}/staging \
--project ${PROJECT?}
```

View the running job on the
Once the above is running, you can continue to monitor it in the terminal, or quit the process and follow the running job on the
[dataflow admin page](https://console.cloud.google.com/dataflow).

Please confirm that the statistics reported on the dataflow job page agree with the statistics reported above, to ensure you have a correct version of the dataset.

The dataset will be saved in the `$DATADIR` directory, as sharded train and test sets- `gs://your-bucket/amazon_qa/YYYYMMDD/train-*-of-00100.tfrecords` and
`gs://your-bucket/amazon_qa/YYYYMMDD/test-*-of-00010.tfrecords`.

View final output:
You can then use [`tools/tfrutil.py`](/tools/tfrutil.py) to inspect the files. For example:

```
python tools/tfrutil.py pp ${DATADIR?}/test-00000-of-00010.tfrecords
Expand Down
1 change: 1 addition & 0 deletions baselines/README.md
@@ -0,0 +1 @@
TODO: add code for baselines.
File renamed without changes.
50 changes: 41 additions & 9 deletions opensubtitles/README.md
@@ -1,10 +1,24 @@
# OpenSubtitles Data

A dataset contains examples of pairs of sentences that occur in sequence in the
data.
This dataset uses movie and television subtitles data from OpenSubtitles. The
script in this directory uses the corpus collected by Lison and Tiedemann. See http://opus.nlpl.eu/OpenSubtitles-v2018.php and the following citation:

*OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.* P. Lison and J. Tiedemann. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

The data is available in 62 different languages.

Consecutive lines in the subtitle data are used to create conversational examples.
There is no guarantee that different lines correspond to different
speakers, but the data nevertheless contains a lot of interesting examples
for modelling the mapping from conversational contexts to responses.

The script filters short and long lines, and strips some text such as
character names and auditory description text.

## Statistics

Below are statistics for the English dataset:

* Input files: 4,415
* Number of examples: 320,233,703
* Train set size: 286,655,424
Expand All @@ -18,14 +32,23 @@ Typical metrics for the Dataflow job:
* Elapsed time: 25m (225 workers)


## Create a dataset
# Create the conversational dataset

Download monolingual raw text data,
Below are instructions for creating the conversational dataset from the
OpenSubtitles corpus.

* English [en.txt.gz](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz).
## Download the OpenSubtitles data

First, download monolingual raw text data for the target language.

Extract and upload to GCS.
Visit http://opus.nlpl.eu/OpenSubtitles-v2018.php, and find the *Statistics and TMX/Moses Downloads* table. Click on the language ID in the first column
to get the monolingual plain text file (untokenized).

For English the correct link is:

http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz

Extract the data, split it into shards, and upload the data to your Google cloud storage bucket:

```
gunzip -k en.txt.gz
Expand All @@ -36,8 +59,13 @@ BUCKET="your-bucket"
gsutil -m cp -r lines gs://${BUCKET?}/opensubtitles/raw/
```

Note that the exact split command is important, as the train/test split is
computed using the file names.

Run the dataflow script:
## Run the dataflow script

Now you can run the dataflow script to read the text files and generate
conversational examples:

```
PROJECT="your-google-cloud-project"
Expand All @@ -53,11 +81,15 @@ python opensubtitles/create_data.py \
--project ${PROJECT?}
```

View the running job on the
Once the above is running, you can continue to monitor it in the terminal, or quit the process and follow the running job on the
[dataflow admin page](https://console.cloud.google.com/dataflow).

Please confirm that the statistics reported on the dataflow job page agree with the statistics reported above, to ensure you have a correct version of the dataset.

The dataset will be saved in the `$DATADIR` directory, as sharded train and test sets- `gs://your-bucket/opensubtitles/YYYYMMDD/train-*-of-01000.tfrecords` and
`gs://your-bucket/opensubtitles/YYYYMMDD/test-*-of-00100.tfrecords`.

View final output:
You can then use [`tools/tfrutil.py`](/tools/tfrutil.py) to inspect the files. For example:

```
python tools/tfrutil.py pp ${DATADIR?}/test-00000-of-00100.tfrecords
Expand Down

0 comments on commit ee0547b

Please sign in to comment.