More work on documentation (#10)

PolyAI-LDN · Mar 12, 2019 · ee0547b · ee0547b
1 parent b67c2b6
commit ee0547b
Show file tree

Hide file tree

Showing 9 changed files with 221 additions and 31 deletions.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -0,0 +1 @@
+TODO: write results for baselines and encoder model
diff --git a/README.md b/README.md
@@ -4,13 +4,63 @@
 
 # conversational-datasets
 
-why scripts? deterministic train/test split.
+*A collection of large datasets for conversational response selection.*
 
+This repository provides tools to create reproducible datasets for training and evaluating models of conversational response. This includes:
+
+* [Reddit](reddit) - 3.7 billion comments structured in threaded conversations
+* [OpenSubtitles](opensubtitles) - over 400 million lines from movie and television subtitles. Also available in other languages
+* [Amazon QA](amazon_qa) - almost 4 million question response pairs in the context of Amazon products
+
+Machine learning methods work best with large datasets such as these. At PolyAI we train models of conversational response on huge conversational datasets, and then adapt these models to domain-specific tasks in conversational AI. This general approach of pre-training large models on huge datasets has long been popular in the image community, and is now taking off in the NLP community.
+
+Rather than providing the raw processed data, we provide scripts and instructions to generate the data yourself. This allows you to view and potentially manipulate the pre-processing and filtering. The instructions define standard datasets that can be used to define reproducible evaluations in research papers.
+
+## Benchmarks
+
+Benchmark results for each of the datasets can be found in [`BENCHMARKS.md`](BENCHMARKS.md).
 
 ## Conversational Dataset Format
 
+This repo contains scripts for creating datasets in a standard format -
+any dataset in this format is referred to elsewhere as simply a
+*conversational dataset*.
+
+Datasets are stored as [tensorflow record files](`https://www.tensorflow.org/tutorials/load_data/tf_records`) containing serialized [tensorflow example](https://www.tensorflow.org/tutorials/load_data/tf_records#data_types_for_tfexample) protocol buffers.
+The training set is stored as one collection of tensorflow record files, and
+the test set as another.
+
+The train/test split is always deterministic, so that whenever the dataset is generated, the same train/test split is created.
+
+Each tensorflow example contains a conversational context, and a response that goes with the context. For example:
+
+```javascript
+{
+  'context/1': "Hello, how are you?",
+  'context/0': "I am fine. And you?",
+  'context': "Great. What do you think of the weather?",
+  'response': "It doesn't feel like February."
+}
+```
+
+Explicitly, each example contains a number of string features:
 
+* A `context` feature, the most recent text in the conversational context
+* A `response` feature, the text that is in direct response to the `context`.
+* A number of *extra context features*, `context/0`, `context/1` etc. going
+  back in time through the conversation. They are named in reverse order so that `context/i` always refers to the `i^th` most recent extra context, so no padding needs to be done, and so datasets with different numbers of extra contexts can be mixed.
 
+Depending on the dataset, there may be some extra features also included in
+each example. E.g. in Reddit, the author of the context and response are
+identified using additional features.
+
+### Reading conversational datasets
+
+The [`tools/tfrutil.py`](tools/tfrutil.py) script demonstrates how to
+read a conversational dataset in Python, using tensorflow functions.
+
+Below is some example tensorflow code for reading a conversational dataset
+into a tensorflow graph:
 
 ```python
 
@@ -55,19 +105,89 @@ tensor_dict = iterator.get_next()
 
 ## Getting Started
 
+Conversational datasets are created using [Apache Beam pipeline](https://beam.apache.org/) scripts, run on [Google Dataflow](https://cloud.google.com/dataflow/). Apache Beam requires python 2.7, so set up a python 2.7 virtual environment:
+
 ```
 python2.7 -m virtualenv venv
 . venv/bin/activate
 pip install -r requirements.txt
 ```
 
+The Dataflow scripts write conversational datasets to Google cloud storage, so you will need to [create a bucket](https://cloud.google.com/storage/docs/creating-buckets) to save the dataset to.
+
+Lastly, you will need to [set up authentication](
+https://cloud.google.com/docs/authentication/getting-started) by creating a service account with access to Dataflow and Cloud Storage, and set `GOOGLE_APPLICATION_CREDENTIALS`:
+
+```
+export GOOGLE_APPLICATION_CREDENTIALS={{json file key location}}
+```
+
+This should be enough to follow the instructions for creating each individual dataset.
 
 ## Datasets
 
 ### Reddit
 
+### OpenSubtitles
+
+### Amazon QA
+
 ## Evaluation
 
+Of course you may evaluate your models in any way you like.
+However when publishing results, we encourage you to include the
+1-of-100 ranking accuracy, which is becoming a research community standard.
+
+The 1-of-100 ranking accuracy is a *Recall@k* metric. In general *Recall@k*
+takes *N* responses to the given conversational context, where only one response is relevant, it indicates whether the relevant response occurs in the top *k* ranked candidate responses.
+The 1-of-100 metric is obtained when *k=1* and *N=100*.
+This effectively means that for each query, we indicate if the correct response is the top ranked response among 100 candidates. The final score is the average across all queries.
+
+The 1-of-100 metric is computed using random batches of 100 examples, so that the responses from other examples in the batch are used as random negative candidates. This allows for efficiently computing the metric across many examples in batches. While it is not guaranteed that the random negatives will indeed be 'true' negatives, it still provides a useful evaluation signal that correlates with downstream tasks.
+
+The following tensorflow code shows how this metric can be computed for a dot-product style encoder model, where the score for each context and response is a dot product between corresponding vectors:
+
+```python
+# Encode the contexts and responses as vectors using tensorflow ops.
+# The following are both [100, encoding_size] matrices.
+context_encodings = _encode_contexts(tensor_dict['context'])
+response_encodings = _encode_responses(tensor_dict['response'])
+
+scores = tf.matmul(
+  context_encodings, response_encodings,
+  transpose_b=True)  # A [100, 100] matrix.
+
+batch_size = tf.shape(context_encodings)[0]
+
+accuracy_1_of_100 = tf.metrics.accuracy(
+  labels=tf.range(batch_size),
+  predictions=tf.argmax(scores, 1)
+)
+```
+
+See also the [baselines](baselines) for example code computing the 1-of-100 metric.
+
+The following papers use *Recall@k* in the context of retrieval-based dialogue:
+
+* [*Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus*](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698), Lowe et al. Dialogue and Discourse 2017.
+
+* [*Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network*](http://aclweb.org/anthology/P18-1103), Zhou et al. ACL 2018.
+
+* [*Improving Response Selection in Multi-turn Dialogue Systems by Incorporating Domain Knowledge*](http://aclweb.org/anthology/K18-1048),
+Chaudhuri et al. CoNLL 2018.
+
+The following papers use the 1-of-100 ranking accuracy in particular:
+
+* [*Conversational Contextual Cues: The Case of Personalization  and History  for  Response  Ranking.*](http://arxiv.org/abs/1606.00372), Al-Rfou et al. arXiv pre-print 2016.
+
+* [*Efficient Natural Language Response Suggestion for Smart Reply*](http://arxiv.org/abs/1705.00652), Henderson et al. arXiv pre-print 2017.
+
+* [*Universal Sentence Encoder*](https://arxiv.org/abs/1803.11175), Cer et al. arXiv pre-print 2018.
+
+* [*Learning  Semantic  Textual  Similarity  from  Conversations.*](http://aclweb.org/anthology/W18-3022). Yang et al. Workshop on Representation Learning for NLP 2018.
+
+
+
 ## Citations
 
 ## Contributing
diff --git a/amazon_qa/README.md b/amazon_qa/README.md
@@ -1,8 +1,16 @@
 # Amazon QA Data
 
+This dataset is based on a corpus extracted by McAuley et al., who scraped questions and answers from Amazon. The dataset is described at http://jmcauley.ucsd.edu/data/amazon/qa/ as well as in the following papers:
+
+*Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems*. Mengting Wan, Julian McAuley. International Conference on Data Mining (ICDM), 2016. [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/icdm16c.pdf)
+
+*Addressing complex and subjective product-related queries with customer reviews*. Julian McAuley, Alex Yang. World Wide Web (WWW), 2016. [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16b.pdf)
+
+The script in this directory processes this corpus, filters long and short texts, and creates a conversational dataset.
 
 ## Statistics
 
+Below are some statistics of the conversational dataset:
 
 * Input files: 38
 * Number of QA dictionaries: 1,569,513
@@ -20,42 +28,49 @@ Typical metrics for the Dataflow job:
 
 # Create the conversational dataset
 
+Below are instructions for how to generate the Amazon QA conversational dataset.
 
 ## Downloading Amazon QA dataset
 
-Download from http://jmcauley.ucsd.edu/data/amazon/qa/.
+First you must download the input data from http://jmcauley.ucsd.edu/data/amazon/qa/. In total there are 38 `.json.gz` files to download. Unzip them all and copy them to your Google cloud storage bucket:
 
 ```
-cd amazon_qa
 gunzip *
 
 BUCKET="your-bucket"
-gsutil -m cp -r * gs://${BUCKET?}/amazon_qa
+gsutil -m cp -r * gs://${BUCKET?}/amazon_qa/raw/
 ```
 
-## Creating QA data
+Note that while the files are named `.json`, they are not actually valid
+JSON, but rather python dictionaries in string format.
 
+## Run the dataflow script
 
-Run the Dataflow script:
+Run the following command to process the raw input data into a conversational
+dataset:
 
 ```
 PROJECT="your-google-cloud-project"
 
 DATADIR="gs://${BUCKET?}/amazon_qa/$(date +"%Y%m%d")"
 
 python amazon_qa/create_data.py \
-  --file_pattern gs://${BUCKET}/amazon_qa/raw/* \
+  --file_pattern gs://${BUCKET?}/amazon_qa/raw/* \
   --output_dir ${DATADIR} \
   --runner DataflowRunner --temp_location ${DATADIR}/temp \
   --staging_location ${DATADIR}/staging \
   --project ${PROJECT?}
 ```
 
-View the running job on the
+Once the above is running, you can continue to monitor it in the terminal, or quit the process and follow the running job on the
 [dataflow admin page](https://console.cloud.google.com/dataflow).
 
+Please confirm that the statistics reported on the dataflow job page agree with the statistics reported above, to ensure you have a correct version of the dataset.
+
+The dataset will be saved in the `$DATADIR` directory, as sharded train and test sets- `gs://your-bucket/amazon_qa/YYYYMMDD/train-*-of-00100.tfrecords` and
+`gs://your-bucket/amazon_qa/YYYYMMDD/test-*-of-00010.tfrecords`.
 
-View final output:
+You can then use [`tools/tfrutil.py`](/tools/tfrutil.py) to inspect the files. For example:
 
 ```
 python tools/tfrutil.py pp ${DATADIR?}/test-00000-of-00010.tfrecords

diff --git a/baselines/README.md b/baselines/README.md
@@ -0,0 +1 @@
+TODO: add code for baselines.
diff --git a/tools/README.md → baselines/__init__.py b/tools/README.md → baselines/__init__.py
diff --git a/opensubtitles/README.md b/opensubtitles/README.md
@@ -1,10 +1,24 @@
 # OpenSubtitles Data
 
-A dataset contains examples of pairs of sentences that occur in sequence in the
-data.
+This dataset uses movie and television subtitles data from OpenSubtitles. The
+script in this directory uses the corpus collected by Lison and Tiedemann. See http://opus.nlpl.eu/OpenSubtitles-v2018.php and the following citation:
+
+*OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.* P. Lison and J. Tiedemann.  In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
+
+The data is available in 62 different languages.
+
+Consecutive lines in the subtitle data are used to create conversational examples.
+There is no guarantee that different lines correspond to different
+speakers, but the data nevertheless contains a lot of interesting examples
+for modelling the mapping from conversational contexts to responses.
+
+The script filters short and long lines, and strips some text such as
+character names and auditory description text.
 
 ## Statistics
 
+Below are statistics for the English dataset:
+
 * Input files: 4,415
 * Number of examples: 320,233,703
 * Train set size: 286,655,424
@@ -18,14 +32,23 @@ Typical metrics for the Dataflow job:
 * Elapsed time: 25m (225 workers)
 
 
-## Create a dataset
+# Create the conversational dataset
 
-Download monolingual raw text data,
+Below are instructions for creating the conversational dataset from the
+OpenSubtitles corpus.
 
-* English [en.txt.gz](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz).
+## Download the OpenSubtitles data
 
+First, download monolingual raw text data for the target language.
 
-Extract and upload to GCS.
+Visit http://opus.nlpl.eu/OpenSubtitles-v2018.php, and find the *Statistics and TMX/Moses Downloads* table. Click on the language ID in the first column
+to get the monolingual plain text file (untokenized).
+
+For English the correct link is:
+
+http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz
+
+Extract the data, split it into shards, and upload the data to your Google cloud storage bucket:
 
 ```
 gunzip -k en.txt.gz
@@ -36,8 +59,13 @@ BUCKET="your-bucket"
 gsutil -m cp -r lines gs://${BUCKET?}/opensubtitles/raw/
 ```
 
+Note that the exact split command is important, as the train/test split is
+computed using the file names.
 
-Run the dataflow script:
+## Run the dataflow script
+
+Now you can run the dataflow script to read the text files and generate
+conversational examples:
 
 ```
 PROJECT="your-google-cloud-project"
@@ -53,11 +81,15 @@ python opensubtitles/create_data.py \
   --project ${PROJECT?}
 ```
 
-View the running job on the
+Once the above is running, you can continue to monitor it in the terminal, or quit the process and follow the running job on the
 [dataflow admin page](https://console.cloud.google.com/dataflow).
 
+Please confirm that the statistics reported on the dataflow job page agree with the statistics reported above, to ensure you have a correct version of the dataset.
+
+The dataset will be saved in the `$DATADIR` directory, as sharded train and test sets- `gs://your-bucket/opensubtitles/YYYYMMDD/train-*-of-01000.tfrecords` and
+`gs://your-bucket/opensubtitles/YYYYMMDD/test-*-of-00100.tfrecords`.
 
-View final output:
+You can then use [`tools/tfrutil.py`](/tools/tfrutil.py) to inspect the files. For example:
 
 ```
 python tools/tfrutil.py pp ${DATADIR?}/test-00000-of-00100.tfrecords