Tools for using Maluuba's NewsQA Dataset (public version)
Python Java
Switch branches/tags
Nothing to show
Clone or download
Latest commit b0601a6 May 22, 2018

README.md

Maluuba NewsQA

Tools for using Maluuba's news questions and answer data.

You can find more information about the dataset here.

Data Description

The combined dataset is made of several columns to show the story text and the derived answers from several crowdsourcers.

Column Name Description
story_id The identifier for the story. Comes from the member name in the CNN stories package.
story_text The text for the story.
question A question about the story.
answer_char_ranges (in combined-newsqa-data-*.csv) The raw data collected for character based indices to answers in story_text. E.g. 196:228|196:202,217:228|None. Answers from different crowdsourcers are separated by |, within those, multiple selections from the same crowdsourcer are separated by ,. None means the crowdsourcer thought there was no answer to the question in the story. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.
answer_token_ranges (in newsqa-data-tokenized-*.csv) Word based indices to answers in story_text. E.g. 196:202,217:228. Multiple selections from the same answer are separated by ,. The start is inclusive and the end is exclusive. The end may point to whitespace after a token.

There are some other fields in combined-newsqa-data-*.csv for raw data collected when crowdsourcing such as the validation of collected data.

Requirements

Run either the Docker steps or do the manual set up.

The dataset for character based indices will be the combined-newsqa-data-*.csv file in the root.

The dataset for token based indices will be maluuba/newsqa/newsqa-data-tokenized-*.csv.

(Recommended) Docker Set Up

These steps handle packaging the dataset and running the tests.

  • Clone this repo.
  • Download the questions and answers from here to the maluuba/newsqa folder. No need to extract anything.
  • Download the CNN stories from here to the maluuba/newsqa folder (for legal and technical reasons, we can't distribute this to you).
  • In the root of this repo, run:
docker build -t maluuba/newsqa .
docker run --rm -it -v ${PWD}:/usr/src/newsqa --name newsqa maluuba/newsqa

You now have the datasets. See combined-newsqa-data-*.csv or maluuba/newsqa/newsqa-data-tokenized-*.csv.

Troubleshooting Docker Set Up

If you run into issues such as the tokenization not unpacking, then you may need to give Docker at least 4GB of memory.

Manual Set Up

  • Clone this repo.
  • Download the questions and answers from here to the maluuba/newsqa folder. No need to extract anything.
  • Download the CNN stories from here to the maluuba/newsqa folder (for legal and technical reasons, we can't distribute this to you).
  • Use Python 2.7 to package the dataset (Python 2.7 was originally used to handle the stories and they got encoded strangely - once the dataset is packaged by these scripts, you should be able to load the files with whatever tools you'd like) You can create a Conda environment like so:
conda create --name newsqa python=2.7 "pandas>=0.19.2"
  • Install the requirements in your environment:
conda activate newsqa && pip install --requirement requirements.txt
  • (Optional - Tokenization) To tokenize the data, you must install a JDK (Java Development Kit) so that you can compile and run Java code.
  • (Optional - Tokenization) To tokenize the data, you must get some JAR files. We use some libraries from Stanford. You just need to put the English option of version 3.6.0 in the maluuba/newsqa folder.

Package the Dataset

Tokenize and Split

To tokenize and split the dataset into train, dev, and test, to match the paper run:

python maluuba/newsqa/data_generator.py

The warnings from the tokenizer are normal.

Testing

To make sure that everything is extracted right, run

python -m unittest discover .

All tests should pass.

PEP8

The code in this repository complies with PEP8 standards with a maximum line length of 99 characters.

Legal

Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Maluuba or its activities.

Terms: See LICENSE.txt.