# Chapter 6: Write

## Question-Answer Generator

The goal of this notebook is to create a model which can generate question and answer pairs about a block of text. This project is based on the [`qgen-workshop` TensorFlow codebase](https://github.com/Maluuba/qgen-workshop). The model consists of two components:

- An RNN which identifies possible question answers from a block of text.

- An encoder-decoder network that generates possible questions that the answers identified by the former model could be for.

An _encoder-decoder_ network is a type of RNN that outputs a new sequence from its input. Some applications of encoder-decoder networks include machine translation, question generation, and text summarization. An encoder-decoder model trains an encoder RNN to encode the input sequence into a vector input for the decoder RNN which outputs a novel sequence from the vector input.

## Question-Answer Dataset

### Set Up

Below we download and preprocess the data for the model. The data we are using for this notebook is provided by the [Maluuba News QA GitHub repository](https://github.com/Maluuba/newsqa). We use the manual setup instructions with the relevant code below:

In [1]:
!git clone https://github.com/Maluuba/newsqa

Cloning into 'newsqa'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 132 (delta 4), reused 4 (delta 2), pack-reused 119[K
Receiving objects: 100% (132/132), 610.20 KiB | 2.45 MiB/s, done.
Resolving deltas: 100% (53/53), done.


For legal reasons, you have to download the data yourself on [Microsoft's website](https://msropendata.com/datasets/939b1042-6402-4697-9c15-7a28de7e1321) and state what you are using the data for. To save myself the trouble of having to reupload the data each time we get a new Colab kernel, 

In [0]:
# Mount drive.

from google.colab import drive

drive.mount('/content/gdrive/')
base_dir = '/content/gdrive/My Drive/gdl_models/qa/'

In [0]:
import os
import subprocess

def load_or_copy(filename):
  """Load a file from Drive or copy it locally into Drive."""
  drive_path = base_dir + filename
  if os.path.isfile(drive_path):
    print('File exists in drive')
    subprocess.call(['cp', drive_path, '.'])
  else:
    print('File exists locally')
    subprocess.call(['cp', filename, drive_path])

In [24]:
load_or_copy('newsqa.tar.gz')

File exists in drive


In [0]:
!mv newsqa.tar.gz newsqa/maluuba/newsqa/newsqa.tar.gz

Now we also download the CNN stories we will use to train the model. You can download the stories [here](https://cs.nyu.edu/~kcho/DMQA/).

In [30]:
load_or_copy('cnn_stories.tgz')

File exists in drive


In [0]:
!mv cnn_stories.tgz newsqa/maluuba/newsqa/cnn_stories.tgz

Now we upload the Java dependencies which can be found [here](https://nlp.stanford.edu/software/stanford-postagger-2015-12-09.zip).

In [132]:
load_or_copy('stanford-postagger-2015-12-09.zip')

File exists in drive


In [0]:
!cp -r stanford-postagger-2015-12-09.zip newsqa/maluuba/newsqa/

Now we run the data processing script in the repository. We run the repository's tests to make sure the data was processed correctly.

In [0]:
!cd newsqa && python2 maluuba/newsqa/data_generator.py

In [139]:
!cd newsqa && python2 -m unittest discover .

[INFO] 2020-05-04 00:25:51,936 - data_processing.py::__init__
Loading dataset from `/content/newsqa/maluuba/newsqa/newsqa-data-v1.csv`...
[INFO] 2020-05-04 00:25:51,936 - data_processing.py::load_combined
Loading data from `/content/newsqa/maluuba/newsqa/newsqa-data-v1.csv`...
[INFO] 2020-05-04 00:25:52,456 - data_processing.py::__init__
Loading stories from `/content/newsqa/maluuba/newsqa/cnn_stories.tgz`...
Getting story texts: 100% 12.7k/12.7k [00:12<00:00, 1.05k stories/s] 
Setting story texts: 100% 120k/120k [00:03<00:00, 37.0k questions/s] 
[INFO] 2020-05-04 00:26:07,792 - data_processing.py::__init__
Done loading dataset.
Checking for possible corruption: 100% 120k/120k [00:01<00:00, 103k questions/s]
.[INFO] 2020-05-04 00:26:09,045 - data_processing.py::dump
Packaging dataset to `/content/newsqa/combined-newsqa-data-v1.json`.
Building json: 100% 120k/120k [00:05<00:00, 21.7k questions/s] 
Checking for possible corruption: 100% 12.7k/12.7k [00:00<00:00, 18.7k stories/s]
Gatherin

Now let's save the data to Drive.

In [0]:
def upload_to_drive(filepath):
  """Copy a file to Drive."""
  drive_path = base_dir + filepath.split('/')[-1]
  subprocess.call(['cp', filepath, drive_path])

In [0]:
upload_to_drive('newsqa/split_data/train.csv')

In [0]:
upload_to_drive('newsqa/split_data/test.csv')

In [0]:
upload_to_drive('newsqa/split_data/dev.csv')