### Build a Arc-I Model

<img src="https://raw.githubusercontent.com/NTMC-Community/MatchZoo/2.0/artworks/matchzoo-logo.png" alt="logo" style="width:600px;float: center"/>

This is a tutorial on training *Arc-I Model* [Hu et al. 2014](http://papers.nips.cc/paper/5550-convolutional-neural-network-architectures-for-matching-natural-language-sentences.pdf) model with [MatchZoo](https://github.com/faneshion/MatchZoo). We use [WikiQA](https://aclweb.org/anthology/D15-1237) as the example benchmark data set to show the usage.

*To walk through this notebook, you need approx 90 minutes.*

-------

**TL;DR**

The following code block illustrates the main workflow of how to train a Arc-I model. 

```python
from matchzoo import preprocessor
from matchzoo import generators
from matchzoo import models

train, test = ... # prepare your training data and test data.

arci_preprocessor = preprocessor.ArcIPreprocessor()
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')


generator_tr = generators.PointGenerator(processed_tr)
generator_te = generators.PointGenerator(processed_te)
# Example, train with generator, test with the first batch.
X_te, y_te = generator_te[0]

arci_model = models.ArcIModel()
arci_model.guess_and_fill_missing_params()
arci_model.build()
arci_model.compile()
arci_model.fit_generator(generator_tr)
# Make predictions
predictions = arci_model.predict([X_te.text_left, X_te.text_right])
```

-----

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

### Table of Content

+ Prepare WikiQA dataset
    - Download
    - Load
    - Adjustment
+ Preprocessing
+ Data Generator
+ Model Training
    - Initialize
    - Hyper-Parameters
    - Make Prediction
    - Model Persistence
- Reference

### Prepare WikiQA dataset

#### Download

We take WikiQA as the example benchmark dataset to show the usage of MatchZoo. Firstly you need to downlowd the data and uncompress the data, we provided the following script to help you download the dataset into `MatchZoo/data/WikiQA` folder, you can change the directory in the following script.

If you already have WikiQA dataset downloaded on your machine, skip the following script.

In [None]:
!rm -rf ../../data/WikiQA
!mkdir -p ../../data/WikiQA
!wget -P ../../data/WikiQA https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
!unzip -d ../../data/WikiQA ../../data/WikiQA/WikiQACorpus.zip

#### Load & Adjustment

The *train/dev/test* files of WikiQA are *WikiQA-train.tsv*, *WikiQA-dev.tsv*, *WikiQA-test.tsv* under the uncompressed folder WikiQACorpus. The data format of WikiQA is as follows:

`QuestionID\tQuestion\tDocumentID\tDocumentTitle\tSentenceID\tSentence\tLabel`

We can convert this format to the expected input format of MatchZoo.

In [2]:
data_folder = '../../data/WikiQA/WikiQACorpus/'

def read_data(input, stage):
    output_list = []
    index = 0
    with open(input, encoding="utf-8") as fin:
        for l in fin:
            tok = l.split('\t')
            if index != 0:
                if stage == 'test':
                    output_list.append((tok[0], tok[4], tok[1], tok[5])) # qid, did, q, d, label
                else:
                    output_list.append((tok[0], tok[4], tok[1], tok[5], tok[6])) # qid, did, q, d 
            index += 1
    return output_list

train = read_data(data_folder + 'WikiQA-train.tsv', stage='train')
dev   = read_data(data_folder + 'WikiQA-dev.tsv', stage='dev')
test  = read_data(data_folder + 'WikiQA-test.tsv', stage='test')

### Preprocessing


#### Download

We first download the GLoVe pre-trained word embeddings. Be pationt - this takes a long time related to your network service. Alternatively, you can download it via another downloading tool and put **glove.840B.300d.txt** in the directory **../../data/embedding**.

In [3]:
!mkdir -p ../../data/embedding
!wget -O ../../data/embedding/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip 
!unzip -o ../../data/embedding/glove.840B.300d.zip -d ../../data/embedding

--2018-11-04 10:47:46--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to 10.61.3.150:8088... ^C
Archive:  ../../data/embedding/glove.840B.300d.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ../../data/embedding/glove.840B.300d.zip or
        ../../data/embedding/glove.840B.300d.zip.zip, and cannot find ../../data/embedding/glove.840B.300d.zip.ZIP, period.


#### Preproceesing with pre-trained embeddings

We run the processor in three lines of code.  
If we use **the randomly initialized embeddings**, just keep default value for *embedding_file*, and the preprocessing code is as follows:
```python
from matchzoo import preprocessor
arci_preprocessor = preprocessor.ArcIPreprocessor()
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')
```

In [4]:
# Initialize a arci preprocessor.
from matchzoo import preprocessor
arci_preprocessor = preprocessor.ArcIPreprocessor(embedding_file='../../data/embedding/glove.840B.300d.txt')
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')

Using TensorFlow backend.
Start building vocabulary & fitting parameters.
2118it [00:00, 3250.69it/s]
18841it [00:08, 2100.09it/s]
Some words are not shown in term_index(29924). Total number of words are 29925.
2196017it [00:55, 39852.02it/s]
Start processing input data for train stage.
2118it [00:00, 2754.38it/s]
18841it [00:10, 1820.31it/s]
Start processing input data for predict stage.
633it [00:00, 2620.76it/s]
5961it [00:03, 1834.15it/s]


**What is `processed_tr`?**

`processed_tr` is a **MatchZoo DataPack** data structure (see `matchzoo/datapack.py`). It contains 
1. A *2-columns* `pandas DataFrame` called `left` to host all the pre-processed records including index and processed text to store `text_left` and `id_left`.
2. A *2-columns* `pandas DataFrame` called `right` to host all the pre-processed records including index and processed text to store `text_right` and `id_right`.
3. A *2-columns* `pandas DataFrame` called `relation` to host all the pre-processed records including index and index mapping `id_left` and `id_right`.
4. A `context` property (dictionary) consists of all the parameters fitted during pre-processing. 


The `fit_transform` method is a sequential combination of two methods:

1. Fit parameters using the `fit` function, this only happens when `stage='train'`.
2. Transform data into expected format.

So the previous preprocessing code in test stage can also be written as:

```python
processed_te = arci_preprocessor.transform(test, stage='predict')
```

As described, the fitted parameters were stored in `context` property, to access the context, just call:

```python
print(processed_tr.context)
```
An example:

In [5]:
print('vocab size: ', len(processed_tr.context['term_index']))

vocab size:  29924


**What has been stored in the `context?`** 

We stored `term_index` and `input_shapes` in the context property. 


**What is `arci_preprocessor` actually doing?**

The `arci_preprocessor` is calling a sequence of `process_units`. Each `process_unit` is designed to perform one atom operation on input data. For instance, in `arci_preprocessor`, we called:


1. TokenizeUnit: Perform tokenization on raw input data.
2. LowercaseUnit: Transform all tokens into lower case.
3. PuncRemovalUnit: Remove all the punctuations.
4. StopRemovalUnit: Remove all the stopwords.

Depending on whether using the pre-trained embeddings, the preprocessor loads the embedding from an embedding file or randomly initializes the embeddings.


----

### Data Generation

For memory efficiency, we expect you to use **generator** to generate batches of data on the fly. For example, we can create a **PointGenerator** as follows:

In [6]:
from matchzoo import generators
from matchzoo import tasks
generator_tr = generators.PointGenerator(inputs=processed_tr, task=tasks.Ranking(), batch_size=64, stage='train')
generator_te = generators.PointGenerator(inputs=processed_te, task=tasks.Ranking(), batch_size=64, stage='predict')

To get the first batch of trainig data, just call `X_train, y_train = generator[0]`.

**What is PointGenerator?**
**PointGenerator** is this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the problem can be approximated by a regression/Classification problem — given a single query-document pair, predict its score.

A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values.

**What is PairGenerator?**
In this case, the problem is approximated by a classification problem — learning a binary classifier that can tell which document is better in a given pair of documents.

In MatchZoo, **PairGenerator** generate one positive & `num_neg` negative examples per pair. As an example, to train a ARCI model (for document ranking), we use `num_neg=4`. 

**What is ListGenerator?**
This generator try to directly optimize the value of evaluation measures, averaged over all queries in the training data. 

Chosse the appropriate generator based on your `task`.

----

### Train Your Arc-I Model

To train a Arc-I model, we need to create an instance of ArcIModel:

In [7]:
from matchzoo import models
arci_model = models.ArcIModel()

Then, we need to set hyper-parameters to our Arc-I Model. In general, there are **two types of hyper-parameters**:

**Required parameters**: For Arc-I, if you are using the pre-trained embeddings, you're required to set the embedding_mat and the size manually!

In [8]:
# The fitted parameters is stored in the `context` property of pre-processor instance during the training stage.
from matchzoo import losses
from matchzoo import tasks
arci_model.params['task'] = tasks.Ranking()
arci_model.embedding_mat = processed_tr.context['embedding_mat']

**Tunable parameters**: For Arc-I, you're allowed to tune these parameters:

```python
from matchzoo import tasks

        
params = {
            'input_shapes': [(32,), (32,)] # Lengths of matching texts.
            'optimizer': 'adam' # By default, we use sgd. See keras optimizer.
            'trainable_embedding': False # Whether finetune the embeddings.
            'num_blocks': 1 # Number of the convolution layers.
            'left_kernel_count': [32] # Number of filters in convolution. See keras Conv1D.
            'left_kernel_size': [3] # The length of the 1D convolution window. See keras Conv1D.
            'right_kernel_count': [32] # Number of filters in convolution. See keras Conv1D.
            'right_kernel_size': [3] # The length of the 1D convolution window. See keras Conv1D.
            'activation': 'relu' # Activation function in convolution. See keras Conv1D.
            'left_pool_size': [2] # Size of the max pooling windows for the left text. See keras MaxPooling1D.
            'right_pool_size': [2] # Size of the max pooling windows for rhe right text. See keras MaxPooling1D.
            'padding': 'same' # Padding mode. See keras padding in Conv1D.   
            'dropout_rate': 0.0 # Probability of an element to be zeroed. See keras Dropout.
            'embedding_random_scale': 0.2 # The range of the random initialized embedding. 
            'task': tasks.Classification, # Default Classification, you can use tasks.Ranking
            'loss': 'categorical_crossentropy', # categorical_crossentropy, see keras loss.
            'metric': 'acc', # Accuracy by default, see keras metric.
         }
```

In [9]:
arci_model.guess_and_fill_missing_params()
print('arci parameters: ', arci_model.params)

arci parameters:  name                          ArcIModel
model_class                   <class 'matchzoo.models.arci_model.ArcIModel'>
input_shapes                  [(32,), (32,)]
task                          <matchzoo.tasks.ranking.Ranking object at 0x7f5ffb7c0f98>
metrics                       ['mae']
loss                          mse
optimizer                     adam
trainable_embedding           False
embedding_dim                 300
vocab_size                    29925
num_blocks                    1
left_kernel_count             [32]
left_kernel_size              [3]
right_kernel_count            [32]
right_kernel_size             [3]
activation                    relu
left_pool_size                [2]
right_pool_size               [2]
padding                       same
dropout_rate                  0.0
embedding_mat                 [[ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-0.42932     0.047699   -0.31364    ... -0.44607    -0.056797
  -0.6577 

#### Model Training

To train the model after all the parameters were settled, call:

In [10]:
arci_model.build()
arci_model.compile()
# Fit the arci model on generator.
arci_model.fit_generator(generator_tr, steps_per_epoch=200, epochs=10)
# Make predictions on the first batch of test data
X_te, y_te = generator_te[0]
predictions = arci_model.predict([X_te.text_left, X_te.text_right])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Make Prediction

In [11]:
for id_left, id_right, pred in zip(X_te.id_left, X_te.id_right, predictions):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

Q1842/D674-2 is predicted as [0.07940455]
Q2073/D1956-0 is predicted as [0.11562713]
Q153/D153-3 is predicted as [0.15267068]
Q1572/D1490-4 is predicted as [-0.01647591]
Q2953/D2730-2 is predicted as [0.04182901]
Q1918/D1812-7 is predicted as [0.06530898]
Q1118/D1076-5 is predicted as [0.03028291]
Q607/D596-3 is predicted as [0.1093749]
Q2663/D2486-5 is predicted as [-0.12719628]
Q802/D776-0 is predicted as [0.02008557]
Q2383/D2237-2 is predicted as [0.18150336]
Q2677/D2053-15 is predicted as [0.13989908]
Q1714/D1625-8 is predicted as [0.02511153]
Q1163/D240-9 is predicted as [0.12640688]
Q1707/D1618-2 is predicted as [-0.00618813]
Q1389/D1326-5 is predicted as [-0.01539764]
Q91/D91-2 is predicted as [-0.01772615]
Q519/D511-1 is predicted as [0.22273019]
Q724/D704-13 is predicted as [-0.04181942]
Q2822/D2621-1 is predicted as [0.0660775]
Q2575/D2407-23 is predicted as [-0.02538727]
Q1799/D1701-8 is predicted as [0.21721181]
Q1211/D1160-7 is predicted as [0.03142788]
Q2723/D2538-4 is pr

#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

```python
from matchzoo import engine
# Save the model to dir.
arci_model.save('/your-model-saved-path')
# And load the model from dir.
engine.load_model('/your-model-saved-path')
```

## Reference

[Hu et al. 2014] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. "Convolutional neural network architectures for matching natural language sentences." In Advances in neural information processing systems (NIPS), pp. 2042-2050. 2014.