### Build a Arc-I Model

<img src="https://raw.githubusercontent.com/NTMC-Community/MatchZoo/2.0/artworks/matchzoo-logo.png" alt="logo" style="width:600px;float: center"/>

This is a tutorial on training *Arc-I Model* [Hu et al. 2014](http://papers.nips.cc/paper/5550-convolutional-neural-network-architectures-for-matching-natural-language-sentences.pdf) model with [MatchZoo](https://github.com/faneshion/MatchZoo)for **classification task**. We use [QuoraQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) as the example benchmark data set to show the usage.

*To walk through this notebook, you need approx 90 minutes.*

-------

In [1]:
import sys
print(sys.path)

['', '/home/pangliang/nips/venv/lib/python36.zip', '/home/pangliang/nips/venv/lib/python3.6', '/home/pangliang/nips/venv/lib/python3.6/lib-dynload', '/usr/local/python3/lib/python3.6', '/home/pangliang/nips/venv/lib/python3.6/site-packages', '/home/pangliang/nips/playground_pl', '/home/pangliang/matching/MatchZoo_New', '/home/pangliang/nips/venv/lib/python3.6/site-packages/IPython/extensions', '/home/pangliang/.ipython']


**TL;DR**

The following code block illustrates the main workflow of how to train a Arc-I model. 

```python
from matchzoo import preprocessor
from matchzoo import generators
from matchzoo import models

train, test = ... # prepare your training data and test data.

arci_preprocessor = preprocessor.ArcIPreprocessor()
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')


generator_tr = generators.PointGenerator(processed_tr)
generator_te = generators.PointGenerator(processed_te)
# Example, train with generator, test with the first batch.
X_te, y_te = generator_te[0]

arci_model = models.ArcIModel()
arci_model.guess_and_fill_missing_params()
arci_model.build()
arci_model.compile()
arci_model.fit_generator(generator_tr)
# Make predictions
predictions = arci_model.predict([X_te.text_left, X_te.text_right])
```

-----

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

### Table of Content

+ Prepare QuoraQP dataset
    - Download
    - Load
    - Adjustment
+ Preprocessing
+ Data Generator
+ Model Training
    - Initialize
    - Hyper-Parameters
    - Make Prediction
    - Model Persistence
- Reference

### Prepare Quora Question Pair dataset

#### Download

We take QuoraQP as the example benchmark dataset to show the usage of MatchZoo. Firstly you need to downlowd the data and uncompress the data, currently you need to download the dataset via [kaggle](https://www.kaggle.com/c/quora-question-pairs/data). Unzip the data, you'll get

- train.csv
- test.csv

#### Load & Adjustment

The *train/test* files of QuoraQP are *train.csv*,  *test.csv* under the uncompressed folder QuoraQP. 

We can convert this format to the expected input format of MatchZoo.

In [5]:
data_folder = '../../data/quoraqp/'

import pandas as pd
import numpy as np

def read_data(file_path, stage):
    df = pd.read_csv(file_path, dtype={'question1': 'str', 'question2': 'str'}, keep_default_na=False)
    print(df.describe())
    if stage == 'train':
        df = df[['qid1', 'qid2', 'question1', 'question2', 'is_duplicate']]
    elif stage == 'predict':
        df = df[['question1', 'question2']]
        # assign ids to left and right
        q_a = pd.unique(df.values.ravel())
        # add index for each
        q_a = dict(map(lambda t: (t[1], t[0]), enumerate(q_a)))
        # assign id
        df['qid1'] = df['question1'].map(q_a)
        df['qid2'] = df['question2'].map(q_a)
        # change the order of columns
        cols = ['qid1', 'qid2', 'question1', 'question2']
        df = df[cols]
    # convert dataframe into list of tuples
    qa_pairs = [tuple(x) for x in df.values]
    return qa_pairs
    

train = read_data(data_folder + 'train.csv', stage='train')
test = read_data(data_folder + 'test.csv', stage='predict')

                  id           qid1           qid2   is_duplicate
count  404290.000000  404290.000000  404290.000000  404290.000000
mean   202144.500000  217243.942418  220955.655337       0.369198
std    116708.614503  157751.700002  159903.182629       0.482588
min         0.000000       1.000000       2.000000       0.000000
25%    101072.250000   74437.500000   74727.000000       0.000000
50%    202144.500000  192182.000000  197052.000000       0.000000
75%    303216.750000  346573.500000  354692.500000       1.000000
max    404289.000000  537932.000000  537933.000000       1.000000


  if self.run_code(code, result):


        test_id question1 question2
count   3563475   3563475   3563475
unique  2607940   2211009   2227400
top     1303969     What      What 
freq          2      2033      2016


### Preprocessing


#### Download

We first download the GLoVe pre-trained word embeddings. Be pationt - this takes a long time related to your network service. Alternatively, you can download it via another downloading tool and put **glove.840B.300d.txt** in the directory **../../data/embedding**.

In [50]:
!mkdir -p ../../data/embedding
!wget -O ../../data/embedding/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip 
!unzip -o ../../data/embedding/glove.840B.300d.zip -d ../../data/embedding

--2018-10-20 20:20:26--  http://nlp.stanford.edu/data/glove.840B.300d.zip
正在解析主机 nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
正在连接 nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 302 Found
位置：https://nlp.stanford.edu/data/glove.840B.300d.zip [跟随至新的 URL]
--2018-10-20 20:20:27--  https://nlp.stanford.edu/data/glove.840B.300d.zip
正在连接 nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：2176768927 (2.0G) [application/zip]
正在保存至: “../../data/embedding/glove.840B.300d.zip.5”


2018-10-20 21:13:22 (670 KB/s) - 已保存 “../../data/embedding/glove.840B.300d.zip.5” [2176768927/2176768927])

Archive:  ../../data/embedding/glove.840B.300d.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zip

#### Preproceesing with pre-trained embeddings

We run the processor in three lines of code.  
If we use **the randomly initialized embeddings**, just keep default value for *embedding_file*, and the preprocessing code is as follows:
```python
from matchzoo import preprocessor
arci_preprocessor = preprocessor.ArcIPreprocessor()
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')
```

In [None]:
# Initialize a dssm preprocessor.
from matchzoo import preprocessor
arci_preprocessor = preprocessor.ArcIPreprocessor(embedding_file='../../data/embedding/glove.840B.300d.txt')
processed_tr = arci_preprocessor.fit_transform(train, stage='train')
processed_te = arci_preprocessor.fit_transform(test, stage='predict')

**What is `processed_tr`?**

`processed_tr` is a **MatchZoo DataPack** data structure (see `matchzoo/datapack.py`). It contains 
1. A *2-columns* `pandas DataFrame` called `left` to host all the pre-processed records including index and processed text to store `text_left` and `id_left`.
2. A *2-columns* `pandas DataFrame` called `right` to host all the pre-processed records including index and processed text to store `text_right` and `id_right`.
3. A *2-columns* `pandas DataFrame` called `relation` to host all the pre-processed records including index and index mapping `id_left` and `id_right`.
4. A `context` property (dictionary) consists of all the parameters fitted during pre-processing. 


The `fit_transform` method is a sequential combination of two methods:

1. Fit parameters using the `fit` function, this only happens when `stage='train'`.
2. Transform data into expected format.

So the previous preprocessing code in test stage can also be written as:

```python
processed_te = dssm_preprocessor.transform(test, stage='predict')
```

As described, the fitted parameters were stored in `context` property, to access the context, just call:

```python
print(processed_tr.context)
```
An example:

In [8]:
print('vocab size: ', len(processed_tr.context['term_index']))

vocab size:  3707


**What has been stored in the `context?`** 

We stored `term_index` and `input_shapes` in the context property. 


**What is `arci_preprocessor` actually doing?**

The `arci_preprocessor` is calling a sequence of `process_units`. Each `process_unit` is designed to perform one atom operation on input data. For instance, in `arci_preprocessor`, we called:


1. TokenizeUnit: Perform tokenization on raw input data.
2. LowercaseUnit: Transform all tokens into lower case.
3. PuncRemovalUnit: Remove all the punctuations.
4. StopRemovalUnit: Remove all the stopwords.

Depending on whether using the pre-trained embeddings, the preprocessor loads the embedding from an embedding file or randomly initializes the embeddings.


----

### Data Generation

For memory efficiency, we expect you to use **generator** to generate batches of data on the fly. For example, we can create a **PointGenerator** as follows:

In [9]:
from matchzoo import generators
from matchzoo import tasks
generator_tr = generators.PointGenerator(inputs=processed_tr, task=tasks.Classification(), batch_size=64, stage='train')
generator_te = generators.PointGenerator(inputs=processed_te, task=tasks.Classification(), batch_size=64, stage='predict')

To get the first batch of trainig data, just call `X_train, y_train = generator[0]`.

**What is PointGenerator?**
**PointGenerator** is this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the problem can be approximated by a regression/Classification problem — given a single query-document pair, predict its score.

A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values.

**What is PairGenerator?**
In this case, the problem is approximated by a classification problem — learning a binary classifier that can tell which document is better in a given pair of documents.

In MatchZoo, **PairGenerator** generate one positive & `num_neg` negative examples per pair. As an example, to train an Arc-I model (for document ranking), we use `num_neg=4`. 

**What is ListGenerator?**
This generator try to directly optimize the value of evaluation measures, averaged over all queries in the training data. 

Chosse the appropriate generator based on your `task`.

----

### Train Your Arc-I Model

To train a Arc-I model, we need to create an instance of ArcIModel:

In [10]:
from matchzoo import models
arci_model = models.ArcIModel()

Then, we need to set hyper-parameters to our Arc-I Model. In general, there are **two types of hyper-parameters**:

**Required parameters**: For Arc-I, if you are using the pre-trained embeddings, you're required to set the embedding_mat and the size manually!

In [11]:
# The fitted parameters is stored in the `context` property of pre-processor instance during the training stage.
from matchzoo import losses
from matchzoo import tasks
arci_model.params['task'] = tasks.Classification()
arci_model.embedding_mat = processed_tr.context['embedding_mat']

**Tunable parameters**: For Arc-I, you're allowed to tune these parameters:

```python
from matchzoo import tasks

        
params = {
            'input_shapes': [(32,), (32,)] # Lengths of matching texts.
            'optimizer': 'adam' # By default, we use sgd. See keras optimizer.
            'trainable_embedding': False # Whether finetune the embeddings.
            'num_blocks': 1 # Number of the convolution layers.
            'left_kernel_count': [32] # Number of filters in convolution. See keras Conv1D.
            'left_kernel_size': [3] # The length of the 1D convolution window. See keras Conv1D.
            'right_kernel_count': [32] # Number of filters in convolution. See keras Conv1D.
            'right_kernel_size': [3] # The length of the 1D convolution window. See keras Conv1D.
            'activation': 'relu' # Activation function in convolution. See keras Conv1D.
            'left_pool_size': [2] # Size of the max pooling windows for the left text. See keras MaxPooling1D.
            'right_pool_size': [2] # Size of the max pooling windows for rhe right text. See keras MaxPooling1D.
            'padding': 'same' # Padding mode. See keras padding in Conv1D.   
            'dropout_rate': 0.0 # Probability of an element to be zeroed. See keras Dropout.
            'embedding_random_scale': 0.2 # The range of the random initialized embedding. 
            'task': tasks.Classification, # Default Classification, you can use tasks.Ranking
            'loss': 'categorical_crossentropy', # categorical_crossentropy, see keras loss.
            'metric': 'acc', # Accuracy by default, see keras metric.
         }
```

In [12]:
arci_model.guess_and_fill_missing_params()
print('arci parameters: ', arci_model.params)

arci parameters:  name                          ArcIModel
model_class                   <class 'matchzoo.models.arci_model.ArcIModel'>
input_shapes                  [(32,), (32,)]
task                          <matchzoo.tasks.classification.Classification object at 0x7f73a4500320>
metrics                       ['acc']
loss                          categorical_crossentropy
optimizer                     adam
trainable_embedding           False
embedding_dim                 300
vocab_size                    3708
num_blocks                    1
left_kernel_count             [32]
left_kernel_size              [3]
right_kernel_count            [32]
right_kernel_size             [3]
activation                    relu
left_pool_size                [2]
right_pool_size               [2]
padding                       same
dropout_rate                  0.0
embedding_mat                 [[ 0.         0.         0.        ...  0.         0.         0.       ]
 [-1.522      0.31011    0.17451   ...  

#### Model Training

To train the model after all the parameters were settled, call:

In [13]:
arci_model.build()
arci_model.compile()
# Fit the arci model on generator.
arci_model.fit_generator(generator_tr, steps_per_epoch=200, epochs=10)
# Make predictions on the first batch of test data
X_te, y_te = generator_te[0]
predictions = arci_model.predict([X_te.text_left, X_te.text_right])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Make Prediction

In [14]:
for id_left, id_right, pred in zip(X_te.id_left, X_te.id_right, predictions):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

1264/1265 is predicted as [0.17494294 0.825057  ]
566/567 is predicted as [0.4681414 0.5318586]
956/957 is predicted as [9.99893427e-01 1.06538624e-04]
1188/1189 is predicted as [0.53175515 0.46824485]
806/807 is predicted as [0.6687734  0.33122656]
796/797 is predicted as [0.9432251  0.05677492]
1886/1887 is predicted as [0.99815995 0.00184011]
834/835 is predicted as [0.8919989  0.10800117]
42/43 is predicted as [9.9982929e-01 1.7065427e-04]
1566/1567 is predicted as [9.9985564e-01 1.4437299e-04]
800/801 is predicted as [0.998869   0.00113099]
1238/1239 is predicted as [9.9998116e-01 1.8888943e-05]
1906/1907 is predicted as [9.9990618e-01 9.3757066e-05]
1274/1275 is predicted as [9.9951077e-01 4.8926729e-04]
1344/1345 is predicted as [0.97073054 0.02926948]
596/597 is predicted as [0.50249165 0.4975083 ]
1224/1225 is predicted as [0.157969   0.84203106]
1340/1341 is predicted as [0.75901395 0.24098605]
646/647 is predicted as [0.5194546 0.4805454]
574/575 is predicted as [9.9930036e-

#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

```python
from matchzoo import engine
# Save the model to dir.
arci_model.save('/your-model-saved-path')
# And load the model from dir.
engine.load_model('/your-model-saved-path')
```

## Reference

[Hu et al. 2014] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. "Convolutional neural network architectures for matching natural language sentences." In Advances in neural information processing systems (NIPS), pp. 2042-2050. 2014.