### Build a Convolutional Deep Semantic Structured Model (CDSSM)

<img src="https://github.com/NTMC-Community/MatchZoo/blob/2.0/artworks/matchzoo-logo.png?raw=True" alt="logo" style="width:600px;float: center"/>

This is a tutorial on training *Convolutional-DSSM* [Shen et al. 2014](https://www.microsoft.com/en-us/research/publication/a-latent-semantic-model-with-convolutional-pooling-structure-for-information-retrieval/) model with [MatchZoo](https://github.com/NTMC-Community/MatchZoo). We use [WikiQA](https://aclweb.org/anthology/D15-1237) as the example benchmark data set to show the usage.

Beyond *DSSM* [Huang et al. 2013](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf), *CDSSM* uses the convolutional struction to extract both local contextual features at the word-n-gram level and global contextual features at the sentence-level.

*To walk through this notebook, you need approx 40 minutes.*

-------

**TL;DR**

The following code block illustrates the main workflow of how to train a CDSSM model. 

```python
from matchzoo import preprocessor
from matchzoo import generators
from matchzoo import models

train, test = ... # prepare your training data and test data.

# prepare data preprocessor
# data preprocessor output the format required by CDSSM
cdssm_preprocessor = preprocessor.CDSSMPreprocessor()
processed_tr = cdssm_preprocessor.fit_transform(train, stage='train')
processed_te = cdssm_preprocessor.fit_transform(test, stage='predict')
# required data information stored in `context`
input_shapes = processed_tr.context['input_shapes']

# data generator controls data batch format
generator_tr = generators.PointGenerator(processed_tr)
generator_te = generators.PointGenerator(processed_te)

cdssm_model = models.CDSSMModel()
# set model parameters
cdssm_model.params['input_shapes'] = input_shapes
# fill missing model parameters by default value
cdssm_model.guess_and_fill_missing_params()
cdssm_model.build()
cdssm_model.compile()
# train
cdssm_model.fit_generator(generator_tr)
# predict the first batch in test set
X_te, y_te = generator_te[0]
predictions = cdssm_model.predict([X_te.text_left, X_te.text_right])
```

-----

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

### Table of Content

+ Prepare WikiQA dataset
+ Preprocessing
+ Data Generator
+ Model Training
    - Initialize
    - Hyper-Parameters
    - Model training
    - Make Prediction
    - Model Persistence
- Reference

### Prepare WikiQA dataset

We take WikiQA as the example benchmark dataset to show the usage of MatchZoo.

Firstly you need to downlowd the data and uncompress the data, we provided the following script to help you download the dataset into `MatchZoo/data/WikiQA` folder, and convert the format to MatchZoo required. You can change the dataset path according to your needs.

The WikiQA dataset has format as:

`QuestionID\tQuestion\tDocumentID\tDocumentTitle\tSentenceID\tSentence\tLabel`

In [2]:
import os

data_folder = '../../data/WikiQA/'

if not os.path.exists(data_folder):
    cmd = 'mkdir -p {}\n'.format(data_folder) \
          +'cd {}\n'.format(data_folder) \
          +'wget https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip\n' \
          +'unzip WikiQACorpus.zip\n'
    print ('download WikiQA data... \n', cmd)
    os.system(cmd)
    
# convert dataset format to MatchZoo required format
def read_data(input, stage):
    output_list = []
    index = 0
    with open(input) as fin:
        for l in fin:
            tok = l.split('\t')
            # skip dataset file first line
            if index != 0:
                if stage == 'test':
                    output_list.append((tok[0], tok[4], tok[1], tok[5])) # qid, did, q, d
                else:
                    output_list.append((tok[0], tok[4], tok[1], tok[5], tok[6])) # qid, did, q, d, label 
            index += 1
    return output_list

train = read_data(data_folder + 'WikiQACorpus/WikiQA-train.tsv', stage='train')
dev   = read_data(data_folder + 'WikiQACorpus/WikiQA-dev.tsv', stage='dev')
test  = read_data(data_folder + 'WikiQACorpus/WikiQA-test.tsv', stage='test')

### Preprocessing

You can pre-process your data to fit CDSSM input in just three lines.

In [6]:
# Initialize a cdssm preprocessor.
from matchzoo import preprocessor
cdssm_preprocessor = preprocessor.CDSSMPreprocessor()
processed_tr = cdssm_preprocessor.fit_transform(train, stage='train')
processed_te = cdssm_preprocessor.fit_transform(test, stage='predict')

ImportError: No module named matchzoo

**What is `processed_tr`?**

`processed_tr` is a **MatchZoo DataPack** data structure (see `matchzoo/datapack.py`). It contains 
1. A *2-columns* `pandas DataFrame` called `left` to host all the pre-processed records including index and processed text to store `text_left` and `id_left`.
2. A *2-columns* `pandas DataFrame` called `right` to host all the pre-processed records including index and processed text to store `text_right` and `id_right`.
3. A *2-columns* `pandas DataFrame` called `relation` to host all the pre-processed records including index and index mapping `id_left` and `id_right`.
4. A `context` property (dictionary) consists of all the parameters fitted during pre-processing. 

The `fit_transform` method is a linear combination of two methods:

1. Fit parameters using the `fit` function, this only happens when `stage='train'`.
2. Transform data into expected format.

So the previous three lines code can also be written as:

```python
# Initialize a cdssm preprocessor.
from matchzoo import preprocessor
cdssm_preprocessor = preprocessor.CDSSMPreprocessor()
processed_tr = cdssm_preprocessor.fit_transform(train, stage='train')
# We do not need to fit any parameters during the testing stage.
# So we can call transform directly.
processed_te = cdssm_preprocessor.transform(test, stage='predict')
```

As described, the fitted parameters were stored in `context` property, to access the context, just call:

```python
print(processed_tr.context)
```
Another example:

In [3]:
print('vocab size: ', len(processed_tr.context['term_index']))

vocab size:  9643


**What has been stored in the `context?`** 

We stored `input_shapes` in the context property. Since CDSSM model's input shape is dynamic (it depends on user's training data to generate tri-letters), you **must** manually set models input shape, we'll discuss it in the model training section.

**What is `cdssm_preprocessor` actually doing?**

The `cdssm_preprocessor` is calling a sequence of `process_units`. Each `process_unit` is designed to perform one atom operation on input data. For instance, in `cdssm_preprocessor`, we called:

1. TokenizeUnit: Perform tokenization on raw input data.
2. LowercaseUnit: Transform all tokens into lower case.
3. PuncRemovalUnit: Remove all the punctuations.
4. StopRemovalUnit: Remove all the stopwords.
5. NgramLetterUnit: Create n-gram-letters (by default we're creating tri-letters) as input data, for example: the token `test` we be transformed to `['#te', 'tes', 'est', 'st#']`.
6. VocabularyUnit: Create vocabulary to get the dimensionality of `tri-letters`.
7. WordHashingUnit: Create `WordHashing` layer as described in the paper.

----

### Data Generation

For memory efficiency, we expect you to use **generator** to generate batches of data on the fly. For example, we can create a **PointGenerator** as follows:

In [4]:
from matchzoo import generators
from matchzoo import tasks
generator_tr = generators.PointGenerator(inputs=processed_tr, task=tasks.Ranking(), batch_size=64, stage='train')
generator_te = generators.PointGenerator(inputs=processed_te, task=tasks.Ranking(), batch_size=64, stage='predict')

To get the first batch of trainig data, just call `X_train, y_train = generator[0]`.

**What is PointGenerator?**
**PointGenerator** is this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the problem can be approximated by a regression/Classification problem — given a single query-document pair, predict its score.

A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values.

**What is PairGenerator?**
In this case, the problem is approximated by a classification problem — learning a binary classifier that can tell which document is better in a given pair of documents.

In MatchZoo, **PairGenerator** generate one positive & `num_neg` negative examples per pair. As an example, to train a CDSSM model (for document ranking), we use `num_neg=4`. 

**What is ListGenerator?**
This generator try to directly optimize the value of evaluation measures, averaged over all queries in the training data. 

Chosse the appropriate generator based on your `task`.

----

### Train Your CDSSM Model

To train a CDSSM model, we need to create an instance of CDSSMModel:

In [5]:
from matchzoo import models
cdssm_model = models.CDSSMModel()

Then, we need to set hyper-parameters to our CDSSM Model. In general, there are **two types of hyper-parameters**:

**Required parameters**: For CDSSM, since the `input_shapes` depend on the dimensionality of fitted training data, you're required to set this parameter manually!

In [6]:
# The fitted parameters is stored in the `context` property of pre-processor instance during the training stage.
from matchzoo import losses
from matchzoo import tasks
input_shapes = processed_tr.context['input_shapes']
cdssm_model.params['input_shapes'] = input_shapes
cdssm_model.params['task'] = tasks.Ranking()

**Tunable parameters**: For CDSSM, you're allowed to tune these parameters:

```python
from matchzoo import tasks

params = {'w_initializer': 'glorot_normal', # see keras weight_initializer.
          'b_initializer': 'zeros', # see keras bias_initializer.
          'dim_fan_out': 128, # Dimension of output layer.
          'dim_hidden': 300, # Dimension of hidden layer.
          'contextual_window': 3, # Convolution window size.
          'strides': 1, # An integer or tuple/list of n integers, specifying the dimensions of the convolution window.
          'padding': 'same', # One of 'valid' or 'same' (case-insensitive).
          'activation_hidden': 'tanh', # Activation function of hidden layer, see keras activation.
          'num_hidden_layers': 2, # Number of hidden layers.
          'optimizer': 'sgd', # By default, we're using sgd, see keras optimizer.
          'task': tasks.Classification, # Default Classification, you can use tasks.Ranking
          'loss': 'categorical_crossentropy', # categorical_crossentropy, see keras loss.
          'metric': 'acc', # Accuracy by default, see keras metric.
         }
```

Same as **required parameters**, use `cdssm_model.params['parameter-name'] = parameter-value` to set the hyper parameters. If you want to keep everything by default values, just use

In [7]:
cdssm_model.guess_and_fill_missing_params()
print('cdssm parameters: ', cdssm_model.params)

cdssm parameters:  name                          CDSSMModel
model_class                   <class 'matchzoo.models.cdssm_model.CDSSMModel'>
input_shapes                  [(5, 28932), (5, 28932)]
task                          <matchzoo.tasks.ranking.Ranking object at 0x7fd6c7fa1e80>
metrics                       ['mae']
loss                          mse
optimizer                     adam
w_initializer                 glorot_normal
b_initializer                 zeros
dim_fan_out                   128
dim_hidden                    300
contextual_window             3
strides                       1
padding                       same
activation_hidden             tanh
num_hidden_layers             1


#### Model Training

To train the model after all the parameters were settled, call:

In [9]:
cdssm_model.build()
cdssm_model.compile()
# Fit the cdssm model on generator.
cdssm_model.fit_generator(generator_tr, steps_per_epoch=10, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd6af7ceb00>

In [10]:
# Make predictions on the first batch of test data
X_te, y_te = generator_te[0]
predictions = cdssm_model.predict([X_te.text_left, X_te.text_right])
for id_left, id_right, pred in zip(X_te.id_left, X_te.id_right, predictions):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

Q1583/D1501-14 is predicted as [0.28110725]
Q1569/D1487-3 is predicted as [0.2935608]
Q3012/D2780-5 is predicted as [0.28088737]
Q878/D849-16 is predicted as [0.05994327]
Q388/D386-3 is predicted as [0.63909245]
Q744/D722-1 is predicted as [0.49063012]
Q599/D588-6 is predicted as [0.17586084]
Q455/D448-11 is predicted as [-0.03900433]
Q2803/D2604-2 is predicted as [-0.0858564]
Q1157/D1112-1 is predicted as [-0.1775243]
Q2144/D2021-2 is predicted as [-0.19360174]
Q817/D791-0 is predicted as [-0.40400195]
Q411/D408-10 is predicted as [0.09464573]
Q2185/D2059-1 is predicted as [-0.19726811]
Q1233/D1181-8 is predicted as [0.04308429]
Q800/D774-6 is predicted as [-0.11885771]
Q2723/D2538-3 is predicted as [-0.3144391]
Q2836/D2633-2 is predicted as [0.10870139]
Q2841/D2638-0 is predicted as [0.22389692]
Q2382/D230-14 is predicted as [0.68754077]
Q2618/D858-5 is predicted as [-0.28881052]
Q2766/D1764-20 is predicted as [-0.00923205]
Q2385/D2239-10 is predicted as [0.08014339]
Q1012/D976-15 is

#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

```python
from matchzoo import engine
# Save the model to dir.
cdssm_model.save('/your-model-saved-path')
# And load the model from dir.
engine.load_model('/your-model-saved-path')
```

## Reference

[Huang et al. 2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.

[Shen et al. 2014] Shen Y, He X, Gao J, et al. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[J]. In Proc. CIKM. ACM, 101-110.