### Build a Deep Semantic Structured Model (DSSM)

<img src="https://github.com/NTMC-Community/MatchZoo/blob/2.0/artworks/matchzoo-logo.png?raw=True" alt="logo" style="width:600px;float: center"/>

This is a tutorial on training *Deep Semantic Similarity Model* [Huang et al. 2013](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/DSSM_cikm13_talk_v4.pdf) model with [MatchZoo](https://github.com/faneshion/MatchZoo). We use [WikiQA](https://aclweb.org/anthology/D15-1237) as the example benchmark data set to show the usage.

Features:

1. Using the tri-letter based word hashing for scalable word representation.
2. Using the deep neural net to extract high-level semantic representations.
3. Using the click signal to guide the learning.



*To walk through this notebook, you need approx 30 minutes.*

-------

**TL;DR**

The following code block illustrates the main workflow of how to train a DSSM model. 

```python
from matchzoo import preprocessor
from matchzoo import generators
from matchzoo import models

train, test = ... # prepare your training data and test data.

dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
processed_te = dssm_preprocessor.fit_transform(test, stage='test')
# DSSM expect dimensionality of letter-trigrams as input shape.
# The fitted parameters has been stored in `context` during preprocessing on training data.
input_shapes = processed_tr.context['input_shapes']

generator_tr = generators.PointGenerator(processed_tr)
generator_te = generators.PointGenerator(processed_te)
# Example, train with generator, test with the first batch.
X_te, y_te = generator_te[0]

dssm_model = models.DSSMModel()
dssm_model.params['input_shapes'] = input_shapes
dssm_model.guess_and_fill_missing_params()
dssm_model.build()
dssm_model.compile()
dssm_model.fit_generator(generator_tr)
# Make predictions
predictions = dssm_model.predict([X_te.text_left, X_te.text_right])
```

-----

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

### Table of Content

+ Prepare WikiQA dataset
    - Download
    - Load
    - Adjustment
+ Preprocessing
+ Data Generator
+ Model Training
    - Initialize
    - Hyper-Parameters
    - Make Prediction
    - Model Persistence
- Reference

### Prepare WikiQA dataset

#### Download

We take WikiQA as the example benchmark dataset to show the usage of MatchZoo. Firstly you need to downlowd the data and uncompress the data, we provided the following script to help you download the dataset into `MatchZoo/data/WikiQA` folder, you can change the directory in the following script.

If you already have WikiQA dataset downloaded on your machine, skip the following script.

In [1]:
import os

cmd = 'mkdir -p ../../data/WikiQA/\n' \
      +'cd ../../data/WikiQA/\n' \
      +'wget https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip\n' \
      +'unzip WikiQACorpus.zip\n'
print ('download WikiQA data... ', cmd)
os.system(cmd)

download WikiQA data...  mkdir -p ../../data/WikiQA/
cd ../../data/WikiQA/
wget https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
unzip WikiQACorpus.zip



256

#### Load & Adjustment

The *train/dev/test* files of WikiQA are *WikiQA-train.tsv*, *WikiQA-dev.tsv*, *WikiQA-test.tsv* under the uncompressed folder WikiQACorpus. The data format of WikiQA is as follows:

`QuestionID\tQuestion\tDocumentID\tDocumentTitle\tSentenceID\tSentence\tLabel`

We can convert this format to the expected input format of MatchZoo.

In [5]:
data_folder = '../../data/WikiQA/WikiQACorpus/'

def read_data(input, stage):
    output_list = []
    index = 0
    with open(input) as fin:
        for l in fin:
            tok = l.split('\t')
            if index != 0:
                if stage == 'test':
                    output_list.append((tok[0], tok[4], tok[1], tok[5])) # qid, did, q, d, label
                else:
                    output_list.append((tok[0], tok[4], tok[1], tok[5], tok[6])) # qid, did, q, d 
            index += 1
    return output_list

train = read_data(data_folder + 'WikiQA-train.tsv', stage='train')
dev   = read_data(data_folder + 'WikiQA-dev.tsv', stage='dev')
test  = read_data(data_folder + 'WikiQA-test.tsv', stage='test')

### Preprocessing

You can pre-process your DSSM input in three lines of code:

In [6]:
# Initialize a dssm preprocessor.
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
processed_te = dssm_preprocessor.fit_transform(test, stage='predict')

Start building vocabulary & fitting parameters.
2118it [00:00, 3440.95it/s]
18841it [00:09, 2079.73it/s]
Start processing input data for train stage.
2118it [00:00, 2442.61it/s]
18841it [00:11, 1580.16it/s]
Start processing input data for predict stage.
633it [00:00, 2211.88it/s]
5961it [00:03, 1574.15it/s]


**What is `processed_tr`?**

`processed_tr` is a **MatchZoo DataPack** data structure (see `matchzoo/datapack.py`). It contains 
1. A *2-columns* `pandas DataFrame` called `left` to host all the pre-processed records including index and processed text to store `text_left` and `id_left`.
2. A *2-columns* `pandas DataFrame` called `right` to host all the pre-processed records including index and processed text to store `text_right` and `id_right`.
3. A *2-columns* `pandas DataFrame` called `relation` to host all the pre-processed records including index and index mapping `id_left` and `id_right`.
4. A `context` property (dictionary) consists of all the parameters fitted during pre-processing. 

The `fit_transform` method is a linear combination of two methods:

1. Fit parameters using the `fit` function, this only happens when `stage='train'`.
2. Transform data into expected format.

So the previous three lines code can also be written as:

```python
# Initialize a dssm preprocessor.
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
# We do not need to fit any parameters during the testing stage.
# So we can call transform directly.
processed_te = dssm_preprocessor.transform(test, stage='test')
```

As described, the fitted parameters were stored in `context` property, to access the context, just call:

```python
print(processed_tr.context)
```
An example:

In [7]:
print('vocab size: ', len(processed_tr.context['term_index']))

vocab size:  9643


**What has been stored in the `context?`** 

We stored `input_shapes` in the context property. Since DSSM model's model input shape is dynamic (it depends on user's training data to generate tri-letters), so you **must** manually set models input shape, we'll discuss it in the model training section.

**What is `dssm_preprocessor` actually doing?**

The `dssm_preprocessor` is calling a sequence of `process_units`. Each `process_unit` is designed to perform one atom operation on input data. For instance, in `dssm_preprocessor`, we called:

1. TokenizeUnit: Perform tokenization on raw input data.
2. LowercaseUnit: Transform all tokens into lower case.
3. PuncRemovalUnit: Remove all the punctuations.
4. StopRemovalUnit: Remove all the stopwords.
5. NgramLetterUnit: Create n-gram-letters (by default we're creating tri-letters) as input data, for example: the token `test` we be transformed to `['#te', 'tes', 'est', 'st#']`.
6. VocabularyUnit: Create vocabulary to get the dimensionality of `tri-letters`.
7. WordHashingUnit: Create `WordHashing` layer as described in the paper.

----

### Data Generation

For memory efficiency, we expect you to use **generator** to generate batches of data on the fly. For example, we can create a **PointGenerator** as follows:

In [8]:
from matchzoo import generators
from matchzoo import tasks
generator_tr = generators.PointGenerator(inputs=processed_tr, task=tasks.Ranking(), batch_size=64, stage='train')
generator_te = generators.PointGenerator(inputs=processed_te, task=tasks.Ranking(), batch_size=64, stage='predict')

To get the first batch of trainig data, just call `X_train, y_train = generator[0]`.

**What is PointGenerator?**
**PointGenerator** is this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the problem can be approximated by a regression/Classification problem — given a single query-document pair, predict its score.

A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values.

**What is PairGenerator?**
In this case, the problem is approximated by a classification problem — learning a binary classifier that can tell which document is better in a given pair of documents.

In MatchZoo, **PairGenerator** generate one positive & `num_neg` negative examples per pair. As an example, to train a DSSM model (for document ranking), we use `num_neg=4`. 

**What is ListGenerator?**
This generator try to directly optimize the value of evaluation measures, averaged over all queries in the training data. 

Chosse the appropriate generator based on your `task`.

----

### Train Your DSSM Model

To train a DSSM model, we need to create an instance of DSSMModel:

In [9]:
from matchzoo import models
dssm_model = models.DSSMModel()

Then, we need to set hyper-parameters to our DSSM Model. In general, there are **two types of hyper-parameters**:

**Required parameters**: For DSSM, since the `input_shapes` depend on the dimensionality of fitted training data, you're required to set this parameter manually!

In [10]:
# The fitted parameters is stored in the `context` property of pre-processor instance during the training stage.
from matchzoo import losses
from matchzoo import tasks
input_shapes = processed_tr.context['input_shapes']
dssm_model.params['input_shapes'] = input_shapes
dssm_model.params['task'] = tasks.Ranking()

**Tunable parameters**: For DSSM, you're allowed to tune these parameters:

```python
from matchzoo import tasks

params = {'w_initializer': 'glorot_normal', # see keras weight_initializer.
          'b_initializer': 'zeros', # see keras bias_initializer.
          'dim_fan_out': 128, # Dimension of output layer.
          'dim_hidden': 300, # Dimension of hidden layer.
          'activation_hidden': 'tanh', # Activation function of hidden layer, see keras activation.
          'num_hidden_layers': 2, # Number of hidden layers.
          'optimizer': 'sgd', # By default, we're using sgd, see keras optimizer.
          'task': tasks.Classification, # Default Classification, you can use tasks.Ranking
          'loss': 'categorical_crossentropy', # categorical_crossentropy, see keras loss.
          'metric': 'acc', # Accuracy by default, see keras metric.
         }
```

Same as **required parameters**, use `dssm_model.params['parameter-name'] = parameter-value` to set the hyper parameters. If you want to keep everything by default values, just use

In [11]:
dssm_model.guess_and_fill_missing_params()
print('dssm parameters: ', dssm_model.params)

dssm parameters:  name                          DSSMModel
model_class                   <class 'matchzoo.models.dssm_model.DSSMModel'>
input_shapes                  [(9644,), (9644,)]
task                          <matchzoo.tasks.ranking.Ranking object at 0x127b428d0>
metrics                       ['mae']
loss                          mse
optimizer                     adam
w_initializer                 glorot_normal
b_initializer                 zeros
dim_fan_out                   128
dim_hidden                    300
activation_hidden             tanh
num_hidden_layers             2


#### Model Training

To train the model after all the parameters were settled, call:

In [12]:
dssm_model.build()
dssm_model.compile()
# Fit the dssm model on generator.
dssm_model.fit_generator(generator_tr, steps_per_epoch=200, epochs=10)
# Make predictions on the first batch of test data
X_te, y_te = generator_te[0]
predictions = dssm_model.predict([X_te.text_left, X_te.text_right])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [13]:
for id_left, id_right, pred in zip(X_te.id_left, X_te.id_right, predictions):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

Q1391/D1328-2 is predicted as [0.30649]
Q523/D515-6 is predicted as [0.26608124]
Q1389/D1326-2 is predicted as [0.12608601]
Q333/D332-10 is predicted as [0.20222421]
Q1583/D1501-8 is predicted as [-0.13000704]
Q2201/D2074-11 is predicted as [0.52017313]
Q2028/D1914-2 is predicted as [0.15043251]
Q1937/D453-1 is predicted as [-0.24049379]
Q1697/D1610-0 is predicted as [0.2737229]
Q432/D427-8 is predicted as [0.01019711]
Q2684/D2504-5 is predicted as [-0.21924268]
Q2741/D1331-17 is predicted as [-0.07659646]
Q2499/D1349-19 is predicted as [0.15225966]
Q2228/D2099-2 is predicted as [-0.04368623]
Q69/D69-3 is predicted as [0.07977355]
Q2199/D2072-0 is predicted as [0.20606047]
Q2009/D1895-8 is predicted as [0.04554044]
Q2537/D2372-0 is predicted as [0.2289976]
Q2640/D2464-2 is predicted as [0.1054885]
Q1103/D1061-8 is predicted as [-0.10978747]
Q2843/D2640-11 is predicted as [0.14241193]
Q1689/D1603-2 is predicted as [-0.13612278]
Q1409/D1342-12 is predicted as [0.07591998]
Q1707/D1618-0 i

#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

```python
from matchzoo import engine
# Save the model to dir.
dssm_model.save('/your-model-saved-path')
# And load the model from dir.
engine.load_model('/your-model-saved-path')
```

## Reference

[Huang et al. 2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.