# Build a Deep Semantic Structured Model (DSSM)


This is a tutorial on *Deep Semantic Similarity Model* ([Huang et al. 2013]) model with MatchZoo. We use WikiQA as the example benchmark data set to show the usage.

## Quick Start

The following code block illustrates the main workflow of how to train a DSSM model. 

In [None]:
from matchzoo import preprocessor
from matchzoo import generators
from matchzoo import models

train, test = ... # prepare your training data and test data.

dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
processed_te = dssm_preprocessor.fit_transform(test, stage='test')
# DSSM expect dimensionality of letter-trigrams as input shape.
# The fitted parameters has been stored in `context` during preprocessing on training data.
input_shapes = processed_tr.context['input_shapes']

generator_tr = generators.PointGenerator(processed_tr)
generator_te = generators.PointGenerator(processed_te)
# Example, train with generator, test with the first batch.
X_te, y_te = generator_te[0]

dssm_model = models.DSSMModel()
dssm_model.params['input_shapes'] = input_shapes
dssm_model.guess_and_fill_missing_params()
dssm_model.build()
dssm_model.compile()
dssm_model.fit_generator(generator_tr)
# Make predictions
predictions = dssm_model.predict([X_te.text_left, X_te.text_right])

## Expected Input

MatchZoo expect a list of *Quintuple* as training input for DSSM model:

In [None]:
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not have labels) as input:

In [None]:
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]

We take WikiQA as the example benchmark dataset to show the usage of MatchZoo. Firstly you need to downlowd the data and uncompress the data into 'MatchZoo/data/WikiQA/'. 

In [22]:
import os

cmd = 'mkdir -p ../../data/WikiQA/\n' \
      +'cd ../../data/WikiQA/\n' \
      +'wget https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip\n' \
      +'unzip WikiQACorpus.zip\n'
print ('download WikiQA data... ', cmd)
os.system(cmd)

download WikiQA data...  mkdir -p ../../data/WikiQA/
cd ../../data/WikiQA/
wget https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
unzip WikiQACorpus.zip



256

The train/dev/test files of WikiQA are WikiQA-train.tsv/WikiQA-dev.tsv/WikiQA-test.tsv under the uncompressed folder WikiQACorpus. The data format of WikiQA is as follows:

`QuestionID\tQuestion\tDocumentID\tDocumentTitle\tSentenceID\tSentence\tLabel`

We can transfer this format to the input format of MatchZoo.

In [3]:
data_folder = '../../data/WikiQA/WikiQACorpus/'

def read_data(input):
    output_list = []
    index = 0
    with open(input) as fin:
        for l in fin:
            tok = l.split('\t')
            if index != 0:
                output_list.append((tok[0], tok[4], tok[1], tok[5], tok[6])) # qid, did, q, d, label 
            index += 1
    return output_list

train = read_data(data_folder + 'WikiQA-train.tsv')
print ('train', len(train), train[0:10])
dev = read_data(data_folder + 'WikiQA-dev.tsv')
test = read_data(data_folder + 'WikiQA-test.tsv')

train 20360 [('Q1', 'D1-0', 'how are glacier caves formed?', 'A partly submerged glacier cave on Perito Moreno Glacier .', '0\n'), ('Q1', 'D1-1', 'how are glacier caves formed?', 'The ice facade is approximately 60 m high', '0\n'), ('Q1', 'D1-2', 'how are glacier caves formed?', 'Ice formations in the Titlis glacier cave', '0\n'), ('Q1', 'D1-3', 'how are glacier caves formed?', 'A glacier cave is a cave formed within the ice of a glacier .', '1\n'), ('Q1', 'D1-4', 'how are glacier caves formed?', 'Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.', '0\n'), ('Q2', 'D2-0', 'How are the directions of the velocity and force vectors related in a circular motion', 'In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path.', '0\n'), ('Q2', 'D2-1', 'How are the directions of the velocity and force vectors related in a circular motion', 'It can be u

## Preprocessing

You can pre-process your DSSM input in three lines of code:

In [4]:
# Initialize a dssm preprocessor.
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
processed_te = dssm_preprocessor.fit_transform(test, stage='test')

Using TensorFlow backend.
2018-08-18 15:44:26,946 - matchzoo.preprocessor.dssm_preprocessor - INFO - Start building vocabulary & fitting parameters.
100%|██████████| 20360/20360 [26:09<00:00, 12.98it/s]
2018-08-18 16:10:36,295 - matchzoo.preprocessor.dssm_preprocessor - INFO - Start processing input data for train stage.
100%|██████████| 20360/20360 [25:19<00:00, 13.40it/s]
2018-08-18 16:35:55,860 - matchzoo.preprocessor.dssm_preprocessor - INFO - Start processing input data for test stage.
100%|██████████| 6165/6165 [07:52<00:00, 13.04it/s]


You might be interested that what is *processed_tr*? Actually, *processed_tr* is a **MatchZoo DataPack** (see matchzoo/datapack.py) data structure. It contains a *pandas DataFrame* to host all the pre-processed records, and a `context` property (dictionary) consists of all the parameters fitted during pre-processing. The `fit_transform` method is a linear combination of two methods:

1. Fit parameters using the `fit` function, this only happens when `stage='train'`.
2. Transform data into expected format.

So the previous three lines code can also be written as:

In [None]:
# Initialize a dssm preprocessor.
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
processed_tr = dssm_preprocessor.fit_transform(train, stage='train')
# We do not need to fit any parameters during the testing stage.
# So we can call transform directly.
processed_te = dssm_preprocessor.transform(test, stage='test')

As described, the fitted parameters were stored in **context** property, to access the context, just call:

In [22]:
print('vocab size: ', processed_tr.context['term_index'])

vocab size:  {'tet': 1, 'par': 2, '#2d': 3, 'σσα': 4, '#56': 5, '077': 6, 'eus': 7, 'guy': 8, 'mla': 9, 'npb': 10, '37t': 11, 'eah': 12, 'efs': 13, 'uop': 14, 'rpa': 15, 'tyn': 16, '#ph': 17, 'del': 18, 'vy#': 19, 'odb': 20, 'dex': 21, 'use': 22, 'bus': 23, 'kōb': 24, 'adl': 25, 'fud': 26, 'cot': 27, 'bpa': 28, 'rl_': 29, 'ewe': 30, 'gea': 31, 'erm': 32, 'ews': 33, 'een': 34, 'nsm': 35, '47#': 36, 'ano': 37, 'lty': 38, 'oso': 39, '866': 40, 'clr': 41, 'ixi': 42, 'wür': 43, '#בח': 44, 'odh': 45, 'wac': 46, 'pwr': 47, '#5c': 48, '437': 49, 'æmi': 50, 'fal': 51, 'nik': 52, 'jet': 53, 'cov': 54, 'ocs': 55, '#mb': 56, 'wch': 57, 'ar#': 58, 'dgn': 59, 'rot': 60, 'faa': 61, 'fit': 62, 'akk': 63, 'umo': 64, '080': 65, 'fta': 66, 'ise': 67, '08#': 68, '011': 69, '814': 70, '#69': 71, 'sos': 72, 'kbu': 73, 'ín#': 74, 'ke#': 75, 'mba': 76, 'img': 77, 'zwa': 78, 'vi#': 79, 'ruf': 80, 'zor': 81, '#tu': 82, '8co': 83, 'aor': 84, 'bme': 85, 'nig': 86, 'dé#': 87, 'ees': 88, 'xot': 89, '#94': 90, 'sly'

What has been stored in the context? We stored `input_shapes` in the context property. Since DSSM model's model input shape is dynamic (it depends on user's training data to generate tri-letters), so you **must** manually set models input shape, we'll discuss it in the model training section.

What is `dssm_preprocessor` actually doing? The `dssm_preprocessor` is calling a sequence of `process_units`. Each `process_unit` is designed to perform one atom operation on input data. For instance, in `dssm_preprocessor`, we called:

1. TokenizeUnit: Perform tokenization on raw input data.
2. LowercaseUnit: Transform all tokens into lower case.
3. PuncRemovalUnit: Remove all the punctuations.
4. StopRemovalUnit: Remove all the stopwords.
5. NgramLetterUnit: Create n-gram-letters (by default we're creating tri-letters) as input data, for example: the token `test` we be transformed to `['#te', 'tes', 'est', 'st#']`.
6. VocabularyUnit: Create vocabulary to get the dimensionality of `tri-letters`.
7. WordHashingUnit: Create **WordHashing** layer as described in the paper.

## Data Generation

For memory efficiency, we expect you to use **generator** to generate batches of data on the fly. For example, we can create a **PointGenerator** as follows:

In [6]:
from matchzoo import generators
generator_tr = generators.PointGenerator(processed_tr, batch_size=100)
generator_te = generators.PointGenerator(processed_te, batch_size=100)

To get the first batch of trainig data, just call `X_train, y_train = generator[0]`.

## Train Your DSSM Model

To train a DSSM model, we need to create an instance of DSSMModel:

In [7]:
from matchzoo import models
dssm_model = models.DSSMModel()

Then, we need to set hyper-parameters to our DSSM Model. In general, there are two types of parameters:

**Required parameters**: For DSSM, since the `input_shapes` depend on the dimensionality of fitted training data, you're required to set this parameter.

In [15]:
# The fitted parameters is stored in the `context` property of pre-processor instance.
input_shapes = processed_tr.context['input_shapes']
dssm_model.params['input_shapes'] = input_shapes

**Tunable parameters**: For DSSM, you're allowed to tune these parameters:

In [None]:
params = {'w_initializer': 'glorot_normal', # Weight initializer, see keras documentation.
            'b_initializer': 'zeros', # Bias initializer, see keras documentation.
            'dim_fan_out': 128, # Dimension of output layer.
            'dim_hidden': 300, # Dimension of hidden layer.
            'activation_hidden': 'tanh', # Activation function of hidden layer, see keras documentation.
            'num_hidden_layers': 2, # Number of hidden layers.
            'optimizer': 'sgd', # By default, we're using sgd, see keras documentation.
            'task': matchzoo.tasks.Classification, # Default Classification, you can use matchzoo.engine.Ranking
            'loss': 'categorical_crossentropy', # categorical_crossentropy by default, see keras documentation.
            'metric': 'acc', # Accuracy by default, see keras documentation.
         }

Same as **required parameters**, use `dssm_model.params['parameter-name'] = parameter-value` to set the hyper parameters. If you want to keep everything by default values, just use

In [16]:
dssm_model.guess_and_fill_missing_params()
print('dssm parameters: ', dssm_model.params)

dssm parameters:  name                          DSSMModel
model_class                   <class 'matchzoo.models.dssm_model.DSSMModel'>
input_shapes                  [(9609,), (9609,)]
task                          <matchzoo.tasks.classification.Classification object at 0x7fc63ea8f9b0>
metrics                       ['acc']
loss                          categorical_crossentropy
optimizer                     sgd
w_initializer                 glorot_normal
b_initializer                 zeros
dim_fan_out                   128
dim_hidden                    300
activation_hidden             tanh
num_hidden_layers             2


To train the model after all the parameters were settled, call:

In [20]:
dssm_model.build()
dssm_model.compile()
# Fit the dssm model on generator.
dssm_model.fit_generator(generator_tr, steps_per_epoch=200, epochs=10)
# Make predictions on the first batch of test data
X_te, y_te = generator_te[0]
predictions = dssm_model.predict([X_te.text_left, X_te.text_right])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


You can also persist your trained model using

In [None]:
from matchzoo import engine
# Save the model to desktop.
dssm_model.save('/your-path-to-desktop/Desktop/')
# And load the model from desktop.
engine.load_model('/your-path-to-desktop/Desktop/')

## Reference

[Huang et al. 2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.