<img src="https://github.com/NTMC-Community/MatchZoo/blob/2.0/artworks/matchzoo-logo.png?raw=True" alt="logo" style="width:600px;float: center"/>

# Get Started with MatchZoo

In this tutorial, we will train a Deep Semantic Structured Model (DSSM) [Huang et al. 2013](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/DSSM_cikm13_talk_v4.pdf) model with [MatchZoo](https://github.com/faneshion/MatchZoo), and use [WikiQA](https://aclweb.org/anthology/D15-1237) as our dataset.

## Download and Unzip the Dataset

In [1]:
import sys
from pathlib import Path
if not Path('../data/WikiQA').exists():
    !mkdir -p ../data/WikiQA
    !wget -P ../data/WikiQA https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
    !unzip -o -d ../data/WikiQA ../data/WikiQA/WikiQACorpus.zip
elif input('WikiQA already exists, download again?(Y/N)').lower() == 'y':
    !rm -rf ../data/WikiQA/
    !mkdir -p ../data/WikiQA
    !wget -P ../data/WikiQA https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
    !unzip -o -d ../data/WikiQA ../../data/WikiQA/WikiQACorpus.zip

--2018-11-07 11:27:30--  https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
Resolving download.microsoft.com... 23.55.115.136
Connecting to download.microsoft.com|23.55.115.136|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7094233 (6.8M) [application/octet-stream]
Saving to: ‘../data/WikiQA/WikiQACorpus.zip’


2018-11-07 11:28:20 (144 KB/s) - ‘../data/WikiQA/WikiQACorpus.zip’ saved [7094233/7094233]

Archive:  ../data/WikiQA/WikiQACorpus.zip
   creating: ../data/WikiQA/WikiQACorpus/emnlp-table/
  inflating: ../data/WikiQA/WikiQACorpus/emnlp-table/WikiQA.CNN.dev.rank  
  inflating: ../data/WikiQA/WikiQACorpus/emnlp-table/WikiQA.CNN.test.rank  
  inflating: ../data/WikiQA/WikiQACorpus/emnlp-table/WikiQA.CNN-Cnt.dev.rank  
  inflating: ../data/WikiQA/WikiQACorpus/emnlp-table/WikiQA.CNN-Cnt.test.rank  
  inflating: ../data/WikiQA/WikiQACorpus/eval.py  
  inflating: ../data/WikiQA/WikiQACorpus/Guidelines_Phase1.pdf

## Load the Dataset
MatchZoo expect a list of *Quintuple* as training data. The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, `text_left` is referred as `query`, and `text_right` is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input.

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

In [1]:
def read_data(path, stage):
    def scan_file():
        with open(path) as in_file:
            next(in_file)  # skip header
            for l in in_file:
                yield l.strip().split('\t')
    if stage == 'train':
        return [(qid, did, q, d, label) for qid, q, _, _, did, d, label in scan_file()]
    elif stage == 'predict':
        return [(qid, did, q, d) for qid, q, _, _, did, d, _ in scan_file()]

train = read_data('../data/WikiQA/WikiQACorpus/WikiQA-train.tsv', stage='train')
predict  = read_data('../data/WikiQA/WikiQACorpus/WikiQA-test.tsv', stage='predict')

In [2]:
print(train[0])
print(predict[0])

('Q1', 'D1-0', 'how are glacier caves formed?', 'A partly submerged glacier cave on Perito Moreno Glacier .', '0')
('Q0', 'D0-0', 'HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US', 'African immigration to the United States refers to immigrants to the United States who are or were nationals of Africa .')


## Preprocessing

In [2]:
import matchzoo

In [1]:
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
datapack_train = dssm_preprocessor.fit_transform(train)
datapack_predict = dssm_preprocessor.fit_transform(predict)

Using TensorFlow backend.


NameError: name 'train' is not defined

In [6]:
type(datapack_train)

matchzoo.datapack.DataPack

In [7]:
# pre-processed records including index and processed text to store `text_left` and `id_left`
datapack_train.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Q2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Q5,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Q6,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Q7,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."


In [8]:
# pre-processed records including index and processed text to store `text_right` and `id_right`
datapack_train.right.head()

Unnamed: 0_level_0,text_right
id_right,Unnamed: 1_level_1
D1-0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
D1-1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
D1-2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
D1-3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
D1-4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [9]:
# pre-processed records including index and index mapping `id_left` and `id_right`
datapack_train.relation.head()

Unnamed: 0,id_left,id_right,label
0,Q1,D1-0,0
1,Q1,D1-1,0
2,Q1,D1-2,0
3,Q1,D1-3,1
4,Q1,D1-4,0


In [10]:
# other information stored during the pre-processing process
datapack_train.context.keys()

dict_keys(['term_index', 'input_shapes'])

In [11]:
# vocabulary size
len(datapack_train.context['term_index'])

9643

In [12]:
# since DSSM models' input shapes are dynamic
# (depend on the generated tri-letters)
# so we have to calculate shapes during the pre-processing process
datapack_train.context['input_shapes']

[(9644,), (9644,)]

## Data Generation

In [13]:
from matchzoo import generators
from matchzoo import tasks
generator_train = generators.PointGenerator(
    inputs=datapack_train, task=tasks.Ranking(), batch_size=64, stage='train')
generator_predict = generators.PointGenerator(
    inputs=datapack_predict, task=tasks.Ranking(), batch_size=64, stage='predict')

## Training

In [14]:
from matchzoo import models, load_model
from matchzoo import losses
from matchzoo import tasks
from matchzoo import metrics
dssm_model = models.DSSMModel()

In [15]:
# handle dynamic input shapes of DSSM
input_shapes = datapack_train.context['input_shapes']
dssm_model.params['input_shapes'] = input_shapes

In [16]:
dssm_model.params['task'] = tasks.Ranking()
dssm_model.params['task'].metrics = ['mae', 'map']

In [17]:
dssm_model.guess_and_fill_missing_params()
print(dssm_model.params)

name                          DSSMModel
model_class                   <class 'matchzoo.models.dssm_model.DSSMModel'>
input_shapes                  [(9644,), (9644,)]
task                          <matchzoo.tasks.ranking.Ranking object at 0x12297e240>
optimizer                     adam
w_initializer                 glorot_normal
b_initializer                 zeros
dim_fan_out                   128
dim_hidden                    300
activation_hidden             tanh
num_hidden_layers             2


In [18]:
dssm_model.build()
dssm_model.compile()
dssm_model.fit_generator(generator_train, steps_per_epoch=20, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x12298e0b8>

In [19]:
X, Y = generator_train[0]
dssm_model.evaluate(X, Y)



{'loss': 0.11642323434352875,
 'mean_absolute_error': 0.17739450931549072,
 'mean_average_precision(0)': 0.12698412698412698}

In [20]:
X_predict, _ = generator_predict[0]
pred = dssm_model.predict(X)
for id_left, id_right, pred, _ in zip(X_predict.id_left, X_predict.id_right, pred, range(10)):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

Q2970/D2744-15 is predicted as [0.1555992]
Q2841/D2638-13 is predicted as [0.02346]
Q2618/D858-20 is predicted as [0.16804412]
Q907/D876-1 is predicted as [0.0600482]
Q1240/D1187-2 is predicted as [0.24149144]
Q1275/D1219-5 is predicted as [0.05737108]
Q2435/D2284-5 is predicted as [-0.00719811]
Q2810/D2610-14 is predicted as [0.11822987]
Q1688/D1602-6 is predicted as [0.05062293]
Q1275/D1219-17 is predicted as [0.04619268]


#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

In [21]:
dssm_model.save('/tmp/my_dssm_model')
loaded_dssm_model = load_model('/tmp/my_dssm_model')

In [22]:
(loaded_dssm_model.predict(X) == dssm_model.predict(X)).all()

True

## Reference

[Huang et al. 2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.