<img src="https://github.com/NTMC-Community/MatchZoo/blob/2.0/artworks/matchzoo-logo.png?raw=True" alt="logo" style="width:600px;float: center"/>

In [1]:
import matchzoo as mz

Using TensorFlow backend.


# Prepare Data

In [2]:
train_data_pack = mz.datasets.wiki_qa.load_data(stage='train', task='ranking')
test_data_pack = mz.datasets.wiki_qa.load_data(stage='test', task='ranking')

In [3]:
type(train_data_pack)

matchzoo.data_pack.data_pack.DataPack

`DataPack` is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A `DataPack` is consists of three `pandas.DataFrame`:

In [4]:
train_data_pack.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,how are glacier caves formed?
Q2,How are the directions of the velocity and for...
Q5,how did apollo creed die
Q6,how long is the term for federal judges
Q7,how a beretta model 21 pistols magazines works


In [5]:
train_data_pack.right.head()

Unnamed: 0_level_0,text_right
id_right,Unnamed: 1_level_1
D1-0,A partly submerged glacier cave on Perito More...
D1-1,The ice facade is approximately 60 m high
D1-2,Ice formations in the Titlis glacier cave
D1-3,A glacier cave is a cave formed within the ice...
D1-4,"Glacier caves are often called ice caves , but..."


In [6]:
train_data_pack.relation.head()

Unnamed: 0,id_left,id_right,label
0,Q1,D1-0,0
1,Q1,D1-1,0
2,Q1,D1-2,0
3,Q1,D1-3,1
4,Q1,D1-4,0


It is also possible to convert a `DataPack` into a single `pandas.DataFrame` that holds all information.

In [7]:
train_data_pack.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,0
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,0
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,0
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,1
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...",0


However, using such `pandas.DataFrame` consumes much more memory if there are many duplicates in the texts, and that is the exact reason why we use `DataPack`. For more details about data handling, consult `matchzoo/tutorials/data_handling.ipynb`.

# Preprocessing

MatchZoo preprocessors are used to convert a raw `DataPack` into a `DataPack` that ready to be fed into a model. 

In [8]:
preprocessor = mz.preprocessors.NaivePreprocessor()

There are two steps to use a preprocessor. First, `fit`. Then, `transform`. `fit` will only changes the preprocessor's inner state but not the input `DataPack`.

In [9]:
preprocessor.fit(train_data_pack)

Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit: 100%|██████████| 2117/2117 [00:00<00:00, 7907.52it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit: 100%|██████████| 18828/18828 [00:03<00:00, 5474.79it/s]
Processing text_left with extend: 100%|██████████| 2117/2117 [00:00<00:00, 793223.30it/s]
Processing text_right with extend: 100%|██████████| 18828/18828 [00:00<00:00, 694678.49it/s]
Building VocabularyUnit from a datapack.: 100%|██████████| 418540/418540 [00:00<00:00, 2738062.29it/s]


<matchzoo.preprocessors.naive_preprocessor.NaivePreprocessor at 0x11c793f60>

`fit` will gather all information it needs into its `context`. In the above example, we can see a `VocabularyUnit` is built during the fitting process using `train_data_pack`.

In [10]:
preprocessor.context

{'vocab_unit': <matchzoo.processor_units.processor_units.VocabularyUnit at 0x11bf4deb8>}

`VocabularyUnit` is a `StatefulProcessorUnit` that has a similar `fit`/`transform` interface. Once a `VocabularyUnit` `fit`, it will store a mapping from `term` to `index` and the reverse in its `state`.

The `NaivePreprocessor` already handles `VocabularyUnit` internally, so we do not have to worry about that. Just access it through the `NaivePreprocessor`'s `context`.

In [11]:
vocab_unit = preprocessor.context['vocab_unit']
print(vocab_unit.state['term_index']['match'])
print(vocab_unit.state['term_index']['zoo'])
print(vocab_unit.state['index_term'][1])
print(vocab_unit.state['index_term'][2])

25726
15364
geocacher
monolayer


Once `fit`, the preprocessor has enough information to `transform`.  `transform` will not change the preprocessor's inner state and the input `DataPack`, but return a transformed `DataPack`.

In [12]:
train_data_pack_processed = preprocessor.transform(train_data_pack)
test_data_pack_processed = preprocessor.transform(test_data_pack)

Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 2117/2117 [00:00<00:00, 7850.86it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 18828/18828 [00:03<00:00, 4775.68it/s]
Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 630/630 [00:00<00:00, 8378.21it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 5914/5914 [00:01<00:00, 4734.85it/s]


In [13]:
train_data_pack_processed.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,"[6248, 3232, 23623, 26906, 18581, 0, 0, 0, 0, ..."
Q2,"[6248, 3232, 11296, 9779, 4231, 11296, 25020, ..."
Q5,"[6248, 8466, 5344, 22570, 26752, 0, 0, 0, 0, 0..."
Q6,"[6248, 18206, 6559, 11296, 12243, 22211, 11936..."
Q7,"[6248, 18788, 4030, 11359, 12567, 17504, 6486,..."


As we can see, `text_left` is already in sequence form that nerual networks love.

Just to make sure we have the correct sequence:

In [14]:
print('Before:', train_data_pack.left.loc['Q1']['text_left'])
sequence = train_data_pack_processed.left.loc['Q1']['text_left']
print('After:', sequence)
print('Translated:', '_'.join([vocab_unit.state['index_term'][i] for i in sequence]))

Before: how are glacier caves formed?
After: [6248, 3232, 23623, 26906, 18581, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Translated: how_are_glacier_caves_formed_________________________


For more details about data handling, consult `matchzoo/tutorials/preprocessing.ipynb`.

# Build Model

MatchZoo provides many built-in text matching models.

In [15]:
mz.models.list_available()

[matchzoo.models.naive_model.NaiveModel,
 matchzoo.models.dssm_model.DSSMModel,
 matchzoo.models.cdssm_model.CDSSMModel,
 matchzoo.models.dense_baseline_model.DenseBaselineModel,
 matchzoo.models.arci_model.ArcIModel,
 matchzoo.models.knrm_model.KNRMModel,
 matchzoo.models.duet_model.DUETModel,
 matchzoo.models.drmmtks_model.DRMMTKSModel,
 matchzoo.models.drmm.DRMM]

In [16]:
model = mz.models.DenseBaselineModel()

The model is initialized with a hyper parameter table, in which values are partially filled.

In [17]:
print(model.params)

name                          None
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  None
task                          None
optimizer                     None
with_multi_layer_perceptron   True
mlp_num_units                 256
mlp_num_layers                None
mlp_num_fan_out               None
mlp_activation_func           None


In [18]:
model.params['name'] = 'My First Model'
model.params['mlp_num_units'] = 3
print(model.params)

name                          My First Model
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  None
task                          None
optimizer                     None
with_multi_layer_perceptron   True
mlp_num_units                 3
mlp_num_layers                None
mlp_num_fan_out               None
mlp_activation_func           None


Use `guess_and_fill_missing_params` to automatically fill-in other hyper parameters. This involves some guessing so the parameter it fills could be wrong. For example, the default task is `Ranking`, and if we do not set it to `Classification` manaully for data packs prepared for classification, then the shape of the model output and the data will mismatch.

In [19]:
model.guess_and_fill_missing_params()
print(model.params)

Parameter "task" set to Ranking Task.
Parameter "input_shapes" set to [(30,), (30,)].
Parameter "optimizer" set to adam.
Parameter "mlp_num_layers" set to 3.
Parameter "mlp_num_fan_out" set to 32.
Parameter "mlp_activation_func" set to relu.
name                          My First Model
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  [(30,), (30,)]
task                          Ranking Task
optimizer                     adam
with_multi_layer_perceptron   True
mlp_num_units                 3
mlp_num_layers                3
mlp_num_fan_out               32
mlp_activation_func           relu


In [20]:
model.params.completed()

True

With all parameters filled in, we can now build and compile the model.

In [21]:
model.build()
model.compile()
model.backend.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_left (InputLayer)          (None, 30)           0                                            
__________________________________________________________________________________________________
text_right (InputLayer)         (None, 30)           0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 60)           0           text_left[0][0]                  
                                                                 text_right[0][0]                 
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 3)            183         concatenate_1[0][0]              
__________

For more details about models, consult `matchzoo/tutorials/models.ipynb`.

# Train, Evaluate, Predict

A `DataPack` can `unpack` itself into data that can be directly used to train a MatchZoo model.

In [22]:
x, y = train_data_pack_processed.unpack()
test_x, test_y = test_data_pack_processed.unpack()

In [23]:
model.fit(x, y, batch_size=32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x11e42d710>

An alternative to train a model is to use a `DataGenerator`. This might be useful for delaying expensive preprocessing steps or doing real-time data augmentation. For more details about `DataGenerator`, consult `matchzoo/tutorials/data_handling.ipynb`.

In [24]:
data_generator = mz.DataGenerator(train_data_pack_processed, batch_size=32)

In [25]:
model.fit_generator(data_generator, epochs=5, use_multiprocessing=True, workers=4)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x11e3be940>

In [26]:
model.evaluate(test_x, test_y)



{'loss': 1.246552992989769, 'mean_absolute_error': 0.20679176925755233}

In [27]:
model.predict(test_x)

array([[0.07851854],
       [0.07851854],
       [0.07851854],
       ...,
       [0.07851854],
       [0.07851854],
       [0.07851854]], dtype=float32)

# Automation

MatchZoo strives for ease of use, and package `matchzoo.auto` is a perfect example of that.

`matchzoo.auto.prepare` handles interaction among data, model, and preprocessor automatically. For example, some model like `DSSM` have dynamic input shapes based on the result of word hashing. Some models have an embedding layer which dimension is related to the data's vocabulary size. `prepare` takes care of all that and returns properly prepared model, data, and preprocessor for you.

In [28]:
model_ok, train_ok, preprocesor_ok = mz.auto.prepare(
    model=mz.models.DSSMModel(),
    data_pack=train_data_pack[:100]
)
test_ok = preprocesor_ok.transform(test_data_pack, verbose=0)
model_ok.fit(*train_ok.unpack(), batch_size=32)
model_ok.evaluate(*test_ok.unpack())

Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => StopRemovalUnit => NgramLetterUnit: 100%|██████████| 13/13 [00:00<00:00, 2538.57it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => StopRemovalUnit => NgramLetterUnit: 100%|██████████| 100/100 [00:00<00:00, 1842.84it/s]
Processing text_left with extend: 100%|██████████| 13/13 [00:00<00:00, 19267.12it/s]
Processing text_right with extend: 100%|██████████| 100/100 [00:00<00:00, 55894.24it/s]
Building VocabularyUnit from a datapack.: 100%|██████████| 8523/8523 [00:00<00:00, 1948122.78it/s]
Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => StopRemovalUnit => NgramLetterUnit => WordHashingUnit: 100%|██████████| 13/13 [00:00<00:00, 2217.76it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => StopRemovalUnit => NgramLetterUnit => WordHashingUnit: 100%

Parameter "name" set to DSSMModel.
Parameter "mlp_num_layers" set to 3.
Parameter "mlp_num_units" set to 64.
Parameter "mlp_num_fan_out" set to 32.
Parameter "mlp_activation_func" set to relu.
Epoch 1/1


{'loss': 0.10139248185858686, 'mean_absolute_error': 0.2107424148942537}

For more details about automation, consult `matchzoo/tutorials/automation.ipynb`.

# Full Example

In [33]:
model_classes = [
    mz.models.DSSMModel,
    mz.models.DUETModel,
]

In [34]:
task = mz.tasks.Ranking(metrics=['ap', 'ndcg'])
results = []
for model_class in model_classes:
    print(model_class)
    model = model_class()
    model.params['task'] = task
    model_ok, train_ok, preprocesor_ok = mz.auto.prepare(
        model=model,
        data_pack=train_data_pack[:2000],
        verbose=0
    )
    test_ok = preprocesor_ok.transform(test_data_pack, verbose=0)
    callback = mz.engine.callbacks.EvaluateAllMetrics(
        model_ok,
        *test_ok.unpack(),
        batch_size=1024,
        verbose=0
    )
    history = model_ok.fit(*train_ok.unpack(), batch_size=32, epochs=30, callbacks=[callback])
    results.append({'name': model_ok.params['name'], 'history': history})

<class 'matchzoo.models.dssm_model.DSSMModel'>
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
<class 'matchzoo.models.duet_model.DUETModel'>
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
import bokeh
from bokeh.plotting import figure
from bokeh.io import export_png
from bokeh.layouts import column
from bokeh.models.tools import HoverTool
import IPython

charts = {
    metric: figure(
        title=str(metric),
        sizing_mode='scale_width',
        width=800, height=400
    ) for metric in results[0]['history'].history.keys()
}
hover_tool = HoverTool(tooltips=[
    ("x", "$x"),
    ("y", "$y")
])
for metric, sub_chart in charts.items():
    lines = {}
    for result, color in zip(results, bokeh.palettes.Category10[10]):
        x = result['history'].epoch
        y = result['history'].history[metric]
        lines[result['name']] = sub_chart.line(
            x, y, color=color, line_width=2, alpha=0.5, legend=result['name'])
        sub_chart.add_tools(hover_tool)

export_png(column(*charts.values()), "quick_start_chart.png")

![chart](./quick_start_chart.png)