<img src="https://github.com/NTMC-Community/MatchZoo/blob/2.0/artworks/matchzoo-logo.png?raw=True" alt="logo" style="width:600px;float: center"/>

In [1]:
import matchzoo as mz

Using TensorFlow backend.


# Prepare Data

In [2]:
train_data_pack = mz.datasets.wiki_qa.load_data(stage='train', task='ranking')
test_data_pack = mz.datasets.wiki_qa.load_data(stage='test', task='ranking')

In [3]:
type(train_data_pack)

matchzoo.data_pack.data_pack.DataPack

`DataPack` is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A `DataPack` is consists of three `pandas.DataFrame`:

In [4]:
train_data_pack.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,how are glacier caves formed?
Q2,How are the directions of the velocity and for...
Q5,how did apollo creed die
Q6,how long is the term for federal judges
Q7,how a beretta model 21 pistols magazines works


In [5]:
train_data_pack.right.head()

Unnamed: 0_level_0,text_right
id_right,Unnamed: 1_level_1
D1-0,A partly submerged glacier cave on Perito More...
D1-1,The ice facade is approximately 60 m high
D1-2,Ice formations in the Titlis glacier cave
D1-3,A glacier cave is a cave formed within the ice...
D1-4,"Glacier caves are often called ice caves , but..."


In [6]:
train_data_pack.relation.head()

Unnamed: 0,id_left,id_right,label
0,Q1,D1-0,0.0
1,Q1,D1-1,0.0
2,Q1,D1-2,0.0
3,Q1,D1-3,1.0
4,Q1,D1-4,0.0


It is also possible to convert a `DataPack` into a single `pandas.DataFrame` that holds all information.

In [7]:
train_data_pack.frame().head()

Unnamed: 0,id_left,text_left,id_right,text_right,label
0,Q1,how are glacier caves formed?,D1-0,A partly submerged glacier cave on Perito More...,0.0
1,Q1,how are glacier caves formed?,D1-1,The ice facade is approximately 60 m high,0.0
2,Q1,how are glacier caves formed?,D1-2,Ice formations in the Titlis glacier cave,0.0
3,Q1,how are glacier caves formed?,D1-3,A glacier cave is a cave formed within the ice...,1.0
4,Q1,how are glacier caves formed?,D1-4,"Glacier caves are often called ice caves , but...",0.0


However, using such `pandas.DataFrame` consumes much more memory if there are many duplicates in the texts, and that is the exact reason why we use `DataPack`. For more details about data handling, consult `matchzoo/tutorials/data_handling.ipynb`.

# Preprocessing

MatchZoo preprocessors are used to convert a raw `DataPack` into a `DataPack` that ready to be fed into a model. 

In [8]:
preprocessor = mz.preprocessors.NaivePreprocessor()

There are two steps to use a preprocessor. First, `fit`. Then, `transform`. `fit` will only changes the preprocessor's inner state but not the input `DataPack`.

In [9]:
preprocessor.fit(train_data_pack)

Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit: 100%|██████████| 2118/2118 [00:00<00:00, 7976.32it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit: 100%|██████████| 18841/18841 [00:03<00:00, 5108.73it/s]
Processing text_left with extend: 100%|██████████| 2118/2118 [00:00<00:00, 774366.79it/s]
Processing text_right with extend: 100%|██████████| 18841/18841 [00:00<00:00, 744169.82it/s]
Building VocabularyUnit from a datapack.: 100%|██████████| 418382/418382 [00:00<00:00, 2621816.78it/s]


<matchzoo.preprocessors.naive_preprocessor.NaivePreprocessor at 0x1175cdc18>

`fit` will gather all information it needs into its `context`. In the above example, we can see a `VocabularyUnit` is built during the fitting process using `train_data_pack`.

In [10]:
preprocessor.context

{'vocab_unit': <matchzoo.processor_units.processor_units.VocabularyUnit at 0x1176779b0>}

`VocabularyUnit` is a `StatefulProcessorUnit` that has a similar `fit`/`transform` interface. Once a `VocabularyUnit` `fit`, it will store a mapping from `term` to `index` and the reverse in its `state`.

The `NaivePreprocessor` already handles `VocabularyUnit` internally, so we do not have to worry about that. Just access it through the `NaivePreprocessor`'s `context`.

In [11]:
vocab_unit = preprocessor.context['vocab_unit']
print(vocab_unit.state['term_index']['match'])
print(vocab_unit.state['term_index']['zoo'])
print(vocab_unit.state['index_term'][1])
print(vocab_unit.state['index_term'][2])

28216
23428
terminate
outnumbering


Once `fit`, the preprocessor has enough information to `transform`.  `transform` will not change the preprocessor's inner state and the input `DataPack`, but return a transformed `DataPack`.

In [12]:
train_data_pack_processed = preprocessor.transform(train_data_pack)
test_data_pack_processed = preprocessor.transform(test_data_pack)

Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 2118/2118 [00:00<00:00, 8359.18it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 18841/18841 [00:04<00:00, 4489.64it/s]
Processing text_left with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 633/633 [00:00<00:00, 8564.03it/s]
Processing text_right with chain_transform of TokenizeUnit => LowercaseUnit => PuncRemovalUnit => VocabularyUnit => FixedLengthUnit: 100%|██████████| 5961/5961 [00:01<00:00, 4806.61it/s]


In [13]:
train_data_pack_processed.left.head()

Unnamed: 0_level_0,text_left
id_left,Unnamed: 1_level_1
Q1,"[20911, 4358, 3064, 5563, 8253, 0, 0, 0, 0, 0,..."
Q2,"[20911, 4358, 2973, 17066, 14927, 2973, 15603,..."
Q5,"[20911, 16734, 13868, 21434, 1795, 0, 0, 0, 0,..."
Q6,"[20911, 2056, 4196, 2973, 28142, 24847, 9954, ..."
Q7,"[20911, 308, 17685, 14739, 17293, 6575, 1022, ..."


As we can see, `text_left` is already in sequence form that nerual networks love.

Just to make sure we have the correct sequence:

In [14]:
print('Before:', train_data_pack.left.loc['Q1']['text_left'])
sequence = train_data_pack_processed.left.loc['Q1']['text_left']
print('After:', sequence)
print('Translated:', '_'.join([vocab_unit.state['index_term'][i] for i in sequence]))

Before: how are glacier caves formed?
After: [20911, 4358, 3064, 5563, 8253, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Translated: how_are_glacier_caves_formed_________________________


For more details about data handling, consult `matchzoo/tutorials/preprocessing.ipynb`.

# Build Model

MatchZoo provides many built-in text matching models.

In [15]:
mz.models.list_available()

[matchzoo.models.naive_model.NaiveModel,
 matchzoo.models.dssm_model.DSSMModel,
 matchzoo.models.cdssm_model.CDSSMModel,
 matchzoo.models.dense_baseline_model.DenseBaselineModel,
 matchzoo.models.arci_model.ArcIModel,
 matchzoo.models.knrm_model.KNRMModel,
 matchzoo.models.duet_model.DUETModel,
 matchzoo.models.drmmtks_model.DRMMTKSModel]

In [16]:
model = mz.models.DenseBaselineModel()

The model is initialized with a hyper parameter table, in which values are partially filled.

In [17]:
print(model.params)

name                          None
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  None
task                          None
optimizer                     None
num_dense_layers              1
num_dense_units               512


In [18]:
model.params['name'] = 'My First Model'
model.params['num_dense_layers'] = 3
print(model.params)

name                          My First Model
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  None
task                          None
optimizer                     None
num_dense_layers              3
num_dense_units               512


Use `guess_and_fill_missing_params` to automatically fill-in other hyper parameters. This involves some guessing so the parameter it fills could be wrong. For example, the default task is `Ranking`, and if we do not set it to `Classification` manaully for data packs prepared for classification, then the shape of the model output and the data will mismatch.

In [19]:
model.guess_and_fill_missing_params()
print(model.params)

Parameter "task" set to Ranking Task.
Parameter "input_shapes" set to [(30,), (30,)].
Parameter "optimizer" set to adam.
name                          My First Model
model_class                   <class 'matchzoo.models.dense_baseline_model.DenseBaselineModel'>
input_shapes                  [(30,), (30,)]
task                          Ranking Task
optimizer                     adam
num_dense_layers              3
num_dense_units               512


In [20]:
model.params.completed()

True

With all parameters filled in, we can now build and compile the model.

In [21]:
model.build()
model.compile()
model.backend.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_left (InputLayer)          (None, 30)           0                                            
__________________________________________________________________________________________________
text_right (InputLayer)         (None, 30)           0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 60)           0           text_left[0][0]                  
                                                                 text_right[0][0]                 
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          31232       concatenate_1[0][0]              
__________

For more details about models, consult `matchzoo/tutorials/models.ipynb`.

# Train, Evaluate, Predict

A `DataPack` can `unpack` itself into data that can be directly used to train a MatchZoo model.

In [22]:
x, y = train_data_pack_processed.unpack()
test_x, test_y = test_data_pack_processed.unpack()

In [23]:
model.fit(x, y, batch_size=32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1192bc0b8>

An alternative to train a model is to use a `DataGenerator`. This might be useful for delaying expensive preprocessing steps or doing real-time data augmentation. For more details about `DataGenerator`, consult `matchzoo/tutorials/data_handling.ipynb`.

In [24]:
data_generator = mz.DataGenerator(train_data_pack_processed, batch_size=32)

In [25]:
model.fit_generator(data_generator, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1192bc358>

In [26]:
model.evaluate(test_x, test_y)



{'loss': 141.9180912843876, 'mean_absolute_error': 10.981327577731712}

In [27]:
model.predict(test_x)

array([[10.7579565],
       [11.024283 ],
       [12.02349  ],
       ...,
       [ 6.1800456],
       [ 7.6900096],
       [ 6.036018 ]], dtype=float32)

# Automation

MatchZoo strives for ease of use, and package `matchzoo.auto` is a perfect example of that.

`prepare` handles interaction between the data and the model automatically. Some model like `DSSM` have dynamic input shapes based on the result of word hashing. Some models have an embedding layer which dimension is related to the data's vocabulary size. `prepare` takes care of all that and returns properly prepared model, data, and preprocessor for use.

In [28]:
for model_cls in mz.models.list_available():
    print(model_cls)
    model_ok, train_ok, preprocesor_ok = mz.auto.prepare(
        model=model_cls(), data_pack=train_data_pack, verbose=0)
    test_ok = preprocesor_ok.transform(test_data_pack, verbose=0)
    model_ok.fit(*train_ok.unpack())
    model_ok.evaluate(*test_ok.unpack())

<class 'matchzoo.models.naive_model.NaiveModel'>
Epoch 1/1
<class 'matchzoo.models.dssm_model.DSSMModel'>
Epoch 1/1
<class 'matchzoo.models.cdssm_model.CDSSMModel'>


ValueError: Error when checking input: expected text_left to have 3 dimensions, but got array with shape (20360, 30)

For more details about automation, consult `matchzoo/tutorials/automation.ipynb`.