New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduction of Benchmark Results #99
Comments
Could you give more details, and which model are you running? |
I am following the readme. I cloned the repo, ran run_data.sh under data/WikiQA, then proceeded to enter the readme commands for training, i.e. "python matchzoo/main.py --phase train --model_file examples/wikiqa/config/drmm_wikiqa.config" and then the testing command "python matchzoo/main.py --phase predict --model_file examples/wikiqa/config/drmm_wikiqa.config". My results for drmm NDCG@3, NDCG@5, and MAP are roughly half the values shown in the table-around (.29, .34, .357) respectively. |
Similar question here, this is something I expected, but clearly, the benchmark in Readme is way higher than IR benchmark, I guess there might be some error in implementing the evaluation functions? I tried to read the source code of MatchZoo, figurred there are some differences w.r.t implementation and paper, let's say CDSSM model:
seq.add(Dense(self.config['hidden_sizes'][0], activation='relu', input_shape=(input_dim,))) The original paper was using
seq.add(Dropout(self.config['dropout_rate']))
wordhashing = Embedding(self.config['vocab_size'], self.config['embed_size'], weights=[self.config['embed']], trainable=self.embed_trainable)) In CDSSM implementation, word hashing layer is represented into an
"optimizer": "adam", The original paper is using
I guess these changes could bring some unexpected result, I would say let's be careful when using it for scientific purpose. |
Thanks @millanbatra for the questions and feedback! The choice of hyper-parameters has great impact on the results on WikiQA. But it is still very strange why "for the benchmark results of WikiQA, the reproduced values for NDCG@3, NDCG@5, and MAP are roughly half of the values shown in the table". The gap is too large. If you used the correct setting, the gap shouldn't be so large. I pasted the raw output log file from my side when I tried to run aNMM of MatchZoo on WikiQA. You can see the results I got are very close to the results in our Readme file. I suggest you to compare your configuration (on hyper-parameters) of your model with mine to debug your model. Thanks @bwanglzu for pointing out the differences of the implementations in MatchZoo with the descriptions in some papers. Yes, this is possible. We tried to implement the most important components/novel parts of these neural models. But it is still possible that there are some differences between our current implementation details with some details described in the paper. For some critical differences, we will fix them in the next version of MatchZoo. But for some differences like "dropout", I think it is fine to keep it. You can adjust the dropout rate to control the network as you want. What do you think about it ? @faneshion @pl8787 @bwanglzu Stay tuned. My raw output logs:
|
Below is the output on my system after running:
Output of Train:
Output of Predict:
|
@yangliuy Agree, just leave stuff such as Any plans for next version? |
My values match annesh_joshi's almost exactly. They are ndcg@3=0.36468, map=0.404962, and ndcg@5=0.440946. from ._conv import register_converters as _register_converters |
@millanbatra @aneesh-joshi I noticed that your training loss didn't decrease as in my output. That's why your results are bad. The vocab_size is different. You can compare your settings with my settings.It is also possible that some code changes between "12-02-2017" and "05-19-2018" introduced bugs into MatchZoo. (My run was performed on "12-02-2017") It is very strange why your training loss always kept the same until the last iteration. We'll double check this part later. |
@yangliuy For dssm, the training loss is decreasing as expected. Maybe this is a drmm specific issue, will need to verify with other models. Ran the following - python matchzoo/main.py --phase train --model_file examples/wikiqa/config/dssm_wikiqa.config Predict - python matchzoo/main.py --phase predict --model_file examples/wikiqa/config/dssm_wikiqa.config Output for train - Using TensorFlow backend.
2018-05-20 11:55:29.012697: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
{
"inputs": {
"test": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"predict": {
"phase": "PREDICT",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"train": {
"relation_file": "./data/WikiQA/relation_train.txt",
"input_type": "Triletter_PairGenerator",
"batch_size": 100,
"batch_per_iter": 5,
"dtype": "dssm",
"phase": "TRAIN",
"query_per_iter": 50,
"use_iter": false
},
"share": {
"vocab_size": 3314,
"embed_size": 1,
"target_mode": "ranking",
"text1_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"text2_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"word_triletter_map_file": "./data/WikiQA/word_triletter_map.txt"
},
"valid": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_valid.txt",
"dtype": "dssm"
}
},
"global": {
"optimizer": "adam",
"num_iters": 400,
"save_weights_iters": 10,
"learning_rate": 0.0001,
"test_weights_iters": 400,
"weights_file": "examples/wikiqa/weights/dssm.wikiqa.weights",
"model_type": "PY",
"display_interval": 10
},
"outputs": {
"predict": {
"save_format": "TREC",
"save_path": "predict.test.wikiqa.txt"
}
},
"losses": [
{
"object_name": "rank_hinge_loss",
"object_params": {
"margin": 1.0
}
}
],
"metrics": [
"ndcg@3",
"ndcg@5",
"map"
],
"net_name": "DSSM",
"model": {
"model_py": "dssm.DSSM",
"setting": {
"dropout_rate": 0.9,
"hidden_sizes": [
300
]
},
"model_path": "./matchzoo/models/"
}
}
[Embedding] Embedding Load Done.
[Input] Process Input Tags. [u'train'] in TRAIN, [u'test', u'valid'] in EVAL.
[./data/WikiQA/corpus_preprocessed.txt]
Data size: 24106
[Dataset] 1 Dataset Load Done.
{u'relation_file': u'./data/WikiQA/relation_train.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_PairGenerator', u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'batch_size': 100, u'batch_per_iter': 5, u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'TRAIN', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32), u'query_per_iter': 50, u'use_iter': False}
[./data/WikiQA/relation_train.txt]
Instance size: 20360
Pair Instance Count: 8995
[Triletter_PairGenerator] init done
{u'relation_file': u'./data/WikiQA/relation_test.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_test.txt]
Instance size: 2341
List Instance Count: 237
[Triletter_ListGenerator] init done
{u'relation_file': u'./data/WikiQA/relation_valid.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_valid.txt]
Instance size: 1126
List Instance Count: 122
[Triletter_ListGenerator] init done
[DSSM] init done
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 188.4258 MB Resident: 192948 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 188.4258 MB Resident: 192948 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 188.7344 MB Resident: 193264 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 188.7344 MB Resident: 193264 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Dot [shape]: [None, 1]
[Memory] Total Memory Use: 188.9414 MB Resident: 193476 Shared: 0 UnshareData: 0 UnshareStack: 0
[Model] Model Compile Done.
[05-20-2018 11:55:29] [Train:train] Iter:0 loss=0.875981
[05-20-2018 11:55:30] [Eval:test] Iter:0 ndcg@3=0.530968 map=0.547287 ndcg@5=0.598319
[05-20-2018 11:55:31] [Eval:valid] Iter:0 ndcg@3=0.559904 map=0.579712 ndcg@5=0.626929
[05-20-2018 11:55:31] [Train:train] Iter:1 loss=0.846194
[05-20-2018 11:55:32] [Eval:test] Iter:1 ndcg@3=0.530223 map=0.549429 ndcg@5=0.604287
[05-20-2018 11:55:32] [Eval:valid] Iter:1 ndcg@3=0.557536 map=0.581286 ndcg@5=0.628172
[05-20-2018 11:55:33] [Train:train] Iter:2 loss=0.832642
[05-20-2018 11:55:34] [Eval:test] Iter:2 ndcg@3=0.530487 map=0.551394 ndcg@5=0.606368
[05-20-2018 11:55:34] [Eval:valid] Iter:2 ndcg@3=0.557536 map=0.580841 ndcg@5=0.625001
[05-20-2018 11:55:34] [Train:train] Iter:3 loss=0.807873
[05-20-2018 11:55:35] [Eval:test] Iter:3 ndcg@3=0.534706 map=0.552465 ndcg@5=0.606953
[05-20-2018 11:55:36] [Eval:valid] Iter:3 ndcg@3=0.558610 map=0.577279 ndcg@5=0.622544
[05-20-2018 11:55:36] [Train:train] Iter:4 loss=0.792922
[05-20-2018 11:55:37] [Eval:test] Iter:4 ndcg@3=0.538373 map=0.559824 ndcg@5=0.610877
[05-20-2018 11:55:37] [Eval:valid] Iter:4 ndcg@3=0.554511 map=0.576596 ndcg@5=0.621976
[05-20-2018 11:55:38] [Train:train] Iter:5 loss=0.780318
[05-20-2018 11:55:39] [Eval:test] Iter:5 ndcg@3=0.537771 map=0.559040 ndcg@5=0.612908
[05-20-2018 11:55:39] [Eval:valid] Iter:5 ndcg@3=0.559683 map=0.579048 ndcg@5=0.630220
[05-20-2018 11:55:39] [Train:train] Iter:6 loss=0.751999
[05-20-2018 11:55:40] [Eval:test] Iter:6 ndcg@3=0.539328 map=0.558607 ndcg@5=0.612648
[05-20-2018 11:55:41] [Eval:valid] Iter:6 ndcg@3=0.560562 map=0.581029 ndcg@5=0.631679
[05-20-2018 11:55:41] [Train:train] Iter:7 loss=0.732407
.
.
.
[05-20-2018 12:07:58] [Train:train] Iter:395 loss=0.001485
[05-20-2018 12:07:59] [Eval:test] Iter:395 ndcg@3=0.552114 map=0.571677 ndcg@5=0.613771
[05-20-2018 12:07:59] [Eval:valid] Iter:395 ndcg@3=0.555092 map=0.560439 ndcg@5=0.610904
[05-20-2018 12:07:59] [Train:train] Iter:396 loss=0.000499
[05-20-2018 12:08:00] [Eval:test] Iter:396 ndcg@3=0.553672 map=0.573834 ndcg@5=0.615328
[05-20-2018 12:08:01] [Eval:valid] Iter:396 ndcg@3=0.555092 map=0.560405 ndcg@5=0.610904
[05-20-2018 12:08:01] [Train:train] Iter:397 loss=0.001926
[05-20-2018 12:08:02] [Eval:test] Iter:397 ndcg@3=0.552114 map=0.574506 ndcg@5=0.616589
[05-20-2018 12:08:03] [Eval:valid] Iter:397 ndcg@3=0.558117 map=0.564390 ndcg@5=0.613929
[05-20-2018 12:08:03] [Train:train] Iter:398 loss=0.000582
[05-20-2018 12:08:04] [Eval:test] Iter:398 ndcg@3=0.552114 map=0.574556 ndcg@5=0.616589
[05-20-2018 12:08:05] [Eval:valid] Iter:398 ndcg@3=0.554019 map=0.563980 ndcg@5=0.616532
[05-20-2018 12:08:05] [Train:train] Iter:399 loss=0.001208
[05-20-2018 12:08:06] [Eval:test] Iter:399 ndcg@3=0.552114 map=0.574365 ndcg@5=0.615588
[05-20-2018 12:08:07] [Eval:valid] Iter:399 ndcg@3=0.554019 map=0.563980 ndcg@5=0.616532 Output for predict - Using TensorFlow backend.
2018-05-20 12:08:16.542890: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
{
"inputs": {
"test": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"predict": {
"phase": "PREDICT",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"train": {
"relation_file": "./data/WikiQA/relation_train.txt",
"input_type": "Triletter_PairGenerator",
"batch_size": 100,
"batch_per_iter": 5,
"dtype": "dssm",
"phase": "TRAIN",
"query_per_iter": 50,
"use_iter": false
},
"share": {
"vocab_size": 3314,
"embed_size": 1,
"target_mode": "ranking",
"text1_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"text2_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"word_triletter_map_file": "./data/WikiQA/word_triletter_map.txt"
},
"valid": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_valid.txt",
"dtype": "dssm"
}
},
"global": {
"optimizer": "adam",
"num_iters": 400,
"save_weights_iters": 10,
"learning_rate": 0.0001,
"test_weights_iters": 400,
"weights_file": "examples/wikiqa/weights/dssm.wikiqa.weights",
"model_type": "PY",
"display_interval": 10
},
"outputs": {
"predict": {
"save_format": "TREC",
"save_path": "predict.test.wikiqa.txt"
}
},
"losses": [
{
"object_name": "rank_hinge_loss",
"object_params": {
"margin": 1.0
}
}
],
"metrics": [
"ndcg@3",
"ndcg@5",
"map"
],
"net_name": "DSSM",
"model": {
"model_py": "dssm.DSSM",
"setting": {
"dropout_rate": 0.9,
"hidden_sizes": [
300
]
},
"model_path": "./matchzoo/models/"
}
}
[Embedding] Embedding Load Done.
[Input] Process Input Tags. [u'predict'] in PREDICT.
[./data/WikiQA/corpus_preprocessed.txt]
Data size: 24106
[Dataset] 1 Dataset Load Done.
{u'relation_file': u'./data/WikiQA/relation_test.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'PREDICT', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_test.txt]
Instance size: 2341
List Instance Count: 237
[Triletter_ListGenerator] init done
[DSSM] init done
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 169.1758 MB Resident: 173236 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 169.1758 MB Resident: 173236 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 169.5430 MB Resident: 173612 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 169.5430 MB Resident: 173612 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Dot [shape]: [None, 1]
[Memory] Total Memory Use: 169.7461 MB Resident: 173820 Shared: 0 UnshareData: 0 UnshareStack: 0
[05-20-2018 12:08:16] [Predict] @ predict [Predict] results: ndcg@3=0.552114map=0.574365 ndcg@5=0.615588 Will check and report on other models |
@mandroid6 |
@changlinzhang class ListGenerator_Feats(ListBasicGenerator): which has nothing to do with class DRMM_ListGenerator(ListBasicGenerator): The problem probably comes from two places:
Previously I was thinking the reason is loss function until I realized every model is using so-called Probably it's because of this line in out_ = Dot( axes= [1, 1])([z, q_w]) since we're expecting cosine similarity between two vectors which was also mentioned in the paper. out_ = Dot(axes=[1, 1], normalize=True)([z, q_w]) and run the script again? I was not able to use my dev computer at the moment. |
I reverted to commit e564565 |
@aneesh-joshi Thank you for providing the results! Your results are very close to the results in the readme file. So this double confirmed my guess before. There are some code changes which introduced bugs into MatchZoo in the recent 5 months. I will discuss with @faneshion on this. I think he will reply to you soon. He is quite busy these days :) |
Here is a better side by side comparison:
|
some of the models do a lot worse than expected. |
@aneesh-joshi Have you optimized the hyper-parameters of the models ? Did you run the model using the default settings ? According to your results, most performances are matched with the reported metrics in our readme file. Only CDSSM and ARC-II have gaps. You need to optimize the hyper-parameters with the validation data of WikiQA. |
@yangliuy |
@millanbatra @aneesh-joshi Can you try again? |
@bwanglzu |
As mentioned in #106
I couldn't add I will also paste the MZ reported values soon. |
I just compared your result with the benchmark in the readme, the results seem to be reasonable. Let's focus on #147 and I'll close this one, thanks for everyone's effort! especially @aneesh-joshi ! |
When running through the procedure described in the readme for the benchmark results of WikiQA, the reproduced values for NDCG@3, NDCG@5, and MAP are roughly half of the values shown in the table. Could you provide insight as to why this may be occuring?
The text was updated successfully, but these errors were encountered: