Comparison against six layer BERT #10

david-macleod · 2020-09-02T15:38:14Z

Hi

I read your paper and found it very interesting, I was wondering if you had any ablation results which compared the performance of a 6 layer Bert-of-Theseus (compressed from 12 layer BERT) against the performance of a 6 layer BERT trained from scratch? If not do you have any intuition for whether the module replacement of a larger model would surpass the performance of that same smaller model trained from scratch?

Many thanks
David

JetRunner · 2020-09-02T15:52:45Z

Hi David, thanks for your interest in our work.

Yes, of course a model trained from scratch performs way worse than any model compression method. This is the basic idea of model compression and we do not have this ablation because we have stronger baselines.

david-macleod · 2020-09-02T18:48:53Z

Thanks for the reply! I was hoping that was the case, but even if it wasn't, a secondary motivation for model compression could be to compress pre-trained models which would be infeasible to retrain from scratch, even if the resultant performance was only equivalent to a smaller model trained from scratch, but it seems that is not the case.

david-macleod closed this as completed Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison against six layer BERT #10

Comparison against six layer BERT #10

david-macleod commented Sep 2, 2020

JetRunner commented Sep 2, 2020

david-macleod commented Sep 2, 2020

Comparison against six layer BERT #10

Comparison against six layer BERT #10

Comments

david-macleod commented Sep 2, 2020

JetRunner commented Sep 2, 2020

david-macleod commented Sep 2, 2020