You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read your paper and found it very interesting, I was wondering if you had any ablation results which compared the performance of a 6 layer Bert-of-Theseus (compressed from 12 layer BERT) against the performance of a 6 layer BERT trained from scratch? If not do you have any intuition for whether the module replacement of a larger model would surpass the performance of that same smaller model trained from scratch?
Many thanks
David
The text was updated successfully, but these errors were encountered:
Yes, of course a model trained from scratch performs way worse than any model compression method. This is the basic idea of model compression and we do not have this ablation because we have stronger baselines.
Thanks for the reply! I was hoping that was the case, but even if it wasn't, a secondary motivation for model compression could be to compress pre-trained models which would be infeasible to retrain from scratch, even if the resultant performance was only equivalent to a smaller model trained from scratch, but it seems that is not the case.
Hi
I read your paper and found it very interesting, I was wondering if you had any ablation results which compared the performance of a 6 layer Bert-of-Theseus (compressed from 12 layer BERT) against the performance of a 6 layer BERT trained from scratch? If not do you have any intuition for whether the module replacement of a larger model would surpass the performance of that same smaller model trained from scratch?
Many thanks
David
The text was updated successfully, but these errors were encountered: