Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison against six layer BERT #10

Closed
david-macleod opened this issue Sep 2, 2020 · 2 comments
Closed

Comparison against six layer BERT #10

david-macleod opened this issue Sep 2, 2020 · 2 comments

Comments

@david-macleod
Copy link

Hi

I read your paper and found it very interesting, I was wondering if you had any ablation results which compared the performance of a 6 layer Bert-of-Theseus (compressed from 12 layer BERT) against the performance of a 6 layer BERT trained from scratch? If not do you have any intuition for whether the module replacement of a larger model would surpass the performance of that same smaller model trained from scratch?

Many thanks
David

@JetRunner
Copy link
Owner

Hi David, thanks for your interest in our work.

Yes, of course a model trained from scratch performs way worse than any model compression method. This is the basic idea of model compression and we do not have this ablation because we have stronger baselines.

@david-macleod
Copy link
Author

Thanks for the reply! I was hoping that was the case, but even if it wasn't, a secondary motivation for model compression could be to compress pre-trained models which would be infeasible to retrain from scratch, even if the resultant performance was only equivalent to a smaller model trained from scratch, but it seems that is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants