Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few-shot NLU: learning rate for model parameters vs. embedding parameters #7

Closed
nelson-liu opened this issue Apr 4, 2021 · 2 comments

Comments

@nelson-liu
Copy link

nelson-liu commented Apr 4, 2021

Hi!

Thanks for the interesting paper and releasing this nice codebase! I had a quick question with respect to the learning rate used for the fewshot NLU experiments. The paper mentions (Section 4.2) that:

We perform grid search of hyper-parameters and take the best combination on Ddev or Ddev32. Specifically, we take learning rates from 1e-5, 2e-5, 3e-5 and batch sizes from 16, 32

However, it seems like the model is updated with a fixed learning rate of 1e-5 in the code ( https://github.com/THUDM/P-tuning/blob/main/PT-Fewshot/pet/wrapper.py#L312 ) , and the learning rate taken from the CLI is only used for the embedding parameters.

Given that the paper and code seem to differ in this regard, I'm not sure if this is a bug in the code (i.e., the model and the embedding parameters should always use the LR taken from the CLI) or if the paper omits this detail (i.e., in reality, the LR grid search is only done on the embedding parameters, and 1e-5 is always used for the model). Could yo clarify which approach was taken in your experiments?

Thanks again!

@nelson-liu
Copy link
Author

ah, rereading that passage, am I correct in that the grid-search is not used in the few-shot setting (and the default hyperparameters from PET are used)?

@zheng-yanan
Copy link
Contributor

ah, rereading that passage, am I correct in that the grid-search is not used in the few-shot setting (and the default hyperparameters from PET are used)?

Hi!

Yes, in the few-shot setting, the hyperparameters from PET are used and we additionally select hyperparameters for prompt-related ones. Actually, we've experimented to use the same/different learning rates for both the backbone and prompt embeddings, and find that using different learning rates yields better performance in the few-shot setting. The grid-search mentioned in the paper was used in the fully supervised setting.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants