-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Larger models and training on the Pile #29
Comments
@Taytay Again, please accept my apologies for the late reply :)
No, can't see any reason :)
Nope either. I've tested training the model for longer and the loss continued to go down very nicely. It's true that Rouge-L on SNI is the same after 20 and 24H, but the validation loss after 20 and 24H is much better than after 16H. It could be the case that Rouge-L on SNI caps at 41 for this model size but I doubt it - SNI is still quite small dataset, their proposed recipe for fine-tuning is 2 epochs and it's very easy to overfit. My bet is that if you instead evaluate on the entire Flan Collection which is way larger than 24H > 20H > 16H. If you are interested in doing TinyT5 - like endeavour and you would like some help then feel free to reach out - I'd be very interested :) |
Oh cool - I'll give this some thought! ElutherAI is cooking up a modern T5 right now as well apparently. https://huggingface.co/collections/EleutherAI/pile-t5-65a76a0d0022dd270b385a66 https://github.com/EleutherAI/improved-t5 They haven't said much about it other than confirming on Twitter they are working on it. I continue to be of the opinion that someone is going to get SOTA results with a T5 model and some modern techniques. |
This is very interesting |
I was first tipped off to it here: Their confirmation in this thread: And you know, if we WERE to do a "TinyModernT5" effort, this paper also points to a way to reduce PreTraining costs by 40-50% by changing the objective and learning rate schedule:
From "SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection" |
One more thing to watch: He said that he is considering open sourcing it. 🤞 ❤️ |
After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?
The text was updated successfully, but these errors were encountered: