Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger models and training on the Pile #29

Closed
Taytay opened this issue Jan 22, 2024 · 5 comments
Closed

Larger models and training on the Pile #29

Taytay opened this issue Jan 22, 2024 · 5 comments

Comments

@Taytay
Copy link

Taytay commented Jan 22, 2024

After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

@PiotrNawrot
Copy link
Owner

@Taytay Again, please accept my apologies for the late reply :)

do you see any reason why this wouldn't be the case?

No, can't see any reason :)

does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

Nope either. I've tested training the model for longer and the loss continued to go down very nicely. It's true that Rouge-L on SNI is the same after 20 and 24H, but the validation loss after 20 and 24H is much better than after 16H. It could be the case that Rouge-L on SNI caps at 41 for this model size but I doubt it - SNI is still quite small dataset, their proposed recipe for fine-tuning is 2 epochs and it's very easy to overfit.

My bet is that if you instead evaluate on the entire Flan Collection which is way larger than 24H > 20H > 16H.

If you are interested in doing TinyT5 - like endeavour and you would like some help then feel free to reach out - I'd be very interested :)

@Taytay
Copy link
Author

Taytay commented Feb 7, 2024

Oh cool - I'll give this some thought!

ElutherAI is cooking up a modern T5 right now as well apparently.

https://huggingface.co/collections/EleutherAI/pile-t5-65a76a0d0022dd270b385a66

https://github.com/EleutherAI/improved-t5

They haven't said much about it other than confirming on Twitter they are working on it. I continue to be of the opinion that someone is going to get SOTA results with a T5 model and some modern techniques.

@PiotrNawrot
Copy link
Owner

This is very interesting

@Taytay
Copy link
Author

Taytay commented Feb 7, 2024

I was first tipped off to it here:
https://x.com/andersonbcdefg/status/1750570453532577883?s=20

Their confirmation in this thread:
https://x.com/thetaytay/status/1753780417365250199?s=20

And you know, if we WERE to do a "TinyModernT5" effort, this paper also points to a way to reduce PreTraining costs by 40-50% by changing the objective and learning rate schedule:

In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

From "SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection"

@Taytay
Copy link
Author

Taytay commented Feb 7, 2024

One more thing to watch:
In this comment, @b-albar claims that he has a custom T5 implementation with his FlashAttention patch and a "few other tricks":
huggingface/transformers#26350 (comment)

He said that he is considering open sourcing it. 🤞 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants