Larger models and training on the Pile #29

Taytay · 2024-01-22T17:44:48Z

After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

PiotrNawrot · 2024-02-06T14:04:10Z

@Taytay Again, please accept my apologies for the late reply :)

do you see any reason why this wouldn't be the case?

No, can't see any reason :)

does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

Nope either. I've tested training the model for longer and the loss continued to go down very nicely. It's true that Rouge-L on SNI is the same after 20 and 24H, but the validation loss after 20 and 24H is much better than after 16H. It could be the case that Rouge-L on SNI caps at 41 for this model size but I doubt it - SNI is still quite small dataset, their proposed recipe for fine-tuning is 2 epochs and it's very easy to overfit.

My bet is that if you instead evaluate on the entire Flan Collection which is way larger than 24H > 20H > 16H.

If you are interested in doing TinyT5 - like endeavour and you would like some help then feel free to reach out - I'd be very interested :)

Taytay · 2024-02-07T15:09:15Z

Oh cool - I'll give this some thought!

ElutherAI is cooking up a modern T5 right now as well apparently.

https://huggingface.co/collections/EleutherAI/pile-t5-65a76a0d0022dd270b385a66

https://github.com/EleutherAI/improved-t5

They haven't said much about it other than confirming on Twitter they are working on it. I continue to be of the opinion that someone is going to get SOTA results with a T5 model and some modern techniques.

PiotrNawrot · 2024-02-07T15:48:41Z

This is very interesting

Taytay · 2024-02-07T21:47:17Z

I was first tipped off to it here:
https://x.com/andersonbcdefg/status/1750570453532577883?s=20

Their confirmation in this thread:
https://x.com/thetaytay/status/1753780417365250199?s=20

And you know, if we WERE to do a "TinyModernT5" effort, this paper also points to a way to reduce PreTraining costs by 40-50% by changing the objective and learning rate schedule:

In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

From "SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection"

Taytay · 2024-02-07T22:09:03Z

One more thing to watch:
In this comment, @b-albar claims that he has a custom T5 implementation with his FlashAttention patch and a "few other tricks":
huggingface/transformers#26350 (comment)

He said that he is considering open sourcing it. 🤞 ❤️

PiotrNawrot closed this as completed Feb 6, 2024

PiotrNawrot mentioned this issue Feb 6, 2024

nanoT5 initializes lm_head weights with 768x too much variance, probably #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger models and training on the Pile #29

Larger models and training on the Pile #29

Taytay commented Jan 22, 2024

PiotrNawrot commented Feb 6, 2024

Taytay commented Feb 7, 2024

PiotrNawrot commented Feb 7, 2024

Taytay commented Feb 7, 2024

Taytay commented Feb 7, 2024

Larger models and training on the Pile #29

Larger models and training on the Pile #29

Comments

Taytay commented Jan 22, 2024

PiotrNawrot commented Feb 6, 2024

Taytay commented Feb 7, 2024

PiotrNawrot commented Feb 7, 2024

Taytay commented Feb 7, 2024

Taytay commented Feb 7, 2024