A toolkit for large scale distributed training
With LARSER we succeeded to train DeBERTa 1.5B model without model parallelism. The DeBERTa 1.5B model is the SOAT model on GLUE and SuperGLUE leaderboard. And it's the first model that surpass T5 11B model and human performance on SuperGLUE leaderboard.
- Add documentation and usage examples