This repository aims to implement different forms of transformer model, including seq2seq (the original architecture in All You Need is Attention paper), encoder-only, decoder-only, and unified transformer models.
These models are not meant to be the states of the arts on any tasks. Instead, they come with the purpose of training myself with advanced programming skills and also provide references to people who share the love of deep learning and machine intelligence.
This work is inspired by, and would not be possible without the open-source repos of NanoGPT, ViT, MAE, CLIP, and OpenCLIP. A huge thanks to them for open-sourcing their models!
This repository also maintains a paperlist of recent progresses in transformer models.
This repository features a list of designs:
- Transformer Architectures:
- Encoder-only
- Decoder-only
- Encoder-Decoder
- Unified (In Progress)
- Attention Modules:
- Unmasked Attention (Transformer, BERT)
- Causal masked Attention (Transformer, GPT)
- Prefix Causal Attention (T5)
- Sliding-Window Attention (Mistral)
- Position Embedding:
- Fixed Position Embedding (Transformer)
- Learnable Position Embedding (Transformer, BERT)
- Rotary Position Embedding (Roformer)
- Extrapolable Position Embedding (Length-Extrapolatable Transformer)
- Sampling:
- Temperature-based Sampler
- Top-k Sampler
- Nucleus (top-p) Sampler
Currently working on implementing DINO, a variant of ViT trained in a self-supervised manner
Model | Implemented | Trained | Evaluated |
---|---|---|---|
Transformer | ✅ | No | No |
GPT | ✅ | No | No |
BERT | ✅ | Yes | No |
ViT | ✅ | No | No |
MAE | No | No | No |
CLIP | No | No | No |
DISCLAIMER: Because of the popularity and versatility of Transformers, there will be a lot of course assignments related to implementing part of or the entire Transformer models. This repository was developed purely for self-training purpose, and could well serve as a reference for implementing a Transformer model. But directly copying from this repo is strictly prohibited and is a violation of code of conduct for most academic institutes.
For those who need a refreshment of what the Transformer is or what the detailed architecture of Transformer looks like, please refer to a well-illustrated blog: http://nlp.seas.harvard.edu/annotated-transformer/#background
Here is a poem generated by LLaMA2, an open-source LLm released by Meta AI:
Attention is all you need,
To understand what's said and read.
Transformers learn relations,
Through multi-head attentions.
Encoder, decoder architecture,
Learns features for good imagery.
Training on large datasets,
Its performance quickness gets.
Built on top of sequence to sequence,
Its parallel computing saves time to flex.
Understanding language, text and voice,
With deep learning that gave it its poise.
Task agnostic, wide usability,
Driving progress in AI agility.
Pushing NLP to new heights,
Transformers show their might.