Implement and optimize the vanilla Transformer

Transformer 8x67M. Total parameters: 332M. Activated parameters: 105M.

Features

Parallelly compute heads and qkv in multi-head attention.
GQA: grouped-query attention / MQA: multi-query attention
RoPE: rotary position embedding
kv_cache
SwiGLU
MoE: mixture of experts
RMSNorm: root mean square layer normalization

Environment

You can create a conda environment with conda env create -f environment.yml or following commands.

conda create -n transformer python=3.12
conda activate transformer
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install tokenizers evaluate sacrebleu wandb

Inference

Download the checkpoint and tar Jxvf checkpoint.tar.xz.

python inference.py to inference the test dataset.

python translate.py to input your own sentence and translate.

Train

You can python -m dataset.prepare-wmt14en2de to build your own tokenizer and datasets or download my dataset and tar Jxvf dataset.tar.xz.

Sentence pairs in the train dataset are sorted by sum of the length of the source and target tokens.

python train.py to train the Transformer.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dataset		dataset
tokenizer		tokenizer
.gitignore		.gitignore
Attention is All You Need zh-CN.md		Attention is All You Need zh-CN.md
LICENSE		LICENSE
README.md		README.md
average_checkpoints.py		average_checkpoints.py
config.py		config.py
environment.yml		environment.yml
inference.py		inference.py
legacy.py		legacy.py
train.py		train.py
transformer.py		transformer.py
translate.py		translate.py
util.py		util.py

License

HanhaiNotHai/transformer

Folders and files

Latest commit

History

Repository files navigation

Implement and optimize the vanilla Transformer

Features

Environment

Inference

Train

About

Topics

Resources

License

Stars

Watchers

Forks

Languages