Instead of performing softmax over the last dimension of the attention scores which corresponds to the keys dimension, we softmaxed over the penultimate dimension of the attention scores which corresponds to the queries dimension. This is a pitfall that leaks information by allowing the transformer to receive information from future tokens in the masked sequence. For more information, take a look at this blog post where I explain in detail why this happens.
You can install our flavour of torch with degenerate attention by either installing from source, or by making a local installation, then porting over the degenerate attention code files. For the second option to work, we need to enforce a consistent version of torch, so we use version 2.0.1.
# from source
. setup_from_src.sh
# local installation
. setup_from_cpy.sh
We will setup multiple tasks on which to train different transformers with degenerate attention. The first task consists of language modelling with a variety of natural language datasets.
To run a language model, there are a variety of hyper parameters that you can set. Please refer to the original README for more information. Here is an example where we run a vanilla transformer on the wikitext2 dataset.
cd nlp
python main.py --cuda --epochs 6 --model Transformer --lr 5 --wandb --degenerate
python main.py --cuda --epochs 6 --model Transformer --lr 5 --wandb