Disclaimer: This project is in early stages, but due to a lack of time and resources, may not continue further. Please take this research into your own hands and cite this page if you do. This was done by an individual on free Colab runtimes.
Official Repository for Efficient Linear-Time Attention Transformers. Implementation in Tensorflow2 (PyTorch on the way). This is a re-working of both the Attention and Feed-Forward elements of a Transformer, resulting in faster and cheaper computation while keeping performance the same, if not in fact better. Yes, it is another linear Attention Transformer, but with its Fourier-like general positional information, its light and much more easily scalable Feed-Forward, as well as extensive ideas in the Other Methods section, I believe more can come of this than its many counterparts. I compare the results to the SoTA Transformers, with RoPE and SwiGLU etc.
The above results are preliminary, on wikitext with models of sizes of <300K params, sequence-length 256 and batch-size 128, using SentencePiece and Adam(min(1e-3, 1e-2/sqrt(step), 0.9, 0.99). (In fact, the diagram shows the poorest performance of ELiTA as it does not have the
A full paper will hopefully be released at some point. Base code is available on this repo.
The goal is to make Transformers cheaper, so that more powerful LLMs can be developed with less reliance on masses of A100s and TPUv4 pods. Though perhaps it is a long way from that, scaling to much larger sequence lengths and parameters should be easier, and better performance from small models should be more plausible.
The attention mechanism manages to be time-linear by using cumulative sums, which though not original, is used in a new way. (This means it is Decoder-only.) Like in original attention, there is a (never used fully) square array of logits which are softmaxed across rows and summed with values, but here, instead of the logit being
To get a grasp on the generality of this mechanism, picture this. Take a trainable (N, N) matrix, whose entries are
The original (N, N) matrix, the
The Feed-Forward magic works by splitting the dimensions of the down-scale kernel into the model width and linear scale factor, and summing across each of them with seperate inputs. These inputs replace the up-scale, and are simple linear layers with swish on top of sizes width and linear scale factor. More clearly, it is an einsum(i,j,ijk->k), using the two inputs i and j, and the full-size kernel ijk. We use a scale factor of 8 to account for the loss of generality, but this does not have a negative impact as described below.
The reasoning is that LLMs use the up-scale as a memory search and the down-scale as memory storage. Instead of simply approximating the double linear transformation, (with activation in-between), I am making that memory search more efficient; literally, across and along instead of just along.
The equation for
If you keep all the model dimensions same (as was done with above wikitext experiment), and layers, heads, etc, there will be a small (~5%) increase in parameter count due to the scale of 8, which is (I find) a good value for this to work, but the parameters saved in attention (as Q and K kernels no longer exist, really) should make up for this.
I have done considerable other research on this same goal using the ELiTA base. For example, if you think about the Transformer as determining the movement through embedding space forward in a sequence (if you weight-tie then the Transformer is predicting the next token in embedding space), then wrapping the same, but very dynamic model, many times around itself allows you to achieve more with the same number of parameters, as it is simply taking many steps to make the prediction. (Think diffusion, too). Effectively, you are giving it more layers with no extra parameters. Training like this has given good results where you wrap three times, and add a bias depending on which step you are on. Unfortunately, unlike a diffusion model, I am at the moment training all steps together, which is very memory-intensive (not as much, of course, as simply making the model have three times the layers).
To this end I have also tried to train a mapping of embedding space to a metric tensor, and use the Geodesic Equation from relativity to model the movement, and use the initial velocity as a hidden state across each token, but this did not prove too successful, due to the memory constraints of the Christoffel Symbols. (The hidden state works by taking the final velocity of some token in the series after taking a few steps through curved space, and using it as the initial velocity for the next token, like a momentum of meaning. The first initial velocity is trainable.) However, I am convinced that this research should not be final, as training only the curvature of the space seems a lot less difficult than training the rules of movement from any point to any other. (In other words, you are trying to accurately learn a mapping of
You can also add the serious bulk of parameters on as a normal Feed-Forward at the end over the logits post weight-tying. In my research I scale from 5000 logits to 8192 back to 5000, with swish activation. This acts as a look-up table on the pre-trained backbone (with the three-fold repetition), with 8192 situations where common errors of the backbone can be fixed, and more factual information can be stored.
Attention2
Inputs:
Parameters:
Output:
Feed-Forward2
Inputs:
Parameters:
Output:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
{
name:Efficient Linear-Time Attention Transformers,
author:ACO Sharma
date: 08/2023
}