# TransformerXL

Our vanilla transformer showed improvements, but suffered from only having attention within the current block.

We also only used absolute positional encodings, so tokens knew where they were in the sequence but not where they were relative to other tokens.

[TransformerXL](https://research.google/blog/transformer-xl-unleashing-the-potential-of-attention-models/) tackles both of these problems by 
1. Using a 'memory' for keys and values from the previous block, allowing information to propagate through time.
2. Employing relative positional encoding.

Obviously for either of these to work, data needs to be fed in sequentially, so our loading and batching strategy will once again need revisiting.

# Coding A Paper

Luckily I found [this walkthrough](https://www.youtube.com/playlist?list=PLam9sigHPGwOe8VDoS_6VT4jjlgs9Uepb) in the style of Karpathy's makemore videos.

## Notes

### Ep.2 Keeping GPUs Busy

We need to keep our music blocks contiguous across batches, e.g. for a batch size of four:


          |        Chunk 1        |------|        Chunk 2        |
|        | Batch 1 | Batch 2 | Batch 3 |        | Batch 4 | Batch 5 | Batch 6 |
|--------|---------|---------|---------|--------|---------|---------|---------|
| Song 1 | Block 1 | Block 2 | Block 3 | Song 5 | Block 1 | Block 2 | Block 3 |
| Song 2 | Block 1 | Block 2 | Block 3 | Song 6 | Block 1 | Block 2 | Block 3 |
| Song 3 | Block 1 | Block 2 | Block 3 | Song 7 | Block 1 | Block 2 | Block 3 |
| Song 4 | Block 1 | Block 2 | Block 3 | Song 8 | Block 1 | Block 2 | Block 3 |

Note that the above shows songs that are all the same length, which of course isn't what we have in reality.

This means that we either
- Crop long songs
- Pad short songs
- Connect them in a ragged way

The video takes the cropping approach, picking a given 'chunk' (i.e. multiple of block) size and cropping the song to a multiple of this chunk size, i.e.

In [10]:
blocks = 3
block_size = 256
chunk_size = blocks * block_size
chunk_size

768

So mod the song length by chunk size and crop.

Use `reshape` (or `view`?) to rearrange the a song into chunks, then `concat` to join the songs into one list of chunks, then `chunk` to split into batches.

Following the above, batch 1 block 1 should be the precursor to batch 2 block 1.

Data and labels per chunk are the same as in a vanilla transformer - labels are data offset by one.

 