# Transformer API

Simple transformers 
can be built with *NNHelferlein* s transformer type. The Implementation follows 
the *Vaswani, 2017* paper (fig. from *Vaswani et al. NIPS (2017)* http://arxiv.org/abs/1706.03762 ) an dis wrapped into the types 
`NNHelferlein.Transformer` and `NNHelferlein.TokenTransformer`:

<img src="assets/80-vaswani-fig-1.png" width="400">

In [1]:
using Knet, NNHelferlein
using LinearAlgebra

## Playground data

For the experiments a tiny but endearing dataset is used and prepared with *NNHelferlein* tools:

In [2]:
de = ["Ich liebe Julia",
      "Peter liebt Python",
      "Susi liebt sie alle",
      "Ich programmiere immer in Julia"]
en = ["I love Julia",
      "Peter loves Python",
      "Susi loves them all",
      "I always code Julia"]

de_vocab = WordTokenizer(de)
d = de_vocab(de, add_ctls=true)
d = pad_sequence.(d, 8)
d = truncate_sequence.(d, 8)

4-element Vector{Vector{Int32}}:
 [1, 7, 9, 6, 2, 3, 3, 3]
 [1, 10, 5, 13, 2, 3, 3, 3]
 [1, 16, 5, 14, 11, 2, 3, 3]
 [1, 7, 12, 8, 15, 6, 2, 3]

In [3]:
en_vocab = WordTokenizer(en)
e = en_vocab(en, add_ctls=true)
e = pad_sequence.(e, 8)
e = truncate_sequence.(e, 8)

4-element Vector{Vector{Int32}}:
 [1, 7, 10, 5, 2, 3, 3, 3]
 [1, 9, 6, 11, 2, 3, 3, 3]
 [1, 12, 6, 14, 13, 2, 3, 3]
 [1, 7, 15, 8, 5, 2, 3, 3]

In [4]:
mbs = sequence_minibatch(d, e, 2)
x,y = first(mbs)
@show length(mbs)
@show x
@show y;

length(mbs) = 2
x = Int32[1 1; 7 10; 9 5; 6 13; 2 2; 3 3; 3 3; 3 3]
y = Int32[1 1; 7 9; 10 6; 5 11; 2 2; 3 3; 3 3; 3 3]


### The Transformer

Transformes can be constructed with the types `NNHelferlein.Transformer` and
`NNHelferlein.TokenTransformer`. The first is more general and expects tensors
of embedded data as input. The `TokenTransformer` works on sequences of
Integer tokens.

We set up a `TokenTransformer` with 5 layers, an embedding depth of 128 
and 4 heads. The size of the vocabulatory can be defined by the
vocab-objects of type `WordTokenizer`. We briefly test it with the first minibatch:

In [5]:
tt =  TokenTransformer(5, 128, 4, de_vocab, en_vocab, drop_rate=0.1);

In [6]:
@show size(tt(x,y))       # raw output
@show size(tt.α)          # attention factors (for 4 heads)
tt(x,y, embedded=false)   # generated sequence

size(tt(x, y)) = (15, 8, 2)
size(tt.α) = (8, 8, 4, 2)


8×2 Matrix{Int64}:
 9  9
 9  9
 9  9
 9  9
 9  9
 9  9
 9  9
 9  9

## Signatures for training and prediction:

Addistional signatures are necessary for training and prediction. The methods
adds functionality for
+ creation of padding masks
+ shifting in- and out-sequences by one, to be able to train the *next*
  position of the sequence
+ loss calculation for training

*NNHelferlein* default encoding of the `WordTokenizer` is used for `<start>`, `<end>` and `<pad>`.

In [7]:
mutable struct AllYouNeed
    t::TokenTransformer
    vocab_enc
    vocab_dec
    
    AllYouNeed(n_layers, depth, heads, x_vocab, y_vocab; drop_rate=0.1) = 
        new(TokenTransformer(n_layers, depth, heads, x_vocab, y_vocab; drop_rate),
        x_vocab,
        y_vocab)
end

In [8]:
function (ayn::AllYouNeed)(x,y)   # calc loss
    
    y_in = y[1:end-1,:]       # shift y against teaching output
    y_teach = y[2:end,:]
        
    x_mask = mk_padding_mask(x)
    y_mask = mk_padding_mask(y_in)
        
    o = ayn.t(x, y_in)
        
    o_mask = (mk_padding_mask(y_teach) .== 0.0) |> Array{Float32}
    y_m = y_teach .* o_mask .|> Int   # make class ID 0 for padded positions
    loss = nll(o, y_m, average=true)  # Xentropy loss of unmasked positions only
    
    return loss
end

In [9]:
translate = AllYouNeed(5, 128, 4, de_vocab, en_vocab, drop_rate=0.1)
translate(x,y)

3.7183423f0

### Accuracy

Calculating a meaningful accuracy is a little bit tricky for transformers, because
target sequence *in* and *out* are shiftet:

In [10]:
function tt_acc(mdl; data=nothing)

    tac = Float32(0.0)
    for (x,y) in data
        y_in = y[1:end-1,:]
        y_teach = y[2:end,:]
        o = mdl.t(x, y_in, embedded=false)

        tac += hamming_acc(o, y_teach, vocab=mdl.vocab_dec)
    end

    return tac / length(data)
end

tt_acc (generic function with 1 method)

In [11]:
tt_acc(translate, data=mbs)

0.0625

## Now we can train:

In [12]:
translate = AllYouNeed(5, 128, 4, de_vocab, en_vocab, drop_rate=0.1)

ayn = tb_train!(translate, Adam, mbs, epochs=100,
                lr=1e-9, lr_decay=2e-4, lrd_steps=5, lrd_linear=true,
                tb_name="I_love_WARMUP",
                acc_fun=tt_acc, eval_size=1, eval_freq=1)
ayn = tb_train!(translate, Adam, mbs, epochs=300,
                lr=1e-4, lr_decay=1e-5, lrd_steps=5, lrd_linear=true,
                tb_name="I_love_TRAIN",
                acc_fun=tt_acc, eval_size=1, eval_freq=1);

Training 100 epochs with 2 minibatches/epoch.
Evaluation is performed every 2 minibatches with 2 mbs.
Watch the progress with TensorBoard at:
/home/andreas/Documents/Projekte/2022-NNHelferlein_KnetML/NNHelferlein/examples/logs/I_love_WARMUP/2023-05-12T14-44-55


[32mProgress:  21%|████████▋                                |  ETA: 0:02:56[39m


Setting learning rate to η=5.00e-05 in epoch 20.5


[32mProgress:  40%|████████████████▋                        |  ETA: 0:01:12[39m


Setting learning rate to η=1.00e-04 in epoch 40.5


[32mProgress:  60%|████████████████████████▋                |  ETA: 0:00:34[39m


Setting learning rate to η=1.50e-04 in epoch 60.5


[32mProgress:  80%|████████████████████████████████▋        |  ETA: 0:00:14[39m


Setting learning rate to η=2.00e-04 in epoch 80.5


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:54[39m


Training finished with:
Training loss:       1.2147009
Training accuracy:   0.4375
Training 300 epochs with 2 minibatches/epoch.
Evaluation is performed every 2 minibatches with 2 mbs.
Watch the progress with TensorBoard at:
/home/andreas/Documents/Projekte/2022-NNHelferlein_KnetML/NNHelferlein/examples/logs/I_love_TRAIN/2023-05-12T14-45-53


[32mProgress:  20%|████████▎                                |  ETA: 0:00:24[39m


Setting learning rate to η=7.75e-05 in epoch 60.5


[32mProgress:  40%|████████████████▍                        |  ETA: 0:00:18[39m


Setting learning rate to η=5.50e-05 in epoch 120.5


[32mProgress:  60%|████████████████████████▋                |  ETA: 0:00:12[39m


Setting learning rate to η=3.25e-05 in epoch 180.5


[32mProgress:  80%|████████████████████████████████▊        |  ETA: 0:00:06[39m


Setting learning rate to η=1.00e-05 in epoch 240.5


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:29[39m


Training finished with:
Training loss:       0.014779562
Training accuracy:   1.0


... of course, this is just the proof, that the transformer 
can overfit a small dataset.

Please have a look at the example `80-transformer.jpynb` to see how to
work with a more realistic dataset.