# Self-Attention doesn't work (by itself) #

The basic idea that we implement of self-attention is not used in practice, why? It has some inconvenients, the principal is that we're using softmax function, that is kind of normalized exponential, the key word here is exponential, it means that the function is very sensitive to large input values (In the vector examples on last posts if you try making  np.exp(x)/np.sum(exps) u will see some inf ), and these can neutralize the gradient, so we need a mechanism to scale sotfmax input, another thing is that nothing on that self-attention operation is a parameter, so it makes anything by itself, but what if we want to move a little the results in a convenient way for us?... well we can't so that is another thing to think about.

Let's start by setting the parameters.

Each vector in the input sequence are used 3 times as we see on the basic attention functions, the first one is to compute the weights for its own output $ y_i $ the second is to compute de weights for the output of the other vectors this is the $ y_j $ the last operation when we use the input vector to compute the output vectors by multiplying the input vector for the weight.

As researchers explains on the same paper "Attention Is All You Need"

"An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key"

We can obtain each vector, query keys and values by making a lienear transformation to the original input vector $ x_i $
Here's where the parameters make sense, for each input vector ( query key and value) we have a square matrix ( $ W_q, W_k,  W_v$ ) to do this linear transform.

 $ q_i $ a query vector, $ k_i $ a key vector, $ v_i $ a value vector.

Defined as next:
$ q_i=W_q x_i \\ $ 
$ k_i=W_k x_i \\ $
$ v_i=W_v x_i \\ $

Now we can use 3 transformed vectors to use as inputs from the attention operation, in order to avoid killing the gradient, we'll scale but which factor whould we use...
Well, we want to avoid growing too large, and it grows too large as a function of the dimention, right? 
The big usual trick is normalizing respect some metric, as we're using vectors, lets use it's Euclidean lenght, 
in a particular case, lets say a vector with $c$ components each with a value of $ u $, euclidean distance from origin to the point in the c dimentional space is $u\sqrt{c} $ so we found our scale factor $\sqrt{c}$

Next is a fragment of the original paper

<img src="../images/Post4/AIAYN3.2.1.png"> 


As additional they specify a multi-head attention, it's usually interpreted as words play diferent roles, for example, the dog eats meat, if we just compute attention with a sigle head, as we're summing each weight, our attention mechanism will return the same attention vector if we change meat for dog, and this is usually no so acurrate, so, we use different heads and concatenate results.

<img src="../images/Post4/AIAYN3.2.2.png"> 

## Attention with dotscaled-multihead attention ##

<img src="../images/Post4/AIAYNFig2.png"> 

## Let's implement it  ##

We can combine all heads in one matrix in order to make just a matmul to calculate all queries, keys and values
Let's implement it like a torch module in order to use it in next posts 

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
#Suposing a number default of heads like in paper, 8 heads
class SelfAttentionWithTricks(nn.Module):
  def __init__(self, k, neads=8):
    super().__init__()
    self.k, self.nheads = k, nheads
    # computando todas las headas
#nn.linear, nos permite hacer una transformacion lienal y sus parámetros son los siguientes

  #      in_features – size of each input sample

#        out_features – size of each output sample

 #       bias – If set to False, the layer will not learn an additive bias. Default: True
    self.tokeys    = nn.Linear(k, k * heads, bias=False)
    self.toqueries = nn.Linear(k, k * heads, bias=False)
    self.tovalues  = nn.Linear(k, k * heads, bias=False)
    #Ponendo todo junto
    self.unifyheads = nn.Linear(heads * k, k)

In [2]:
    #Implementando la funcion forward del módulo, aqui computaremos keys queries y values 
    def forward(self, x):
        b, t, k = x.size()
        h = self.heads
        #Reshaping, so now we have a dimension of the tensor for each head
        queries = self.toqueries(x).view(b, t, h, k)
        keys    = self.tokeys(x)   .view(b, t, h, k)
        values  = self.tovalues(x) .view(b, t, h, k)
        #As our tensor dimention are not headers and batch net to the other, we have to transpose in order to 
        #get the required shape for the dot product, contiguous is to basically order internally the vectors, 
        #like contiguous index for each value, like a new tensor with these initial values
        keys = keys.transpose(1, 2).contiguous().view(b * h, t, k)
        queries = queries.transpose(1, 2).contiguous().view(b * h, t, k)
        values = values.transpose(1, 2).contiguous().view(b * h, t, k)
        
        #Now, we want to multiply queries and keys to get values
        #As we'll multiply queries and keys, we can take the sqrt before the multiplication, 
        #we just have to get de 4th sqre
        queries = queries / (k ** (1/4))
        keys    = keys / (k ** (1/4))

        #lets multiply them
        weights = torch.bmm(queries, keys.transpose(1, 2))
        #and of course aply softmax
        weights = F.softmax(weights, dim=2) 
        # now we multiply this weights for the values and reshape 
        out = torch.bmm(weights,values).view(b,h,t,k)
        #and, in order to get a k dimentional vector, we use the unifiedheads proyection to sum or proyect hk yo k dimentions
        out = out.transpose(1, 2).contiguous().view(b, t, h * k)
        return self.unifyheads(out)

We've made just self attention part, but a transformer is an architecture, based on this operation, so in next posts we'll implement a transformer with blocks of this operation.

If you are looking for a much more explained way, look at Peter Bloem's blog, i used it as inspiration, and guide to write all this posts
http://peterbloem.nl/blog/transformers
and of course... Attention is all you need original paper https://arxiv.org/pdf/1706.03762.pdf