<a href="https://colab.research.google.com/github/DanteNoguez/modelos/blob/main/notebooks/attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
from torch import nn
import math
import torch.nn.functional as F
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
import copy

The encoder is composed of a stack of $N = 6$ layers.

In [2]:
def clones(module, N):
  "Produce N identical layers"
  return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512.$


La [Layer Normalization](https://arxiv.org/abs/1607.06450) luce así:

$\begin{aligned} L N(\mathbf{z} ; \boldsymbol{\alpha}, \boldsymbol{\beta}) &=\frac{(\mathbf{z}-\mu)}{\sigma} \odot \boldsymbol{\alpha}+\boldsymbol{\beta} \\ \mu=\frac{1}{D} \sum_{i=1}^D z_i, \quad \sigma &=\sqrt{\frac{1}{D} \sum_{i=1}^D\left(z_i-\mu\right)^2} \end{aligned}$

$\alpha$ = vector con valores 0 (aditivo) \\
$\beta$ = vector con valores 1 (multiplicativo) \\
$\mu$ = promedio \\
$\sigma$ = desviación estándar

Lo haremos sin nos vectores alfa y beta:

In [3]:
# Implementación manual de LayerNorm

x = torch.arange(0,10,1).float().view(2,5)
D = len(x.view(-1))
x_flat = x.view(-1)
eps = 1e-5

mu = 1/D * sum(x_flat)
sigma = (1/D * sum((x_flat-mu)**2)).sqrt()
std = np.sqrt(np.mean(abs((x.numpy() - x.numpy().mean())**2))) # así lo hace numpy

# Implementación del paper:

#STD = x.std(-1, keepdim=True)
#MEAN = x.mean(-1, keepdim=True)
#ALFA = nn.Parameter(torch.ones(2, 5))
#BETA = nn.Parameter(torch.zeros(2, 5))
#LANO = ALFA * (x - MEAN) / (STD + eps) + BETA

LN = ((x_flat - mu)/sigma).view(x.shape)

LayerNorma = torch.nn.LayerNorm(x.shape, elementwise_affine=True)
torch_layernorm = LayerNorma(x)

# Implementación de tinygrad:
#y = x - x.mean(axis=-1, keepdim=True)
#NLN = y.div((y*y).mean(axis=-1, keepdim=True).add(eps).sqrt())

#print(f'yo {sigma}, pytorch {torch.var(x)}, numpy {np.std(x.numpy())}, réplica de numpy {std}')
#print(mu, torch.mean(x), np.mean(x.numpy()))
#print(LN)
#print(torch_layernorm)
#print(STD, sigma, torch.std(x))

In [4]:
class LayerNorm(nn.Module):
  "Construye una Layer Normalization"
  def __init__(self, features, eps=1e-6):
    super().__init__()
    self.a_2 = nn.Parameter(torch.ones(features))
    self.a_1 = nn.Parameter(torch.zeros(features))
    self.eps = eps

  def forward(self, x):
    mean = x.mean(-1, keepdim=True)
    std = x.std(-1, keepdim=True)
    return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

The output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself.

In [5]:
class SublayerConnection(nn.Module):
  """Una conexión residual seguida de una LayerNorm.
  Pondremos la norm primero en lugar de al final"""
  def __init__(self, size, dropout):
    super().__init__()
    self.norm = LayerNorm(size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x, sublayer):
    "Aplicar conexión residual a toda sublayer"
    return x + self.dropout(sublayer(self.norm(x)))

In [6]:
class Encoder(nn.Module):
  "Core encoder is a stack of N = 6 layers."
  def __init__(self, layer, N):
    super().__init__()
    self.layers = clones(layer, N)
    self.norm = LayerNorm(layer.size)

  def forward(self, x, mask):
    "Pass the input and mask through each layer"
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

**Attention**

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

\begin{equation}
\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
\end{equation}

In [7]:
def attention(Q, K, V):
  "Scaled Dot-Product Attention"
  d_k = Q.size(-1) # dimension d_k
  scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # compute dot products and divide each by sqroot
  p_attn = scores.softmax(dim=-1) # apply softmax
  return torch.matmul(p_attn, V), p_attn # final multiplication with V

**Multi-Head Attention**

Instead of performind a single attention function with $d_{model}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$-dimensional output values. These are concatenated and once again projected, resulting in the final values.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

\begin{align}
\operatorname{MultiHead}(Q, K, V)=\operatorname{Concat}\left(\operatorname{head}_1, \ldots\right., head\left._{\mathrm{h}}\right) W^O \\ 
\text{where} \operatorname{head}_i=\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right)
\end{align}

Where the projections are parameter matrices $W_i^Q \in \mathbb{R}^{d_{\text {model }} \times d_k}, W_i^K \in \mathbb{R}^{d_{\text {model }} \times d_k}, W_i^V \in$ $\mathbb{R}^{d_{\text {model }} \times d_v}$ and $W^O \in \mathbb{R}^{h d_v \times d_{\text {model }}}$

In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=$ $d_v=d_{\text {model }} / h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

Figure 2:

<img src="https://miro.medium.com/max/640/1*LpDpZojgoKTPBBt8wdC4nQ.png" height=400 width=700 alt="attention"/> 

In [38]:
class MultiHeadAttention(nn.Module):
  def __init__(self, h, d_model, dropout=0.1):
    "Take in model size and number of heads"
    super().__init__()
    assert d_model % h == 0 # d_k = dmodel/h = 64
    self.d_k = d_model // h
    self.h = h # = 8
    self.linears = clones(nn.Linear(d_model, d_model), 4)
    self.attn = None
    self.dropout = nn.Dropout(p=dropout) # residual dropout, section 5.4

  def forward(self, Q, K, V, mask=None):
    "Implements figure 2"
    if mask is not None:
      #Same mask applied to all h heads
      mask = mask.unsqueeze(1)
    nbatches = Q.size(0)

    # 1) Do all the linear projections in batch from d_model => h x d_k
    Q, K, V = [
        lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
        for lin, x in zip(self.linears, (Q, K, V))]

    # 2) Apply attention on all the projected vectors in batch
    x, self.attn = attention( Q, K, V, mask=mask, dropout=self.dropout)

    # 3) Concat using a view and apply a final linear
    x = (x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)) # contiguous tiene que ver con el almacenamiento del tensor en memoria
    del Q
    del K
    del V
    return self.linears[-1](x)

**Position-wise Feed-Forward Networks**


In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

\begin{equation}
\operatorname{FFN}(x)=\max \left(0, x W_1+b_1\right) W_2+b_2
\end{equation}

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff} = 2048$.

In [None]:
class PositionwiseFFN(nn.Module):
  "Implements FFN equation"
  def __init__(self, d_model, d_ff, dropout=0.1):
    super().__init__()
    self.w_1 = nn.Linear(d_model, d_ff)
    self.w_2 = nn.Linear(d_ff, d_model)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    return self.w_2(self.dropout(self._1(x).relu()))

**Embeddings and Softmax**

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to ([cite](vhttps://arxiv.org/abs/1608.05859)). In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$.

In [47]:
embedding = nn.Embedding(10, 3, padding_idx=0)
input = torch.LongTensor([[0,2,0,5]])
embedding(input)

tensor([[[ 0.0000,  0.0000,  0.0000],
         [-1.7401, -1.5656, -0.3643],
         [ 0.0000,  0.0000,  0.0000],
         [ 1.5436, -0.6005,  0.8174]]], grad_fn=<EmbeddingBackward0>)

In [None]:
class EncoderLayer(nn.Module):
  "Encoder is made up of self-attn and feed forward"
  def __init__(self, size, self_attn, feed_forward, dropout):
    super().__init__()
    self.self_attn = self_attn
    self.feed_forward = feed_forward
    self.sublayer = clones(SublayerConnection(size, dropout), 2)
    self.size = size

  def forward(self, x, mask):
    x = self.sublayer[0](x, lambda x: self.sef_attn(x, x, x, mask))
    return self.sublayer[1](x, self.feed_forward)

The decoder is also composed of a stack of $N=6$ identical layers.

In [None]:
class Decoder(nn.Module):
  "Generic N layer decoder with masking"
  def __init__(self, layer, N):
    super().__init__()
    self.layers = clones(layer, N)
    self.norm = LayerNorm(layer.size)

  def forward(slef, x, memory, src_mask, tgt_mask):
    for layer in self.layers:
      x = layer(x, memory, src_mask, tgt_mask)
    return self.norm(x)