<a href="https://colab.research.google.com/github/DanteNoguez/modelos/blob/main/notebooks/attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from torch import nn
import math
import torch.nn.functional as F
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
import copy

The encoder is composed of a stack of $N = 6$ layers.

In [None]:
def clones(module, N):
  "Produce N identical layers"
  return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512.$


La Layer Normalization luce así:

$\begin{aligned} L N(\mathbf{z} ; \boldsymbol{\alpha}, \boldsymbol{\beta}) &=\frac{(\mathbf{z}-\mu)}{\sigma} \odot \boldsymbol{\alpha}+\boldsymbol{\beta} \\ \mu=\frac{1}{D} \sum_{i=1}^D z_i, \quad \sigma &=\sqrt{\frac{1}{D} \sum_{i=1}^D\left(z_i-\mu\right)^2} \end{aligned}$

$\alpha$ = vector con valores 0 (aditivo) \\
$\beta$ = vector con valores 1 (multiplicativo) \\
$\mu$ = promedio \\
$\sigma$ = desviación estándar

Lo haremos sin nos vectores alfa y beta:

In [None]:
# Implementación manual de LayerNorm

x = torch.arange(0,10,1).float().view(2,5)
D = len(x.view(-1))
x_flat = x.view(-1)
eps = 1e-5

mu = 1/D * sum(x_flat)
sigma = (1/D * sum((x_flat-mu)**2)).sqrt()
std = np.sqrt(np.mean(abs((x.numpy() - x.numpy().mean())**2))) # así lo hace numpy

# Implementación del paper:

#STD = x.std(-1, keepdim=True)
#MEAN = x.mean(-1, keepdim=True)
#ALFA = nn.Parameter(torch.ones(2, 5))
#BETA = nn.Parameter(torch.zeros(2, 5))
#LANO = ALFA * (x - MEAN) / (STD + eps) + BETA

LN = ((x_flat - mu)/sigma).view(x.shape)

LayerNorma = torch.nn.LayerNorm(x.shape, elementwise_affine=True)
torch_layernorm = LayerNorma(x)

# Implementación de tinygrad:
#y = x - x.mean(axis=-1, keepdim=True)
#NLN = y.div((y*y).mean(axis=-1, keepdim=True).add(eps).sqrt())

#print(f'yo {sigma}, pytorch {torch.var(x)}, numpy {np.std(x.numpy())}, réplica de numpy {std}')
#print(mu, torch.mean(x), np.mean(x.numpy()))
#print(LN)
#print(torch_layernorm)
#print(STD, sigma, torch.std(x))

In [None]:
class LayerNorm(nn.Module):
  "Construye una Layer Normalization"
  def __init__(self, features, eps=1e-6):
    super().__init__()
    self.a_2 = nn.Parameter(torch.ones(features))
    self.a_1 = nn.Parameter(torch.zeros(features))
    self.eps = eps

  def forward(self, x):
    mean = x.mean(-1, keepdim=True)
    std = x.std(-1, keepdim=True)
    return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

The output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself.

In [None]:
class SublayerConnection(nn.Module):
  """Una conexión residual seguida de una LayerNorm.
  Pondremos la norm primero en lugar de al final"""
  def __init__(self, size, dropout):
    super().__init__()
    self.norm = LayerNorm(size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, x, sublayer):
    "Aplicar conexión residual a toda sublayer"
    return x + self.dropout(sublayer(self.norm(x)))

In [None]:
class Encoder(nn.Module):
  "Core encoder is a stack of N = 6 layers."
  def __init__(self, layer, N):
    super().__init__()
    self.layers = clones(layer, N)
    self.norm = LayerNorm(layer.size)

  def forward(self, x, mask):
    "Pass the input and mask through each layer"
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.