# 11.6 Self-Attention and Positional Encoding

## AUTOR: Isaac Reyes

### 11.6Self-Attention and Positional Encoding

#In deep learning, we often use CNNs or RNNs to encode sequences. Now with attention mechanisms in mind, imagine feeding a sequence of tokens into an attention mechanism such that at every step, each token has its own query, keys, and values. 
#Here, when computing the value of a token’s representation at the next layer, the token can attend (via its query vector) to any other’s token (matching based on their key vectors). Using the full set of query-key compatibility scores, we can compute,
#for each token, a representation by building the appropriate weighted sum over the other tokens. Because every token is attending to each other token (unlike the case where decoder steps attend to encoder steps), such architectures are typically described as self-attention 
#models (Lin et al., 2017, Vaswani et al., 2017), and elsewhere described as intra-attention model (Cheng et al., 2016, Parikh et al., 2016, Paulus et al., 2017). In this section, we will discuss sequence encoding using self-attention, including using additional information
#for the sequence order.

#Importamos las librerias y lo necesario:


In [1]:
use strict; 
use warnings; 
use Data::Dump qw(dump); 
use List::Util qw(zip min max sum);
use d2l; 
IPerl->load_plugin('Chart::Plotly'); 

### 11.6.1. Self-Attention

#Using multi-head attention, the following code snippet computes the self-attention of a tensor with shape
#(batch size, number of time steps or sequence length in tokens, d). The output tensor has the same shape.

In [2]:
my ($num_hiddens, $num_heads) = (100, 5);
my $attention = new d2l::MultiHeadAttention($num_hiddens, $num_heads, 0.5);
$attention->initialize();
my ($batch_size, $num_queries, $valid_lens) = (2, 4, mx->nd->array([3, 2]));
my $X = mx->nd->ones([$batch_size, $num_queries, $num_hiddens]);
d2l->check_shape($attention->forward($X, $X, $X, $valid_lens), [$batch_size, $num_queries, $num_hiddens]);

1

### 11.6.2. Comparing CNNs, RNNs, and Self-Attention

Let’s compare architectures for mapping a sequence of 
 tokens to another one of equal length, where each input or output token is represented by a 
-dimensional vector. Specifically, we will consider CNNs, RNNs, and self-attention. We will compare their computational complexity, sequential operations, and maximum path lengths. Note that sequential operations prevent parallel computation, while a shorter path between any combination of sequence positions makes it easier to learn long-range dependencies within the sequence

### 11.6.3. Positional Encoding

In [3]:
package  PositionalEncoding{
      use base qw(AI::MXNet::Gluon::Block); #@save
      use List::Util qw(zip);
      sub new {
        my ($class, $num_hiddens,$dropout,%args) = (splice(@_,0,3), d2l->get_arguments(max_len=>1000,\@_));
        my  $self = $class->SUPER::new();     
        $self->{dropout} = mx->gluon->nn->Dropout($dropout);
        map {$self->register_child($self->{$_})} ('dropout');
          my $X = mx->nd->arange(stop=>$args{max_len})->reshape([-1,1]) /
            (1000 ** (mx->nd->arange(start=>0, stop=>$num_hiddens, step=>2) / $num_hiddens));
          $self->{P} = mx->nd->concat(
            (map { $_->expand_dims(1) } 
                map { @$_ } 
                    zip \@{mx->nd->sin($X)->T}, \@{mx->nd->cos($X)->T}
            ),
            dim => 1
          )->expand_dims(0);

        return bless ($self, $class);
    }
    sub forward {
        my ($self, $X) = @_;
         $self->{P}->slice('X', [0, $X->shape->[1] -1], 'X')->as_in_context($X->context);

        return $self->{dropout}->($X);
    }
1;
}       

1

#The offset between the 6 and the 7 (same for the 8 and the 9) columns is due to the alternation of sine and cosine functions.


In [4]:
my ($encoding_dim ,$num_steps) = (32 , 60);
my $pos_encoding = new PositionalEncoding($encoding_dim, 0);
$pos_encoding->initialize();
my $X = $pos_encoding->forward(mx->nd->zeros([1, $num_steps,$encoding_dim]));
my $P = $pos_encoding->{P}->slice('X', [0, $X->shape->[1] -1], 'X');

<AI::MXNet::NDArray 1x60x32 @cpu(0)>

In [5]:
#Graficamos:
my $squeeze_p = $P->slice(0, 'X', [6, 9])->squeeze(0);
d2l->plot(mx->nd->arange(stop => $num_steps), $squeeze_p->T, 
          xlabel => 'Row (position)', figsize => [6, 2.5],
          legend => [map { "Col $_" } 6..9]);

#### 11.6.3.1. Absolute Positional Information

In [6]:
for my $i (0..7) {
    printf "%d in binary is %03b\n", $i, $i;
}
my $P_expanded = $P->slice([0])->expand_dims(axis=>0);


0 in binary is 000
1 in binary is 001
2 in binary is 010
3 in binary is 011
4 in binary is 100
5 in binary is 101
6 in binary is 110
7 in binary is 111


<AI::MXNet::NDArray 1x1x60x32 @cpu(0)>

In [7]:
d2l->show_heatmaps(
    $P_expanded,
    xlabel => 'Column (encoding dimension)',
    ylabel => 'Row (position)',
    figsize => [3.5, 4],
    cmap => 'Purples'
);

#### 11.6.3.2. Relative Positional Information

Besides capturing absolute positional information, the above positional encoding also allows a model to easily learn to attend by relative positions. This is because for any fixed position offset 
, the positional encoding at position 
 can be represented by a linear projection of that at position 
.
where the 2x2 projection matrix does not depend on any position index 
.

### Summary

### Exercises 

No realizar