# Overview

Please check the [Encoder In Transformers architecture](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture) and [Decoder In Transformers architecture](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture) to familar **self-attention** in transformers architecture.

In this notebook, we focus on the scaled-dot product attention mechanism(referred to as self-attention), which remains the most populat and most widely used attention mechanism in practice. And there are existed other types of attention machanisms, like [2020 Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732) and the [2023 A Survey on Effcient Training of Transformers](https://arxiv.org/abs/2302.01107) review and the [FlashAttention](https://arxiv.org/abs/2205.14135) paper.

# Embedding an Input Sentence

Through the "Encoder in Transformers architecture", we know the first steps is tokenization and embedding the tokenzes.

In [1]:
inputs="According to the news, it it hard to say Melbourne is safe now"

input_ids={s:i for i,s in enumerate(sorted(inputs.replace(',','').split()))}
input_ids

{'According': 0,
 'Melbourne': 1,
 'hard': 2,
 'is': 3,
 'it': 5,
 'news': 6,
 'now': 7,
 'safe': 8,
 'say': 9,
 'the': 10,
 'to': 12}

Let's convert them to the tokens(assign an integer index to each word).

In [2]:
import torch

input_tokens=torch.tensor([input_ids[s] for s in inputs.replace(',','').split()])
input_tokens

tensor([ 0, 12, 10,  6,  5,  5,  2, 12,  9,  1,  3,  8,  7])

Now, using the integer-vector reoresentation of the input sentence, we can use an embedding layer to **encode the inputs** into a real vector embedding. Here, we will use a 16-dimensional embedding such that each input word is represented by a 16-dimensional vector. Since the sentence consists of 13 words, this will result in a 13x16-dimentional embedding.

In [3]:
# using the same seed to keep the same result
torch.manual_seed(123)
embed=torch.nn.Embedding(13,16)
embedded_sentence=embed(input_tokens).detach()
embedded_sentence

tensor([[ 3.3737e-01, -1.7778e-01, -3.0353e-01, -5.8801e-01,  3.4861e-01,
          6.6034e-01, -2.1964e-01, -3.7917e-01,  7.6711e-01, -1.1925e+00,
          6.9835e-01, -1.4097e+00,  1.7938e-01,  1.8951e+00,  4.9545e-01,
          2.6920e-01],
        [-9.7969e-01, -2.1126e+00, -2.7214e-01, -3.5100e-01,  1.1152e+00,
         -6.1722e-01, -2.2708e+00, -1.3819e+00,  1.1721e+00, -4.3716e-01,
         -4.0527e-01,  7.0864e-01,  9.5331e-01, -1.3035e-02, -1.3009e-01,
         -8.7660e-02],
        [ 6.8508e-01,  2.0024e+00, -5.4688e-01,  1.6014e+00, -2.2577e+00,
         -1.8009e+00,  7.0147e-01,  5.7028e-01, -1.1766e+00, -2.0524e+00,
          1.1318e-01,  1.4353e+00,  8.8307e-02, -1.2037e+00,  1.0964e+00,
          2.4210e+00],
        [-2.2150e+00, -1.3193e+00, -2.0915e+00,  9.6285e-01, -3.1861e-02,
         -4.7896e-01,  7.6681e-01,  2.7468e-02,  1.9929e+00,  1.3708e+00,
         -5.0087e-01, -2.7928e-01, -2.0628e+00,  6.3745e-03, -9.8955e-01,
          7.0161e-01],
        [ 2.5529e-01

In [4]:
embedded_sentence.shape

torch.Size([13, 16])