# Embedding an Input Sentence

For simplicity, dictionary `dc` is restricted to words that occur in the input sentence, otherwise large in real world applications.

In [1]:
sentence = 'Life is short, eat dessert first'

dc = {s:i for i, s in enumerate(sorted(sentence.replace(',', '').split()))}
dc #This is the dictionary (vocabulary)

{'Life': 0, 'dessert': 1, 'eat': 2, 'first': 3, 'is': 4, 'short': 5}

Now need to convert the sentence into a sequence of integers.

In [8]:
import torch

sentence_int = torch.tensor([dc[s] for s in sentence.replace(',', '').split()])
sentence_int

tensor([0, 4, 5, 2, 1, 3])

Can use an embedding layer to cencode the inputs into a real-vector embedding using the integer-vector representation of the input sentence.

In [9]:
torch.manual_seed(123)
embed = torch.nn.Embedding(6, 16) # Here 6 words, and each word is represented by a 16 dimensional vector
embedded_sentence = embed(sentence_int).detach() #detach() is used to prevent the computation graph from being traced
print(embedded_sentence)
print(embedded_sentence.shape)



tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
          0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692],
        [ 0.5146,  0.9938, -0.2587, -1.0826, -0.0444,  1.6236, -2.3229,  1.0878,
          0.6716,  0.6933, -0.9487, -0.0765, -0.1526,  0.1167,  0.4403, -1.4465],
        [ 0.2553, -0.5496,  1.0042,  0.8272, -0.3948,  0.4892, -0.2168, -1.7472,
         -1.6025, -1.0764,  0.9031, -0.7218, -0.5951, -0.7112,  0.6230, -1.3729],
        [-1.3250,  0.1784, -2.1338,  1.0524, -0.3885, -0.9343, -0.4991, -1.0867,
          0.8805,  1.5542,  0.6266, -0.1755,  0.0983, -0.0935,  0.2662, -0.5850],
        [-0.0770, -1.0205, -0.1690,  0.9178,  1.5810,  1.3010,  1.2753, -0.2010,
          0.4965, -1.5723,  0.9666, -1.1481, -1.1589,  0.3255, -0.6315, -2.8400],
        [ 0.8768,  1.6221, -1.4779,  1.1331, -1.2203,  1.3139,  1.0533,  0.1388,
          2.2473, -0.8036, -0.2808,  0.7697, -0.6596, -0.7979,  0.1838,  0.2293]])
torch.Size([6, 16])


# Weight Matrices

Self attention uses $W_q$, $W_k$, and $W_v$ to project the embedded sentence into query, key, and value vectors.

The respective query, key and value sequences are obtained via matrix multiplication between the weight matrices W and the embedded inputs x:

Query sequence: $q^{(i)} = W_q x^{(i)}$ for $i ∈ [1,T]$

Key sequence: $k^{(i)} = W_k x^{(i)}$ for $i ∈ [1,T]$

Value sequence: $v^{(i)} = W_v x^{(i)}$ for $i ∈ [1,T]$

The index i refers to the token index position in the input sequence, which has length T.

Another important thing to note is that the projection matrices:

$W_q$ and $W_k$ have a shape of $d_k$ x $d$

$W_v$ has a shape of $d_v$ x $d$.

$d$ = size of each word vector $x$

For this code, $d_q = d_k = 24$ and $d_v = 28$

In [10]:
torch.manual_seed(123)
d = embedded_sentence.shape[1]
d_q, d_k, d_v = 24, 24, 28

W_query = torch.nn.Parameter(torch.randn(d_q, d))
W_key = torch.nn.Parameter(torch.randn(d_k, d))
W_value = torch.nn.Parameter(torch.randn(d_v, d))