# Token Embedding

- Algorithm 1

![img](../assets/algorithm_1.png)

In Algorithm 1, token embedding gets an input $v$, which represents a token ID.  
And its output is $e$, the vector representation of the token mapped by $W_{e}$, the token embedding matrix.

## Token Embedding Matrix

Before implementing Algorithm 1 we need to make $W_{e}$, the token embedding matrix.  
$W_{e}$ needs a two argument from its defintion:
$$
W_{e}\in\mathbb{R}^{d_{e} \times N_{V}}
$$
- $d_{e}$ : dimension of embedding vector
- $N_{V}$ : number of vocabulary

In [None]:
def generate_token_embedding_matrix(d, v):
    ...
    return embed_matrix

In [None]:
embed_dimension = 10
num_vocab = 3
embed_matrix = generate_token_embedding_matrix(embed_dimension, num_vocab)
embed_matrix

## Embed token to vector
Define a function that implements algorithm 1.  
Function has two arguments, `token_id` and `embed_matrix`.

In [None]:
def token_embedding(token_id, embed_matrix):
    ...
    return output_vector

In [None]:
sample_token_id = 1
sample_output_vector = token_embedding(sample_token_id, embed_matrix)
sample_output_vector

## In sentence

But in practical, we are not mapping only a token.
We should map all tokens in sentence.

Input sample will looks like below:

```
sample_sentence = [0, 1, 2, 3, ...]
len(sample_sentence)
# n
```

* `n`: length of sentence

In [None]:
sample_sentence = [0, 1, 2]
len(sample_sentence)

Define a function that implements mapping to all tokens in sentence.  
Use `token_embedding` that we have defined before.

In [None]:
def sentence_token_embedding(sentence, embed_matrix):
    ...
    return embed_sentence

Embedded sentence should be looks like below:

```
embed_sentence = mapping(sentence, embed)
embed_sentence
# [
#     [0.xxx, 0.xxx, 0.xxx, ...],
#     [0.xxx, 0.xxx, 0.xxx, ...],
#     [0.xxx, 0.xxx, 0.xxx, ...],
#     ...
# ]
```

In [None]:
embed_token_sentence = sentence_token_embedding(sample_sentence, embed_matrix)
embed_token_sentence

Length of `embed_sentence` is `n` same as length of sentence.

In [None]:
len(embed_token_sentence)

Dimension of embeded vector is `d`, which is given parameter.

In [None]:
len(embed_token_sentence[0])

# Positional Embedding

- Algorithm 2

![img](../assets/algorithm_2.png)

In Algorithm 2, positional embedding gets an input $l$, which reprensents position of a token in the sequence.  
And its ouput is $e_{p}$, the vector representation of the position mapped by $W_{p}$, the positional embedding matrix.

## Postional Embedding Matrix
Before implementing Algorithm 2 we need to make $W_{p}$, the positional embedding matrix.
$W_{p}$ needs a two argument from its defintion:
$$
W_{p}\in\mathbb{R}^{d_{e} \times l_{max}}
$$
- $d_{e}$ : dimension of embedding vector
- $l_{max}$ : maximal context length

Definition of $l_{max}$:

![img](../assets/chunking.png)

### Hardcoded Embedding Matrix
In this tutorial `positional_embedding_matrix` implements original tranformer methods.

> Not all transformers make use of learned positional embeddings, some use a hard-coded mapping.  
> Such hardcoded positional embeddings can (theoretically) handle arbitrarily long sequences.  
> The original [Transformer](https://arxiv.org/abs/1706.03762) uses:  
>$$
>W_{p}[2i-1,t] = \sin({t/l^{2i/d_{e}}_{max}})
>$$
>$$
>W_{p}[2i,t] = \cos({t/l^{2i/d_{e}}_{max}})
>$$
> for $0 < i \le d_{e}/2$.

* $t$ represents the position of token

For example, assume that we have embedding vector with 10 dimension.  
Comparing with paper's notation, $d_{e}=10 \rightarrow 0 < i \le 5$.

And also assume that $l_{max}=5$.  
Implementing positional_embedding_matrix expression above eash can calucated like this:

- $i=1 \rightarrow W_{p}[1,t]=\sin(t/5^{2/10}), W_{p}[2,t]=\cos(t/5^{2/10})$
- $i=2 \rightarrow W_{p}[3,t]=\sin(t/5^{4/10}), W_{p}[4,t]=\cos(t/5^{4/10})$
- $i=3 \rightarrow W_{p}[5,t]=\sin(t/5^{6/10}), W_{p}[6,t]=\cos(t/5^{6/10})$
- $i=4 \rightarrow W_{p}[7,t]=\sin(t/5^{8/10}), W_{p}[8,t]=\cos(t/5^{8/10})$
- $i=5 \rightarrow W_{p}[9,t]=\sin(t/5^{10/10}), W_{p}[10,t]=\cos(t/5^{10/10})$

In this tutorial we assume that position $t$ has range $0\le t < l_{max}$.  
So it takes only two arguments `d` and `l_max`.

In [None]:
def generate_positional_embedding_matrix(d, l_max):
    ...
    return pos_matrix

In [None]:
l_max = 5
pos_matrix = generate_positional_embedding_matrix(embed_dimension, l_max)
pos_matrix

# Embedding

Positonal embedding of a token is usually added to the token embedding to form a tokens's initial embeding.  
For the $t$-th token of a sequence $x$, the embedding is:
$$
e=W_{e}[:,x[t]]+W_{p}[:,x[t]]
$$

Define a function that adds embedding vector and positional vector when `position` and `token_id` is given.

In [None]:
def embed_by_position(token_id, position, embed_matrix, pos_matrix):
    ...
    return result_vector

In [None]:
position = 0
token_id = 0
embed_vecotr = embed_by_position(token_id, position, embed_matrix, pos_matrix)
embed_vecotr

As explained before, in practical, we should map all tokens in sentence.  
Define a function that implements mapping to all tokens in sentence.  
Use `embed_by_position` that we have defined before.

In [None]:
def sentence_embedding(sentence, embed_matrix, pos_matrix):
    ...
    return embed_sentence

In [None]:
embed_sentence = sentence_embedding(sample_sentence, embed_matrix, pos_matrix)
embed_sentence