<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/2.%20Token%20embeddings/Token_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

# Toy example

Let's create a simple example to generate token embeddings.

In [2]:
# We have the text "Quick fox is in the house"

# Say their token ids are [4, 0, 3, 2, 5, 1]

# Create a tensor that stores the token ids [2, 3, 5, 1]
# That is [in, is, the, house] to be transformed into token embeddings
input_ids = torch.tensor([2, 3, 5, 1])

In [3]:
# Say we have a vocabulary of 6 words, those given in the sentence above
vocabulary_size = 6

# We want to embbed the words into a 3-dimensional space. We can choose other
# dimensionality, but we'll keep it 3-dimensional
output_dimension = 3

In [5]:
torch.manual_seed(42)

# Create an embedding layer for vectorize the words
# PyTorch creates embedding layers suitable for the task of creating embedding
# vectors
embedding_layer = torch.nn.Embedding(vocabulary_size, output_dimension)

# Take a look at the created tensor
print(embedding_layer.weight)

Parameter containing:
tensor([[ 1.9269,  1.4873, -0.4974],
        [ 0.4396, -0.7581,  1.0783],
        [ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516],
        [ 0.8599, -0.3097, -0.3957]], requires_grad=True)


This created a matrix whose dimensions are `(vocabulary_size, output_dimension)`, where each row corresponds to a word in the vocabulary and each column to a direction in the hyperspace. The values are randomly initialized.

Note that `requires_grad=True` indicates that the values are going to be adjusted via training the network.

Let's get the vector of each word in the vocabulary.

In [6]:
print(f'Vector for the word "fox": {embedding_layer.weight[0]}')
print(f'Vector for the word "house": {embedding_layer.weight[1]}')
print(f'Vector for the word "in": {embedding_layer.weight[2]}')
print(f'Vector for the word "is": {embedding_layer.weight[3]}')
print(f'Vector for the word "quick": {embedding_layer.weight[4]}')
print(f'Vector for the word "the": {embedding_layer.weight[5]}')

Vector for the word "fox": tensor([ 1.9269,  1.4873, -0.4974], grad_fn=<SelectBackward0>)
Vector for the word "house": tensor([ 0.4396, -0.7581,  1.0783], grad_fn=<SelectBackward0>)
Vector for the word "in": tensor([0.8008, 1.6806, 0.3559], grad_fn=<SelectBackward0>)
Vector for the word "is": tensor([-0.6866,  0.6105,  1.3347], grad_fn=<SelectBackward0>)
Vector for the word "quick": tensor([-0.2316,  0.0418, -0.2516], grad_fn=<SelectBackward0>)
Vector for the word "the": tensor([ 0.8599, -0.3097, -0.3957], grad_fn=<SelectBackward0>)


Now, we can obtain the vectors for the `input_ids` defined above.

In [8]:
print(embedding_layer(input_ids))

tensor([[ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [ 0.8599, -0.3097, -0.3957],
        [ 0.4396, -0.7581,  1.0783]], grad_fn=<EmbeddingBackward0>)


Once again, these values are random for now, they're optimized with the training of the LLM.

# Embedding layer = Linear layer

The embedding layer creates a matrix that takes into account the size of the vocabulary and the desired dimensionality of the space one wants to embbed the words into.

In [9]:
# Create a matrix for a vocabulary of 4 words to embedd them into a
# 5 dimensional space

embedding_layer = torch.nn.Embedding(4, 5)
print(embedding_layer.weight)

Parameter containing:
tensor([[-0.2234,  1.7174,  0.3189, -0.4245, -0.8140],
        [-0.7360, -0.8371, -0.9224,  1.8113,  0.1606],
        [ 0.3672,  0.1754, -1.1845,  1.3835, -1.2024],
        [ 0.7078, -1.0759,  0.5357,  1.1754,  0.5612]], requires_grad=True)


In [13]:
# Retrieving the vectors for the following ids
input_ids = torch.tensor([0, 2])
print(embedding_layer(input_ids))

tensor([[-0.2234,  1.7174,  0.3189, -0.4245, -0.8140],
        [ 0.3672,  0.1754, -1.1845,  1.3835, -1.2024]],
       grad_fn=<EmbeddingBackward0>)


Now, the embedding layer is the same as a linear layer in a neural network. In this case, such a linear layer will have 4 inputs (akin to the vocabulary size) and 5 neurons (akin to the dimension of the space).

So, what's the benefit of using PyTorch's embedding layer? It is computationally more efficient.