# PLAYING WITH TOKEN EMBEDDINGS

## Imported Trained Models

### Word2Vec

<div class="alert alert-info">
Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality'.
</div>

In [5]:
import gensim.downloader as api
model = api.load("word2vec-google-news-300")

In [6]:
word_vectors = model

# Lets us look how the vector embeddings look like
print(word_vectors['computer']) # Example: Accessing the vector for the word 'computer'

[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e-02  1.88476562e-01
  5.51757812e-02  5.02929

In [7]:
print(word_vectors['cat'].shape) # Example: Checking the shape of the vector for 'cat'

(300,)


### Similar words

KING + WOMEN - MAN = ?

In [8]:
# Example of using most_similar to find similar words
print(word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=10))

[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.518113374710083), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411403656006)]


Lets us check the similarity between a few pairs of words using the pre-trained Word2Vec model.

In [9]:
# Example: Finding similarity between word pairs
print(f"Similarity between 'women' and 'man': {word_vectors.similarity('woman', 'man')}")
print(f"Similarity between 'king' and 'queen': {word_vectors.similarity('king', 'queen')}")
print(f"Similarity between 'uncle' and 'aunt': {word_vectors.similarity('uncle', 'aunt')}")
print(f"Similarity between 'boy' and 'girl': {word_vectors.similarity('boy', 'girl')}")
print(f"Similarity between 'father' and 'mother': {word_vectors.similarity('father', 'mother')}")
print(f"Similarity between 'brother' and 'sister': {word_vectors.similarity('brother', 'sister')}")
print(f"Similarity between 'paper' and 'water': {word_vectors.similarity('paper', 'water')}")

Similarity between 'women' and 'man': 0.7664012312889099
Similarity between 'king' and 'queen': 0.6510956883430481
Similarity between 'uncle' and 'aunt': 0.7643473744392395
Similarity between 'boy' and 'girl': 0.8543272018432617
Similarity between 'father' and 'mother': 0.7901482582092285
Similarity between 'brother' and 'sister': 0.7160383462905884
Similarity between 'paper' and 'water': 0.11408083885908127


most similar words

In [10]:
print(word_vectors.most_similar('tower', topn=5))

[('towers', 0.8531750440597534), ('skyscraper', 0.6417425870895386), ('Tower', 0.639177143573761), ('spire', 0.594687819480896), ('responded_Understood_Atlasjet', 0.5931612253189087)]


### Now Let's us see the vector similarity

In [12]:
import numpy as np

# word to compare

word1 = 'man'
word2 = 'woman'

word3 = 'semiconductor'
word4 = 'earthworm'

word5 = 'nephew'
word6 = 'niece'

# Calculate the vector differences
vector_diff1 = word_vectors[word1] - word_vectors[word2]
vector_diff2 = word_vectors[word3] - word_vectors[word4]
vector_diff3 = word_vectors[word5] - word_vectors[word6]

# Calculate the magnitudes of the vector differences
magnitude_diff1 = np.linalg.norm(vector_diff1)
magnitude_diff2 = np.linalg.norm(vector_diff2)
magnitude_diff3 = np.linalg.norm(vector_diff3)

print(f"Magnitude difference between '{word1}' and '{word2}': {magnitude_diff1}")
print(f"Magnitude difference between '{word3}' and '{word4}': {magnitude_diff2}")
print(f"Magnitude difference between '{word5}' and '{word6}': {magnitude_diff3}")

Magnitude difference between 'man' and 'woman': 1.7279510498046875
Magnitude difference between 'semiconductor' and 'earthworm': 5.6670427322387695
Magnitude difference between 'nephew' and 'niece': 1.9557794332504272


# CREATING TOKEN EMBEDDINGS

<div class="alert alert-block alert-success">
    
Let's illustrate how the token ID to embedding vector conversion works with a hands-on
example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:</div>

In [14]:
import torch

input_ids = torch.tensor([2, 3, 5, 1])

<div class="alert alert-block alert-success">
    
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of
only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want
to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):

</div>

<div class="alert alert-block alert-success">
    
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch,
setting the random seed to 123 for reproducibility purposes:

</div>

In [15]:
vocabulary_size = 6
output_dimension = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocabulary_size, output_dimension)

<div class="alert alert-block alert-info">
    
The print statement in the code prints the embedding layer's underlying
weight matrix:
    
</div>

In [16]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


<div class="alert alert-block alert-info">
    
We can see that the weight matrix of the embedding layer contains small, random values.
These values are optimized during LLM training as part of the LLM optimization itself, as we
will see in upcoming chapters. Moreover, we can see that the weight matrix has six rows
and three columns. There is one row for each of the six possible tokens in the vocabulary.
And there is one column for each of the three embedding dimensions.
    
</div>

<div class="alert alert-block alert-success">
    
After we instantiated the embedding layer, let's now apply it to a token ID to obtain the
embedding vector:

</div>

In [17]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
    
If we compare the embedding vector for token ID 3 to the previous embedding matrix, we
see that it is identical to the 4th row (Python starts with a zero index, so it's the row
corresponding to index 3). In other words, the embedding layer is essentially a look-up
operation that retrieves rows from the embedding layer's weight matrix via a token ID.
    
</div>

<div class="alert alert-block alert-success">
    
Previously, we have seen how to convert a single token ID into a three-dimensional
embedding vector. Let's now apply that to all four input IDs we defined earlier
(torch.tensor([2, 3, 5, 1])):

</div>

In [19]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
    
Each row in this output matrix is obtained via a lookup operation from the embedding
weight matrix
    
</div>