In [3]:
from torchtext.vocab import GloVe
import torch

glove = GloVe(dim=300)

In [4]:
from torch import sin, cos

def positional_embedding(word, pos):
    model_dims = 300

    positional_encoding = torch.tensor([0.0] * model_dims)
    for i in range(0, model_dims // 2):
        positional_encoding[2 * i] = sin(torch.tensor(pos / (10000 ** (2 * i / model_dims))))
        positional_encoding[2 * i + 1] = cos(torch.tensor(pos / (10000 ** (2 * i / model_dims))))

    embedding = glove[word]
    embedding += positional_encoding
    return embedding


In [5]:
print(torch.nn.functional.cosine_similarity(glove["puppy"].unsqueeze(0), glove["stochastic"].unsqueeze(0)).item())
print(torch.nn.functional.cosine_similarity(glove["puppy"].unsqueeze(0), glove["puppies"].unsqueeze(0)).item())


-0.029923180118203163
0.8881815075874329


There's not that much I can meaningfully test right now for the positional embedding function, but experimenting with the above shows roughly what I'd expect from the standard embedding function. I'll call this done for now.

Update: Gonna experiment with getting a dictionary out of glove below. I need this dictionary because my model will use it to generate output probabilities for the next word.

In [6]:
# Integer-to-string mapping
vocab_itos = glove.itos

# String-to-integer mapping
vocab_stoi = glove.stoi

# You can check the size of the vocabulary
vocab_size = len(vocab_itos)

# Access a word by index
word_at_index_10 = vocab_itos[10]

# Find index of a word
index_of_word = vocab_stoi['hello']


In [7]:
vocab_size

2196017

A dictionary size of over 2 million means the final layer of my model will have an equal number of parameters. I have to imagine that's going to be extremely expensive computationally. Therefore, I've decided to build a custom dictionary using shakespeare instead. Later I may implement my own custom embeddings, which is another potential solution to this problem.

Slightly rethinking the approach to the above, but in the meantime I need to create a function that can encode a full block of text.

In [8]:
# from torch import tensor

# def encode_text(text: str) -> tensor:
    

So I've learned that glove doesn't have built in tokenization, and that's gonna be a huge pain. So I'm starting a new notebook with a simpler approach using a different embedding library. Goodbye forever!

Sike I'm just gonna use a random tokenizer and connect it to glove now. This is a bad approach because some tokens may not come with embeddings, but I'm gonna change this all later when I switch to embeddings that are trained as part of the model so too bad!

In [9]:
import spacy

# Load the language model
tokenizer = spacy.load("en_core_web_sm")

# Process a sentence
sentence = "This is a sample sentence."
tokens = tokenizer(sentence)

# Tokenize the sentence
print(len(tokenizer.vocab))
for token in tokenizer.vocab:
    print(token)


766
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object at 0x000001A4A2F3E180>
<spacy.lexeme.Lexeme object at 0x000001A4843A9780>
<spacy.lexeme.Lexeme object

In [13]:
import spacy

def encode_string(str):
    tokenizer = spacy.load("en_core_web_sm")
    tokens = tokenizer(str)

    output = torch.zeros(size=[len(tokens), 300])
    for i, token in enumerate(tokens):
        output[i] = positional_embedding(token.text, i)

    return output

In [19]:
encodings = encode_string("I drive an elephant every day during my daily commute. It gets 62 MPG.")
len(encodings)
encodings.shape

torch.Size([16, 300])