# A simple word embedding experiment

See:
- Pag 386 of Data Science from Scratch

## Plan
- data
    - hand-made sentences: what is the basic structure of a sentence? '"the" + color + noun + verb + adverb + adjective.'
    - what is the thing we predict and what is our target? predict: an embedding which is just a collection of numbers that it tries to adjust to the target | it serves as a proxy to improve the parameters of our embedding matrix, target: an embedding belonging to a specific word in our vocabulary.
- architecture
    - embedding layer + linear layer + softmax cross entropy
    - i imagine pytorch is going to take care of the gradients 
## Data
Our data is going to be handmade. We will feed our network in training pairs `(word,nearby_word)` and try to minimize the `SoftmaxCrossEntropy`.

In [6]:
import random
import torch.nn as nn
import torch

# let's start with the data
# we will create some random sentences with:
color = ['blue','red','green','yellow','white']
noun  = ['cat','dog','car','boat','house']
verb  = ['is','was','seems','looks']
adverb = ['quite','absurdly','extremely']
adjective = ['slow','fast','big','small']

# joining all these words following "the" + color + noun + verb + adverb + adjective.
num_sentences = 100
NUM_WORDS = len(color)+len(noun)+len(verb)+len(adverb)+len(adjective)

sentences = [" ".join(['the',random.choice(color),random.choice(noun),random.choice(verb),random.choice(adverb),random.choice(adjective),'.']) for _ in range(num_sentences)]
random.sample(sentences,10)


['the red car was quite fast .',
 'the yellow dog looks quite slow .',
 'the yellow boat seems extremely slow .',
 'the blue car was extremely slow .',
 'the blue car is quite fast .',
 'the yellow boat seems quite fast .',
 'the white house seems absurdly small .',
 'the red dog is extremely slow .',
 'the white boat looks quite big .',
 'the white house seems quite big .']

In [9]:
import re

# now we should assign indexes to words and viceversa to have some sort of mapping 
idx_to_word = {}
word_to_idx = {}

def get_one_hot(word):
    return [1.0 if i==word_to_idx[word] else 0.0 for i in range(NUM_WORDS)]

joined_words = color + noun + verb + adverb + adjective
for idx, word in enumerate(joined_words):
    idx_to_word[idx] = word
    word_to_idx[word] = idx

# we use redex to tokenize the senteces and obtain the training pairs
training_pairs = []
for sentence in sentences:
    tokenized_sentece = re.findall(r'\b\w+\b',sentence) # the `r` in r'\b\w+\b' indicates that the string should we treated as a raw string (because of the \)
    for i in range(len(tokenized_sentece)-1):
            training_pairs.append((tokenized_sentece[i],get_one_hot(tokenized_sentece[i+1])))

# sanity check
assert get_one_hot(training_pairs[1][0]) == [1.0 if i==word_to_idx[training_pairs[1][0]] else 0.0 for i in range(NUM_WORDS)]        

## Architecture

<div style="text-align: center;">
    <img src="images/word_embedding_architecture.jpg" alt="image.png" style="width: 50%;"/>
    <figcaption>simple word embedding architecture.</figcaption>
</div>

In [None]:
# cosine similarity


In [10]:
from tqdm import trange, tqdm
import matplotlib.pyplot as plt

class Embedding(nn.Module):
    # if all goes well the autograd is going to do what I wish it would do 
    def __init__(self,word_elements,embed_dimension):
        super().__init__()
        # initialize tensor with normal distribution of shape emb = (num_words,embed_dimension)
        self.embedding = nn.Parameter(torch.randn((word_elements, embed_dimension), requires_grad=True))
    
    def forward(self,word):
        """
        word: word string
        we will get the string id and return its embedding
        """
        return self.embedding[word_to_idx[word]]

# include the linear layer in our model | careful with the dimensions
EMBEDDING_SIZE = 10  
model = nn.Sequential(
    Embedding(NUM_WORDS, EMBEDDING_SIZE),
    nn.Linear(EMBEDDING_SIZE, NUM_WORDS) 
)

# Training loop
epochs = len(training_pairs)
optim = nn.optim.Adam(model.parameters, lr=0.001, weight_decay=1e-2)
loss_function = nn.CrossEntropyLoss()

# keep track of loss function
losses = [] 
accuracies = []

for i in (t := trange(epochs)):
    X,target = training_pairs
    optim.zero_grad()
    prediction = model(X)
    loss = loss_function(prediction,target)# order matters for the Cross Entropy Loss
    loss.backward()
    optim.step()

    loss, accuracy = loss.item(), accuracy.item()
    losses.append(loss)
    accuracies.append(accuracy)

    t.set_description(f"Loss: {loss:.2f}, accuracy: {accuracy:.2f}")

plt.plot(losses)
plt.plot(accuracies, alpha = 0.5)
