# A simple word embedding experiment

See:
- Pag 386 of Data Science from Scratch

## Plan
- data
    - hand-made sentences: what is the basic structure of a sentence? '"the" + color + noun + verb + adverb + adjective.'
    - what is the thing we predict and what is our target? predict: an embedding which is just a collection of numbers that it tries to adjust to the target | it serves as a proxy to improve the parameters of our embedding matrix, target: an embedding belonging to a specific word in our vocabulary.
- architecture
    - embedding layer + linear layer + softmax cross entropy
    - i imagine pytorch is going to take care of the gradients 
## Data
Our data is going to be handmade. We will feed our network in training pairs `(word,nearby_word)` and try to minimize the `SoftmaxCrossEntropy`.

In [1]:
import random
import torch.nn as nn
# let's start with the data
# we will create some random sentences with:
color = ['blue','red','green','yellow','white']
noun  = ['cat','dog','car','boat','house']
verb  = ['is','was','seems','looks']
adverb = ['quite','absurdly','extremely']
adjective = ['slow','fast','big','small']

# joining all these words following "the" + color + noun + verb + adverb + adjective.
num_sentences = 100

sentences = [" ".join(['the',random.choice(color),random.choice(noun),random.choice(verb),random.choice(adverb),random.choice(adjective),'.']) for _ in range(num_sentences)]
random.sample(sentences,10)


['the white cat is absurdly fast .',
 'the red house seems quite small .',
 'the red dog looks extremely slow .',
 'the red boat seems absurdly slow .',
 'the red dog seems quite small .',
 'the red cat seems extremely fast .',
 'the green house seems extremely small .',
 'the red cat looks extremely fast .',
 'the blue car seems absurdly small .',
 'the green dog looks extremely big .']

## Architecture

<div style="text-align: center;">
    <img src="images/word_embedding_architecture.jpg" alt="image.png" style="width: 50%;"/>
    <figcaption>simple word embedding architecture.</figcaption>
</div>

In [4]:
import re

# now we should assign indexes to words and viceversa to have some sort of mapping 
idx_to_word = {}
word_to_idx = {}

joined_words = color + noun + verb + adverb + adjective
for idx, word in enumerate(joined_words):
    idx_to_word[idx] = word
    word_to_idx[word] = idx

training_pairs = []
# we use redex to tokenize the senteces and obtain the training pairs
for sentence in sentences:
    tokens = re.findall(r'\b\w+\b',sentence) # the `r` in r'\b\w+\b' indicates that the string should we treated as a raw string (because of the \)
    for i in range(len(tokens)-1):
            training_pairs.append((tokens[i],tokens[i+1]))

[('the', 'green'), ('green', 'cat'), ('cat', 'looks'), ('looks', 'absurdly'), ('absurdly', 'big')]


In [10]:
class Embedding(nn.Module):
    def __init__(self,num_words,embed_dimension):
        # Where do I specify that this thing needs grad? 
        # initialize tensor with normal distribution of shape emb = (num_words,embed_dimension)
        pass
    
    def forward(self,x):
        # we should refer to the embedding belonging to the word x, we should map words to indexes
        # return emb[id[x]]
        pass

# include the linear layer in our model 
model = nn.Sequential(
    Embedding(),
    nn.Linear(embed_dimension,embed_dimension)
)

epochs = num_words
optim = nn.optim.Adam([TODO]) # TODO: put parameteres of the model: embedding and linear
loss = nn.CrossEntropyLoss()

for _ in range(epochs): # TODO: use tqdm 
    # set optimizer grad to zero
    out = #TODO:  calculate output of the model
    loss_calc = loss(out,...) # TODO: add target ...
    optim.step()

    #TODO: add stuff to visualize progress bar

# TODO: one it is trained, plot everything in 2D to see if it actually learned something about the semantics
