# Word Embeddings
Based on Word Embeddings tutorial by Robert Guthrie  https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py

Word Embeddings are dense vectors representations of words.

Word embeddings compress information so you have a more dense representation. Compare this to sparse vectors like One-Hot Encoding.

In [1]:
sentence = "the quick brown fox jumped over the lazy dog"
words = sentence.split(' ')
print(words)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


Let's look at the individual words in our vocabulary:

In [2]:
vocab1 = list(set(words))
print(vocab1)

['quick', 'dog', 'brown', 'over', 'the', 'jumped', 'lazy', 'fox']


In [3]:
# Number of words in our vocabulary
len(vocab1)

8

# One Hot Encoding

The vocabulary size is seen above. Now we can one-hot encode the vocabulary words. The good news is that PyTorch (As of December 2018) now has a built-in one-hot encoding module

In [4]:
# Convert words to indexes
word_to_ix1 = {word: i for i, word in enumerate(vocab1)}
print(word_to_ix1)

{'quick': 0, 'dog': 1, 'brown': 2, 'over': 3, 'the': 4, 'jumped': 5, 'lazy': 6, 'fox': 7}


In [5]:
import torch
from torch.nn.functional import one_hot

words = torch.tensor([word_to_ix1[w] for w in vocab1], dtype=torch.long)

one_hot_encoding = one_hot(words)
print(vocab1)
print(one_hot_encoding)

['quick', 'dog', 'brown', 'over', 'the', 'jumped', 'lazy', 'fox']
tensor([[1, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 1]])


The issue with sparse one-hot encoding is that the vectors are very large 
and we have a very sparse representation of the vectors. As you can see there
are a lot of zeros. For example, the popular data set WikiText-103 has 267,000
words in the vocabulary. This means around 267,000 zeros in each vector with
one-hot encoding.

We should try to find a smaller encoding for our dataset. Let's try a denser vector using a
Word Embedding.

# Word Embedding Example

In [6]:
# Context is the number of words we are using as a context for the next word we want to predict
CONTEXT_SIZE = 2

# Embedding dimension is the size of the embedding vector
EMBEDDING_DIM = 10

# Size of the hidden layer
HIDDEN_DIM = 256

In [7]:
# We will use Shakespeare Sonnet 2
test_sentence = """Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.
""".lower().split()

In [8]:
# Build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab2 = list(set(test_sentence))
word_to_ix2 = {word: i for i, word in enumerate(vocab2)}

# Show what a trigram looks like

[(['tomorrow,', 'and'], 'tomorrow,'), (['and', 'tomorrow,'], 'and'), (['tomorrow,', 'and'], 'tomorrow,')]


# N-Gram Language Model

An N-Gram is a sequence of words as in a sentence. This is useful because it gives us some context to train a deep learning classifier. 

For a detailed post visit: https://www.microsoft.com/developerblog/2015/11/29/feature-representation-for-text-analyses/

Here's what a diagram of our n-gram deep learning model would look like:


<img src="../images/network_next_word.png" style="width: 800px;"/>

# ReLU
Rectifier activation function: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
<img src="https://miro.medium.com/max/357/1*oePAhrm74RNnNEolprmTaQ.png" />


# Softmax function
<a href="https://www.researchgate.net/figure/Softmax-function-image_fig1_325856086"><img src="https://www.researchgate.net/profile/Shen_Leixian/publication/325856086/figure/fig1/AS:723221292789765@1549440801787/Softmax-function-image.png" alt="Softmax function image"/></a>

# Training is based on preceding words
Predict the probability of a word based on the words around it

In [9]:
# Add imports here
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, HIDDEN_DIM)
        self.linear2 = nn.Linear(HIDDEN_DIM, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [10]:
learning_rate = 0.001
losses = []
loss_function = nn.NLLLoss()  # negative log likelihood
model = NGramLanguageModeler(len(vocab2), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

In [17]:
from tqdm import tqdm

for epoch in range(25):
    total_loss = 0

    iterator = tqdm(trigrams)
    for context, target in iterator:
        # (['When', 'forty'], 'winters')
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix2[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix2[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
        iterator.set_postfix(loss=float(loss))
    losses.append(total_loss)
    # add progress bar with epochs

100%|██████████| 73/73 [00:00<00:00, 432.11it/s, loss=3.6] 
100%|██████████| 73/73 [00:00<00:00, 507.88it/s, loss=3.57]
100%|██████████| 73/73 [00:00<00:00, 508.73it/s, loss=3.54]
100%|██████████| 73/73 [00:00<00:00, 497.76it/s, loss=3.51]
100%|██████████| 73/73 [00:00<00:00, 481.95it/s, loss=3.49]
100%|██████████| 73/73 [00:00<00:00, 534.10it/s, loss=3.46]
100%|██████████| 73/73 [00:00<00:00, 475.73it/s, loss=3.43]
100%|██████████| 73/73 [00:00<00:00, 503.46it/s, loss=3.4] 
100%|██████████| 73/73 [00:00<00:00, 480.42it/s, loss=3.37]
100%|██████████| 73/73 [00:00<00:00, 509.81it/s, loss=3.34]
100%|██████████| 73/73 [00:00<00:00, 489.63it/s, loss=3.31]
100%|██████████| 73/73 [00:00<00:00, 497.34it/s, loss=3.28]
100%|██████████| 73/73 [00:00<00:00, 509.76it/s, loss=3.25]
100%|██████████| 73/73 [00:00<00:00, 510.08it/s, loss=3.22]
100%|██████████| 73/73 [00:00<00:00, 521.02it/s, loss=3.19]
100%|██████████| 73/73 [00:00<00:00, 507.04it/s, loss=3.16]
100%|██████████| 73/73 [00:00<00:00, 507

In [12]:
# Check the structure of our model here
model.eval()

NGramLanguageModeler(
  (embeddings): Embedding(59, 10)
  (linear1): Linear(in_features=20, out_features=256, bias=True)
  (linear2): Linear(in_features=256, out_features=59, bias=True)
)

Let's try this out!

In [13]:
import numpy

with torch.no_grad():
    context = ['tomorrow,', 'and']
    context_idxs = torch.tensor([word_to_ix2[w] for w in context], dtype=torch.long)
    pred = model(context_idxs)
    print(pred)
    index_of_prediction = numpy.argmax(pred)
    print(vocab2[index_of_prediction])

tensor([[-4.2813, -4.3018, -4.1416, -3.9168, -4.2167, -3.9121, -3.8985, -4.1464,
         -4.3967, -3.6064, -3.8017, -3.9178, -4.3452, -4.3658, -4.1338, -3.7384,
         -3.9032, -4.4974, -4.2858, -4.3245, -4.1646, -4.4215, -3.3084, -4.0751,
         -4.1101, -4.3610, -4.2177, -4.1962, -4.4140, -4.0597, -3.0100, -3.8150,
         -4.3852, -4.2511, -4.0806, -4.0710, -4.0321, -4.4454, -4.2866, -4.3894,
         -3.9729, -3.6467, -3.8749, -4.4524, -4.0009, -4.0418, -4.3280, -3.9617,
         -4.4341, -4.4680, -3.9659, -4.5848, -4.1615, -4.1848, -4.3031, -4.0144,
         -4.1116, -4.1997, -4.4411]])
tomorrow,


# Next Steps
* RNN/LSTM/BiLSTM
* Pointer to GloVe word embedding: https://nlp.stanford.edu/projects/glove/
* https://github.com/fastai/word-embeddings-workshop/blob/master/Word%20Embeddings.ipynb
* ELMo: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

# Exercise: Continuous Bag of Words
Continuous Bag of Words is a model that tries to predict a word based on a few word before and after the word.

In [14]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab3 = list(set(raw_text))
vocab_size = len(vocab3)

word_to_ix3 = {word: i for i, word in enumerate(vocab3)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [15]:
# create your model and train.  here are some functions to help you make
# the data ready for use by your module

def make_context_vector(context, word_to_ix3):
    idxs = [word_to_ix3[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

make_context_vector(data[0][0], word_to_ix3)  # example

tensor([44,  4, 12, 42])

In [16]:
class CBOW(nn.Module):
    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# Glossary

* word embedding -- a dense vector representation of words
* one-hot encoding -- a sparse vector representation of words with ones
* vocabulary -- the set of words used in your language 
* tokenization -- the process of breaking down bodies of text into words
* ReLU function -- a positive activation function for neural networks
* softmax function -- activation function used to map probability distribution
* negative log likelihood -- a probability function used in conjunction with softmax
* loss function -- also known as a "cost function" a function to estimate the cost associated with an event
* Stochastic Gradient Descent (SGD) -- "an iterative method for optimizing an objective function" * (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) 
* learning rate -- a constant step to take in one iteration of stochastic gradient descnet
* autograd -- PyTorch's automatic differentiation class that performs the backpropagation gradient calculations automatically so that a "backward" class does not need to be defined by the programmer


References: 
* https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
* https://github.com/fastai/word-embeddings-workshop/blob/master/Word%20Embeddings.ipynb