
# Read from here:

### https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

What we really want is a notion of similarity between words.

It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven’t. This should be captured by the deep network themselves and not be designed by the programmer 

SO WHY NOT LET THE WORD EMBEDDINGS BE THE PARAMETERS IN OUR MODEL WHICH ARE THEN UPDATED DURING TRAINING 

Word embeddings will probably not be interpretable. 

# Word Embeddings : Encoding Lexical Semantics


In [1]:
"""
word embedding in python

we need to define an index for each word 
when using embeddings. These will be the keys
into a lookup table.

Thus the embeddings will be stored as |V| X D matrix
where D is the dimensionality of the embeddings 

word i --> ith row of the matrix 

the mapping from words to indices is a dictionary
named to word_to_idx

"""

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

word_to_ix = {"hello": 0, "world": 1}

print(word_to_ix)

embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings

# lookup_tensor returns the index 
# and converts it into long tensor
lookup_tensor = torch.tensor([word_to_ix["hello"],word_to_ix["world"]], dtype=torch.long)

# returns the row of the embedding vector
hello_embed = embeds(lookup_tensor)

print(hello_embed)
print(hello_embed.shape)

{'hello': 0, 'world': 1}
tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519],
        [-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]])
torch.Size([2, 5])


In [2]:
"""
let us have a short description of the embedding layer

the input are (num_embeddings, embedding_dim, 
               padding_idx=None, max_norm=None, 
               norm_type=2, scale_grad_by_freq=False, 
               sparse=False, _weight=None)

read it here 
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

num_embeddings = size of dictionary 
embedding_dim = dim of embedding

ALSO YOU CAN LOAD AN EMBEDDING
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding.from_pretrained

"""
print(embeds)

Embedding(2, 5)


# N-Gram Language Modeling

Here given a sequence of words w we want to compute  P( w_i | w_(i-1), ....., w_(i-n+1) ) where w_i is the ith word of the sequence.
<br/>
We compute the loss function on some training examples  and then update the parameters with backpropagation.
<br/>

In [3]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

# We will use Shakespeare Sonnet 2

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now
# build a list of tuples. 
# Each tuple is ([ word_i-2, word_i-1 ], target word)
# THIS IS WHAT IS MEANT BY CONTEXT 
# WE WILL CONSIDER ONLY TWO PREVIOUS WORDS HENCE CONTEXT SIZE = 2

trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

# print the first 3, just so you can see what they look like
print(trigrams[:3])

# get the unique words 
vocab = set(test_sentence)

word_to_ix = {word: i for i, word in enumerate(vocab)}

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [4]:
# total no of words 
print(len(test_sentence))

# total no of words in voc
print(len(set(test_sentence)))

print(type(trigrams))
print("length of trigrams: {}".format(len(trigrams)))

for i in range(len(trigrams)):
    print(trigrams[i])
    if i+1==3:
        break

print('the length of vocab')
print(len(vocab))

"""
for i, word in enumerate(vocab):
    print(i, word)
"""

print(word_to_ix)

115
97
<class 'list'>
length of trigrams: 113
(['When', 'forty'], 'winters')
(['forty', 'winters'], 'shall')
(['winters', 'shall'], 'besiege')
the length of vocab
97
{'the': 0, "totter'd": 1, 'trenches': 50, 'When': 51, 'Were': 2, 'a': 3, "excuse,'": 52, 'small': 53, 'to': 91, 'and': 5, 'days;': 6, 'old': 7, 'lusty': 9, 'an': 10, 'own': 11, 'beauty': 12, 'thine': 55, 'were': 56, 'thou': 13, 'lies,': 57, 'eyes,': 58, 'besiege': 61, 'all': 15, 'sunken': 16, 'of': 18, "'This": 62, 'Where': 48, 'where': 63, 'praise.': 64, 'much': 20, 'If': 21, 'treasure': 34, 'being': 65, 'within': 67, 'in': 68, 'my': 4, 'Will': 69, "feel'st": 60, 'all-eating': 22, 'proud': 23, 'sum': 70, 'field,': 24, 'mine': 92, 'make': 25, "deserv'd": 71, 'it': 72, 'deep': 17, 'dig': 73, 'when': 76, 'thy': 28, 'more': 93, 'forty': 74, 'shame,': 75, 'held:': 26, 'made': 27, 'worth': 77, 'answer': 78, 'Then': 79, 'Proving': 29, 'new': 30, "beauty's": 31, 'livery': 80, 'Thy': 33, 'his': 81, 'succession': 87, 'cold.': 82, '

In [26]:
"""
a little explanation
embeddings layer converts the inputs into the vector
which is then passed through --> linear1 --> relu 
--> linear2 --> log_softmax

observe that the linear1 layer is like (2*embedding_dim) -->(128)
which basically means it will take the word embeddings of two words 
as input together and then convert to a 128 dim vector.

In this example the context size is two words, that is used to 
predict the third word. 
"""

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        print("self.embeddings.shape Embedding(vocab size,dimension):",self.embeddings)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        print("self.linear1.shape: ",self.linear1)
        self.linear2 = nn.Linear(128, vocab_size)
        print("self.linear2.shape :",self.linear2)
        
    def forward(self, inputs):
        dummy_out = self.embeddings(inputs)
        print("dummy_out shape: ",dummy_out)
        embeds = self.embeddings(inputs).view((1, -1))
        print("enbed")
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs, dummy_out


In [27]:
losses = []
loss_function = nn.NLLLoss()

model = NGramLanguageModeler(
    vocab_size=len(vocab),
    embedding_dim= EMBEDDING_DIM,
    context_size= CONTEXT_SIZE)

optimizer = optim.SGD(model.parameters(), lr=0.001)

self.embeddings.shape Embedding(vocab size,dimension): Embedding(97, 10)
self.linear1.shape:  Linear(in_features=20, out_features=128, bias=True)
self.linear2.shape : Linear(in_features=128, out_features=97, bias=True)


In [28]:
print(model)

for params in model.named_parameters():
    print(params[0])

NGramLanguageModeler(
  (embeddings): Embedding(97, 10)
  (linear1): Linear(in_features=20, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=97, bias=True)
)
embeddings.weight
linear1.weight
linear1.bias
linear2.weight
linear2.bias


In [None]:
import pdb

for epoch in range(10):
    
    total_loss = 0
    
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model 
        # (i.e, turn the words into integer indices and wrap 
        # them in tensors)
        print(context, target)
        
        context_idxs = torch.tensor([word_to_ix[w] for w in context], 
                                    dtype=torch.long)
        
        # as context size is "2"
        print(context_idxs.shape)
        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        
        """
        let us study this part a bit more 
        dummy_output is a torch of tensor size 
        context X embedding_dim
        
        which is converted into a torch tensor of shape
        1 X (2 X embedding_dim)
        which goes to the linear1 layer
        """
        _, dummy_output = model(context_idxs)
        print("dummy_output shape: {}".format(dummy_output.shape))
        pdb.set_trace()
        dummy_output = dummy_output.view((1, -1))
        print("dummy_output shape: {}".format(dummy_output.shape))
        pdb.set_trace()
        
        log_probs, _ = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor(
            [word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
        
    losses.append(total_loss)
    
print(losses)  # The loss decreased every iteration over the training data!


['When', 'forty'] winters
torch.Size([2])
dummy_out shape:  tensor([[-0.1773, -1.2818,  0.0191,  0.0305, -0.5312, -0.0357,  0.1381,
         -1.3279, -2.3791,  0.0060],
        [-0.4342,  0.2437,  1.6761, -0.1521, -0.6779,  1.2820, -1.0480,
          0.5994, -1.8145,  1.0090]])
dummy_output shape: torch.Size([2, 10])
> <ipython-input-30-ed1c49076720>(40)<module>()
-> dummy_output = dummy_output.view((1, -1))
(Pdb) c
dummy_output shape: torch.Size([1, 20])
> <ipython-input-30-ed1c49076720>(44)<module>()
-> log_probs, _ = model(context_idxs)
(Pdb) dummy_output
tensor([[-0.1773, -1.2818,  0.0191,  0.0305, -0.5312, -0.0357,  0.1381,
         -1.3279, -2.3791,  0.0060, -0.4342,  0.2437,  1.6761, -0.1521,
         -0.6779,  1.2820, -1.0480,  0.5994, -1.8145,  1.0090]])


# Computing Word Embeddins: Continuous Bag-of-Words


"""
Let us explain it in small details 

predict words given the context of a few words before and a few words 
after the target word 

not a sequential model and does not have a probabilistic interpretation

Typcially, CBOW is used to quickly train word embeddings, and these 
embeddings are used to initialize the embeddings of some more complicated 
model. Usually, this is referred to as pretraining embeddings. It 
almost always helps performance a couple of percent.

The model is like this 

given a target word w_i and N context window on each side 
i.e., w_(i-1), ..., w_(i-N) and w_(i+1), ..., w_(i+N)
refer to all context words as C

CBOW tries to minimize 
-log P(w_i|C) = -log softmax ( A * \sum_(w \in C) q_w + b)

which basically means collect all the context words and then
get their word embeddings and add them up and multiply with 
A and add bias to get the final representation 

from this compute the prob of the target word 

I BELIEVE THAT THE FINAL DIM SHOULD BE THE SIZE OF THE 
VOCABULARY
THE LOSS WILL BE NLL LOSS OR CROSS-ENTROPY LOSS

"""


In [9]:

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}


In [10]:

"""
collecting the data in terms of the context
"""

data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
    
print(data[:5])

for i in range(len(data)):
    if i==5:
        break
    print(data[i])


[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]
(['We', 'are', 'to', 'study'], 'about')
(['are', 'about', 'study', 'the'], 'to')
(['about', 'to', 'the', 'idea'], 'study')
(['to', 'study', 'idea', 'of'], 'the')
(['study', 'the', 'of', 'a'], 'idea')


In [11]:

class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass


In [12]:

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


print(data[0][0], data[0][1])

# data[0][0] is the input 
# data[0][1] is the target

make_context_vector(data[0][0], word_to_ix)  # example


['We', 'are', 'to', 'study'] about


tensor([41, 46, 23,  4])