In [3]:
from collections import defaultdict
import numpy as np

# **Word2vec**
 generates separate vector for each of the word present in the corpus. For example, the word **“happy”** can be represented as a vector of 4 dimensions **[0.24, 0.45, 0.11, 0.49]** and **“sad”** has a vector of **[0.88, 0.78, 0.45, 0.91]**


whole process = **word embedding**, **motive** = TO generate vectors so that linear algebra operations can be performed on numbers.


# Types
- CBOW:
       several times faster to train than the skip-gram, slightly better accuracy for the frequent words
- Skipgram:
           works well with small amount of the training data, represents well even rare words or phrases




<img src="https://miro.medium.com/max/800/0*o2FCVrLKtdcxPQqc.png">

# **Implementation** (Skipgram)
There are six steps that will be followed till the completion of the work.



* 1. **Data preparation**
   In reality, text data are unstructured and can be “dirty”. Cleaning them will involve steps such as removing stop words, punctuations, convert text to lowercase (actually depends on your use-case), replacing digits, etc
                               

In [4]:
# Initial corpus
text = "I want to be a genomic data scientist here in TU Kaiserslautern under MEC"


# Preprocessing the corpus
corpus = [[word.lower() for word in text.split()]]
print("Corpus is:", corpus)

Corpus is: [['i', 'want', 'to', 'be', 'a', 'genomic', 'data', 'scientist', 'here', 'in', 'tu', 'kaiserslautern', 'under', 'mec']]


* 2. **Hyperparameters**

In [27]:
# Creating a dictionary for the hyperparameters
hyperparameters= {'window_size': 2,
                  'embedding_size': 14,
                  'lr': 0.01,
                  'epochs': 300
                  }

**Sliding window**
  context words are words that are neighbouring the target word


<img src="https://miro.medium.com/max/560/1*tD7P83Bl7dB91iNwYHEmEg.png">


**[n]** This is the dimension of the word embedding and it typically ranges from 100 to 300 depending on your vocabulary size. Dimension size beyond 300 tends to have diminishing benefit. Do note that the dimension is also the size of the hidden layer.

**[epochs]** This is the number of training epochs. In each epoch, we cycle through all training samples.

**[learning_rate]** The learning rate controls the amount of adjustment made to the weights with respect to the loss gradient

</br>


* 3. **Generation of Training Data**


<img src="https://miro.medium.com/max/800/1*vunPUSipHyot3vvwLcND_w.png">



In [28]:
class word2vec():
  def __init__(self):
    self.window = hyperparameters['window_size']
    self.epochs = hyperparameters['epochs']
    self.learning_rate = hyperparameters['lr']
    self.embedding_size = hyperparameters['embedding_size']


  def generate_training_data(self, hyperparameters, corpus):
   
    # Find unique words using dictionary
    word_count = defaultdict(int)
    
    # We have only one row here and we have approximately 14 words in it
    for row in corpus:
      for word in row:
        word_count[word]+=1
    
    # Unique words
    self.u_counts = len(word_count.keys())
    print("Unique words = ", self.u_counts)

    # Creating a lookup dictionary or vocabulary
    self.words_list = list(word_count.keys())
    #print(self.words_list)


    # Generate word:index
    self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
    #print("Generated word:index", self.word_index)

    # Generate index:word
    self.index_word =dict((i, word) for i, word in enumerate(self.words_list))
    #print("Generated word:index", self.index_word)


    training_data=[]

    # Cycle through each sentence in corpus
    for sentence in corpus:
      sent_len = len(sentence)

      # Cycle through each word in sentence
      for i, word in enumerate(sentence):
        # print("Petrit: ",i, word)
        # Convert target word to one-hot
        w_target = self.word2onehot(sentence[i])
        print("One hot encoded vectors: ", w_target)
        w_context = []

        # getting the context 
        for j in range(i - self.window, i + self.window + 1): 
          # print("Line for petrit = ", i-self.window, i+self.window)       
          # Criteria for context word 
          # 1. Target word cannot be context word (j != i)
          # 2. Index must be greater or equal than 0 (j >= 0) - if not list index out of range
          # 3. Index must be less or equal than length of sentence (j <= sent_len-1) - if not list index out of range 

          if j != i and j <= sent_len-1 and j >= 0:

            # Append the one-hot representation of word to w_context
            w_context.append(self.word2onehot(sentence[j]))

            # print(sentence[i], sentence[j]) 
            # training_data contains a one-hot representation of the target word and context words
        training_data.append([w_target, w_context])
    return np.array(training_data)



  def word2onehot(self, word):

    # Initializing one hot vector with 14 values
    word_vec = [0 for i in range (0, self.u_counts)]
    #print(word_vec)

    # Now we have to get the index of the word
    word_index = self.word_index[word]

    # Change the value to 1 according to the index
    word_vec[word_index] = 1

    return word_vec

  def initialize_weights(self, training_data):
      self.w1  = np.random.uniform(-1,1, (self.u_counts, self.embedding_size))
      self.w2 = np.random.uniform(-1, 1, (self.embedding_size, self.u_counts))

  def train(self, training_data):
    losses=[]
    count=0
    for i in range (self.epochs):
      # Initialize loss to 0 at the start of every epoch
      
      self.loss = 0
      
      # Forward pass
      for w_target, w_context in training_data:

        count+=1

        # print(w_target, "and ", w_context)
        # 1. predicted y using softmax (y_pred) 2. matrix of hidden layer (h) 3. output layer before softmax (u)
        pred, hidden, s_wio_activation = self.feed_forward_pass(w_target)
        
        # Backward Pass
        # Calculate error
        # 1. For a target word, calculate difference between y_pred and each of the context words
        # 2. Sum up the differences using np.sum to give us the error for this particular target word


        EI = np.sum([np.subtract(pred, word) for word in w_context], axis=0)
        


        #print("The error value is:", EI, "count = ", count)
        # Backpropagation
        # We use SGD to backpropagate errors - calculate loss on the output layer 


        self.backpropagation(EI, hidden, w_target)
        #losses.append(self.loss)
        # Calculate loss
        # There are 2 parts to the loss function
        # Part 1: -ve sum of all the output +
        # Part 2: length of context words * log of sum for all elements (exponential-ed) in the output layer before softmax (u)
        # Note: word.index(1) returns the index in the context word vector with value 1
        # Note: u[word.index(1)] returns the value of the output layer before softmax
        
        self.loss += -np.sum([s_wio_activation[word.index(1)] for word in w_context]) + len(w_context) * np.log(np.sum(np.exp(s_wio_activation)))
        
      print('Epoch:', i, "Loss:", self.loss)



  def backpropagation(self, e, h, x):
    # https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.outer.html
    # Column vector EI represents row-wise sum of prediction errors across each context word for the current center word
    # Going backwards, we need to take derivative of E with respect of w2
    # h - shape 10x1, e - shape 9x1, dl_dw2 - shape 10x9

    dl_dw2 = np.outer(h, e)
    #print("H SHape", h.shape, "e", e.shape, "dl_w2", dl_dw2.shape )
    # x - shape 1x8, w2 - 5x8, e.T - 8x1
    # x - 1x8, np.dot() - 5x1, dl_dw1 - 8x5
    dl_dw1 = np.outer(x, np.dot(self.w2, e.T))

    # Update weights
    self.w1 = self.w1 - (self.learning_rate * dl_dw1)
    self.w2 = self.w2 - (self.learning_rate * dl_dw2)
    
  def feed_forward_pass(self, x):
    #  WHY T?
    # Getting the hidden layer without any activation function
    h = np.dot(self.w1.T, x)

    # Getting Output layer without activation
    u = np.dot(self.w2.T, h )

    # Passing output through softmax
    y_context = self.softmax(u)

    return y_context, h, u

  def softmax(self, x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)
  
  def backp_error_loss(self, training_loss):

    for i in range (self.epochs):
      self.loss


* 4. **Model Training**

skipgram architecture presented in the diagram

<img src="https://miro.medium.com/max/560/1*uuVrJhSF89KGJPltvJ4xhg.png">

**Forward pass of the training**

In [29]:
# Initialize the class object
wvec = word2vec()

# Pass the arguments and get the ndarray of one hot encoded vectors
training_data = wvec.generate_training_data(hyperparameters, corpus)
wvec.initialize_weights(training_data)
wvec.train(training_data)

Unique words =  14
One hot encoded vectors:  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
One hot encoded vectors:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Epoch: 0 Loss: 

A more computational level view of the feed forward pass

<img src="https://miro.medium.com/max/800/1*JHqzFok6Vz60HqoDf0OogQ.png">

**Training — Error, Backpropagation and Loss**

**Error** — With y_pred, h and u, we proceed to calculate the error for this particular set of target and context words. This is done by summing up the difference between y_pred and each of the context words inw_c.

<img src="https://miro.medium.com/max/560/1*pp5kV6uF7S0exTujskhbZw.png">

SyntaxError: ignored

**Backpropagration**
we need to alter the weights using the function backprop by passing in error EI, hidden layer h and vector for target word w_t.

To update the weights, we multiply the weights to be adjusted (dl_dw1 and dl_dw2) with learning rate and then subtract it from the current weights (w1 and w2


<img src="https://miro.medium.com/max/560/1*ZWoH_NpUFGCPuHmtXUW5AA.png">