### Word2Vec Implementation from Scratch

# Introduction
 Word2Vec is a popular technique for word embeddings, which captures the meaning of words by placing them in a continuous vector space.
 In this exercise, you will implement Word2Vec using NumPy and complete the missing parts of the code.
We will represent each word as a one-hot vector, meaning each word in the vocabulary is mapped to a unique binary vector with only one active (1) position.

import necessary libraries

In [67]:
import numpy as np
from collections import defaultdict

In [123]:
### Adjust the hyperparameters if needed ###
settings = {
	'window_size': 2,
	'n': 4,
	'epochs': 150,
	'learning_rate': 0.0001
}

In [124]:
class word2vec:
    def __init__(self, settings):
        """
        Initialize the Word2Vec model with given hyperparameters.
        """
        ## Start code
        self.n = settings['n']
        self.lr = settings['learning_rate']
        self.epochs = settings['epochs']
        self.window = settings['window_size']
        ## End code

    def generate_training_data(self, corpus):
        """
        Generate training data from the given corpus.
        This function processes the input corpus to create training examples for the Word2Vec model.
        - It first counts the occurrences of each word in the corpus.
        - Then, it creates a vocabulary of unique words and assigns each word a unique index.
        - Finally, it generates training pairs consisting of target words and their surrounding context words.
        """
        ## Start code
        # Each word frequency in the corpus
        word_counts = defaultdict(int)
        for sentence in corpus:
          for word in sentence:
            word_counts[word] += 1


        self.v_count = len(word_counts.keys()) # Total Vocab len
        self.words_list = list(word_counts.keys()) # Vocabs
        self.word_index = {word: i for i, word in enumerate(self.words_list)} # Each vocab index
        self.index_word = {i: word for word, i in self.word_index.items()} # #Each index vocab
        training_data = []
        # Generating Word Context pairs in each sentence of corpus.
        # We process Each word in the sentecne and gets its index in word_index list and look at window before and after that vocab.
        # At the end we add the pairs into our training_data
        for sentence in corpus:
          for i, word in enumerate(sentence):
            id = self.word_index[word]
            start = max(0, i - self.window)
            end = min(len(sentence), i + self.window + 1)
            for j in range(start, end):
              if i != j:
                curr_word = sentence[j]
                training_data.append([id, self.word_index[curr_word]])

        training_data = np.array(training_data)
        return training_data
        ## End code

    def word2onehot(self, word):
        """
        Convert a word into a one-hot encoded vector.
        Output:
        - A one-hot vector of length equal to the vocabulary size.
        """
        ## Start code
        # one hot for i : [0 0 0 0 .. 1 (i-th index) 0 .. 0]
        word_vec = np.zeros(self.v_count)
        word_vec[self.word_index[word]] = 1
        return word_vec
        ## End code

    def train(self, training_data):
        """
        Train the model using the given training data.
        This function initializes the weight matrices and performs forward and backward propagation.
        - Initializes weight matrices w1 (input to hidden) and w2 (hidden to output)
        - Iterates through training data and performs forward pass
        - Computes the error and updates weights using backpropagation
        - Tracks and prints loss for each epoch
        """
        ## Start code
        # Two matrices for weights (initialized values will be random)
        self.w1 = np.random.uniform(-1, 1, (self.v_count, self.n))
        self.w2 = np.random.uniform(-1, 1, (self.n, self.v_count))

        for epoch in range(self.epochs):
          loss = 0
          for target, context in training_data:
            x = self.word2onehot(self.index_word[target])
            y_true = self.word2onehot(self.index_word[context])
            y_pred, h, u = self.forward_pass(x)
            e = y_true - y_pred
            self.backprop(e, h, x)

            loss += -np.sum(y_true * np.log(y_pred + 1e-9))
          print('Epoch:', epoch + 1, 'Loss:', loss.round(4))

        ## End code

    def softmax(self, x):
        """
        Apply softmax function.
        This function normalizes the input values into probabilities, ensuring that they sum to 1.
        - It exponentiates each value in x to ensure non-negativity.
        - It divides each exponentiated value by the sum of all exponentiated values to normalize them into a probability distribution.
        Output:
        - A probability distribution where the sum of all elements equals 1.
        """
        ## Start code
        e_x = np.exp(x - np.max(x))
        e_x = e_x / np.sum(e_x)
        return e_x
        ## End code

    def forward_pass(self, x):
        """
        Forward pass through the network.
        This function takes a one-hot encoded word vector as input and performs the following steps:
        - Computes the hidden layer by multiplying the input vector with the first weight matrix.
        - Computes the output layer values by multiplying the hidden layer with the second weight matrix.
        - Applies the softmax function to get the probability distribution over the vocabulary.
        Output:
        - The predicted probability distribution (y_c), hidden layer activations (h), and raw scores before softmax (u).
        """
        ## Start code
        h = np.dot(x, self.w1)
        u = np.dot(h, self.w2)
        y_c = self.softmax(u)
        return y_c, h, u
        ## End code

    def backprop(self, e, h, x):
        """
        Backpropagation step to update weights.
        This function updates the weight matrices using gradient descent.
        - Computes the gradient of the loss with respect to the second weight matrix (w2).
        - Computes the gradient of the loss with respect to the first weight matrix (w1).
        - Updates w1 and w2 using the learning rate and computed gradients.
        """
        ## Start code
        dl_dw2 = np.outer(h, e)
        dl_dw1 = np.outer(x, np.dot(self.w2, e))
        self.w1 -= self.lr * dl_dw1
        self.w2 -= self.lr * dl_dw2
        ## End code

    def word_vec(self, word):
        """
        Retrieve the word vector for a given word.
        """
        ## Start code
        v_w = self.w1[self.word_index[word]]

        return v_w
        ## End code

    def vec_sim(self, word, top_n):
        """
        Find top N most similar words based on cosine similarity.
        """
        ## Start code
        v_w1 = self.word_vec(word)
        word_sim = {}


        for other_word in self.words_list:
          if other_word != word:
            v_w2 = self.word_vec(other_word)
            similarity = np.dot(v_w1, v_w2) / (np.linalg.norm(v_w1) * np.linalg.norm(v_w2)) # Cosine Similairity
            word_sim[other_word] = similarity

        words_sorted = sorted(word_sim.items(), key=lambda x: x[1], reverse=True)
        for word, sim in words_sorted[:top_n]:
            print(word, sim)
        ## End code

In [125]:
text = "Natural language processing and machine learning open up fascinating possibilities, allowing machines to analyze,\
 understand, and respond to human language in ways that were once thought impossible."

In [129]:
corpus = [[word.lower() for word in text.split()]]

w2v = word2vec(settings)

training_data = w2v.generate_training_data(corpus)

w2v.train(training_data)

Epoch: 1 Loss: 336.4113
Epoch: 2 Loss: 336.4363
Epoch: 3 Loss: 336.4614
Epoch: 4 Loss: 336.4864
Epoch: 5 Loss: 336.5115
Epoch: 6 Loss: 336.5367
Epoch: 7 Loss: 336.5618
Epoch: 8 Loss: 336.587
Epoch: 9 Loss: 336.6121
Epoch: 10 Loss: 336.6373
Epoch: 11 Loss: 336.6626
Epoch: 12 Loss: 336.6878
Epoch: 13 Loss: 336.713
Epoch: 14 Loss: 336.7383
Epoch: 15 Loss: 336.7636
Epoch: 16 Loss: 336.7889
Epoch: 17 Loss: 336.8142
Epoch: 18 Loss: 336.8396
Epoch: 19 Loss: 336.865
Epoch: 20 Loss: 336.8904
Epoch: 21 Loss: 336.9158
Epoch: 22 Loss: 336.9412
Epoch: 23 Loss: 336.9666
Epoch: 24 Loss: 336.9921
Epoch: 25 Loss: 337.0176
Epoch: 26 Loss: 337.0431
Epoch: 27 Loss: 337.0686
Epoch: 28 Loss: 337.0942
Epoch: 29 Loss: 337.1197
Epoch: 30 Loss: 337.1453
Epoch: 31 Loss: 337.1709
Epoch: 32 Loss: 337.1965
Epoch: 33 Loss: 337.2222
Epoch: 34 Loss: 337.2478
Epoch: 35 Loss: 337.2735
Epoch: 36 Loss: 337.2992
Epoch: 37 Loss: 337.3249
Epoch: 38 Loss: 337.3507
Epoch: 39 Loss: 337.3764
Epoch: 40 Loss: 337.4022
Epoch: 41 Lo

In [130]:
word = "machine"
vec = w2v.word_vec(word)
print(word, vec)

# Find similar words
w2v.vec_sim("machine", 3)

machine [ 0.9836496   1.00943166 -0.70331904  0.47491823]
machines 0.7059625750623806
fascinating 0.5771269101843289
thought 0.5369287864606412


# Code Logic
## Word2Vec (By Skipgram)
I used Skipgram to implement a code for getting word2vec and similarity by cosine similarity and also used a simple neural network for training this neural network. In skipgram model learns that if there is a central word which words will be around it based on the corpus trained on it.

# Results

<table>
<tr>
<td>n</td>
<td>window</td>
<td>learning rate</td>
<td>epochs</td>
<td>Last Loss</td>
<td>3 Sims</td>
</tr>
<tr>
<td>5</td>
<td>2</td>
<td>0.01</td>
<td>50</td>
<td>2072.3266</td>
<td>fascinating - to - language</td>
</tr>
<tr>
<td>10</td>
<td>4</td>
<td>0.001</td>
<td>50</td>
<td>1422.4607</td>
<td>learning - up - that</td>
</tr>
<tr>
<td>10</td>
<td>6</td>
<td>0.0005</td>
<td>100</td>
<td>1931.6857</td>
<td>allowing - learning - open</td>
</tr>
<tr>
<td>8</td>
<td>5</td>
<td>0.0005</td>
<td>100</td>
<td>880.6461</td>
<td>analyze - open - that</td>
</tr>
<tr>
<td>4</td>
<td>2</td>
<td>0.0001</td>
<td>150</td>
<td>340.3658</td>
<td>machines - fascinating - thoughts</td>
</tr>
</table>

Based on my experiments the best hyperparameters are <i><b>n = 4, window_size = 2, learning_rate = 0.0001 and epochs = 150</i></b>
