# DATA PREPARATION

- [Read more from here](https://towardsdatascience.com/a-word2vec-implementation-using-numpy-and-python-d256cf0e5f28)
- [Theory from here](http://www.claudiobellei.com/2018/01/06/backprop-word2vec/#skipgram)

The training data needs to be in the following format. 

Example:

    Window size = 2, Vocab size = 9


    We will set the indicies as 1 according to the word_to_index dict i.e natural : 0,  so we set the 0th index as 1 to denote natural

    Target word = best    
    Context words = (way,to)
    Target_word_one_hot_vector = [1, 0, 0, 0, 0, 0, 0, 0, 0]
    Context_word_one_hot_vector = [0, 1, 1, 0, 0, 0, 0, 0, 0]
    
    Target word = way    
    Context words = (best,to,success)
    Target_word_one_hot_vector = [0, 1, 0, 0, 0, 0, 0, 0, 0]
    Context_word_one_hot_vector= [1, 0, 1, 1, 0, 0, 0, 0, 0]
    


Thus we need to take the text 

1. Encode it into hot encoded vectors
2. In this case we are using Skip-gram which build model that, tries to learn the context words for each of the target words.



In [1]:
from my_utils.word_2_vec_dataprep import *


def prepare_training_data(text):
    word_to_index,index_to_word,corpus,vocab_size,length_of_corpus = generate_dictionary_data(text)
    return vocab_size, generate_training_data(corpus, 3, vocab_size= vocab_size, word_to_index=word_to_index,length_of_corpus=length_of_corpus)



def sample_training_data(target_word_vec,context_word_vec,index_to_word):
    print(f"Vocab has " + str(len(target_word_vec)) + " words")
    print("Target Word")
    for idx,val in enumerate(target_word_vec):
        if val == 1:
            print(f"{idx} : {index_to_word[idx]}")

    print("Context Words")
    for idx,val in enumerate(context_word_vec):
        if val == 1:
            print(f"{idx} : {index_to_word[idx]}")




# TAKE TRAINING DATA

In [11]:
text = []
# with open('data/jef_archer.txt') as f:
#     for line in f:
#         text.append(line)
        

text = "Abel dies soon after, and bequeathes everything to his daughter Florentyna, except his silver band of authority, which he leaves to his grandson, whom Florentyna and Richard have named Harry Clifton has joined the British Navy and has assumed the identity of Tom Bradshaw after his ship sinks in order to solve some of his problems".split()

vocab_size,training_data = prepare_training_data(text=text)

print(f"Corpus size : {len(text)}")
print(f"Vocab size : {vocab_size}")

Corpus size : 56
Vocab size : 44


# TRAINING NOW THAT TRAINING DATA IS READY

In [12]:
from my_utils.micrograd import *

neural_network = MLP(vocab_size, [6, vocab_size])

In [13]:
neural_network.represent()

Layer: 0
Has 6 neurons
Each neuron has 45 inputs

Layer: 1
Has 44 neurons
Each neuron has 7 inputs



In [17]:
import numpy as np

# # Put training data in right form 

learning_rate = 0.1

xs = [x for x,_ in training_data]
ys = [y for _,y in training_data]

# Perform forward propagation for all x values with current neural network
# Store predictions in yout

for i in range(1,10):
    yout = [neural_network(x) for x in xs]
    def substract(arr1,arr2):
        return sum(np.square(arr1-arr2))

    err = sum([ substract(y_pred,y) for y_pred,y in zip(yout,ys)]) 

    print(f"Iteration {i}; Error : {err}")

    err.backward()

    all_params = neural_network.parameters()

    for param in all_params:
        param.data =- param.grad * learning_rate/i**2
        param.grad = 0.0 
    

Iteration 1; Error : Value(data=2789.618727530883)
Iteration 2; Error : Value(data=626.5161021165817)
Iteration 3; Error : Value(data=1148.9076093187746)
Iteration 4; Error : Value(data=596.1076855262487)
Iteration 5; Error : Value(data=496.68296883171655)
Iteration 6; Error : Value(data=274.75579685626974)
Iteration 7; Error : Value(data=313.05758375010834)
Iteration 8; Error : Value(data=298.0044847762304)
Iteration 9; Error : Value(data=304.9200292189766)


In [7]:
print(all_params[0].get_node_label())

{W_111=-0.5616756035848232 | grad=-0.8953650036662133 | grad_updates=56}
