<a href="https://colab.research.google.com/github/PhatHuynhTranSon99/Neural-Network-From-Scratch/blob/main/Recurrent_neural_network_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Network from scratch

In this notebook, I will present to you the method in which we can implement a simple recurrent neural network from scratch (RNN). The motivating task for this example is the task of sentimental analysis, where we will try to predict the sentiment of a sentence (positive or negative)

An example maybe:



*   I love it -> Positive with label 1
*   I do not like it -> Negative with label 0
*   It is great -> Positive with label 1
*   It is awful -> Negative with label 0



## Library import

In this section, we will import only the necessary library for this notebook. Since this implement is totally from scratch, the most important library to use is just numpy

In [None]:
import numpy as np

Another important module will be Spacy as we will use this to obtain the vector embeddings for each word in a sentence. Also we will load the en_core_web_sm
models which has all pretrained word embedding included.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

## Dataset collection and manipulation

The most important phase for any data science task is to collect the data. For sentiment analysis, we can actually find many valuable datasets available online which has been cleaned and contains many examples. However, in this notebook, we will only use a small dataset that I found on Github ([rnn-from-scratch](https://github.com/vzhou842/rnn-from-scratch/blob/master/data.py)), as an foundation.

However, you can apply what yout have learned after this notebook to however big or complex datasets you can find as the principles remaing the same

### Collection

In [None]:
# Traing data used for the training phase 
# of the RNN
train_data = {
  'good': True,
  'bad': False,
  'happy': True,
  'sad': False,
  'not good': False,
  'not bad': True,
  'not happy': False,
  'not sad': True,
  'very good': True,
  'very bad': False,
  'very happy': True,
  'very sad': False,
  'i am happy': True,
  'this is good': True,
  'i am bad': False,
  'this is bad': False,
  'i am sad': False,
  'this is sad': False,
  'i am not happy': False,
  'this is not good': False,
  'i am not bad': True,
  'this is not sad': True,
  'i am very happy': True,
  'this is very good': True,
  'i am very bad': False,
  'this is very sad': False,
  'this is very happy': True,
  'i am good not bad': True,
  'this is good not bad': True,
  'i am bad not good': False,
  'i am good and happy': True,
  'this is not good and not happy': False,
  'i am not at all good': False,
  'i am not at all bad': True,
  'i am not at all happy': False,
  'this is not at all sad': True,
  'this is not at all happy': False,
  'i am good right now': True,
  'i am bad right now': False,
  'this is bad right now': False,
  'i am sad right now': False,
  'i was good earlier': True,
  'i was happy earlier': True,
  'i was bad earlier': False,
  'i was sad earlier': False,
  'i am very bad right now': False,
  'this is very good right now': True,
  'this is very sad right now': False,
  'this was bad earlier': False,
  'this was very good earlier': True,
  'this was very bad earlier': False,
  'this was very happy earlier': True,
  'this was very sad earlier': False,
  'i was good and not bad earlier': True,
  'i was not good and not happy earlier': False,
  'i am not at all bad or sad right now': True,
  'i am not at all good or happy right now': False,
  'this was not happy and not good earlier': False,
}

# Test data for the validation of RNN model
# This contains example that the RNN has not seen before
test_data = {
  'this is happy': True,
  'i am good': True,
  'this is not happy': False,
  'i am not good': False,
  'this is not bad': True,
  'i am not sad': True,
  'i am very good': True,
  'this is very bad': False,
  'i am very sad': False,
  'this is bad not good': False,
  'this is good and happy': True,
  'i am not good and not happy': False,
  'i am not at all sad': True,
  'this is not at all good': False,
  'this is not at all bad': True,
  'this is good right now': True,
  'this is sad right now': False,
  'this is very bad right now': False,
  'this was good earlier': True,
  'i was not happy and not good earlier': False,
}

### Manipulation

In this section, we will disect the training examples into training data. By representing them as matrices and vectors

To elaborate, each word in the sentence will be encoded as a vector (vectorized) and their label will be encoded as a number (0 for negative emotion and 1 for positive emotion). Hence, a sentence will be a list containing many vectors which are representation of the words

There are many ways to encode each word including using one-hot encoding, however, to make our model more robust despite the limited number of training examples, I will take advantage of pretrained word embeddings. 

This method will allow us to train the RNN better as there are more knowledge baked in the embedding themselves.

### Word embeddings

This is how we can obtain word embeddings using Spacy en_core_web_sm model

In [None]:
sentence = "hello world this is awesome"

# First we create a document object using nlp 
document = nlp(sentence)

# Each item in the document object is now an entity with a 
# vectorized representation
word_hello = document[0]
word_hello_embedding = word_hello.vector

word_hello_embedding is a vector with encoded information about the word 'hello'. We can print it and also check it dimension.

In [None]:
print(f"\"Hello\"'s vectorized representation is {word_hello_embedding}")
print(f"Its dimension is {word_hello_embedding.shape}")

"Hello"'s vectorized representation is [-1.6272194  -0.33563375  0.945545   -0.4469183   2.6902642   4.3016396
 -0.824273    3.1982849   0.5284388   3.6185837   1.3415287  -2.4555066
  0.65796274  2.110478    2.577197    1.9148287   0.6069482   0.9331498
 -2.5915394  -3.3500705  -3.4974782   1.8654583  -2.3845963   0.9036485
  1.4803083  -3.5128365  -2.5116596  -2.5202007   1.7190666   3.5116007
 -3.3501587   2.204627   -3.0264146   1.4101822   3.1886137   3.4279332
  1.4341421  -1.1750133   0.50860566  0.93580085 -1.9668213   1.6744696
 -3.6765428  -1.734254    1.3900673  -3.8862624  -0.50333697 -1.6206884
 -0.03179216 -0.58700883 -0.13928567 -1.9868772  -0.15296161 -0.3285142
 -2.6088018   0.82431364  2.9109895   2.4748793  -2.1238127  -2.6898267
  3.409523   -1.2409576  -2.057255   -0.11251724 -1.0778928   0.7698482
  1.998522   -3.7546642  -1.5513041  -2.1098228  -0.05553401  1.3901733
  2.742693    1.7499138  -2.35433     1.7996001   1.0548267   1.4774419
 -1.785715    2.662733   

As you can see, the vector has length 96 which is somewhat small for modern versions of word embedding. If possible, we can use Spacy's en_core_web_md model to buff the length to 300 for more baked-in information. 

However, it will take more time to training due to the time taken to do backpropagation

Now, let's convert all the sentences we have in the train_dataset and test_dataset into vectorized forms

In [None]:
processed_train_data = []

for example in train_data:
  # Get the sentence and the label
  x_as_sentence = example
  y_as_boolean = train_data[x_as_sentence]

  # Convert x into list of word embedding
  x_as_doc = nlp(x_as_sentence)
  x_as_embeddings = [entity.vector for entity in x_as_doc]

  # Convert y to binary value
  y_as_binary = 1 if y_as_boolean is True else 0

  # Insert into processed_train_data
  processed_train_data.append(
      {
          'x': x_as_embeddings,
          'y': y_as_binary
      }
  )

Let's verify that it works by checking the first example

In [None]:
first_example = processed_train_data[5]
print(first_example['x'])
print(first_example['y'])

[array([-1.1862527 , -0.5168541 ,  0.55137575, -1.3239889 , -1.2699106 ,
        4.0138464 ,  1.5577921 ,  2.3985279 , -2.5761452 ,  2.9888492 ,
        1.6186955 ,  2.5658243 ,  3.6894712 ,  1.4838963 ,  0.6751787 ,
        6.75658   , -1.1680346 , -1.6799572 , -2.970725  ,  1.1116401 ,
       -2.140803  , -0.7070384 , -1.9781781 , -1.3197064 , -2.3057742 ,
        2.8701203 ,  1.167284  ,  1.6811383 , -0.21117121, -0.7254225 ,
       -0.9812827 , -0.5519656 , -1.1937029 , -1.8989229 ,  1.7492497 ,
        1.4318465 , -2.4354227 , -2.029803  , -3.1974297 ,  1.2980125 ,
       -1.034874  ,  0.44772184, -1.740391  , -1.3432336 , -2.2001183 ,
        5.4227986 , -1.520893  , -1.8390274 , -0.068088  , -2.1983423 ,
        2.6347175 , -1.2741227 , -0.25415123,  2.5880084 ,  1.8729777 ,
       -1.3763707 , -0.82773346,  3.4450068 ,  2.5435286 , -2.715124  ,
        2.4196951 , -1.1905601 ,  1.6580374 ,  0.94533134,  2.1511202 ,
       -1.2368288 , -2.337473  , -3.74741   ,  5.251529  , -1.0