
## Recurrent Neural Networks

### Basics of RNN

RNN works well with sequence or range of inputs with respect to time.
Thus, it finds its wide application in NLP
Formal Definition

A recurrent neural network is a class of artificial neural networks where connections between nodes form a directed or undirected graph along a temporal sequence.

## Detailed explanation with a sample usecase

Let's consider a NLP usecase, (ie) classification of email as (spam/ham).
The title or the subject of the email is considered as the dataset.
Let us consider a sentence X = "you have an offer". The sentence X can be represented as word vectors using any of the previously discussed approaches such as Bag Of Words(BOW), Word2Vec, TF-IDF.
Vectorized representation be X = <X1, X2, X3, X4>
Forward propagation over time

The input for a hidden layer for a recurrent neural network at a time instant t1 will be the vector for word X1 along with the weights W
An output o1 will be generated from the first layer.
The key factor that distinguishes RNN from traditional ANN is that the input to the next hidden layer will be the word vector X2 along with o1 summed up together with the weights
Thus, from this it is evident that the sequence or order is preserved as the next hidden layer input depends on the previous layer at a time instant t.
Let us consider the below image
picture

For a time instant t1 or t+1, the next word vector is considered and o2 is obtained as a function of o1+w and corresponding input X2
Multiple hidden layers can be created based on this weight forwarding technique (weight recurrence)
Thus, finally a softmax loss function can be used to classify (0 and 1) as spam and ham
Backward propagation over time

The main reason to take up backpropagation is to reduce the loss
To reduce the loss, we indeed have to update the weights
Weight updation can be done by simply taking the derivative for the original weights (using chain rule maybe)
Subtract the original weights by the derivative value and update the value
Once when the global minima is reached (zero), RNN training will stop
Problems with RNN
In case of RNN, everytime the weights get updated based on the previous input in backpropagation

Main issue - Vanishing gradient - In case of sigmoid function, when the derivative is found for weights, it becomes so very less that it is negligible and makes no difference in the next hidden layer's weight - does not converge

When we use any other activation function like ReLu, exploding gradient problem occurs

To overcome this issue, LSTM with RNNs are used

LSTM Recurrent Neural Network
Link : https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Consider a usecase of text generation

Say the model has generated the sentence "My firstname is Hema"
In this case , to generate a similar sentence and a slight difference the model has to understand that there is a change in context.
Eg: "My surname is Priya" . Here, the name has changed and the model needs to forget my old data and remember the new information
This is where LSTM with RNN comes into picture
LSTM generally consists of four important components

Input gate
Memory cell
Forget gate
Output gate
Memory cell

It temporarily remembers and forgets the vectors . (ie) when the original vector is say [1, 1, 2, 1, 3], the next time when the vector is updated the vector goes like [1, 1, 0, 0, 1, 3] . Thus, information related to 3rd and 4th word is forgotten by the hidden layer

Forget Gate

When the context is changed , the ouput vector will be changed and the previous vector will be removed or forgetted. Thus, the state of the vector is lost.

Input Gate

Y = WX + B where X acts as the input to the hidden layer along with the added information to the memory cell

Output Gate

All information in the memory cell is carried over back to the output layer finally (ie) memory cell + weights

# Word Embeddings
Word Embeddings
There are already a number of methods to convert word to numerical values or vectors (ie) BOW (Bag of words), TF-IDF
But, these methods have many disadvantages such as lack of semantic information
A method called One hot encoding was introduced
One hot encoding

consider a sentence s1 - "I like eating apples".
sentence s2 - "I like eating mangoes"
The corresponding one-hot vector can be represented as [1, 1, 1, 1] for both the sentences
Assuming to determine the simiarity or closeness for the sentences, the result would be [1, 1, 1, 0] as only the last word varies(based on index).
In this case, for many sentences or huge corpus, such similar vectors might be obtained so the semantic is lost (ie difference between the word apples and mangoes is not justified).
Thus, the word embeddings come into picture
word embeddings

It can also be described as feature based representation
In a huge corpus or dataset, say 10,000 sentences , a sample of 300 features can be considered and vectors of dimension 300 can be created.
Else, a vector will have dimension of 10,000 in case of one-hot encoding
For the previous example, under the feature or category of fruit, the word apples and mangoes can be categorized so the individual vector value may vary , thus preserving the semantic
Word embeddings provide a dense representation of words and their relative meanings.They are an improvement over sparse representations used in simpler bag of word model representations.Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

**Implementation of word embeddings**

Keras , by default has embeddings layer . But initially the words are tokenized, and one-hot encoding is applied, and the dimensions or number of features to be considered is defined to obtain the word embeddings

In [1]:
import tensorflow as tf
tf.__version__

'2.17.0'

In [2]:
from tensorflow.keras.preprocessing.text import one_hot

In [3]:
sentences = [
             "I am a good girl",
             "I am an engineer",
             "The sum rises in the east",
             "Live life to the fullest",
             "I am a developer",
             "I need a cup of tea",
             "I can understand",
             "my work is good",
             "I like apples",
             "I don't like mangoes"
]

In [4]:
#define the vocabulary size
vocab_size = 1000


In [5]:
onehot_repr=[one_hot(words,vocab_size)for words in sentences]
print(onehot_repr)

#index based on the created vocabulary/dictionary will be obtained
#length of list and length of every sentence is same
# 159 => I 39 => am

[[567, 730, 999, 659, 755], [567, 730, 579, 850], [223, 461, 16, 44, 223, 874], [982, 799, 587, 223, 608], [567, 730, 999, 659], [567, 277, 999, 892, 698, 819], [567, 910, 370], [416, 661, 508, 659], [567, 261, 79], [567, 593, 261, 889]]


In [6]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences #to make length of sentence equal
from tensorflow.keras.models import Sequential
import numpy as np

In [7]:

#Embedding representation

sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[  0   0   0 567 730 999 659 755]
 [  0   0   0   0 567 730 579 850]
 [  0   0 223 461  16  44 223 874]
 [  0   0   0 982 799 587 223 608]
 [  0   0   0   0 567 730 999 659]
 [  0   0 567 277 999 892 698 819]
 [  0   0   0   0   0 567 910 370]
 [  0   0   0   0 416 661 508 659]
 [  0   0   0   0   0 567 261  79]
 [  0   0   0   0 567 593 261 889]]


In [8]:
#define the number of features

dim=10

#define the model
model=Sequential()
model.add(Embedding(vocab_size,10,input_length=sent_length))
model.compile('adam','mse')




In [9]:
print(model.predict(embedded_docs))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 270ms/step
[[[ 4.87896465e-02 -1.56032331e-02  4.89363559e-02 -3.27865630e-02
    2.46693380e-02  2.61197127e-02  4.76915762e-03 -6.66989014e-03
    2.38736011e-02  3.01277302e-02]
  [ 4.87896465e-02 -1.56032331e-02  4.89363559e-02 -3.27865630e-02
    2.46693380e-02  2.61197127e-02  4.76915762e-03 -6.66989014e-03
    2.38736011e-02  3.01277302e-02]
  [ 4.87896465e-02 -1.56032331e-02  4.89363559e-02 -3.27865630e-02
    2.46693380e-02  2.61197127e-02  4.76915762e-03 -6.66989014e-03
    2.38736011e-02  3.01277302e-02]
  [ 2.66976282e-03 -4.73067537e-02  4.03023884e-03 -2.41728425e-02
   -9.38636065e-03 -3.15554291e-02 -1.87604185e-02 -4.96016257e-02
   -3.23021039e-02  3.21363322e-02]
  [ 1.96955241e-02 -3.91077250e-04  9.55078751e-03 -2.01512501e-03
   -8.70171934e-03  2.33808421e-02  2.26479657e-02  2.14028694e-02
    1.53617971e-02  6.81634992e-03]
  [-1.83971412e-02 -2.30427142e-02  1.92599036e-02  2.41428502e-02
    2.2730

In [10]:
print(model.predict(embedded_docs)[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[[ 0.04878965 -0.01560323  0.04893636 -0.03278656  0.02466934  0.02611971
   0.00476916 -0.00666989  0.0238736   0.03012773]
 [ 0.04878965 -0.01560323  0.04893636 -0.03278656  0.02466934  0.02611971
   0.00476916 -0.00666989  0.0238736   0.03012773]
 [ 0.04878965 -0.01560323  0.04893636 -0.03278656  0.02466934  0.02611971
   0.00476916 -0.00666989  0.0238736   0.03012773]
 [ 0.00266976 -0.04730675  0.00403024 -0.02417284 -0.00938636 -0.03155543
  -0.01876042 -0.04960163 -0.0323021   0.03213633]
 [ 0.01969552 -0.00039108  0.00955079 -0.00201513 -0.00870172  0.02338084
   0.02264797  0.02140287  0.0153618   0.00681635]
 [-0.01839714 -0.02304271  0.0192599   0.02414285  0.02273014  0.00549482
   0.02964698 -0.02442052  0.02247221  0.01864943]
 [ 0.03020063 -0.04920372 -0.04778533  0.04046493 -0.03640121  0.04129915
  -0.00594747 -0.03485905  0.02245817  0.00199797]
 [ 0.01112071 -0.01309968 -0.00229196  0.01274223 -0.