## Getting ready to implement recurrent  neural network

RNN is a unique kind of network because of its ability to remember inputs which makes it perfectly suited for problems that deal with sequential data like time series forecasting, speech recognition, machine translation, audio and video sequence prediction.

In RNNs data traverses in such a way that at each node the network learns from both current and previous inputs sharing the weights over time which reflects that we are performing the same task at each step, just with different inputs.
This reduces the total number of parameters we need to learn.
Example - If the activation function is tanh, the weight at the recurrent neuron is , and the weight at the input neuron is , we can write the equation for the state , at time t as –



The gradient at each output depends on the calculations of the current time step and also the previous time steps.
For example, in order to calculate the gradient at t=4, we would need to backpropagate 3 steps and sum up the gradients.
This is known as Backpropagation Through Time (BPTT).
During BPTT while iterating over the training examples, we modify the weights in order to reduce error.



RNNs can handle data with various input and output types through different kinds of architectures it supports, which are mainly:

One-to-Many : One input mapped to a sequence with multiple steps as an output.Example- Music generation


Many-to-One: Sequence of inputs mapped to class or quantity prediction.Example- Sentiment classification


Many-to-Many: Sequence of inputs mapped to a sequence of outputs.Example- Language translation, named entity recognition








### Getting ready...

In this section, we're going to use IMDB dataset which contains movie reviews and sentiment associated with it, and we can import this dataset using a built-in function from keras library. Movie reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. Now let us import required library and dataset.

In [7]:
library(keras)

In [8]:
imdb <- dataset_imdb(num_words = 1000)

Lets divide dataset into train and test.

In [9]:
train_x <- imdb$train$x
train_y <- imdb$train$y
test_x <- imdb$test$x
test_y <- imdb$test$y

Lets have as look at number reviews in train and test data 

In [5]:
# number of samples in train and test set
cat(length(train_x), 'train sequences\n')
cat(length(test_x), 'test sequences')

25000 train sequences
25000 test sequences

We see that there are 25000 reviews in train and test set each, let's look at the structure of train data.

In [13]:
str(train_y)

 int [1:25000] 1 0 0 1 0 0 1 0 1 0 ...


In [14]:
str(train_x)

List of 25000
 $ : int [1:218] 1 14 22 16 43 530 973 2 2 65 ...
 $ : int [1:189] 1 194 2 194 2 78 228 5 6 2 ...
 $ : int [1:141] 1 14 47 8 30 31 7 4 249 108 ...
 $ : int [1:550] 1 4 2 2 33 2 4 2 432 111 ...
 $ : int [1:147] 1 249 2 7 61 113 10 10 13 2 ...
 $ : int [1:43] 1 778 128 74 12 630 163 15 4 2 ...
 $ : int [1:123] 1 2 365 2 5 2 354 11 14 2 ...
 $ : int [1:562] 1 4 2 716 4 65 7 4 689 2 ...
 $ : int [1:233] 1 43 188 46 5 566 264 51 6 530 ...
 $ : int [1:130] 1 14 20 47 111 439 2 19 12 15 ...
 $ : int [1:450] 1 785 189 438 47 110 142 7 6 2 ...
 $ : int [1:99] 1 54 13 2 14 20 13 69 55 364 ...
 $ : int [1:117] 1 13 119 954 189 2 13 92 459 48 ...
 $ : int [1:238] 1 259 37 100 169 2 2 11 14 418 ...
 $ : int [1:109] 1 503 20 33 118 481 302 26 184 52 ...
 $ : int [1:129] 1 6 964 437 7 58 43 2 11 6 ...
 $ : int [1:163] 1 2 2 11 4 2 9 4 2 4 ...
 $ : int [1:752] 1 33 4 2 7 4 2 194 2 2 ...
 $ : int [1:212] 1 13 28 64 69 4 2 7 319 14 ...
 $ : int [1:177] 1 2 26 9 6 2 731 939 44 6 ...
 $ : in

Above we can see that our training set is a list of reviews and list sentiment labels. Lets look at the first review and the number of words in it.

In [6]:
train_x[[1]]
cat("Number of words in the first review is",length(train_x[[1]]))

Number of words in the first review is 218

Please note that when we imported our data set, we set the value of the argument <i>num_words to 1000</i>. It means that only top thousand frequent words are kept in the encoded reviews. Let's look at what is the maximum encoded value in our list of reviews

In [7]:
cat("Maximum encoded value in train ",max(sapply(train_x, max)),"\n")
cat("Maximum encoded value in test ",max(sapply(test_x, max)))

Maximum encoded value in train  999 
Maximum encoded value in test  999

### How to do it...

In the above section, we got familiar with that data lets get into details of it. We know that the words of sentences are encoded as per a word index where words are indexed by overall frequency in the dataset. Let us import this word index.

In [8]:
word_index = dataset_imdb_word_index()

Lets look at the head of word index

In [9]:
head(word_index)

We see that it is a list of key-value pair where key is the word and value the integer to which it is mapped .Lets see how many unique words we have in our word index.

In [10]:
length((word_index))

Let us create a reversed list of key-value pair of the word index. We will use this to decode reviews in IMDB data set.

In [11]:
reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
head(reverse_word_index)

Now let us decode the first review and look at it. Note that the word encodings are offset by three because 0,1,2 are reserved for padding, the start of the sequence and out vocabulary words respectively.

In [12]:
decoded_review <- sapply(train_x[[1]], function(index) {
  word <- if (index >= 3) reverse_word_index[[as.character(index -3)]]
  if (!is.null(word)) word else "?"
})

cat(decoded_review)

? this film was just brilliant casting ? ? story direction ? really ? the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same ? ? as myself so i loved the fact there was a real ? with this film the ? ? throughout the film were great it was just brilliant so much that i ? the film as soon as it was released for ? and would recommend it to everyone to watch and the ? ? was amazing really ? at the end it was so sad and you know what they say if you ? at a film it must have been good and this definitely was also ? to the two little ? that played the ? of ? and paul they were just brilliant children are often left out of the ? ? i think because the stars that play them all ? up are such a big ? for the whole film but these children are amazing and should be ? for what they have done don't you think the whole story was so ? because it was true and was ? life after all that was ? with us all

Let us pad the sequences.

In [13]:
train_x <- pad_sequences(train_x, maxlen = 80)
test_x <- pad_sequences(test_x, maxlen = 80)

cat('x_train shape:', dim(train_x), '\n')
cat('x_test shape:', dim(test_x), '\n')

x_train shape: 25000 80 
x_test shape: 25000 80 


Now let us look at the first review after padding it.

In [14]:
train_x[1,]

Let us create our neural network for sentiment classification and view its summary.

In [15]:
model <- keras_model_sequential()
model %>%
  layer_embedding(input_dim = 1000, output_dim = 128) %>% 
  layer_simple_rnn(units = 32) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
embedding (Embedding)               (None, None, 128)               256000      
________________________________________________________________________________
simple_rnn (SimpleRNN)              (None, 32)                      5152        
________________________________________________________________________________
dense (Dense)                       (None, 1)                       33          
Total params: 261,185
Trainable params: 261,185
Non-trainable params: 0
________________________________________________________________________________


Now let us compile the model created above and train it.

In [16]:
# compile model
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)

# train model
model %>% fit(
  train_x,train_y,
  batch_size = 32,
  epochs = 10,
  validation_split = .2
)

In [42]:
# save_model_hdf5(model,"simple_rnn.h5")

Let us evaluate the model and print out its test scores.

In [43]:
scores <- model %>% evaluate(
  test_x, test_y,
  batch_size = 32
)

cat('Test score:', scores[[1]],'\n')
cat('Test accuracy', scores[[2]])

Test score: 0.8564493 
Test accuracy 0.71648

### How it works...

Step 1:

In this example, we are using the imdb reviews built in dataset from keras library. In the first step, we are loading the training and testing datasets. The data has been mapped to a specific sequence of integer values, each integer representing a particular word in a dictionary. This dictionary has a rich collection of words arranged based on the frequency of each word getting used in the corpus. You can see that the dictionary is a list of key-value pairs, keys representing the words and values representing the index of the word in the dictionary. To discard the words which are not frequently used, we give a threshold of 1000,i.e. we will keep only the top 1000 most frequent words in our training dataset and ignore the rest.

Step 2:

In this step we are showcasing how to regenerate the reviews.

Step 3:

In this step we prepare the data to feed into the model. Since we cannot directly pass a list of integers into the model, we convert them into uniformly shaped tensors. To make the length of all the reviews uniform, we can follow either of these two approaches:

1. One hot encoding of all the reviews to convert them into tensors of same length. The size of the matrix will be - "number of words  *  number of reviews".This approach is computationally heavy.

2. Padding the reviews - Alternatively, we can pad all the reviews, so they all have the same length. This will create an integer tensor of shape  "num_examples  *  max_length".  The max_length argument is used to cap the maximum number of words that we want to keep in all the reviews.

Since the second approach is less memory and computationally intensive, we will go for the second approach,i.e. padding the sequences.

Step 4:

In the next step, we define keras sequential model and configure the layers. The first layer is the embedding layer that is used to generate context out of our word sequences from our data and give information about relevant features. In an embedding, the words are represented by dense vector representations where a vector represents the projection of the word into a continuous vector space which is learnt from the text and is based on the words that surround a particular word. This position of the word in the vector space is referred to as its embedding. When we do embedding, we represent each review in terms of some latent factors.For example the word "brilliant" can be represented by a vector ,let's say- [.32, .02, .48, .21, .56, .15]. This is computationally efficient when using massive datasets since it reduces the dimensionality. The embedded vectors also get updated during the training process of the deep neural network, which helps in identifying similar words in a multi-dimensional space. Word embeddings also reflect how words are related to each other semantically. For example, words like "talking" and "talked" can be thought of as related in the same way as "swimming" is related to "swam".



The Embedding layer is defined by specifying 3 arguments:
- input_dim: This is the size of the vocabulary in the text data. In our example, the text data is integer encoded to values between 0-9999, then the size of the vocabulary would be 1000 words.
- output_dim: This is the size of the vector space in which words will be embedded. We have specified it as 128.
- input_length: This is the length of input sequences, as we define for any input layer of a keras model. This argument is required if we are going to connect Flatten then Dense layers upstream.
In the next layer, we define a simple RNN model with 32 hidden units. If  is the number of input dimensions and  is the number of hidden units in RNN layer, then the number of trainable parameters is given by:


The last layer is densely connected with a single output node, and here we use the sigmoid activation function since this is a binary classification task.

Step 5

Next, we compile the model. We specify "binary_crossentropy" as the loss function since we are dealing with binary classification here, and this loss function is preferred for models that output probabilities. The optimizer used is "adam". We then fit our training data into the model.

Step 6

In this step, we evaluate the test accuracy of our model to see how is our model performing on test data.



### There is more...

By now, you are aware of how backpropagation through time(BPTT) works in an RNN. We traverse the network backwards, calculating gradients of errors with respect to the weights in each iteration. As we move closer to the early layers of the network, these gradients become too small, thus making the neurons in these layers learn very slowly.
For an accurate model, it is crucial for the early layers to get trained accurately since these layers are responsible to learn simple patterns from the input and pass the relevant information accordingly to the following layers.
RNNs often face this challenge when we train huge networks with more dependencies within the layers.
This challenge is referred to as the vanishing gradient problem which makes the network learn too slowly and also the results not so accurate.
It is often advised to use the RELU activation function to avoid vanishing gradient problem in large networks.
Another very common way to deal with this issue is to use LSTMs(long short term memory) model about which we will talk in the following recipe.

Another challenge that RNNs encounter is exploding gradient problem.
In this case, we can see large gradients values, which in turn make the model learn too fast and inaccurately. In some cases, gradients can also become NaN due to numerical overflow in computations.
We can see that the weights in the network increase by huge margins within less time while training.
The most commonly used remedy for preventing this problem is by applying gradient clipping which prevents
the gradients to increase beyond a specified threshold.

layer_repeat_vector()
time_distributed() timeseries_generator()

### See also...