# What Are Recurrent Neural Networks (RNNs)?

RNNs are a powerful and robust type of neural network, and belong to the most promising algorithms in use because they are the only type of neural network with an internal memory.

Like many other deep learning algorithms, recurrent neural networks are relatively old. They were initially created in the 1980s, but only in recent years have we seen their true potential. An increase in computational power along with the massive amounts of data that we now have to work with, and the invention of long short-term memory (LSTM) in the 1990s, has really brought RNNs to the foreground.

Because of their internal memory, RNNs can remember important things about the input they received, which allows them to be very precise in predicting what’s coming next. This is why they’re the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more. Recurrent neural networks can form a much deeper understanding of a sequence and its context compared to other algorithms.
 
# How Do Recurrent Neural Networks Work?

To understand RNNs properly, you’ll need a working knowledge of “normal” feed-forward neural networks and sequential data. 

Sequential data is basically just ordered data in which related things follow each other. Examples are financial data or the DNA sequence. The most popular type of sequential data is perhaps time series data, which is just a series of data points that are listed in time order.

![](2024-03-01-22-08-31.png)
 
# Recurrent vs. Feed-Forward Neural Networks

**feed forward neural network**

RNNs and feed-forward neural networks get their names from the way they channel information.

In a feed-forward neural network, the information only moves in one direction — from the input layer, through the hidden layers, to the output layer. The information moves straight through the network.

Feed-forward neural networks have no memory of the input they receive and are bad at predicting what’s coming next. Because a feed-forward network only considers the current input, it has no notion of order in time. It simply can’t remember anything about what happened in the past except its training.

In an RNN, the information cycles through a loop. When it makes a decision, it considers the current input and also what it has learned from the inputs it received previously.

The two images below illustrate the difference in information flow between an RNN and a feed-forward neural network.

![](2024-03-01-22-09-19.png)

A usual RNN has a short-term memory. In combination with an LSTM they also have a long-term memory (more on that later).

Another good way to illustrate the concept of a recurrent neural network’s memory is to explain it with an example: Imagine you have a normal feed-forward neural network and give it the word “neuron” as an input and it processes the word character by character. By the time it reaches the character “r,” it has already forgotten about “n,” “e” and “u,” which makes it almost impossible for this type of neural network to predict which character would come next.

A recurrent neural network, however, is able to remember those characters because of its internal memory. It produces output, copies that output and loops it back into the network. 

Simply put: Recurrent neural networks add the immediate past to the present.

Therefore, an RNN has two inputs: the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why an RNN can do things other algorithms can’t.

A feed-forward neural network assigns, like all other deep learning algorithms, a weight matrix to its inputs and then produces the output. Note that RNNs apply weights to the current and also to the previous input. Furthermore, a recurrent neural network will also tweak the weights for both gradient descent and backpropagation through time.

# Types of Recurrent Neural Networks

Types of Recurrent Neural Networks (RNNs)

    One to One
    One to Many
    Many to One
    Many to Many

Also note that while feed-forward neural networks map one input to one output, RNNs can map one to many, many to many (translation) and many to one (classifying a voice).

![](2024-03-01-22-10-18.png)

# Recurrent Neural Networks and Backpropagation Through Time

To understand the concept of backpropagation through time (BPTT), you’ll need to understand the concepts of forward and backpropagation first. We could spend an entire article discussing these concepts, so I will attempt to provide as simple a definition as possible.
What Is Backpropagation?

Backpropagation (BP or backprop) is known as a workhorse algorithm in machine learning. Backpropagation is used for calculating the gradient of an error function with respect to a neural network’s weights. The algorithm works its way backwards through the various layers of gradients to find the partial derivative of the errors with respect to the weights. Backprop then uses these weights to decrease error margins when training.

In neural networks, you basically do forward-propagation to get the output of your model and check if this output is correct or incorrect, to get the error. Backpropagation is nothing but going backwards through your neural network to find the partial derivatives of the error with respect to the weights, which enables you to subtract this value from the weights.

Those derivatives are then used by gradient descent, an algorithm that can iteratively minimize a given function. Then it adjusts the weights up or down, depending on which decreases the error. That is exactly how a neural network learns during the training process.

So, with backpropagation you basically try to tweak the weights of your model while training.

The image below illustrates the concept of forward propagation and backpropagation in a feed-forward neural network:

![](2024-03-01-22-11-06.png)

BPTT is basically just a fancy buzzword for doing backpropagation on an unrolled recurrent neural network. Unrolling is a visualization and conceptual tool, which helps you understand what’s going on within the network. Most of the time when implementing a recurrent neural network in the common programming frameworks, backpropagation is automatically taken care of, but you need to understand how it works to troubleshoot problems that may arise during the development process.

You can view an RNN as a sequence of neural networks that you train one after another with backpropagation.

The image below illustrates an unrolled RNN. On the left, the RNN is unrolled after the equal sign. Note there is no cycle after the equal sign since the different time steps are visualized and information is passed from one time step to the next. This illustration also shows why an RNN can be seen as a sequence of neural networks.
unrolled version of RNN
An unrolled version of RNN

![](2024-03-01-22-12-01.png)

If you do BPTT, the conceptualization of unrolling is required since the error of a given time step depends on the previous time step.

Within BPTT the error is backpropagated from the last to the first time step, while unrolling all the time steps. This allows calculating the error for each time step, which allows updating the weights. Note that BPTT can be computationally expensive when you have a high number of time steps.

 
# Common Problems of Recurrent Neural Networks

While RNNs have been a difference-maker in the deep learning space, there are some issues to keep in mind: 

    Exploding gradients: This is when the algorithm, without much reason, assigns a stupidly high importance to the weights. Fortunately, this problem can be easily solved by truncating or squashing the gradients.
     
    Vanishing gradients: These occur when the values of a gradient are too small and the model stops learning or takes way too long as a result. Fortunately, it was solved through the concept of LSTM by Sepp Hochreiter and Juergen Schmidhuber.
     
    Complex training process: Because RNNs process data sequentially, this can result in a tedious training process.
     
    Difficulty with long sequences: The longer the sequence, the harder RNNs must work to remember past information.
     
    Inefficient methods: RNNs process data sequentially, which can be a slow and inefficient approach.  

 
# Benefits of Recurrent Neural Networks 

The upsides of recurrent neural networks overshadow any challenges. Here are a few reasons RNNs have accelerated machine learning innovation:  

    Extensive memory: RNNs can remember previous inputs and outputs, and this ability is enhanced with the help of LSTM networks.
     
    Greater accuracy: Because RNNs are able to learn from past experiences, they can make accurate predictions.
     
    Sequential data expertise: RNNs understand the temporal aspect of data, making them ideal for processing sequential data.
     
    Wide versatility: RNNs can handle sequential data like time series, audio and speech, giving them a broad range of applications.  

 
# Recurrent Neural Networks and Long Short-Term Memory (LSTM) 

Long short-term memory networks (LSTMs) are an extension for RNNs, which basically extends the memory. Therefore, it is well suited to learn from important experiences that have very long time lags in between.
What Is Long Short-Term Memory (LSTM)?

Long short-term memory (LSTM) networks are an extension of RNN that extend the memory. LSTMs are used as the building blocks for the layers of a RNN. LSTMs assign data “weights” which helps RNNs to either let new information in, forget information or give it importance enough to impact the output.

The units of an LSTM are used as building units for the layers of an RNN, often called an LSTM network.

LSTMs enable RNNs to remember inputs over a long period of time. This is because LSTMs contain information in a memory, much like the memory of a computer. The LSTM can read, write and delete information from its memory.

This memory can be seen as a gated cell, with gated meaning the cell decides whether or not to store or delete information (i.e., if it opens the gates or not), based on the importance it assigns to the information. The assigning of importance happens through weights, which are also learned by the algorithm. This simply means that it learns over time what information is important and what is not.

In a long short-term memory cell you have three gates: input, forget and output gate. These gates determine whether or not to let new input in (input gate), delete the information because it isn’t important (forget gate), or let it impact the output at the current time step (output gate). Below is an illustration of an RNN with its three gates:

![](2024-03-01-22-12-39.png)

The gates in an LSTM are analog in the form of sigmoids, meaning they range from zero to one. The fact that they are analog enables them to do backpropagation.

The problematic issue of vanishing gradients is solved through LSTM because it keeps the gradients steep enough, which keeps the training relatively short and the accuracy high.

In [None]:
import tensorflow as tf

# Check for the presence of GPUs
gpus = tf.config.experimental.list_physical_devices('GPU')

if not gpus:
    print("No GPU devices found.")
else:
    print("Available GPU devices:")
    for gpu in gpus:
        print(gpu)

In [6]:
from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())

Available GPU devices:
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


In [7]:
#import required libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [8]:
#read dataset
with open('C:\\Users\\manas\\Downloads\\sherlock-holm.es_stories_plain-text_advs.txt' , 'r', encoding='utf-8') as file:
    text = file.read()

In [9]:
#Tokenizer process
tokenizer = Tokenizer()
#fit
tokenizer.fit_on_texts([text])
#assign length of word index
total_words = len(tokenizer.word_index) + 1

In [10]:
#chek the tokens
tokenizer.word_index

{'the': 1,
 'and': 2,
 'i': 3,
 'to': 4,
 'of': 5,
 'a': 6,
 'in': 7,
 'that': 8,
 'it': 9,
 'he': 10,
 'you': 11,
 'was': 12,
 'his': 13,
 'is': 14,
 'my': 15,
 'have': 16,
 'as': 17,
 'with': 18,
 'had': 19,
 'which': 20,
 'at': 21,
 'for': 22,
 'but': 23,
 'me': 24,
 'not': 25,
 'be': 26,
 'we': 27,
 'from': 28,
 'there': 29,
 'this': 30,
 'said': 31,
 'upon': 32,
 'so': 33,
 'holmes': 34,
 'him': 35,
 'her': 36,
 'she': 37,
 "'": 38,
 'very': 39,
 'your': 40,
 'been': 41,
 'all': 42,
 'on': 43,
 'no': 44,
 'what': 45,
 'one': 46,
 'then': 47,
 'were': 48,
 'by': 49,
 'are': 50,
 'an': 51,
 'would': 52,
 'out': 53,
 'when': 54,
 'up': 55,
 'man': 56,
 'could': 57,
 'has': 58,
 'do': 59,
 'into': 60,
 'mr': 61,
 'who': 62,
 'little': 63,
 'will': 64,
 'if': 65,
 'some': 66,
 'now': 67,
 'see': 68,
 'down': 69,
 'should': 70,
 'our': 71,
 'or': 72,
 'they': 73,
 'may': 74,
 'well': 75,
 'am': 76,
 'us': 77,
 'over': 78,
 'more': 79,
 'think': 80,
 'room': 81,
 'know': 82,
 'shall': 83

In [11]:
#declare ngrams
input_sequences = []
#split the sentence from '\n'
for line in text.split('\n'):
    #get tokens
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [12]:
setence_token = input_sequences[3] # [1, 1561, 5, 129, 34]
sentence = []
for token in setence_token:
    sentence.append(list((tokenizer.word_index).keys())[list((tokenizer.word_index).values()).index(token)])
print(sentence)

['the', 'adventures', 'of', 'sherlock', 'holmes']


In [13]:
#maximum sentence length
max_sequence_len = max([len(seq) for seq in input_sequences])
# input sequences
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In [14]:
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

In [15]:
#convert one-hot-encode
y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

In [16]:
#create model
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 17, 100)           820000    
                                                                 
 lstm (LSTM)                 (None, 150)               150600    
                                                                 
 dense (Dense)               (None, 8200)              1238200   
                                                                 
Total params: 2,208,800
Trainable params: 2,208,800
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
#compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#fit the model
model.fit(X, y, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2c1ff5cfa00>

In [18]:
# determine a text
seed_text = "I will close the door if"
# predict word number
next_words = 7

for _ in range(next_words):
    # convert to token
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    # pad sequences
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    # model prediction
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    # get predicted word
    for word, index in tokenizer.word_index.items():
        if index == predicted[0]:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)

I will close the door if we were out to a man if
