# **Deep Learning Fundamentals: RNN**

# What are Recurrent Neural Networks?

RNN is a variety of Artifical Neural Networks, and till date remain the only type of neural network with an internal memory.

The genesis of RNN can be traced to as far back as the Ising model, devised in 1925, which dealt with ferromagnetism and was used to calculate spins in a lattice.

Shunichi Amari's work on geometric structure in 1972 gave mathematicians the tools to understand neural networks.

Hopfield network and David Rumelhart's work in 1980's helped further refine this model, and it finally culminated in the creation of LSTM networks in 1997.

Since they possess an internal memory, RNN's remain the only neural network that can be used for natural language processing and for predicting events and trends like stock market, weather etc.

# How do RNN's work?

RNN's depend on sequential data to function. Sequential data is, in essence, ordered data in which related things follow each other. Some examples are financial data, Protein sequencing or DNA sequencing. 

![](./Images/2024-03-01-22-08-31.png)

RNN's are a variant of ANN, which are a feed-forward neural network. 

A feed-forward Neural network channels information in only one direction: from the input layer, through the hidden layers to the output layer. The information moves straight through the network. 

Therefore, feed forward neural networks have no memory of the input they recieve and are thus incapable of making any predictions. Since feed forward networks only focus on the current input, it has no notion of order in respect to time.

in RNN, however, this information travels in a loop. When a decision is made, the current input is considered and what was learned recently is also factored into the next decisions.

![](./Images/2024-03-01-22-09-19.png)

# Types of Recurrent neural networks

the types of neural networks are:

- One to One
- One to Many
- Many to One
- Many to Many

![](./Images/2024-03-03-06-38-23.png)

# RNN and Backpropagation through time (BPTT)

In Neural networks, each neuron can be thought of as a block of mathematical calculations with an input and an output. What makes each of these neurons different is the weights attached to them. This allows the neural network to explore different solutions to reduce error.

Forward propagation refers to the neurons from the previous layer providing output, which is used by neurons in the next layer as input. Calculations are done and error is obtained.

In backpropagation, based on the error calculated by the neural network, the model goes back through the neurons to find the partial derivatives of the error with respect to the weights given to these neurons. These weights are then used to decrease error margins.

![](./Images/2024-03-01-22-11-06.png)

For development and troubleshooting purposes, RNN's are sometimes "unrolled" to be able visualize what is going on in a model in order to make tweaks and adjustments.

![](./Images/2024-03-01-22-12-01.png)

# Pros and Cons of RNN

**Pros**

- It can process input of any length.
- Model size doesn't increase with increase in input size.
- Computation will take into account older information.
- Weights are shared across time.

**Cons**

- Computation is slow, because computation is sequential and cannot be done in parallel.
- Model has a short term memory and therefore has trouble accessing information from a long time ago.
- Model cannot consider any future input for the current state.

# Limitations of RNN

- **Exploding gradients:** this happens when the model assigns a very high importance to the weights all of a sudden. This problem is solved by truncating or squashing the gradients
- **Vanishing gradients:** This happens when the gradient values are so low the model makes little to no changes to weights and causes the learning time to (theoretically) tend to infinity. This can cause models to either stop learning or take too long to learn. LSTM was made to solve this issue.

# Long Short Term Memory

LSTM are an extension for RNN's which solve the vanishing gradients problem in the model. LSTM cells are used as basic building blocks in an RNN and have gates which decide which information to keep in and which information to discard by assigning them weights. This information flow is controlled by three gates:

1. **Forget Gate**

The Forget Gate plays a crucial role in deciding which information from the cell state at the current timestamp should be discarded. This decision is determined through the sigmoid function. Essentially, the Forget Gate helps the model to selectively "forget" certain pieces of information, making room for new inputs.

2. **Input Gate**

The Input Gate is responsible for determining how much of the new information should be added to the current state of the cell. The decision-making process involves two steps:

- The sigmoid function decides which values to allow through (ranging from 0 to 1).
- The tanh function assigns weights to the allowed values, indicating their level of importance. These weights range from -1 to 1.

3. **Output Gate**

The Output Gate is tasked with deciding which portion of the current cell state contributes to the output. Similar to the Input Gate, this decision is a two-step process:

- The sigmoid function determines which values should be permitted through (ranging from 0 to 1).
- The tanh function assigns weights to the allowed values, indicating their level of importance and ranging from -1 to 1. This output is further expanded with the output of a sigmoid function.

In essence, the Output Gate regulates the information that is passed on to the model's output, and considers both the relevance and significance of the current cell state. 

![](./Images/2024-03-03-06-59-28.png)

![](./Images/2024-03-01-22-12-39.png)

# RNN Project: Using Sherlock Holmes to predict a sentence

For this project, we will be demonstrating the Natural Language Processing capabilities of RNN by using a story from Arthur Conan Doyle's Sherlock Holmes Series: The Final Problem, which sees Sherlock meeting his match against professor Moriarty.

**_Why use this data set?_**

Correct utilization of AI depends as much on good data as it does on the appropriate model.

A story, like The Final Problem, has all use-cases of natural language:

- Dialogue
- Internal monologue (thinking)
- Exposition (description of things like events, actions, feelings)

therefore, using stories as datasets can provide the RNN model with a wide variety of sentences in a wide variety of contexts within a self-contained unit.

# Step 0: Enabling GPU support

Since we will be using a high number of epochs (100) to train our model, and be using a high number of neurons (150), it will lead to a long training time on default hardware allocations. 

By default, Tensorflow uses the CPU to train the algorithm. By using CUDA, we can greatly enhance our training speed and be able to use bigger datasets and higher epoch numbers. 

to enable GPU support, we require CUDA Toolkit and cuDNN from Nvidia.

However, Tensorflow 2.10 is the last version of tensorflow that supported native GPU usage. Later versions required the use of WSL (Windows Subsystem for Linux), a Linux emulator.

Therefore to use GPU on Tensorflow natively, we will be creating a custom anaconda environment. 

To create a custom Anaconda environment, we open the Anaconda Command prompt with Administrator privileges to ensure sufficient write privileges.

the commands to create a custom anaconda environment are:

 > conda create py310 python=3.10

now we activate our python environment

 > conda activate py310

now we are in our python environment and will be installing all the needed modules and libraries.

 > conda install -c  conda-forge cudatoolkit=11.2 cudnn=8.1.0

And finally, we install the desired version of tensorflow.
 
 > python -m pip install "tensorflow=2.10"

 To test whether the GPU is being detected, we use the following program:

In [2]:
import tensorflow as tf

# Check for the presence of GPUs
gpus = tf.config.experimental.list_physical_devices('GPU')

if not gpus:
    print("No GPU devices found.")
else:
    print("Available GPU devices:")
    for gpu in gpus:
        print(gpu)

Available GPU devices:
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


We have now confirmed the presence of our GPU and that it can be used by Tensorflow. We don't need to specify the program to use tensorflow as it can assign hardware automatically.

We can confirm usage of GPU by running a sample model to train and viewing the resource usage in Task Manager:

![](./Images/2024-03-03-06-04-32.png)

The speed improvements are massive, reporting as high as 1/3rd the time per epoch. For example, here are epoch times on CPU, a Ryzen 7 7840HS:

![](./Images/2024-03-03-06-05-42.png)

and here are the epoch times for the same program, utilizing GPU acceleration on the RTX 4060 laptop GPU:

![](./Images/2024-03-03-06-06-32.png)

now we can continue with our project.

# Step 1: Importing all the necessary libraries

For the purpose of this project, we will be importing Tensorflow, an open source Machine Learning Library from Google. We will be using Keras, a well known module used for sequential data.

From Tensorflow we will be importing the following modules

- **Tokenizer:** to convert the data into tokens with unique indices
- **pad_sequences:** to equalize the length of all sequences of tokens
- **Sequential, Embedding, LSTM, Dense:** to build our model

We will also be importing Numpy for dealing with data in the form of lists, arrays etc.

In [24]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Step 2: loading our dataset.

We're using the **with** statement with the **open** function to open a file, use it and then close it back again after its contents have been read.

the **'r'** here signifies that we're only reading the contents of the file, and not making changes.

**encoding='utf-8'** specifies the character encoding of the file, which in our case is UTF-8, a very common, widely supported encoding format.

**as data** assigns our opened txt file to a variable called **data** now the file is accessible to us in this indented code block.

**text = data.read()** reads the entire file as a single string.

**_Note:_** we're reading the entire file as a single string. This means the computer will load _the entire file at once_ into memory. Our data set here is made of 42,667 characters, for a total memory cost of 42 kilobytes. With 16 gigabytes of available RAM and 8 gigabytes of dedicated VRAM (Video RAM) on the system, this is a non-issue for us, however, commercially used data sets have sizes ranging from a few hundred GB's to a TB of memory, and dedicated memory management methods might be required. One such method is using **for line in data** method, which can be used to read the data line by line and store one by one in an array, commonly used to read CSV files.

In [1]:
with open('.\\TheFinalProblem.txt' , 'r', encoding='utf-8') as data:
    text = data.read()

# Step 3: creating tokens

This is the first pre-processing step for training our AI model. Here, the **Tokenizer** class converts a text into a matrix of token counts.

**tokenizer.fit_on_texts([text]):** updates the vocabulary of the Tokenizer based on the text we just provided.

**wordlength = len(tokenizer.word_index) +1** is used to store the number of tokens generated. **word_index** is an index each word of the dataset with a unique number associated with the word. the +1 adds a 1 to the final number, because the counting begins from 0 in case of indices.

In [4]:
tokens = Tokenizer()
tokens.fit_on_texts([text])
wordlength = len(tokens.word_index) + 1

# Step 3.5: checking our tokens

we can check the tokens we just created and the indices by using **tokens.word_index** which prints the first 1000 indices

In [5]:
tokens.word_index

{'the': 1,
 'i': 2,
 'to': 3,
 'of': 4,
 'and': 5,
 'a': 6,
 'that': 7,
 'in': 8,
 'he': 9,
 'it': 10,
 'my': 11,
 'was': 12,
 'you': 13,
 'his': 14,
 'which': 15,
 'have': 16,
 'had': 17,
 'is': 18,
 'me': 19,
 'as': 20,
 'at': 21,
 'for': 22,
 'but': 23,
 'we': 24,
 'with': 25,
 'an': 26,
 'be': 27,
 'him': 28,
 'from': 29,
 'on': 30,
 'upon': 31,
 'by': 32,
 'will': 33,
 'said': 34,
 'been': 35,
 'would': 36,
 'there': 37,
 'not': 38,
 'could': 39,
 'our': 40,
 "'": 41,
 'this': 42,
 'holmes': 43,
 'no': 44,
 'so': 45,
 'all': 46,
 'what': 47,
 'one': 48,
 'watson': 49,
 'man': 50,
 'up': 51,
 'has': 52,
 'if': 53,
 'over': 54,
 'moriarty': 55,
 'your': 56,
 'then': 57,
 'were': 58,
 'should': 59,
 'when': 60,
 'only': 61,
 'now': 62,
 'into': 63,
 'are': 64,
 'do': 65,
 'out': 66,
 'see': 67,
 'away': 68,
 'may': 69,
 'down': 70,
 'who': 71,
 'last': 72,
 'some': 73,
 'time': 74,
 'two': 75,
 'professor': 76,
 'way': 77,
 'well': 78,
 'never': 79,
 'us': 80,
 'am': 81,
 'must': 82,

# Step 4: creating our N-grams

An N-gram is defined as a sequence of adjacent words in a particular order. N-grams are used in a variety of contexts like Protein sequencing, DNA sequencing and language models.

![](./Images/2024-03-03-00-14-17.png)
_source: https://botpenguin.com/glossary/n-gram_

here, **inputsequence** is our new empty list, where we'll store our n-grams.

**for line in text.split('\n')** this loop goes through every line, split at newline character **'\n'**.

**tokenlist = tokenizer.texts_to_sequences([line])[0]** here, each line is created into a sequence of tokens. Since we are processing line by line, we begin at 0.

**for i in range(1, len(token_list)):** this inner loop iterates from the second index (1 in this case) to the last token.

**ngramsequence = token_list[:i+1]** here, an n-gram sequece is created including tokens from the beginning up to the current index.

**input_sequences.append(ngramsequence)** here, the n-gram sequence is added to the **inputsequence** list.

In [6]:
inputsequence = []
for line in text.split('\n'):
    tokenlist = tokens.texts_to_sequences([line])[0]
    for i in range(1, len(tokenlist)):
        ngramsequence = tokenlist[:i+1]
        inputsequence.append(ngramsequence)

# Step 5: Setting the maximum sequence length


**maxsequencelen = max([len(seq) for seq in input_sequences])** calculates the maximum sequence length from all the sequences in **inputsequence**

**input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))** equalizes all the sequences to have the same length by adding padding at the beginning (**pre**) and arranging them into a numpy array

In [7]:
maxsequencelen = max([len(seq) for seq in inputsequence])
inputsequence = np.array(pad_sequences(inputsequence, maxlen=maxsequencelen, padding='pre'))

# Step 6: Preparing input and target data

now we're assigning parts of **inputsequence** into two seperate variables, **X**(input) and **y** (target)

**X** contains all the columns except the last column of **inputsequence**

**y** contains only the last column of **inputsequence**

therefore, **X** contains the input sequence of tokens and **y** contains the target token, or the token to be "predicted" while training the model

In [9]:
X = inputsequence[:, :-1]
y = inputsequence[:, -1]

# Step 7: One-hot encoding

One-hot encoding is a method to represent categorical data in the form of binary vectors, or vectors in a matrix where the only values are 0 and 1, and are used to see if a certain index is present or not. This way we can represent individual words in a vector matrix and keep them distinct from each other.

for example:

Lets take a vocabulary of 5 words: ["apple", "banana", "orange", "grape", "kiwi"]. Each word in this vocabulary will be represented as a unique index:

    "apple" is assigned index 0,
    "banana" is assigned index 1,
    "orange" is assigned index 2,
    "grape" is assigned index 3,
    "kiwi" is assigned index 4.

Now, if we have a sentence like "banana is a fruit," and we want to represent each word in this sentence using one-hot encoding, the matrix would end up with the following binary vectors:

    "banana": [0, 1, 0, 0, 0] (because it's at index 1)
    "is": [0, 0, 0, 0, 0] (in this case, it's not in the vocabulary of fruits)
    "a": [0, 0, 0, 0, 0]
    "fruit": [0, 0, 0, 0, 0]

Each vector is the length of the vocabulary (which is five words in this case), and it contains all zeros except for a 1 at the index corresponding to the word.

in our code, **tf.keras.utils.to_categorical(y, num_classes=wordlength)** converts the variable **y** into a binary matrix.

in order to set the number of classes, we use **num_classes=wordlength** which is the total length of the word index we calculated earlier.

In [10]:
y = np.array(tf.keras.utils.to_categorical(y, num_classes=wordlength))

# Step 8: building our model

So far, we have:

- Read our input data.
- Converted each word into a token and given it a distinct index number.
- Created N-grams, or sequences of words as a sequence of tokens' index number.
- Determined a maximum sequence length, and padded all sequences, where necessary, to have all sequences of same length.
- Split our sequence data into an input sequence (which contains all but the last word of the sequence) and the target sequence (which contains the last word of the sequence) so the model can be shown to guess words
- one-hot encoded the target data, so the model can tell which index is appearing where.

Now that we have all the pieces we need to train the model, we build the model.

**model = Sequential()** initializes a sequential model, where each layer has one input and one output tensor.

**model.add(Embedding(wordlength, 100, input_length=maxsequencelen-1))** adds an embedding layer to the model where the integer indices are converted to dense vectors (vectors where most values are non-zero) of a fixed size, which is why we converted all sequences to a fixed size and one hot encoded our target variable. Here the vector length is 100 dimensions and is a vector where the values are adjusted to make sure that words with similar meanings have similar vector arrangements so that the model can start to understand their usage.

**model.add(LSTM(150))** adds the LSTM layer, which has 150 neurons.

**model.add(Dense(wordlength, activation='softmax'))** adds a dense output layer to the model, this layer will have a vocabulary size determined by **wordlength**. The activation function here is **softmax** used commonly for multi-classification problems. This function normalizes the output values into a normal probability distribution over the entire vocabulary, to get a better idea of which word will be next.

The structure of our model is: 

![](./Images/2024-03-03-02-09-50.png)

To get a summary of the model, we use **print(model.summary())**

In [11]:
model = Sequential()
model.add(Embedding(wordlength, 100, input_length=maxsequencelen-1))
model.add(LSTM(150))
model.add(Dense(wordlength, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 16, 100)           177100    
                                                                 
 lstm (LSTM)                 (None, 150)               150600    
                                                                 
 dense (Dense)               (None, 1771)              267421    
                                                                 
Total params: 595,121
Trainable params: 595,121
Non-trainable params: 0
_________________________________________________________________
None


Here we can see, there are no parameters that the model cannot be trained on, which means the data we prepared is problem-free and will cause no problems with training or use of the model.

We are now ready to compile the model, and fit it to the data we just prepared.

# Step 9: compiling and fitting our model

We're compiling our model according to the following parameters:

- **Loss function: Categorical Crossentropy**, a common function used for multi-class classification problems

- **Optimizer: Adam**, a well known and widely used optimizer, known for its efficiency and effectiveness in Deep learning

- **Metrics: accuracy**, which will measure what percentage of the predicted words were actually correct. 

We're fitting our model with the following parameters:

- **Epochs = 100** which means the model will have 100 iteration steps.

- **Verbose = 1** which will show a progress bar for each epoch as it iterates through them.


In [12]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1c2f9e907c0>

# Step 10: Moment of truth

Its time to feed the model a seed text (text it will use as an input to predict) the next words.

we will also ask it to predict a certain number of words, in this case, 10.

This seed text will be sent through a prediction loop, which: 

- Converts the current seedtext into a sequence of tokens.
- Pads the sequence to match the model's input length.
- Uses the trained model to predict the next word.
- Converts the predicted word index back into the actual word using tokens's word index.
- Appends the predicted word to the seedtext.

In [23]:
seedtext = "What happens we do this"
nextwords = 10

for _ in range(nextwords):
    token_list = tokens.texts_to_sequences([seedtext])[0]
    token_list = pad_sequences([token_list], maxlen=maxsequencelen-1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokens.word_index.items():
        if index == predicted[0]:
            output_word = word
            break
    seedtext += " " + output_word

print(seedtext)



What happens we do this then we must place me to the strand end of


# Result:

Our final result is: _What happens we do this then we must place me to the strand end of_ which resembles a semi-coherent sentence, and is cut off abruptly at the end, which is simply because we asked it to stop when it finished its word quota, and not when it had a coherent idea to present.

This is a prime example of the fact that *Artificial Intelligence does what we tell it to do, and not what we want it to do*

As the amount of data fed to this model grows, so will the coherency of the output.

# References:

- CS 230 - Recurrent Neural Networks Cheatsheet. (n.d.). https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
- Wikipedia contributors. (2024b, February 10). Long short-term memory. Wikipedia. https://en.wikipedia.org/wiki/Long_short-term_memory
- İlaslan, D. (2023, September 11). Next Word Prediction using LSTM with TensorFlow - Düzgün İlaslan - Medium. Medium. https://medium.com/@ilaslanduzgun/next-word-prediction-using-lstm-with-tensorflow-e2a8f63b613c
- Donges, N. (2024, February 28). A complete guide to Recurrent Neural networks (RNNs). Built In. https://builtin.com/data-science/recurrent-neural-networks-and-lstm
- Kharwal, A. (2022, January 3). Stock Price Prediction with LSTM. Thecleverprogrammer. https://thecleverprogrammer.com/2022/01/03/stock-price-prediction-with-lstm/
- Wikipedia contributors. (2024c, February 12). Recurrent neural network. Wikipedia. https://en.wikipedia.org/wiki/Recurrent_neural_network