<a name='0'></a>

# Recurrent Neural Networks

In the previous parts, we learned about core computer vision tasks such as image classification, object detection, and image segmentation and some prominent visual architectures such as ConvNets. Although those are the things that you really want to know well, deep learning and computer vision don't stop there. In this notebook, we will learn another kind of neural network architectures called Recurrent Neural Networks that is most popular in Natural Language Processing(NLP) and its applications in computer vision.

***Outline***:

- [1. Introduction to Recurrent Neural Networks](#1)
- [2. The Downsides of Vanilla RNNs](#2)
- [3. Other Recurrent Networks: LSTMs & GRUs](#3)
- [4. Recurrent Networks for Computer Vision](#4)
- [5. Final Notes](#5)
- [6. References and Further Learning](#6)

<a name='1'></a>

## 1.Introduction to Recurrent Neural Networks

So far, all neural network architectures we saw take some sorts of input data(example: image) with fixed size and produces output(example: category label) of fixed size in a straightforward manner. These networks are normally called feed-forward networks. Image classification architectures are a good example of feed-forward networks(some people like to confuse feed-forward networks with fully connected networks). Feed-forward type networks works great for types of data that don't have some forms of recurrence or sequence behaviors like images, but they are not capable of handling sequential data such as texts, videos, and audios.

Further more, different to static images, sequential data have variable lengths. Recurrent Neural Networks(RNNs) are best suited for processing data with variable lengths. From the architecture point of view, RNNs give us the ability to design neural network systems that operates over a sequence of information and make prediction at every timestep taking all previous time steps into account.


Most RNNs applications can be viewed into 3 main categories which are discussed briefly below:

- **One to many**, where given a single and fixed input data, a recurrent network can map it to a sequence of outputs. Example of this category is image captioning where we take an image(of fixed shape) and generate its caption(sequence of words).

- **Many to one**, where given a sequence of input data such as texts, a recurrent network can produce a single output. Examples of this category are video classification where given a video(sequence of images) we produce its category or class label and sentiment analysis where given a piece of texts, we produce a label(positive or negative) that tells the sentiment of the input texts.

- **Many to many**, where both input and output of recurrent network are both sequences. One example of this is video captioning. In video captioning, a recurrent model takes video(a sequence of frames) and produces caption(sequence of words) that describes what the video is all about. Another similar example is video classification done on frame level. Per-frame video classifiction sounds like many-to-one, but it's many to many in a way that a recurrent model takes a video, classifies every current frame with given classes considering all previous frames or timesteps. If we are are classifying every frames, we have many classes that we are associating each frame with, and hence many-to-many. Another NLP example of this category is machine translation where given a sequence of words in one language, we translate it into another language.

The image below illustrates the above categories.

![image](https://drive.google.com/uc?export=view&id=1eS5U-N_7mBgDvJh4T66yVVMMaP3SaT0-)


Recurrent Neural Networks are made of many RNN cells. Each cell takes input data and previous output and update its internal state with current output. Mathematical speaking, RNN cell can be represented as:

$$
h_t = f_W(h_t-1, x_t)
$$

The notations $h_t$ is internal state of the cell or hidden state vector, $f_w$ is a function with weights parameters. $h_t-1$ is previous output and $x_t$ is current input vector. The image below illustrates the computation of RNN cell.

![image](https://drive.google.com/uc?export=view&id=1Qu60NgDj40PS0NbGVsBP7W5Y4hs4bq1V)

A typical RNNs is made of many cells that perform the same exact computations over many timesteps. As we saw previously, at every timestep, the hidden or internal state is updated with the current input data and the previous timestep. The image below shows unrolled form of RNNs.


![image](https://drive.google.com/uc?export=view&id=1ySyf5uGzaXZxDLkuwXGtrsD5DeisfEC7)


<a name='2'></a>

## 2. The Downsides of Vanilla RNNs

In theories, it looks like RNNs are all we need for processing all kinds of sequential data, but in practice, RNNs are not capable of learning long-term dependencies due to unstable gradients problems. 

When RNN is fed with a long sequence of information, it's likely that it will suffer to learn such sequence. This is caused by how RNNs are wired. If you look at the unrolled RNNs, you will notice that we have many timesteps, and at every timestep, RNN cell takes previous internal or hidden state and input vector. Also remember that the internal state of the RNN cell is `tanh()` function whose output is always between negative one (-1) and positive one(1). So, if we have many time steps, it's very likely that the gradients(derivative of current hidden state $h_t$ with respect to previous hidden state $h_t-1$) will approach zero over the course of training. That will results in [vanishing gradients problem](https://www.youtube.com/watch?v=qhXZsFVxGKo)(vanishing gradients problem is a [common problem](https://twitter.com/Jeande_d/status/1436277279697539074?s=20&t=otiGUpWenwSdGSpAlZdExg) in large neural networks. In normal networks, it's alleviated by not using sigmoid or tanh in early layers of the network and using residual connections but sadly, RNNs are already designed with tanh).

One way of alleviating vanishing gradients problem in RNNs is removing the tanh non-linearity but that also results in another problem which is exploding gradients. If the internal or hidden state vectors are not constrained with any non-linearity, their gradients will be extremely large especially when the weights matrix is greater than 1. Think about it this way: we are multiplying weights and input vector at every time step, so if weights are greater than 1 and we are multiplying them with input vector at every timestep, in the end we will end up with very large values of gradients which is a problem. We can avoid exploding gradients by clipping gradients or scaling down the gradients, but it's still not enough to alleviate unstable gradients problem.

Further more, the absence of non-linearity in RNNs can result in vanishing gradients problem too when the weight matrix is less than 1(multiplying a given number by a number less than 1 will always reduce such number and doing it over and over will keep reducing it even further).

There are lots of theories behind RNNs, but that's fairly enough to know about it. If you want to learn more about RNNs, check the further learning section at the end of this notebook. Finally, RNN is implemented in most deep learning APIs, so you ever won't have to built it yourself. It is implemented in [Keras](https://keras.io/api/layers/recurrent_layers/simple_rnn/) and [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).

Due to vanishing gradients and exploding gradients problems, vanilla RNNs are not used in practice. Let's see their successor in the next section.


<a name='3'></a>

## 3. Other Recurrent Networks: LSTMs and GRUs

LSTMs standing for Long Short Term Memory are the successor and special version of Recurrent Neural Networks(RNNs) that suprisingly work well in practice due to its ability to handle long-term sequences.

LSTMs are RNNs with a slightly different design. Below image illustrates the difference between vanilla RNNs and LSTMs in terms of their design and gradients flow.

![image](https://drive.google.com/uc?export=view&id=1zGByZwNZIu56Mx2AekTdHFIgx3E0ieMV)

Looking at the above image, there are at least 2 main differences between RNNs and LSTMs. The first one is the horizontal line on top of LSTM cell that largely contributes to uninterrupted flow of information and 3 additional sigmoid gates. Christopher Olah explained in details what those additional gates do in his fantastic blog post on [understanding LSTMs networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) which is a highly recommended read!


[GRUs or Gate Recurrent Units](https://arxiv.org/abs/1406.1078) are also special version of RNNs and simplified LSTMs that also work well in practice and they are compute efficient than LSTMs. For more about RNNs and their other variants, check this [survey paper](https://arxiv.org/pdf/1503.04069.pdf)(this is completely optional and you probably don't want to read it since this is not NLP class).

Let's summarize what we have seen for far about Recurrent Neural Networks or RNNs. RNNs are actually like normal feedforward networks with a loop. Due to such looping mechanism, RNNs are used for handling sequential data such as texts, videos, and time series. LSTMs are special version of RNNs that are used in practice. In general, RNNs are used in Natural Language Processing but they are also used in Computer Vision as well. In the next section, we will review the applications of recurrent networks in Computer Vision.





<a name='4'></a>

## 4. Recurrent Networks for Computer Vision

Recurrent Networks are most popular in sequence modelling or Natural Language Processing, but they are also used in tasks that involves both vision and language such as image captioning and visual question answering. Also, although it's rare, recurrent networks can also be used in core vision tasks such as image classification. We will actually see a recent example of that later.

Most past works in vision-language used some forms of recurrent networks(LSTMs mostly). Example is image captioning task where given an image, a model can predict its caption. To predict image caption, you typically have a ConvNets that extract features in an image and LSTMs that take the extracted features and produce the caption. Below image taken from [Show and Tell: A Neural Image Caption Generator paper by (Vinyals et al.)](https://arxiv.org/pdf/1411.4555v2.pdf) shows exactly how we can predict caption with ConvNet and LSTMs.

![image](https://drive.google.com/uc?export=view&id=1Z5bOb7GIv2GmiL1qOMhPGR4kMlmNCgIz)

Below are some examples of image captioning model results.

![image](https://drive.google.com/uc?export=view&id=1BS1XVM54xjvsw9uf6IWTUsQAdtMD23B_)


Another vision-language task that use recurrent networks is visual question answering(VQA) where given an image and a question, a model can produce the answer of such particular question. Most early VQA algorithms used ConvNets and LSTMs. You can learn more about VQA [here](http://vqa.cloudcv.org0). There is also a [demo](http://vqa.cloudcv.org) that allows you to play with VQA models and you can try it on your images.

![image](https://drive.google.com/uc?export=view&id=11wqu5gPQUMx-C-ObfKW7nRaz2mZf7CdQ)

We will learn more about vision-language tasks in the later parts and other recent algorithms.


Lastly, as we alluded in the beginning, recurrent networks are also used in core vision tasks such as image classification. A recent example of this is found in the paper [Sequencer: Deep LSTM for Image Classification(Yuki et al., 2022)](https://arxiv.org/abs/2205.01972). Sequencer network is an LSTM based network that achieves excellent accuracy on image classification benchmarks comparable to existing state-of-the-arts networks such as modern CNNs and Vision Transformers(ViT). Sequencer uses Bidirectional LSTMs as its core element(Bidirectional LSTM is two merged LSTM layers. In TensorFlow, you wrap a [normal LSTM layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) in [Bidirectional layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) and in PyTorch, you simply set `bidirectional` hyperparameter to True in [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)). Below image shows the Sequencer architecture. You can read more about it in its [paper](https://arxiv.org/pdf/2205.01972.pdf).


![image](https://drive.google.com/uc?export=view&id=1fERZS9aaOc_LJsWdzONhrV6tvM4oiIYO)


That's it about Recurrent Neural Networks in Computer Vision. In the decade 2010s when RNNs were still attractive(before [Attention](https://arxiv.org/abs/1706.03762) came into the scene), people also thrown RNNs into other visual tasks such as image generation(example is found [here](https://arxiv.org/pdf/1502.04623.pdf)), but my guess is that no body does it anymore given the [recent](https://imagen.research.google) [progess](https://openai.com/dall-e-2/) in image generative models and computer vision in general.





<a name='5'></a>

## 5. Final Notes

This notebook was about Recurrent Networks and their applications in Computer Vision. RNNs are mostly used in sequence modelling or NLP but long ago(and still to day), they also happened to find their applications in visual recognition especially in vision and language joint tasks such as image captioning and visual question answering. However, it's important to note that with the rise of [Transformer](https://arxiv.org/abs/1706.03762), RNNs are rarely used in both NLP and Computer Vision to day and in those rare cases, it's LSTMs that are used rather than Vanilla RNNs.

One last thing: [1D convolution](https://wandb.ai/ayush-thakur/dl-question-bank/reports/Intuitive-understanding-of-1D-2D-and-3D-convolutions-in-convolutional-neural-networks---VmlldzoxOTk2MDA) is also used for processing sequential data, but just like RNNs in practice, it's rare to see people using [Conv1D](https://keras.io/api/layers/convolution_layers/convolution1d/) in practice to day.

<a name='6'></a>

## References and Further Learning

* [Recurrent Networks by Justin Johnson at University of Michigan](https://www.youtube.com/watch?v=dUzLD91Sj-o&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=13)

* [Deep Sequence Modeling - MIT Intro to Deep Learning](https://www.youtube.com/watch?v=QvkQ1B3FBqA&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=2)

* [Understanding LSTM Networks blog by Christopher Olah](https://colah.github.io/about.html)

* [CS231n - Recurrent Neural Networks Notes](https://cs231n.github.io/rnn/)

* [The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

* [Natural Language Processing - Complete Machine Learning Package](https://nyandwi.com/machine_learning_complete/31_intro_to_nlp_and_text_preprocessing/)

## [BACK TO TOP](#0)