# ___Long Short Term Memory (LSTM)___

_Sequence prediction problems have been around for a long time. They are considered as one of the hardest problems to solve in the data science industry. These include a wide range of problems; from predicting sales to finding patterns in stock markets’ data, from understanding movie plots to recognizing your way of speech, from language translations to predicting your next word on your iPhone’s keyboard._

_With the recent breakthroughs that have been happening in data science, it is found that for almost all of these sequence prediction problems, Long short Term Memory networks, a.k.a LSTMs have been observed as the most effective solution._

_LSTMs have an edge over conventional feed-forward neural networks and RNN in many ways. This is because of their property of selectively remembering patterns for long durations of time._

### ___Limitations of RNNs___
_Recurrent Neural Networks work just fine when we are dealing with short-term dependencies. That is when applied to problems like:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/10015332/b2_9.png'/>

_RNNs turn out to be quite effective. This is because this problem has nothing to do with the context of the statement. The RNN need not remember what was said before this, or what was its meaning, all they need to know is that in most cases the sky is blue. Thus the prediction would be:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/10015342/b2_10-300x76.png'/>

_However, vanilla RNNs fail to understand the context behind an input. Something that was said long before, cannot be recalled when making predictions in the present. Let’s understand this as an example:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/10015352/b2_11.png'/>

_Here, we can understand that since the author has worked in Spain for 20 years, it is very likely that he may possess a good command over Spanish. But, to make a proper prediction, the RNN needs to remember this context. The relevant information may be separated from the point where it is needed, by a huge load of irrelevant data. This is where a Recurrent Neural Network fails!_

_The reason behind this is the problem of __Vanishing Gradient__._

_RNN remembers things for just small durations of time, i.e. if we need the information after a small time it may be reproducible, but once a lot of words are fed in, this information gets lost somewhere. This issue can be resolved by applying a slightly tweaked version of RNNs – the Long Short-Term Memory Networks._

### ___LSTM (Long-Short Term Memory)___

<img src='https://miro.medium.com/proxy/1*goJVQs-p9kgLODFNyhl9zA.gif' width = 600/>

_Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used._

_LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!_

_All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png' width = 600/>

_LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png' width = 600/>

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM2-notation.png' width = 600/>

_In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations._

#### ___The Core Idea Behind LSTMs___
_The key to LSTMs is the cell state, the horizontal line running through the top of the diagram._

_The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png' width =600/>

_The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates._

_Each LSTM recurrent unit also maintains a vector called the __Internal Cell State__ which conceptually describes the information that was chosen to be retained by the previous LSTM recurrent unit._

_Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-gate.png' width = 100/>

_The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”_

_An LSTM has three of these gates, to protect and control the cell state._

<img src='https://miro.medium.com/max/700/1*S0rXIeO_VoUVOyrYHckUWg.gif' width = 600/>

#### ___Step-by-Step LSTM Walk Through___

_Every LSTM module will have 3 gates named as __Forget gate, Input gate, Output gate__._

<img src='https://miro.medium.com/max/700/0*G474BVfgtu5ZE4ai'/>

##### ___Forget Gate___
>___Decides how much of the past you should remember.___

_This gate Decides which information to be omitted in from the cell in that particular time stamp. It is decided by the sigmoid function. it looks at the previous state(ht-1) and the content input(Xt) and outputs a number between 0(omit this)and 1(keep this)for each number in the cell state Ct−1._

<img src='https://miro.medium.com/max/700/0*wvDTn9i0Q6ieTiUH.png'/>

_EX: lets say ht-1 →Roufa and Manoj plays well in basket ball._

_Xt →Manoj is really good at webdesigning._

* _Forget gate realizes that there might be change in the context after encounter its first fullstop._
* _Compare with Current Input Xt._
* _Its important to know that next sentence, talks about Manoj. so information about Roufa is omited._

<img src='https://miro.medium.com/max/700/1*GjehOa513_BgpDDP6Vkw2Q.gif' width = 600/>

##### ___Update Gate/Input Gate___
> ___Decides how much of this unit is added to the current state.___

<img src='https://miro.medium.com/max/700/0*uesHvKaIW6A1Ac5Q.png'/>

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png' width = 700/>

_Sigmoid function decides which values to let through 0,1. and tanh function gives weightage to the values which are passed deciding their level of importance ranging from-1 to 1._

_EX: Manoj good webdesigining, yesterday he told me that he is a university topper._

* _Input gate analysis the important information._
* _Manoj good webdesigining, he is university topper is important._
* _yesterday he told me that is not important, hence forgotten._

_Once these three-step process is done with, we ensure that only that information is added to the cell state that is important and is not redundant._

<img src='https://miro.medium.com/max/700/1*TTmYy7Sy8uUXxUXfzmoKbA.gif' width = 600/>

##### ___Output Gate___
> ___Decides which part of the current cell makes it to the output.___

<img src='https://miro.medium.com/max/700/0*vsF6h5KAmP5o8sAV.png' />

_Sigmoid function decides which values to let through 0,1. and tanh function gives weightage to the values which are passed deciding their level of importance ranging from-1 to 1 and multiplied with output of Sigmoid._

_EX: Manoj good webdesigining, he is university topper so the Merit student _______________ was awarded University Gold medalist._

* _There could be lot of choices for the empty dash. this final gate replaces it with Manoj._

<img src='https://miro.medium.com/max/700/1*VOXRGhOShoWWks6ouoDN3Q.gif' width = 600/>

### ___Backpropagation in LSTM___

_The only main difference between the Back-Propagation algorithms of Recurrent Neural Networks and Long Short Term Memory Networks is related to the mathematics of the algorithm._

_Let $\overline{y}_{t}$ be the predicted output at each time step and y_{t} be the actual output at each time step. Then the error at each time step is given by:_

$$E_{t} = -y_{t}log(\overline{y}_{t})$$

_The total error is thus given by the summation of errors at all time steps._

$$E = \sum _{t} E_{t}$$

$$E = \sum _{t} -y_{t}log(\overline{y}_{t})$$

_Similarly, the value $\frac{\partial E}{\partial W}$ can be calculated as the summation of the gradients at each time step._

$$\frac{\partial E}{\partial W} = \sum _{t} \frac{\partial E_{t}}{\partial W}$$

_Using the chain rule and using the fact that $\overline{y}_{t}$ is a function of $h_{t}$ and which indeed is a function of $c_{t}$, the following expression arises:_

$$\frac{\partial E_{t}}{\partial W} = \frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial c_{t}}\frac{\partial c_{t}}{\partial c_{t-1}}\frac{\partial c_{t-1}}{\partial c_{t-2}}.......\frac{\partial c_{0}}{\partial W}$$

_Thus the total error gradient is given by the following:_

$$\frac{\partial E}{\partial W} = \sum _{t} \frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial c_{t}}\frac{\partial c_{t}}{\partial c_{t-1}}\frac{\partial c_{t-1}}{\partial c_{t-2}}.......\frac{\partial c_{0}}{\partial W}$$

_Note that the gradient equation involves a chain of $\partial c_{t}$ for an LSTM Back-Propagation while the gradient equation involves a chain of $\partial h_{t}$ for a basic Recurrent Neural Network._

### ___How does LSTM solve the problem of vanishing and exploding gradients?___

_Recall the expression for $c_{t}$._

$$c_{t} = i * g + f * c_{t-1}$$

_The value of the gradients is controlled by the chain of derivatives starting from $\frac{\partial c_{t}}{\partial c_{t-1}}$. Expanding this value using the expression for $c_{t}$:_

$$\frac{\partial c_{t}}{\partial c_{t-1}} = \frac{\partial c_{t}}{\partial f}\frac{\partial f}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial i}\frac{\partial i}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial g}\frac{\partial g}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial c_{t-1}}$$

_For a basic RNN, the term $\frac{\partial h_{t}}{\partial h_{t-1}}$ after a certain time starts to take values either greater than 1 or less than 1 but always in the same range. This is the root cause of the vanishing and exploding gradients problem. In an LSTM, the term $\frac{\partial c_{t}}{\partial c_{t-1}}$ does not have a fixed pattern and can take any positive value at any time step. Thus, it is not guarenteed that for an infinite number of time steps, the term will converge to 0 or diverge completely. If the gradient starts converging towards zero, then the weights of the gates can be adjusted accordingly to bring it closer to 1. Since during the training phase, the network adjusts these weights only, it thus learns when to let the gradient converge to zero and when to preserve it._

### ___Variants on Long Short Term Memory___

_One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-peepholes.png' width = 600/>

_Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-tied.png' width = 600/>

_A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular._

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png' width = 600/>

_These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014)._

### ___Applications___

_LSTM models need to be trained with a training dataset prior to its employment in real-world applications. Some of the most demanding applications are discussed below:_

* ___Language modelling or text generation___ _, that involves the computation of words when a sequence of words is fed as input. Language models can be operated at the character level, n-gram level, sentence level or even paragraph level._

* ___Image processing___ _, that involves performing analysis of a picture and concluding its result into a sentence. For this, it’s required to have a dataset comprising of a good amount of pictures with their corresponding descriptive captions. A model that has already been trained is used to predict features of images present in the dataset. This is photo data. The dataset is then processed in such a way that only the words that are most suggestive are present in it. This is text data. Using these two types of data, we try to fit the model. The work of the model is to generate a descriptive sentence for the picture one word at a time by taking input words that were predicted previously by the model and also the image._

* ___Speech and Handwriting Recognition___

* ___Music generation___ _which is quite similar to that of text generation where LSTMs predict musical notes instead of text by analyzing a combination of given notes fed as input._

* ___Language Translation___ _involves mapping a sequence in one language to a sequence in another language. Similar to image processing, a dataset, containing phrases and their translations, is first cleaned and only a part of it is used to train the model. An encoder-decoder LSTM model is used which first converts input sequence to its vector representation (encoding) and then outputs it to its translated version._

### ___Drawbacks___

_As it is said, everything in this world comes with its own advantages and disadvantages, LSTMs too, have a few drawbacks which are discussed as below:_

* _LSTMs became popular because they could solve the problem of vanishing gradients. But it turns out, they fail to remove it completely. The problem lies in the fact that the data still has to move from cell to cell for its evaluation. Moreover, the cell has become quite complex now with the additional features (such as forget gates) being brought into the picture._


* _They require a lot of resources and time to get trained and become ready for real-world applications. In technical terms, they need high memory-bandwidth because of linear layers present in each cell which the system usually fails to provide for. Thus, hardware-wise, LSTMs become quite inefficient._


* _With the rise of data mining, developers are looking for a model that can remember past information for a longer time than LSTMs. The source of inspiration for such kind of model is the human habit of dividing a given piece of information into small parts for easy remembrance._


* _LSTMs get affected by different random weight initializations and hence behave quite similar to that of a feed-forward neural net. They prefer small weight initializations instead._


* _LSTMs are prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue. Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network._

# ___Gated Recurrent Unit (GRUs)___

## ___What is Gated Recurrent Unit?___

_A Gated Recurrent Unit (GRU), as its name suggests, is a variant of the RNN architecture, and uses gating mechanisms to control and manage the flow of information between cells in the neural network. GRUs were introduced only in 2014 by Cho, et al. and can be considered a relatively new architecture, especially when compared to the widely-adopted LSTM, which was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber._

_The structure of the GRU allows it to adaptively capture dependencies from large sequences of data without discarding information from earlier parts of the sequence. This is achieved through its gating units, similar to the ones in LSTMs, which solve the vanishing/exploding gradient problem of traditional RNNs. These gates are responsible for regulating the information to be kept or discarded at each time step._

<img src='https://cdn-images-1.medium.com/max/1000/1*dhq14CzJijlqjf7IlDB0uw.png' width= 600/>

_GRU’s got rid of the cell state and used the hidden state to transfer information. It also only has two gates, a __reset gate__ and __update gate__. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction._

<img src='https://miro.medium.com/max/700/1*jhi5uOm9PvZfmxvfaCektw.png' width = 400/>

_The information which is stored in the Internal Cell State in an LSTM recurrent unit is incorporated into the hidden state of the Gated Recurrent Unit. This collective information is passed onto the next Gated Recurrent Unit. The different gates of a GRU are as described below:_

* ___Update Gate(z)___ _: It determines how much of the past knowledge needs to be passed along into the future. It is analogous to the Output Gate in an LSTM recurrent unit._


* ___Reset Gate(r)___ _: It determines how much of the past knowledge to forget. It is analogous to the combination of the Input Gate and the Forget Gate in an LSTM recurrent unit._


* ___Current Memory Gate($\overline{h}_{t}$)___ _: It is often overlooked during a typical discussion on Gated Recurrent Unit Network. It is incorporated into the Reset Gate just like the Input Modulation Gate is a sub-part of the Input Gate and is used to introduce some non-linearity into the input and to also make the input Zero-mean. Another reason to make it a sub-part of the Reset gate is to reduce the effect that previous information has on the current information that is being passed into the future._

### ___Working of a Gated Recurrent Unit___

* _Take input the current input and and the previous hidden state as vectors._
* _Calculate the values of the three different gates by following the steps given below:_
    * _For each gate, calculate the parameterized currrent input and previous hidden state vectors by performing element-wise multiplication (hadmard product) between the concerned vector and the respective weights for each gate._
    * _Apply the Sigmoid activation function for each gate element-wise on the parameterized vectors._
    
* _The process of calculating the Current Memory Gate is a little different. First, the Hadmard product of the Reset Gate and the previous hidden state vector is calculated. Then this vector is parameterized and then added to the parameterized current input vector._
$$\overline{h}_{t} = tanh(W * x_{t}+W * (r_{t} * h_{t-1}))$$
* _To calculate the current hidden state, first a vector of ones and the same dimensions as that of the input is defined. This vector will be called ones and mathematically be denoted by 1. First calculate the hadmard product of the update gate and the previous hidden state vector. Then generate a new vector by subtracting the update gate from ones and then calculate the hadmard product of the newly generated vector with the current memory gate. Finally add the two vectors to get the current hidden state vector._
$$h_{t} = z_{t} * h_{t-1} + (1-z_{t}) * \overline{h}_{t}$$

### ___GRUs vs Longterm Short Term Memory (LSTM) RNNs___

<img src='https://blog.floydhub.com/content/images/2019/07/image11.jpg' width = 600/>

_The main differences between GRUs and the popular LSTMs (nicely explained by Chris Olah) are the number of gates and maintenance of cell states. Unlike GRUs, LSTMs have 3 gates (input, forget, output) and maintains an internal memory cell state, which makes it more flexible, but less efficient memory and time wise. However, since both of these networks are great at addressing the vanishing gradient problem required for efficiently tracking long term dependencies. Choosing between them is usually done using the following rule of thumb. When deciding between these two, it is recommended that you first train a LSTM, since it has more parameters and is a bit more flexible, followed by a GRU, and if there are no sizable differences between the performance of the two, then use the much simpler and efficient GRU._

### ___Backpropagation in GRUs___

_The Back-Propagation Through Time Algorithm for a Gated Recurrent Unit Network is similar to that of a Long Short Term Memory Network and differs only in the differential chain formation._

_Let $\overline{y}_{t}$ be the predicted output at each time step and $y_{t}$ be the actual output at each time step. Then the error at each time step is given by:_

$$E_{t} = -y_{t}log(\overline{y}_{t})$$
_The total error is thus given by the summation of errors at all time steps._

$$E = \sum _{t} E_{t}$$

$$E = \sum _{t} -y_{t}log(\overline{y}_{t})$$

_Similarly, the value $\frac{\partial E}{\partial W}$ can be calculated as the summation of the gradients at each time step._

$$\frac{\partial E}{\partial W} = \sum _{t} \frac{\partial E_{t}}{\partial W}$$

_Using the chain rule and using the fact that $\overline{y}_{t}$ is a function of $h_{t}$ and which indeed is a function of $\overline{h}_{t}$, the following expression arises:_

$$\frac{\partial E_{t}}{\partial W} = \frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial h_{t-2}}......\frac{\partial h_{0}}{\partial W}$$

_Thus the total error gradient is given by the following:_

$$\frac{\partial E}{\partial W} = \sum _{t}\frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial h_{t-2}}......\frac{\partial h_{0}}{\partial W}$$

### ___How do Gated Recurrent Units solve the problem of vanishing gradients?___

_The value of the gradients is controlled by the chain of derivatives starting from $\frac{\partial h_{t}}{\partial h_{t-1}}$. Recall the expression for $h_{t}$:_

$$h_{t} = z_{t} * h_{t-1} + (1-z_{t}) * \overline{h}_{t}$$
_Using the above expression, the value for $\frac{\partial {h}_{t}}{\partial {h}_{t-1}}$ is:-_

$$\frac{\partial h_{t}}{\partial h_{t-1}} = z + (1-z)\frac{\partial \overline{h}_{t}}{\partial h_{t-1}}$$

_Recall the expression for $\overline{h}_{t}$:_

$$\overline{h}_{t} = tanh(W * x_{t}+W * (r_{t} * h_{t-1}))$$

_Using the above expression to calculate the value of $\frac{\partial \overline{h_{t}}}{\partial h_{t-1}}$:_

$$\frac{\partial \overline{h_{t}}}{\partial h_{t-1}} = \frac{\partial (tanh(W * x_{t}+W * (r_{t} * h_{t-1})))}{\partial h_{t-1}}  \Rightarrow \frac{\partial \overline{h_{t}}}{\partial h_{t-1}} = (1-\overline{h}_{t}^{2})(W * r)$$ 
 
_Since both the update and reset gate use the sigmoid function as their activation function, both can take values either 0 or 1._

___Case 1(z = 1):___

_In this case, irrespective of the value of r, the term $\frac{\partial \overline{h_{t}}}{\partial h_{t-1}}$ is equal to z which in turn is equal to 1._

___Case 2A(z=0 and r=0):___

_In this case, the term $\frac{\partial \overline{h_{t}}}{\partial h_{t-1}}$ is equal to 0._

___Case 2B(z=0 and r=1):___

_In this case, the term $\frac{\partial \overline{h_{t}}}{\partial h_{t-1}}$ is equal to $(1-\overline{h}_{t}^{2})(W)$. This value is controlled by the weight matrix which is trainable and thus the network learns to adjust the weights in such a way that the term $\frac{\partial \overline{h_{t}}}{\partial h_{t-1}}$ comes closer to 1._

_Thus the Back-Propagation Through Time algorithm adjusts the respective weights in such a manner that the value of the chain of derivatives is as close to 1 as possible._

# ___Bi-Directional LSTM___

_Many applications are sequential in nature. One input follows another in time. Dependencies among these give us important clues as to how they should be processed. Since Recurrent Neural Networks (RNNs) model the flow of time, they're suited for these applications._

_RNN has the limitation that it processes inputs in strict temporal order. This means current input has context of previous inputs but not the future. Bidirectional RNN (BRNN) duplicates the RNN processing chain so that inputs are processed in both forward and reverse time order. This allows a BRNN to look at future context as well._

_Two common variants of RNN include GRU and LSTM. LSTM does better than RNN in capturing long-term dependencies. __Bidirectional LSTM (BiLSTM)__ in particular is a popular choice in NLP._

<img src='https://valiancesolutions.com/wp-content/uploads/2019/07/blogpost-8.png'/>

_Consider the phrase, ```'He said, "Teddy ___"```. From these three opening words it's difficult to conclude if the sentence is about Teddy bears or Teddy Roosevelt. This is because the context that clarifies Teddy comes later. RNNs (including GRUs and LSTMs) are able to obtain the context only in one direction, from the preceding words. They're unable to look ahead into future words._

<img src='https://valiancesolutions.com/wp-content/uploads/2019/07/blogpost15.gif'/>

___Bidirectional RNNs solve this problem by processing the sequence in both directions. Typically, two separate RNNs are used: one for forward direction and one for reverse direction. This results in a hidden state from each RNN, which are usually concatenated to form a single hidden state.___

_The final hidden state goes to a decoder, such as a fully connected network followed by softmax. Depending on the design of the neural network, the output from a BRNN can either be the complete sequence of hidden states or the state from the last time step. If a single hidden state is given to the decoder, it comes from the last states of each RNN_

_The idea of Bidirectional Recurrent Neural Networks (RNNs) is straightforward._

<img src='https://valiancesolutions.com/wp-content/uploads/2019/07/blogpost-13.gif'/>

_It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second._

_To overcome the limitations of a regular RNN. we propose a bidirectional recurrent neural network (BRNN) that can be trained using all available input information in the past and future of a specific time frame._

_The idea is to split the state neurons of a regular RNN in a part that is responsible for the positive time direction (forward states) and a part for the negative time direction (backward states)_

<img src='https://www.i2tutorials.com/wp-content/media/2019/05/Deep-Dive-into-Bidirectional-LSTM-i2tutorials.jpg'/>

### ___Applications of Bidirectional RNN___

_BiLSTM has become a popular architecture for many NLP tasks. An early application of BiLSTM was in the domain of speech recognition. Other applications include sentence classification, sentiment analysis, review generation, or even medical event detection in electronic health records._

_BiLSTM has been used for POS tagging and Word Sense Disambiguation (WSD). For Named Entity Recognition (NER), Lample et al. used word representations that captured both character-level characteristics and word-level context. These were fed into a BiLSTM encoder layer. The sequence of hidden states was decoded by a CRF layer._

_For lemmatization, one study used two-layer bidirectional GRUs for the encoder. The decoder was a conditional GRU plus another GRU layer. Another study used a two-layer BiLSTM encoder and a one-layer LSTM decoder. A stack of four BiLSTMs has been used for Semantic Role Labelling (SRL)._

_Beyond NLP, BiLSTM has been applied to image processing applications such as OCR._


### ___Limitations of Bidirectional RNN___
_One limitation with BRNN is that the entire sequence must be available before we can make predictions. For some applications such as real-time speech recognition, the entire utterance may not be available and BRNN may not be adequate._

_In the case of language models, the task is to predict the next word given preceding words. BRNN is clearly not suitable since it expects future words as well. Applying BRNN in this application will give poor accuracy. Moreover, BRNN is slower than RNN since results of the forward pass must be available for the backward pass to proceed. Gradients will therefore have a long dependency chain._

_LSTMs capture long-term dependencies better than RNN and also solve the exploding/vanishing gradient problem. However, stacking many layers of BiLSTM creates the vanishing gradient problem._