<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/10_modernRNNs_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25h

* RNNs are faced with two numerical stability problems: exploding and vanishing gradients.
* For exploding gradients, the gradient clipping method was introduced in the previous chapter

## 10.1 Long Short-Term Memory (LSTM)

* Long short-term memory (LSTM) was used to address the vanishing gradients in RNNs.
* Recurrent Neural Networks suffer from short-term memory due to vanishing gradient problem, especially when working with large data sequences.
* Newer versions of RNNs can preserve important information from earlier parts of the sequence and carry it forward.

### LSTMs

Standard RNNs and LSTMs use recurrent units to learn from sequential data. But, the structure of the recurrent unit is very different. Recurrent structure in the standard RNNs has the following structure:
$$
𝐇_t = Φ[𝐗_t𝐖_{xh} + 𝐇_{t-1}𝐖_{hh} + 𝐛_h]
$$
LSTM employs various gates to decide what information to keep or discard. Also, it adds a **cell state**, which is like a long term memory of LSTM. The recurrent unit of LSTM is more complex than RNNs and contains the following parts:

* **Hidden state** ($𝐇_{t-1}$) & new inputs ($𝐗_t$): These two are concatenated before going through different gates.

* **Forget gate**: Controls what information should be forgotten.
$$
𝐅_t = σ[𝐗_tW_{xf} + 𝐇_{t-1}𝐖_{hf} + 𝐛_f]
$$
Since sigmoid function ranges between 0 and 1, it sets which value in **the cell state** should be forgotten (multiplied by 0), remembered (multiplied by 1), or partially remembered (multiplied by some value between zero and one).

* **input gate**: Helps to identify important elements that need to be added to cell state.
$$
𝐈_t = σ[𝐗_tW_{xi} + 𝐇_{t-1}𝐖_{hi} + 𝐛_i]
$$
Intuitively, the input gate determines how much of the input node's value should be added to the current memory cell internal state.

* **Input node**:
$$
𝐂̃_t = tanh[𝐗_tW_{xc} + 𝐇_{t-1}𝐖_{hc} + 𝐛_c]
$$

* **Cell state** ($𝐂_t$): First, the previous cell state ($𝐂_{t-1}$) gets multiplied by the result of the forget gate (determines how much of the old cell internal state we retain.) Then we add new information from [input gate * cell state candidate] to get the cell sate (This governs how much we take new data into account.)
$$
𝐂_t = 𝐅_t ⊗ 𝐂_{t-1} + 𝐈_t ⊗ 𝐂̃_t
$$

So, in order to update the hidden state ($𝐇_t$), the follwing formula can be utilized:
$$
𝐇_t = 𝐎_t ⊗ tanh(𝐂_t)
$$



* Class `torch.nn.LSTM(*args(input_size, hidden_size), *kwargs)`: Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence. 
* `input_size`: The number of expected features in the input `x`.
* `hidden_size`: The number of features in the hidden state `h`.

### Inputs: input, (hidden state (`h_0`), cell state (`c_0`))
* `input`: tensor of shape ($L, H_{in}$) for unbatched input, ($L,N,H_{in}$), when `batch_first=False` or ($N, L, H_{in}$) when `batch_first=True` containing the features of the input sequence. Where $N$ is batch size, $L$ is sequence length, and $H_{in}$ is input size.
* `h_o`: tensor of shape ($D$ * *num_layers*, $H_{out}$) for unbatched input, ($D$* *num_layers*, $N$, $H_{out}$) containing the initial hidden state for each element in the input sequence. Default to zero if (`h_o`, `c_o`) is not provided. Where $D$ is equal to 2 if bidirectional=True, otherwise equals to 1, and $H_{out}$ is hidden size.
* `c_o`: tensor of shape ($D$ * *num_layers*, $H_{cell}$) for unbatched input, or ($D$ * *num_layers*, $N$, $H_{cell}$) containing the initial cell state for each element in the input sequence. Where $H_{cell}$ is hidden size.

### Outputs: output, (`h_n`, `c_n`)
* `output`: Tensor of shape ($L$, $D$ * $H_{out}$) for unbatched input, ($L$, $N$, $D$ * $H_{out}$) when `batch_first=False`, or ($N$, $L$, $D$ * $H_{out}$) when `batch_first=True` contains the output features (`h_t`) from the last layer of the LSTM, for each t.
* `h_n`: Tensor of shape ($D$ * *num_layers*, $H_{out}$) for unbatched input, or ($D$ * *num_layers*, $N$, $H_{out}$) containing the final hidden state for each element in the sequence.
* `c_n`: Tensor of shape ($D$ * *num_layers*, $H_{cell}$) for unbatched input or ($D$ * *num_layers*, $N$, $H_{cell}$) containing the final cell state for each element in the sequence.

In [5]:
import torch
from torch import nn

rnn = nn.LSTM(10, 20)                # (input_size, hidden_size)
input = torch.randn(5, 3, 10)        # (sequence length, batch_size, input_size)
h0 = torch.randn(1, 3, 20)           # (D * num_layers, batch_size, hidden_size)
c0 = torch.rand(1, 3, 20)            # (D * num_layers, batch_size, hidden_size)

output, (hn, cn) = rnn(input, (h0, c0))

print('output shape: ', output.shape) # (sequence length, batch_size, D * hidden_size) == (5, 3, 20)
print('h_n shape:', hn.shape)         # (D * num_layers, batch_size, hidden_size) == (1, 3, 20)
print('c_n shape:', cn.shape)         # (D * num_layers, batch_size, hidden_size) == (1, 3, 20)

output shape:  torch.Size([5, 3, 20])
h_n shape: torch.Size([1, 3, 20])
c_n shape: torch.Size([1, 3, 20])


## 10.2 Gated Recurrent Units

* Class `torch.nn.GRU(*args (input_size, hidden_size), **kwargs)`: Applies a multi-layer gated recrurrent unit (GRU) RNN to an input sequence.

### Inputs: `input`, `h_0`
* `input`: tensor of shape ($L, H_{in}$) for unbatched input, ($L, N, H_{in}$) when `batch_first=False` or ($N,L,H_{in}$) when `batch_first=True` containing the features of the input sequence. Where $L$ is sequence length, $N$ is batch size, and $H_{in}$ is input size.
* `h_0`: Tensor of shape ($D$ * *num_layers*, $H_{out}$) or ($D$ * *num_layers*,$N$, $H_{out}$) containing the initial hidden state for the input sequence. Wehre $D$ is equal to 2 if bidirectional=True, otherwise is equal to 1, and $H_{out}$ is hidden size.

### Outputs: `output`, `h_n`
* `output`: tensor of shape ($L$, $D$ * $H_{out}$) for unbatched input, ($L$, $N$, $D$ * $H_{out}$) when `batch_first=False` or ($N$, $L$, $D$ * $H_{out}$) when `batch_first=True` containing the output features (`h_t`) from the last layer of the GRU, for each t.
* `h_n`: tensor of shape ($D$ * *num_layers*, $H_{out}$) or ($D$ * *num_layers*, $N$, $H_{out}$) containing the final hidden state for the input sequence.