# Long Short-Term Memory (LSTM)

## Introduction
- Long Short-Term Memory (LSTM), which is a particular case of a recurrent neural network (RNN)
    - This type of neural network is used to process sequences of data, where *the order matters*.
    - Common application of RNN and LSTM is in natural language processing. 
        - Words in a sentence have an order, and changing that order can completely change the meaning of a sentence
    - Another application where the order of data matters is time series. 
        - We know that time series are sequences of data equally spaced in time, and that their order cannot be changed. 
        - The data point observed at 9 a.m. must come before the data point at 10 a.m. and after the data point at 8 a.m.

## Recurrent Neural Network (RNN)
- A recurrent neural network (RNN) is especially adapted to processing sequences of data. 
- It uses a hidden state that is fed back into the network so it can use past information as an input when processing the next element of a sequence.
<p align="center"><img src="../../assets/img/rnn-architecture.png" width=600><br>Expanded illustration of an RNN</p>

- Expanded illustration of an RNN: an input, denoted as $x_t$, and an output, denoted as $y_t$. 
    - $x_0$ is the first element of a sequence, fed to the RNN, and it computes a hidden state, denoted as $h_0$. 
    - This hidden state acts as memory, and is fed back to the RNN as an input along with $x_1$ to produce the output $y_1$
        - That way, the network effectively uses past information computed for previous elements of the sequence to inform the output for the next element of the sequence.
- Drawbacks: in practice, this means the RNN forgets about past information that is far away in the sequence. It therefore suffers from a short-term memory
    - This short-term memory is due to the vanishing gradient
- Subtypes of RNNs to address the drawback: long short-term memory (LSTM) and gated recurrent unit (GRU)

### Vanishing Gradient
- RNNs suffer from short-term memory, meaning that information from an early element in the sequence will stop having an impact further into the sequence.
    - This short-term memory is due to the **vanishing gradient**.
- The gradient is simply the function that tells the network how to change the weights. 
    - If the change in gradient is large, the weights change by a large magnitude. 
    - If the change in gradient is small, the weights do not change significantly. 
    - The vanishing gradient problem refers to what happens when the change in gradient *becomes very small*, sometimes close to 0. 
        - This in turn means that the weights of the network do not get updated, and the *network stops learning*.
- For example, if an RNN is processing 24 hours of hourly data, the points at hours 9, 10, and 11 might still impact the output at hour 12, but any point prior to hour 9 might not contribute at all to the network’s learning, because the gradient gets very small for those early data points.
- Solution: the long short-term memory (LSTM) architecture, which uses the cell state as an additional way of keeping past information in memory for a long time.

## Long Short-Term Memory (LSTM)
- Long short-term memory (LSTM) is a deep learning architecture that is a subtype of RNN
- The LSTM architecture adds a **cell state** to the RNN architecture to avoid the vanishing gradient problem
    - The addition of the cell state, denoted as $C_{t-1}$ along with an input element of a sequence $x_t$, and a hidden state $h_{t-1}$ are fed into the LSTM 
        - This cell state is what allows the network to keep past information in the network for *a longer time*.
- The LSTM is made up of three gates:
    - The `forget gate` determines what information from past steps is still relevant.
    - The `input gate` determines what information from the current step is relevant.
    - The `output gate` uses the information from the past stored in the network’s memory to process the current element of a sequence to generate a result to the output layer & pass to the next element of the sequence.

<p align="center"><img src="../../assets/img/lstm-architecture.png" width=600><br>The architecture of a LSTM neuron</p>

### Forget Gate
- The `forget gate` is the first gate in an LSTM cell. 
- Its role is to determine what information, from both the past values $C_{t-1}$ and $h_{t-1}$ and the current value of the sequence $x_t$, should be forgotten or kept in the network.
<p align="center"><img src="../../assets/img/lstm-forget-gate.png" width=400><br>The forget gate in an LSTM cell</p>

- Step 1:  the past hidden state $h_{t–1}$ (carries information from past values) and the present value of a sequence $x_t$ are fed into the forget gate. 
- Step 2: $h_{t–1}$ and $x_t$ are combined $[h_{t–1} + x_t]$ and duplicated. 
    - One copy goes straight to the input gate
    - The other copy is sent through a `sigmoid` activation function $f(x)=\frac{1}{1 - e^{-x}}$
        - The sigmoid function determines which information to keep or to forget. 
        - That output is then combined with the previous cell state $C_{t-1}$ using pointwise multiplication. 
            - This results in an **updated cell state** that we call $C'_{t-1}$.
- Step 3: an updated cell state $C'_{t-1}$ & and a copy of the combination of the past hidden state and the current element of the sequence $[h_{t–1} + x_t]$ are sent to the `input gate`.

### Input Gate
- The `input gate` is to determine which information is relevant from the current element of the sequence. 
<p align="center"><img src="../../assets/img/lstm-input-gate.png" width=400><br>The input gate in an LSTM cell</p>

- Step 1: The combination of the past hidden state and the current element of a sequence $[h_{t–1} + x_t]$ coming from the forget gate is fed into the input gate and it is again duplicated.
    - One copy goes out the input gate toward the `output gate`
    - The other copy is sent through the `sigmoid` activation function to determine if the information will be kept or forgotten. - Another copy is sent through the `hyperbolic` tangent ($tanh$) function
- Step 2: The outputs of the `sigmoid` and `tanh` functions are combined using pointwise multiplication, and the result is combined with the updated cell state coming from the forget gate $C'_{t-1}$ using pointwise addition. 
    - This operation generates the **final cell state** $C_t$.
    - This adds information from the current element in the sequence to the long memory of the network. 
    - This newly updated cell state $C_t$ is then sent to the output gate.

### Output Gate
- The output gate is to output a result to the output layer or computes new information to be sent to the processing of the next element in the sequence $h_t$ by combining the **final cell state** $C_t$ & the combination of the past hidden state and the current element of a sequence $[h_{t–1} + x_t]$
<p align="center"><img src="../../assets/img/lstm-output-gate.png" width=400><br>The output gate in an LSTM cell</p>

- Step 1:
    - The combination of the past hidden state and the current element of a sequence $[h_{t–1} + x_t]$ are sent through the `sigmoid` function.
    - The final cell state $C_t$ goes through the `tanh` function. 
    - The resulting values from the `tanh` and `sigmoid` functions are then combined using pointwise multiplication, generating an **updated hidden state** $h_t$.
        - :star: This is the step where past information, represented by the cell state $C_t$, is used to process the information of the present element of the sequence.
- Step 2: 
    - The current **updated hidden state** $h_t$ is then sent out of the output gate. 
        - This will either be sent to the output layer of the network or to the next LSTM neuron treating the next element of the sequence. 
    - The cell state Ct is also passed to the next LSTM neuron.

