# **LSTMs: From GRU Memory Limitation to Long-Term Memory Design**

In a GRU, memory is carried by the hidden state $h_t$.
The update rule is given by:

$$h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t$$

If we simplify this expression, we obtain:

$$h_t = z_t (\tilde{h}_t - h_{t-1}) + h_{t-1}$$

Here, the **new memory** is $h_t$ and the **previous memory** is $h_{t-1}$.
This can be interpreted as:

$$\text{new memory} = \text{old memory} + \text{some change}$$

However, an important issue appears at this point.
Even when the update gate is very small, for example $z_t = 0.00001$ (intuitively meaning “do not update the old memory”), the equation still produces:

$$\text{new memory} = 0.99999 \cdot h_{t-1}$$

This means that only **99.999%** of the previous memory is retained.
After many time steps, this multiplicative decay compounds:

$$(0.99999)^{100}$$

As a result, some critical information is gradually lost.
This explains why, for long sequences, the GRU design struggles to retain memory over very long time spans.

Observing the equation again:

$$h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t$$

we see that **both memory removal and memory update are controlled by a single gate**, $z_t$.
This coupled control motivates the design change introduced in LSTM, where these responsibilities are split for better manipulation of memory.

---

## **Solution: Redesigning Memory Update**

The goal is to update memory such that:

* **100% of the old memory can be retained**
* the update is **linear**, so no forced decay of old information occurs
* the flow of information remains fully controlled by the neural network

The update must satisfy:

$$\text{new memory} = \text{old memory} \quad \text{(must be possible)}$$

and also allow:

$$\text{new memory} = \text{old memory} + \text{some information}$$

where this new information depends on the current input and the current context.

Introducing gating control, this formulation becomes:

$$\text{new memory} = f_t \cdot \text{old memory} + i_t \cdot g_t$$

This is the **long-term memory** formulation, answering what happened many time steps earlier.

The **current hidden state** is then defined as:

$$\text{current hidden state} = o_t \cdot \tanh(\text{new memory})$$

which represents **short-term memory**, answering what happened most recently.

---

## **Gate Definitions and Roles**

* $f_t$ → forget gate
* $i_t$ → input gate
* $g_t$ → candidate gate
* $o_t$ → output gate

The long-term memory is defined as:

$$c_t = f_t \cdot c_{t-1} + i_t \cdot g_t$$

This guarantees that **100% retention of old memory is possible across any time step**, with a linear update mechanism.

---

### **Forget Gate**

$$f_t = \sigma(W x + U h_{t-1} + b)$$

Given the current input $x$ and previous context $h_{t-1}$, this gate analyzes which information from $c_{t-1}$ should be kept.

---

### **Input Gate**

$$i_t = \sigma(W x + U h_{t-1} + b)$$

This gate determines **how much new information** is added to the old memory.
It only controls the **strength** of the update.
The actual content to be written is decided by $g_t$.

Thus:

$$i_t \cdot g_t$$

represents the update applied to the long-term memory.

---

### **Candidate Gate**

$$g_t = \tanh(W x + U h_{t-1} + b)$$

Given the current input and previous context, this gate determines **what new information should be stored** in long-term memory.

The use of `tanh` keeps the candidate memory **bounded and stable**, while the strength of the update is controlled by $i_t$.

---

## **Long-Term Memory Update**

$$c_t = f_t \cdot c_{t-1} + i_t \cdot g_t$$

* $f_t$ → how much of the old memory to keep
* $i_t$ → how much new information to add
* $g_t$ → content of the new memory

These two gates were a **single update gate in GRU**, where:

* $(1 - z_t)$ controlled erasure
* $z_t$ controlled update

---

## **From GRU to LSTM (Conceptual Mapping)**

$$(1 - z_t) \rightarrow f_t$$

$$z_t \rightarrow i_t$$

$$h_{t-1} \rightarrow c_{t-1} \quad \text{(long-term memory)}$$

$$\tilde{h}_t \rightarrow g_t$$

**Note:** This mapping is conceptual.
LSTM allows **independent control** of retention and update, which GRU cannot.

---

## **Output Gate and Hidden State**

$$o_t = \sigma(W x + U h_{t-1} + b)$$

Given the current input and recent context, this gate determines **how much of the long-term memory should be exposed**.

The hidden state is then computed as:

$$h_t = o_t \cdot \tanh(c_t)$$

This represents **short-term memory** and is returned as the visible state of the network.


### **Summary**

$$
\begin{aligned}
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \\
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \\
g_t &= \tanh(W_g x_t + U_g h_{t-1} + b_g) \\
c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \\
h_t &= o_t \odot \tanh(c_t)
\end{aligned}
$$

### **Notation**

* $\sigma(\cdot)$: sigmoid activation  
* $\tanh(\cdot)$: hyperbolic tangent  
* $\odot$: element-wise (Hadamard) product  
* $W_*, U_*, b_*$: trainable parameters