---
title: "Understanding the LSTM Cell"
author: "Kirtan Gangani"
date: "2025-07-10"
categories: [Deep Learning, LSTM]
format:
  html:
    code-fold: false
jupyter: python3
---

# What is LSTM?

<div>
Imagine trying to understand a long conversation where you forget the beginning halfway through. Traditional Recurrent Neural Network (RNN) models often struggle with this "memory" problem when dealing with sequences of data like text, speech, or time series. As sequences get longer, the information from earlier steps gets "diluted" or "forgotten" or in other words, gradients become too small which is called gradient vanishing. This makes it hard for the RNN to learn long-term dependencies meaning that they can't effectively connect information from far earlier parts of a sequence to make a current decision. 

This is where Long Short-Term Memory (LSTM) networks comes in. LSTMs are a special type of RNNs designed to overcome this limitation, allowing AI to remember important information over long periods. They were introduced to mitigate gradient vanishing/exploding problem faced by standard RNNs.

This is achieved by a "cell state" that acts like a conveyor belt, carrying information through the network, allowing it to preserve information over long sequences. This is their "long-term memory." LSTMs achieve their long-term memory capabilities through a unique internal structure called "gates." These gates are like intelligent filters that control the flow of information in and out of the memory cell. More about gate will be explained in later sections.
</div>

# LSTM Cell Diagram

![LSTM-chain](./images/lstm-cell.png)

: LSTM cell. Source: [Turing's Blog](https://www.turing.com/kb/comprehensive-guide-to-lstm-rnn)

# LSTM Architecture
 
LSTMs have three types of gates -- Forget Gate, Input Gate, Output Gate which regulates information flow. Each gate uses Sigmoid Activation Function $\sigma$
, which outputs values between 0 to 1. This range allows them to effectively control how much information passes through, where 1 means let all the information pass and 0 means let no information pass. 

## Forget Gate

The forget gate's primary role is to decide what information from the previous cell state (c_{t-1}) should be discarded or "forgotten". It takes the current input (x_t) and the hidden state from the last time step (h_{t-1}) as inputs. These are passed through a sigmoid activation function.

\begin{aligned}
f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f)
\end{aligned}

The output f_t is a vector with values from 0 to 1. This output is then element-wise multiplied with the previous cell state (c_{t-1}). A value of 1 means "keep all of this information", while a value of 0 means "forget all of this informtion". A value of 0.5 would indicate keep half of that information.

## Input Gate: 

The input gate is responsible for determining which new information from the current input should be stored in the cell state. It works in two parts:

### Candidate Memory ($\tilde{c}_t$)

Before the input gate itself, there's the candidate memory, often denoted as \tilde{c}_t. Its purpose is to propose new information that could be added to the cell state. Unlike the gates, which use sigmoid, the candidate memory uses a hyperbolic tangent (tanh) activation function. The tanh function outputs values between -1 and 1, allowing for both positive and negative contributions to the cell state.

\begin{aligned}
\tilde{c}_t &= \tanh(W_c [h_{t-1}, x_t] + b_c)
\end{aligned}

### Input gate ($i_t$)

The input gate then decides how much of this newly proposed candidate memory \tilde{c}_t should actually be added to the cell state. Similar to the forget gate, it takes the current input (x_t) and the previous hidden state (h_{t-1})and passes them through a sigmoid activation function.

\begin{aligned}
i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i)
\end{aligned}

the output i_t acts as a filter for the candidate memory \tilde{c}_t

### Cell state Update ($c_t$)

This is where the magic happens for updating the long-term memory of the network. The new cell state ($c_t$) is combination of two components:  
1. The information from the previous cell state ($c_{t−1}$) that the forget gate ($f_t$) decided to keep.  
2. The new candidate memory ($\tilde{c}_t$) that the input gate ($i_t$) decided to add.

\begin{aligned}
c_t &= f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t
\end{aligned}

These two parts are element-wise added to form the updared cell state ($c_t$).  
This updated cell state $c_t$ then carries the network's long-term memory forward to the next time step.

## Output Gate

Finally, the output gate controls how much of the updated cell state ($c_t$) will be exposed as the current hidden state ($h_t$). The hidden state is the output of the LSTM cell at the current time step and is also passed on to the next time step.

First, the output gate determines which parts of the cell state are relevant for the current hidden state. It uses the current input ($x_t$) and the previous hidden state ($h_{t−1}$) passed through a sigmoid function:

\begin{aligned}
o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o)
\end{aligned}

Next, the updated cell state ($c_t$) is passed through a tanh activation function. This scales the cell state values to between -1 and 1, making them ready to be filtered.

Finally, the result of the tanh operation on $c_t$ is element-wise multiplied by the output gate's activation ($o_t$) to produce the new hidden state ($h_t$):

\begin{aligned}
h_t &= o_t \cdot \tanh(c_t)
\end{aligned}

The hidden state $h_t$ serves as the output of the LSTM block for the current time step and is also used as an input to the gates in the next time step.

# Advantages and Disadvantages of LSTMs

While LSTMs have revolutionized sequence modeling, it's important to understand their strengths and limitations.

## Advantages of LSTMs:

* **Solving the Vanishing Gradient Problem:** This is their most significant advantage. LSTMs effectively address the vanishing gradient problem inherent in traditional RNNs, allowing them to learn and retain information over very long sequences. This is primarily due to their unique cell state and gate mechanisms that regulate information flow.
* **Capturing Long-Term Dependencies:** Thanks to their ability to maintain a persistent cell state, LSTMs can connect information from distant past steps to make decisions in the present. This is crucial for tasks like understanding context in long sentences or predicting future values in time series based on historical trends.
* **Handling Variable-Length Sequences:** LSTMs are naturally designed to process sequences of varying lengths, making them highly versatile for real-world data like text (sentences of different lengths), speech (utterances of different durations), and time series.
* **Robustness to Noise (to some extent):** The gating mechanism allows LSTMs to selectively filter out irrelevant or noisy information, focusing on the most important features in the sequence.
* **Wide Applicability:** LSTMs have found immense success across a broad spectrum of domains, including:
    * **Natural Language Processing (NLP):** Machine translation, sentiment analysis, text summarization, named entity recognition, language modeling.
    * **Speech Recognition:** Converting spoken words into text.
    * **Time Series Forecasting:** Predicting stock prices, weather patterns, energy consumption.
    * **Video Processing:** Action recognition, video captioning.

## Disadvantages of LSTMs:

* **Computational Cost:** LSTMs are more computationally intensive and slower to train compared to simpler neural networks or even standard RNNs. This is due to the increased number of parameters (weights and biases for each gate) and the complex calculations involved in their internal mechanisms.
* **Complex Architecture:** The multiple gates and intricate interactions within an LSTM cell make them more complex to understand and debug compared to simpler models. While powerful, this complexity can be a barrier for newcomers.
* **Difficulty with Very Long Sequences:** Although LSTMs significantly mitigate the long-term dependency problem, they can still struggle with *extremely* long sequences. As the sequence length increases, the computational burden grows, and even LSTMs can start to lose efficiency or effectiveness.
* **Limited Parallelization:** The inherent sequential nature of LSTMs (processing one time step after another) makes it challenging to fully parallelize their training across multiple computing cores or GPUs, especially during the forward and backward passes within a sequence. This is a key area where newer architectures like Transformers have shown significant advantages.
* **Hyperparameter Tuning:** LSTMs often require careful tuning of hyperparameters (e.g., number of hidden units, learning rate, dropout) to achieve optimal performance, which can be a time-consuming process.
* **Outshined by Transformers in Many NLP Tasks:** For many state-of-the-art NLP tasks, transformer architectures (which use attention mechanisms instead of recurrence) have largely surpassed LSTMs in performance, particularly for very long sequences and tasks requiring complex contextual understanding.