# LSTM 

When we talk about a Long Short-Term Memory (LSTM) network, we are referring to a
special type of recurrent neural network cell that is designed to handle memory
in a more effective and controlled way.

In a traditional RNN, memory is stored in a single hidden state that gets updated
at every time step. This makes the network prone to forgetting important
information from earlier in the sequence, especially when dealing with long
inputs.

The LSTM cell addresses this limitation by introducing a structured memory
mechanism that allows the network to retain important information over long
periods of time while discarding irrelevant details.

The key idea behind the LSTM cell is the separation of memory into two parts:
a long-term memory called the *cell state* and a short-term memory called the
*hidden state*. This separation allows the network to preserve important
information while still producing useful outputs at each time step.

To control how information flows into, out of, and through the cell state, the
LSTM uses gating mechanisms. These gates act like filters that decide what
information should be remembered, forgotten, or exposed as output. Because these
decisions are made using smooth, learnable functions, the LSTM can be trained
effectively using gradient-based optimization methods.

Overall, the LSTM cell is designed to remember what is important, forget what is
not, and maintain stable learning across long sequences.


## The Cell State:

The most important component of an LSTM cell is the **cell state**. This is where
long-term information is stored. Unlike the hidden state in a traditional RNN,
the cell state is designed to change very slowly over time.

You can think of the cell state as a memory line that runs straight through the
entire sequence. Information can flow along this line with very little
interference, which allows important details from earlier time steps to be
preserved for a long duration.

What makes the cell state special is that it is not repeatedly overwritten.
Instead, it is updated in a controlled manner using gating mechanisms. These
gates decide which parts of the previous memory should be kept and which parts
should be modified.

Because the cell state follows a mostly linear path through time, gradients can
flow backward during training without shrinking too much. This directly helps
solve the vanishing gradient problem that affects traditional recurrent neural
networks.

In simple terms, the cell state acts as the long-term memory of the LSTM. It
stores information that is relevant across many time steps and provides a stable
foundation on which the rest of the LSTM operations are built.


## The Forget Gate: Deciding What to Remove from Memory

As an LSTM processes a sequence, not all past information remains useful forever.
Some details become irrelevant as new inputs arrive. The purpose of the forget
gate is to decide which parts of the existing memory should be discarded.

At each time step, the forget gate looks at two things: the current input and the
previous hidden state. Based on these, it produces a set of values between 0 and 1 . These values act as control signals for the cell state.

A value close to 1 means that the corresponding information in the cell state
should be kept almost unchanged. A value close to 0 means that the information
should be largely forgotten. In this way, the forget gate selectively cleans the
memory.

This mechanism is crucial because memory that is never updated or cleaned can
become noisy and harmful to learning. By allowing the network to forget
irrelevant information, the LSTM maintains a meaningful and focused long-term
memory.

In essence, the forget gate ensures that the cell state contains only information
that continues to be useful for understanding the sequence.


## The Input Gate: Deciding What New Information to Store

After deciding what information to remove from memory, the LSTM must determine
what new information is important enough to be stored. This responsibility lies
with the input gate.

The input gate controls how much new information should be written into the cell
state. It does not directly decide the content of the information, but rather how
strongly new information should influence the memory.

At the same time, the LSTM creates a set of candidate values that represent new
information derived from the current input and the previous hidden state. These
candidate values describe what could potentially be added to the memory.

The input gate then acts as a filter on this candidate information. Only the
portions that are considered important are allowed to enter the cell state,
while the rest are suppressed.

This two-step process ensures that the LSTM does not blindly store everything it
encounters. Instead, it carefully selects and integrates only relevant new
information, helping the network maintain a meaningful and efficient long-term
memory.


## Updating the Cell State: Combining Old and New Memory

Once the LSTM has decided what information to forget and what new information to
store, it updates the cell state. This step is the heart of the LSTM mechanism
because it determines how memory evolves over time.

The update process combines two sources of information. First, the existing cell
state is adjusted by removing the parts marked as unimportant by the forget
gate. Second, new information selected by the input gate is added to the memory.

This update happens in a smooth and controlled way rather than through abrupt
overwriting. As a result, important information can persist across many time
steps while still allowing the memory to adapt when needed.

Because the cell state follows a mostly linear path through time, this update
structure allows gradients to flow backward more effectively during training.
This is a key reason why LSTM networks are able to learn long-term dependencies
that traditional recurrent neural networks struggle with.

In essence, updating the cell state is what allows the LSTM to balance stability
and flexibility in its memory, remembering what matters while remaining open to
new information.


## The Output Gate: Controlling What the LSTM Reveals

After updating its internal memory, the LSTM must decide what information should
be exposed as output at the current time step. This decision is handled by the
output gate.

The output gate examines the current input and the previous hidden state to
determine which parts of the updated cell state are relevant for the present
moment. Not all information stored in memory needs to be visible at every step.

To produce the output, the cell state is first transformed into a scaled form so
that its values remain within a manageable range. The output gate then filters
this transformed memory, allowing only the most relevant information to pass
through.

The result of this process is the new hidden state. This hidden state serves two
important purposes: it is passed to the next time step in the sequence, and it is
used to generate predictions or outputs for the current step.

In simple terms, the output gate ensures that the LSTM shares only what is
necessary, keeping the rest of the memory protected for future use.


![](https://d2l.ai/_images/lstm-0.svg)

## Example
In this example, we manually implement the internal computations of an LSTM cell
to understand how it processes sequential data, updates memory, and produces
outputs at each time step.


### Step 1: Import Libraries


In [2]:
import torch
from torch import nn


### Step 2: Initialize LSTM Parameters
This cell defines a function that creates all trainable parameters of the LSTM
cell. It includes weight matrices and bias vectors for the forget gate, input
gate, candidate memory, and output gate.

Each gate has:

input-to-hidden weights,

hidden-to-hidden weights,

a bias term.

Weights are initialized with small random values to ensure numerical stability.
All parameters are marked to require gradients so that they can be updated during
training. The function returns all parameters as a list

In [13]:
def get_lstm_params(vocab_size, num_hiddens, device):
    num_inputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    def three():
        return (normal((num_inputs, num_hiddens)),
                normal((num_hiddens, num_hiddens)),
                torch.zeros(num_hiddens, device=device))

    # Forget gate
    W_xf, W_hf, b_f = three()
    # Input gate
    W_xi, W_hi, b_i = three()
    # Candidate memory
    W_xc, W_hc, b_c = three()
    # Output gate
    W_xo, W_ho, b_o = three()

    # Output layer
    W_hq = normal((num_hiddens, vocab_size))
    b_q = torch.zeros(vocab_size, device=device)

    params = [W_xf, W_hf, b_f,
              W_xi, W_hi, b_i,
              W_xc, W_hc, b_c,
              W_xo, W_ho, b_o,
              W_hq, b_q]

    for param in params:
        param.requires_grad_(True)

    return params

###  Step 3 Initialize Hidden State and Cell State
This cell defines a function that initializes the hidden state and cell
state of the LSTM. Both are set to zero at the start of a sequence, indicating
that the model begins with no prior memory.

The hidden state represents short-term information, while the cell state stores
long-term memory.

In [10]:
def init_lstm_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),
            torch.zeros((batch_size, num_hiddens), device=device))


### Step 4 LSTM Forward Computation
This cell contains the core logic of the LSTM. It processes the input sequence one
time step at a time and updates the hidden and cell states using LSTM gate
operations.

At each time step:

- the forget gate controls what old memory is kept,

- the input gate selects new information to add,

- the cell state is updated,

the output gate determines what information is exposed.

The hidden state is then transformed into an output vector. Outputs from all time
steps are collected and returned along with the final states.

In [11]:
def lstm(inputs, state, params):
    (H, C) = state
    (W_xf, W_hf, b_f,
     W_xi, W_hi, b_i,
     W_xc, W_hc, b_c,
     W_xo, W_ho, b_o,
     W_hq, b_q) = params

    outputs = []

    for X in inputs:
        F = torch.sigmoid(X @ W_xf + H @ W_hf + b_f)
        I = torch.sigmoid(X @ W_xi + H @ W_hi + b_i)
        C_tilde = torch.tanh(X @ W_xc + H @ W_hc + b_c)
        C = F * C + I * C_tilde
        O = torch.sigmoid(X @ W_xo + H @ W_ho + b_o)
        H = O * torch.tanh(C)
        Y = H @ W_hq + b_q
        outputs.append(Y)

    return torch.cat(outputs, dim=0), (H, C)


### Step 5 Test the Implementation
This cell tests the LSTM implementation using dummy input data. It initializes the
parameters and memory states, creates a random input sequence, and passes it
through the LSTM function.

The purpose of this cell is to verify that the LSTM runs correctly and produces
outputs with the expected shape. The model is not trained here; this step only
confirms that the implementation is logically and syntactically correct.

In [12]:
device = torch.device('cpu')

vocab_size = 10
num_hiddens = 16
batch_size = 2
num_steps = 5

params = get_lstm_params(vocab_size, num_hiddens, device)
state = init_lstm_state(batch_size, num_hiddens, device)

inputs = torch.randn(num_steps, batch_size, vocab_size)

outputs, state = lstm(inputs, state, params)

outputs.shape


torch.Size([10, 10])

The output is produced at every time step from the hidden state of the LSTM.
Its shape reflects the sequence length and batch size, while its values appear
random because the model has not been trained.



## Task for the reader
Train the LSTM on the same task using different sequence lengths. Analyze how increasing or decreasing the sequence length affects learning and output quality