<div style="background-color: black; color: white; padding: 10px;text-align: center;">
  <strong>Date Published:</strong> November 29, 2025 <strong>Author:</strong> Adnan Alaref
</div>

# üìù Build LSTM from Scratch in NumPy (With Gates Mechanism)

Welcome to this beginner-friendly notebook where we implement a **Long Short-Term Memory (LSTM)** network entirely from scratch using **NumPy**, without relying on high-level deep learning frameworks like PyTorch or TensorFlow.

LSTMs are a type of **recurrent neural network (RNN)** designed to learn from **sequential data** (like time series, text, or speech). Unlike simple RNNs, LSTMs can capture **long-term dependencies** thanks to their **gate mechanism**, which controls the flow of information through the network.

---

## üîπ What You'll Learn

- How to **build a custom LSTM class** from scratch.
- How to handle both **single sequences** and **batch sequences**.
- How the **input, forget, and output gates** work internally.
- How to extract and inspect **intermediate gate values**.
- How to test **our custom LSTM layer.**

---

## üîπ Why This Notebook is Beginner-Friendly

- Everything is implemented using **NumPy**: no complex frameworks.
- Step-by-step explanation of **LSTM equations**.
- Easy-to-understand code for **forward pass** and gate computation.
- Examples with **single sequences** and **batch processing**.
- Gate values are accessible for **debugging and analysis**.

---


# **LSTM Equations.**

For timestep `t`, with input `x_t`, previous hidden state `h_{t-1}`, and previous cell state `c_{t-1}`:

 ## üîπ **Forward Pass:**

$$
\begin{align*}
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) &\text{Forget gate} \\
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) &\text{Input gate} \\
\tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) &\text{Candidate cell state} \\
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) &\text{Output gate} \\
c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t &\text{Cell state update} \\
h_t &= o_t \odot \tanh(c_t) &\text{Hidden state output}
\end{align*}
$$

Where:

- \(œÉ) = sigmoid activation  
- \(tanh) = hyperbolic tangent  
- \(‚äô) = element-wise multiplication  
- \(W_*, U_*, b_*) = weights and biases

---

## üîπ Backward Pass (BPTT)

Gradients for timestep `t`:

$$
\begin{align*}
d o_t &= d h_t \odot \tanh(c_t) \\
d c_t &= d h_t \odot o_t \odot (1 - \tanh^2(c_t)) + d c_{t+1} \odot f_{t+1} \\
d f_t &= d c_t \odot c_{t-1} \\
d i_t &= d c_t \odot \tilde{c}_t \\
d \tilde{c}_t &= d c_t \odot i_t
\end{align*}
$$

Pre-activation derivatives:

$$
\begin{align*}
d a_f &= d f_t \odot f_t \odot (1 - f_t) \\
d a_i &= d i_t \odot i_t \odot (1 - i_t) \\
d a_o &= d o_t \odot o_t \odot (1 - o_t) \\
d a_c &= d \tilde{c}_t \odot (1 - \tilde{c}_t^2)
\end{align*}
$$

Gradients w.r.t weights:

$$
\begin{align*}
dW_* &= x_t^T \cdot da_* \\
dU_* &= h_{t-1}^T \cdot da_* \\
db_* &= \sum_t da_*
\end{align*}
$$

---

## üîπ LSTM Flow Diagram

```
 x_t ‚îÄ‚îÄ‚ñ∫ [W* + U* + b*] ‚îÄ‚îÄ‚ñ∫ Gates (f_t, i_t, o_t, cÃÉ_t)
                             ‚îÇ
                             ‚ñº
                   Cell state c_t = f_t ‚äô c_{t-1} + i_t ‚äô cÃÉ_t
                             ‚îÇ
                             ‚ñº
                    Hidden state h_t = o_t ‚äô tanh(c_t)

```
---

Next, we will implement the **LSTM class in NumPy**, run **forward passes**, and inspect **gate values** step by step to see how the network processes sequential data.


# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 1: Import Library.</div>

In [3]:
import random
import numpy as np

import warnings
warnings.simplefilter(action='ignore')
warnings.filterwarnings(action='ignore', category=FutureWarning)

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 2: Create a Custom LSTM Layer.</div>

>
- initial_states: Tuple of initial states (h_0, C_0), each of shape (hidden_dim,)   
- Wf, Wi, Wc, Wo: Weight matrices for the forget, input, candidate, and output gates, respectively    
- Uf, Ui, Uc, Uo: Recurrent weight matrices for the forget, input, candidate, and output gates, respectively    
- bf, bi, bc, bo: Bias vectors for the forget, input, candidate, and output gates, respectively


In [4]:
class LSTM:
  def __init__(self, hidden_dim, input_dim) -> None:
    self.hidden_dim = hidden_dim
    self.input_dim = input_dim
    self._init_params()

  def _init_params(self):
    """Xavier-based initialization for stability"""
    limit_U = 1 / np.sqrt(self.hidden_dim) #  For recurrent weights.
    limit_W = np.sqrt(6 / (self.input_dim + self.hidden_dim)) #  For input weights

    # Input ‚Üí Hidden weights
    self.Wf, self.Wi, self.Wc, self.Wo = [
      np.random.uniform(-limit_W, limit_W,(self.input_dim, self.hidden_dim))
      for _ in range(4)
    ]

    # Hidden ‚Üí Hidden weights
    self.Uf, self.Ui, self.Uc, self.Uo = [
      np.random.uniform(-limit_U, limit_U,(self.hidden_dim, self.hidden_dim))
      for _ in range(4)
    ]

    # Biases
    self.bf, self.bi, self.bc, self.bo = [
      np.zeros(self.hidden_dim) for _ in range(4)
    ]


  def _transform(self, Wx, x_t, Uh, h_t, b):
    return np.dot(x_t, Wx) + np.dot(h_t, Uh) + b

  def _gate(self, Wx, x_t, Uh, h_t, b):
    t = self._transform(Wx, x_t, Uh, h_t, b)
    return sigmoid(t)

  def forward(self, inputs, initial_states=None, return_gates = False):

    # Handle batch vs single sequence
    if inputs.ndim == 2:
      # Single sequence: (seq_len, input_dim)
      inputs = inputs[:,np.newaxis,:] # Add batch dimension
      batch_size = 1
    else:
      # Batch: (seq_len, batch_size, input_dim)
      batch_size = inputs.shape[1]

    seq_length = inputs.shape[0]
    if initial_states is None:
      h_t = np.zeros((batch_size,self.hidden_dim))
      c_t = np.zeros((batch_size,self.hidden_dim))
    else:
      h_t ,c_t = initial_states
      # Ensure states have batch dimension
      if h_t.ndim == 1:
        h_t = h_t[np.newaxis,:]
        c_t = c_t[np.newaxis,:]

    outputs = []
    if return_gates:
      gates = {
        'f_t': [], 'i_t': [], 'o_t': [], 'fused_state': []
      }

    for t in range(seq_length):
      x_t = inputs[t] # (batch_size, input_dim)

      # Gates
      f_t = self._gate(self.Wf, x_t, self.Uf, h_t, self.bf)
      i_t = self._gate(self.Wi, x_t, self.Ui, h_t, self.bi)
      o_t = self._gate(self.Wo, x_t, self.Uo, h_t, self.bo)

      # Candidate state
      fused_state = tanh(self._transform(self.Wc, x_t, self.Uc, h_t, self.bc))

      # Cell update
      c_t = f_t * c_t + i_t * fused_state

      # hidden update
      h_t = o_t * tanh(c_t)

      if return_gates:
        gates['f_t'].append(f_t.copy())
        gates['i_t'].append(i_t.copy())
        gates['o_t'].append(o_t.copy())
        gates['fused_state'].append(fused_state.copy())
      outputs.append(h_t.copy())

    outputs = np.array(outputs) # (seq_len, batch_size, hidden_dim)
    # Remove batch dimension if single sequence
    if batch_size==1:
      outputs = outputs[:,0,:]
      h_t = h_t[0]
      c_t = c_t[0]

    if return_gates:
      for k,v in gates.items():
        gates[k] = np.array(v) # convert list ‚Üí array
        if batch_size==1:
          gates[k] = gates[k][:,0,:] # squeeze batch dim
      return outputs, (h_t, c_t), gates
    else:
      return outputs, (h_t, c_t)

  def get_parameters(self):
    """Return all parameters for inspection or saving"""
    return {
      'Wf': self.Wf, 'Wi': self.Wi, 'Wc': self.Wc, 'Wo': self.Wo,
      'Uf': self.Uf, 'Ui': self.Ui, 'Uc': self.Uc, 'Uo': self.Uo,
      'bf': self.bf, 'bi': self.bi, 'bc': self.bc, 'bo': self.bo
    }

def sigmoid(x):
  return 1 / (1 + np.exp(-np.clip(x, -50, 50)))

def tanh(x):
  return np.tanh(np.clip(x, -50, 50))

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 3: Test the code.</div>

In [6]:
# Example usage and testing
np.random.seed(42)

# Test 1: Single sequence
print("=== Single Sequence Test ===")
lstm = LSTM(hidden_dim=8, input_dim=5)
inputs = np.random.randn(10, 5)  # (seq_len, input_dim)

outputs, final_states = lstm.forward(inputs)
print(f"Input shape: {inputs.shape}")
print(f"Output shape: {outputs.shape}")
print(f"Final hidden state shape: {final_states[0].shape}")
print("\nFinal Hidden State:", final_states[0])
print("\nFinal Cell State:", final_states[1])

# Test 2: Batch processing
print("\n=== Batch Processing Test ===")
batch_inputs = np.random.randn(10, 3, 5)  # (seq_len, batch_size, input_dim)
batch_outputs, batch_final = lstm.forward(batch_inputs)
print(f"Batch input shape: {batch_inputs.shape}")
print(f"Batch output shape: {batch_outputs.shape}")
print("\nBatch Final Hidden State:", batch_final[0])
print("\nBatch Final Cell State:", batch_final[1])

# Test 3: With gate analysis
print("\n=== Gate Analysis Test ===")
outputs, final_states, gates = lstm.forward(inputs, return_gates=True)
print("Gate shapes:")
for gate_name, gate_values in gates.items():
  print(f"  {gate_name}: {gate_values.shape}")

# Test 4: Gates Parameters
print("\n=== Gates Parameters ===")
gates = lstm.get_parameters()
print(f"\nWeights For Inputs gate {gates['Wi']}")

=== Single Sequence Test ===
Input shape: (10, 5)
Output shape: (10, 8)
Final hidden state shape: (8,)

Final Hidden State: [ 0.04481557 -0.04112659 -0.12080551  0.20438796 -0.13502488 -0.15720278
 -0.09887659 -0.04118645]

Final Cell State: [ 0.20153002 -0.17672744 -0.16868528  0.34533251 -0.28118538 -0.37809598
 -0.18890819 -0.12817226]

=== Batch Processing Test ===
Batch input shape: (10, 3, 5)
Batch output shape: (10, 3, 8)

Batch Final Hidden State: [[-0.22349013 -0.04674666 -0.00363955 -0.24551477  0.07669256  0.19765127
  -0.11742645  0.08119563]
 [-0.14642205  0.21918959  0.00480341 -0.02662936  0.55483706  0.19328816
   0.12706896  0.19456277]
 [-0.23451168  0.31953675  0.20554991 -0.18352124 -0.08959388  0.08293861
  -0.01321995  0.23430946]]

Batch Final Cell State: [[-0.47784877 -0.1111659  -0.01011088 -0.37148196  0.13802839  0.46185492
  -0.22905192  0.25104552]
 [-0.39467202  0.35677115  0.01108424 -0.04925037  0.99174794  0.36442915
   0.27596584  0.46523864]
 [-0.3040

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Thanks & Upvote ‚ù§Ô∏è</div>