<a href="https://colab.research.google.com/github/Ivyson/Neural-Network-XOR/blob/main/notebooks/CNN%2BLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN + LSTM

Now we are working on combining two architectures we've already designed: Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). LSTM handles **temporal** data, while CNN handles **spatial or positional** data. With this in mind, we can make a new architecture that processes data that is both spatial and temporal—like video.

A video is basically a sequence of images, called **frames**. If we feed a single frame to a CNN, it will classify that frame on its own. But for video, we don't want to just look at one frame; we want to consider multiple frames together. For example, recognizing if a person is dancing or making a specific hand gesture is hard from one frame alone. We need to see how their stance changes over time.

Here, instead of using one frame, we take **n frames** and process them together. Each frame goes through a CNN to produce a **feature vector**, which captures spatial information like edges, textures, and objects. Then, the sequence of feature vectors is fed into an LSTM, which looks at how things change over time.

The **order matters**: CNN first, then LSTM. If we start with LSTM, we'd lose the spatial information from the frames, and the model wouldn't work well.

So the pipeline looks like this:

**Frames $\rightarrow$ CNN $\rightarrow$ Feature vectors $\rightarrow$ LSTM $\rightarrow$ Output**

In this notebook, we won't go deep into math since that's already covered in previous notebooks. The focus is just on combining the two architectures, and any extra workarounds will be explained as needed.

The Architecture will look as follows:
![](https://www.researchgate.net/publication/364039225/figure/fig4/AS:11431281414868010@1746027141215/CNN-LSTM-architecture.tif)


Coding Time. Remember the philosophy, build it from scatch.

Now, Before we start, Remember

For each time step $( t )$:


$$
\begin{aligned}
f_t &= \sigma \left(W_f \cdot \left[h_{t-1}, x_t \right] + b_f \right) \quad \text{(forget gate)} \\
i_t &= \sigma \left(W_i \cdot [h_{t-1}, x_t] + b_i \right) \quad \text{(input gate)} \\
\tilde{c_t} &= \tanh \left(W_c \cdot \left[h_{t-1}, x_t \right] + b_c \right) \quad \text{(candidate cell)} \\
c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c_t} \quad \text{(cell state)} \\
o_t &= \sigma \left(W_o \cdot [h_{t-1}, x_t] + b_o \right) \quad \text{(output gate)} \\
h_t &= o_t \odot \tanh(c_t) \quad \text{(hidden state)}
\end{aligned}
$$


In [None]:
import numpy as np
from typing import Tuple, Optional, List, Dict, Any, Callable, Union
import pickle
import scipy # Refrain from using the convolution from scipy, computationalyy expensive for video processing, Explore the usage of im2col for Fast-Fourier Convolution Methods.

ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8
LEAKY_RELU_ALPHA = 0.01

def mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Calculates Mean Squared Error loss."""
    return np.mean(np.sum((y_true - y_pred) ** 2, axis=1))

def mean_squared_error_gradient(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """Calculates the gradient of Mean Squared Error loss."""
    return 2 * (y_pred - y_true) / y_true.shape[0]

def categorical_cross_entropy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Calculates Categorical Cross-Entropy loss."""
    # Clip predictions to avoid log(0)
    epsillon = 1e-9
    y_pred_clipped = np.clip(y_pred, epsillon, 1 - epsillon)
    return -np.mean(np.sum(y_true * np.log(y_pred_clipped), axis=1))

def categorical_cross_entropy_gradient(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """
    Calculates the gradient of CCE loss w.r.t y_pred.
    When combined with Softmax, the gradient w.r.t the Softmax *input* simplifies.
    """
    return (y_pred - y_true) / y_true.shape[0]



LOSS_FUNCTIONS: Dict[str, Callable] = {
    'mse': mean_squared_error,
    'categorical_crossentropy': categorical_cross_entropy,
}

LOSS_GRADIENTS: Dict[str, Callable] = {
    'mse': mean_squared_error_gradient,
    'categorical_crossentropy': categorical_cross_entropy_gradient,
}



def _sigmoid(x: np.ndarray) -> np.ndarray:
    """Returns a much stable Sigmoid function."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def _sigmoid_derivative(output: np.ndarray) -> np.ndarray:
    """Derivative of sigmoid (using its output)."""
    return output * (1 - output)

def _relu(x: np.ndarray) -> np.ndarray:
    """Rectified Linear Unit activation."""
    return np.maximum(0, x)

def _relu_derivative(output: np.ndarray) -> np.ndarray:
    """Derivative of ReLU."""
    return np.where(output > 0, 1, 0) # This here is because the derivative at 0 is undefined

def _leaky_relu(x: np.ndarray) -> np.ndarray:
    """Leaky Rectified Linear Unit activation."""
    return np.where(x > 0, x, x * LEAKY_RELU_ALPHA)

def _leaky_relu_derivative(output: np.ndarray) -> np.ndarray:
    """Derivative of Leaky ReLU."""
    return np.where(output > 0, 1, LEAKY_RELU_ALPHA)

def _softmax(x: np.ndarray) -> np.ndarray:
    """Softmax activation function."""
    Soft_epsilon = 1e-9
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / (np.sum(exp_x, axis=-1, keepdims=True) + Soft_epsilon) # Add epsilon to stabilise the division..

def _softmax_derivative_cross_entropy(output: np.ndarray, y_true: np.ndarray) -> np.ndarray:
    """
    Computes the gradient of the cross-entropy loss with respect to the
    inputs of the softmax function (often denoted dL/dz).
    This combined gradient is simply (output - y_true).
    Note: This function isn't the derivative of softmax itself, but
          the combined gradient needed for backprop when using softmax + cross-entropy.
    The `output_gradient` passed to the Activation layer's backward pass in this case
    should be y_true. The loss gradient calculation should return y_pred - y_true.
    So, the backward pass of Softmax Activation simplifies.
    """
    pass

def _linear(x: np.ndarray) -> np.ndarray:
    """
    Linear activation (identity).
    For every parameter receieved, perform a linear transformation on it,
    """
    return x

def _linear_derivative(output: np.ndarray) -> np.ndarray:
    """
    Derivative of linear activation.
    (d/dx)x = 1
    """
    return np.ones_like(output)

ACTIVATION_FUNCTIONS: Dict[str, Callable[[np.ndarray], np.ndarray]] = {
    'sigmoid': _sigmoid,
    'relu': _relu,
    'leaky_relu': _leaky_relu,
    'softmax': _softmax,
    'linear': _linear,
}

ACTIVATION_DERIVATIVES: Dict[str, Callable[[np.ndarray], np.ndarray]] = {
    'sigmoid': _sigmoid_derivative,
    'relu': _relu_derivative,
    'leaky_relu': _leaky_relu_derivative,
    'softmax': lambda output: output * (1-output),
    'linear': _linear_derivative,
}

class Layer:
    """Base class for all network layers."""
    def __init__(self):
        self.input: Optional[np.ndarray] = None
        self.output: Optional[np.ndarray] = None
        self._has_weights = False # Flag to indicate if layer has trainable weights

    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Perform the forward pass.
        Each Laye, Convolutions and Dense and Flattened/MLP have their own Forward Propagation, so they will use that instead of a core implementation from Layer class
        """
        raise NotImplementedError

    def backward(self, output_gradient: np.ndarray, learning_rate: float, **kwargs) -> np.ndarray:
        """
        Perform the backward pass.
        kwargs might include Adam parameters like t, beta1, beta2, epsilon.
        So far, these are the arguments being used, if any is included, then might not be used..
        """
        raise NotImplementedError

    def has_weights(self) -> bool:
        """
        Check if the layer has trainable weights.
        Returns True if the weights are trainable, else false.
        """
        return self._has_weights


class Activation(Layer):
    """Applies an activation function element-wise."""
    def __init__(self, activation_name: str):
        """
        Initialise activation layer.

        :param activation_name: Name of the activation function
                                ('sigmoid', 'relu', 'leaky_relu', 'softmax', 'linear').
        """
        super().__init__()
        if activation_name not in ACTIVATION_FUNCTIONS:
            raise ValueError(f"Unknown activation function: '{activation_name}'")
        self.activation_name = activation_name
        self.activation_func = ACTIVATION_FUNCTIONS[activation_name]
        self.activation_derivative = ACTIVATION_DERIVATIVES.get(activation_name)
        if self.activation_derivative is None and self.activation_name != 'softmax':
             raise ValueError(f"Derivative for '{activation_name}' not found.")


    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """Perform the forward pass applying the activation function."""
        self.input = input_data
        self.output = self.activation_func(input_data)
        return self.output

    def backward(self, output_gradient: np.ndarray, learning_rate: Optional[float] = None, **kwargs) -> np.ndarray:
        """
        Perform the backward pass through the activation function.
        :param output_gradient: Gradient from the next layer.
        :param learning_rate: Not used for activation layer.
        :param kwargs: May include y_true for Softmax+CCE simplification.
        :return: Gradient with respect to the input of this layer.
        """
        if self.activation_name == 'softmax':
            return output_gradient
        elif self.activation_derivative:
             '''
              Apply chain rule: dL/dx = dL/dy * dy/dx
              where y = activation_func(x)
              dy/dx is the activation_derivative evaluated at the output y (or input x sometimes)
              '''
            return output_gradient * self.activation_derivative(self.output)
        else:
            raise RuntimeError(f"Cannot perform backward pass for {self.activation_name} without derivative.")



class Conv2D(Layer):
    """2D Convolutional Layer."""
    def __init__(self, input_shape: Tuple[int, int, int], kernel_size: Tuple[int, int], depth: int, padding_mode: str = 'valid', stride: int = 0):
        """
        Initialize convolutional layer.
        The Stride and padding input have just been recently added, and therefore might malfunction.(Not tested yet.)

        :param input_shape: Shape of the input volume (height, width, channels).
        :param kernel_size: Size of the convolution kernel (height, width).
        :param depth: Number of kernels/filters (output depth).
        """
        super().__init__()
        self._has_weights = True
        self.input_height, self.input_width, self.input_channels = input_shape
        self.kernel_height, self.kernel_width = kernel_size
        self.depth = depth # Number of output filters

        if not (isinstance(kernel_size, tuple) and len(kernel_size) == 2):
             raise ValueError("kernel_size must be a tuple of two integers (height, width).")
        if not (isinstance(input_shape, tuple) and len(input_shape) == 3):
             raise ValueError("input_shape must be a tuple of three integers (height, width, channels).")

        # Xavier initialization
        self.kernels_shape = (self.kernel_height, self.kernel_width, self.input_channels, self.depth)
        limit = np.sqrt(6 / (np.prod(kernel_size) * self.input_channels + np.prod(kernel_size) * self.depth))
        self.kernels = np.random.uniform(-limit, limit, self.kernels_shape)
        self.biases = np.zeros(self.depth) # One bias per output filter

        # Adam optimizerz
        self.m_kernels = np.zeros_like(self.kernels)
        self.v_kernels = np.zeros_like(self.kernels)
        self.m_biases = np.zeros_like(self.biases)
        self.v_biases = np.zeros_like(self.biases)

        self.output_height = self.input_height - self.kernel_height + 1
        self.output_width = self.input_width - self.kernel_width + 1
        if self.output_height <= 0 or self.output_width <= 0:
            raise ValueError(f"Kernel size {kernel_size} is too large for input shape {input_shape[:2]}.")
        self.output_shape = (self.output_height, self.output_width, self.depth)

    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Perform the forward pass using convolution.

        Note: This implementation uses scipy.signal.convolve2d which is
              computationally expensive for large inputs/kernels compared to
              optimized libraries like im2col

        :param input_data: Input data of shape (batch_size, height, width, channels).
        :return: Output feature map of shape (batch_size, new_height, new_width, depth).
        """
        self.input = input_data
        batch_size = input_data.shape[0]

        # Initialize output array
        self.output = np.zeros((batch_size, *self.output_shape))

        for i in range(batch_size):
            for d in range(self.depth):
                output_feature_map = np.zeros((self.output_height, self.output_width))
                for c in range(self.input_channels):
                    # Kernel shape: (kH, kW, InChannels, OutDepth)
                    kernel_slice = self.kernels[:, :, c, d]
                    input_slice = self.input[i, :, :, c]
                    output_feature_map += scipy.signal.convolve2d(
                        input_slice, kernel_slice, mode=self.padding_mode
                    )
                # Add bias
                self.output[i, :, :, d] = output_feature_map + self.biases[d]

        return self.output

    def _adam_update(self, param: np.ndarray, grad: np.ndarray, m: np.ndarray, v: np.ndarray,
                     learning_rate: float, t: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Helper function to perform Adam update."""
        m = ADAM_BETA1 * m + (1 - ADAM_BETA1) * grad
        v = ADAM_BETA2 * v + (1 - ADAM_BETA2) * (grad ** 2)

        m_hat = m / (1 - ADAM_BETA1 ** t)
        v_hat = v / (1 - ADAM_BETA2 ** t)

        # Update
        param -= learning_rate * m_hat / (np.sqrt(v_hat) + ADAM_EPSILON)
        return param, m, v

    def backward(self, output_gradient: np.ndarray, learning_rate: float, **kwargs) -> np.ndarray:
        """
        Perform the backward pass to compute gradients and update weights using Adam.

        Note: This implementation uses scipy.signal correlate2d/convolve2d,
              which can be slow.

        :param output_gradient: Gradient from the next layer, shape (batch_size, out_h, out_w, depth).
        :param learning_rate: Learning rate for the optimizer.
        :param kwargs: Expected to contain 't' (Adam timestep).
        :return: Gradient with respect to the input of this layer.
        """
        if 't' not in kwargs:
            raise ValueError("Adam timestep 't' is required for backward pass.")
        t = kwargs['t']

        batch_size = output_gradient.shape[0]
        kernels_gradient = np.zeros_like(self.kernels)
        biases_gradient = np.zeros_like(self.biases)
        input_gradient = np.zeros_like(self.input)

        for i in range(batch_size):
            for d in range(self.depth):
                biases_gradient[d] += np.sum(output_gradient[i, :, :, d])

                for c in range(self.input_channels):
                    input_slice = self.input[i, :, :, c]
                    output_grad_slice = output_gradient[i, :, :, d]
                    kernels_gradient[:, :, c, d] += scipy.signal.correlate2d(
                        input_slice, output_grad_slice, mode='valid'
                    )

                    kernel_slice = self.kernels[:, :, c, d]
                    rotated_kernel = np.rot90(kernel_slice, 2) # Rotate 180 degrees
                    input_gradient[i, :, :, c] += scipy.signal.convolve2d(
                        output_grad_slice, rotated_kernel, mode='full'
                    )

        # Update kernels and biases using Adam
        self.kernels, self.m_kernels, self.v_kernels = self._adam_update(
            self.kernels, kernels_gradient, self.m_kernels, self.v_kernels, learning_rate, t
        )
        self.biases, self.m_biases, self.v_biases = self._adam_update(
            self.biases, biases_gradient, self.m_biases, self.v_biases, learning_rate, t
        )

        return input_gradient


class MaxPool2D(Layer):
    """2D Max Pooling Layer."""
    def __init__(self, pool_size: Tuple[int, int] = (2, 2), stride: Optional[Tuple[int, int]] = None):
        """
        Initialize max pooling layer.

        :param pool_size: Size of the pooling window (height, width).
        :param stride: Step size for pooling. If None, defaults to pool_size.
        """
        super().__init__()
        self.pool_height, self.pool_width = pool_size
        self.stride_h, self.stride_w = stride if stride is not None else pool_size
        self.max_indices: Optional[np.ndarray] = None

    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Perform the forward pass using max pooling.

        :param input_data: Input data of shape (batch_size, height, width, channels).
        :return: Output after max pooling.
        """
        self.input = input_data
        batch_size, h_in, w_in, channels = input_data.shape

        h_out = (h_in - self.pool_height) // self.stride_h + 1
        w_out = (w_in - self.pool_width) // self.stride_w + 1
        if h_out <= 0 or w_out <= 0:
            raise ValueError(f"Pool size {self.pool_height, self.pool_width} with stride {self.stride_h, self.stride_w} "
                             f"is too large for input shape {h_in, w_in}.")

        output = np.zeros((batch_size, h_out, w_out, channels))

        self.max_indices = np.zeros((batch_size, h_out, w_out, channels, 2), dtype=int)

        for b in range(batch_size):
            for c in range(channels):
                for i in range(h_out):
                    for j in range(w_out):
                        h_start = i * self.stride_h
                        h_end = h_start + self.pool_height
                        w_start = j * self.stride_w
                        w_end = w_start + self.pool_width

                        pool_region = input_data[b, h_start:h_end, w_start:w_end, c]

                        max_val = np.max(pool_region)
                        max_pos_relative = np.unravel_index(np.argmax(pool_region), pool_region.shape)

                        output[b, i, j, c] = max_val
                        self.max_indices[b, i, j, c] = max_pos_relative

        self.output = output
        return output

    def backward(self, output_gradient: np.ndarray, learning_rate: Optional[float] = None, **kwargs) -> np.ndarray:
        """
        Perform the backward pass for max pooling.

        Distributes the gradient only to the locations where the max value was originally found.

        :param output_gradient: Gradient from the next layer.
        :param learning_rate: Not used for pooling layer.
        :return: Gradient with respect to the input of this layer.
        """
        if self.input is None or self.max_indices is None:
            raise RuntimeError("Forward pass must be called before backward pass.")

        batch_size, h_out, w_out, channels = output_gradient.shape
        input_gradient = np.zeros_like(self.input)

        for b in range(batch_size):
            for c in range(channels):
                for i in range(h_out):
                    for j in range(w_out):
                      # Window Cordinates
                        h_start = i * self.stride_h
                        w_start = j * self.stride_w

                        h_max_rel, w_max_rel = self.max_indices[b, i, j, c]

                        h_abs = h_start + h_max_rel
                        w_abs = w_start + w_max_rel

                        input_gradient[b, h_abs, w_abs, c] += output_gradient[b, i, j, c]

        return input_gradient




class Flatten(Layer):
    """Flattens the input volume into a vector."""
    def __init__(self):
        super().__init__()
        self.original_shape: Optional[Tuple[int, ...]] = None

    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Perform the forward pass, flattening the input.

        :param input_data: Input data of shape (batch_size, height, width, channels) or similar.
        :return: Flattened data of shape (batch_size, height * width * channels).
        """
        self.input = input_data
        self.original_shape = input_data.shape
        batch_size = input_data.shape[0]

        flattened_dim = np.prod(input_data.shape[1:])

        self.output = input_data.reshape(batch_size, flattened_dim)
        return self.output

    def backward(self, output_gradient: np.ndarray, learning_rate: Optional[float] = None, **kwargs) -> np.ndarray:
        """
        Perform the backward pass, reshaping the gradient back to the original input shape.

        :param output_gradient: Gradient from the next layer (flattened).
        :param learning_rate: Not used for flatten layer.
        :return: Gradient with respect to the input (reshaped).
        """
        if self.original_shape is None:
             raise RuntimeError("Forward pass must be called before backward pass.")

        return output_gradient.reshape(self.original_shape)



class Dense(Layer):
    """Dense (fully connected) layer."""
    def __init__(self, input_size: int, output_size: int):
        """
        Initialize dense layer. Activation should be applied by a subsequent Activation layer.

        :param input_size: Number of input features (neurons in the previous layer).
        :param output_size: Number of output features (neurons in this layer).
        """
        super().__init__()
        self._has_weights = True
        self.input_size = input_size
        self.output_size = output_size

        # Xavier
        limit = np.sqrt(6 / (input_size + output_size))
        self.weights = np.random.uniform(-limit, limit, (input_size, output_size))
        self.biases = np.zeros(output_size) # One bias per output neuron

        # Adam
        self.m_weights = np.zeros_like(self.weights)
        self.v_weights = np.zeros_like(self.weights)
        self.m_biases = np.zeros_like(self.biases)
        self.v_biases = np.zeros_like(self.biases)


    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Perform the forward pass (linear transformation Wx + b).

        :param input_data: Input data of shape (batch_size, input_size).
        :return: Output of shape (batch_size, output_size).
        """
        self.input = input_data
        self.output = np.dot(input_data, self.weights) + self.biases
        return self.output

    def _adam_update(self, param: np.ndarray, grad: np.ndarray, m: np.ndarray, v: np.ndarray,
                     learning_rate: float, t: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Helper function to perform Adam update."""
        m = ADAM_BETA1 * m + (1 - ADAM_BETA1) * grad
        v = ADAM_BETA2 * v + (1 - ADAM_BETA2) * (grad ** 2)

        m_hat = m / (1 - ADAM_BETA1 ** t)
        v_hat = v / (1 - ADAM_BETA2 ** t)

        # Update parameter
        param -= learning_rate * m_hat / (np.sqrt(v_hat) + ADAM_EPSILON)
        return param, m, v

    def backward(self, output_gradient: np.ndarray, learning_rate: float, **kwargs) -> np.ndarray:
        """
        Perform the backward pass for the dense layer using Adam optimiser.

        Notes: This gradient `output_gradient` is dL/dz where z is the output of this Dense layer
              (BEFORE any activation is applied). It comes from the layer before Activation layer's
              backward pass.

        :param output_gradient: Gradient from the next layer (typically an Activation layer).
                                Shape: (batch_size, output_size).
        :param learning_rate: Learning rate for the optimizer.
        :param kwargs: Expected to contain 't' (Adam timestep).
        :return: Gradient with respect to the input of this layer (dL/dx).
                 Shape: (batch_size, input_size).
        """
        if 't' not in kwargs:
            raise ValueError("Adam timestep 't' is required for back pass.")
        if self.input is None:
             raise RuntimeError("Forward pass must be called before backward pass.")
        t = kwargs['t']

        weights_gradient = np.dot(self.input.T, output_gradient)

        biases_gradient = np.sum(output_gradient, axis=0)
        input_gradient = np.dot(output_gradient, self.weights.T)
        self.weights, self.m_weights, self.v_weights = self._adam_update(
            self.weights, weights_gradient, self.m_weights, self.v_weights, learning_rate, t
        )
        self.biases, self.m_biases, self.v_biases = self._adam_update(
            self.biases, biases_gradient, self.m_biases, self.v_biases, learning_rate, t
        )

        return input_gradient

class LSTM(Layer):
    """LSTM Layer (supports sequence inputs)."""
    def __init__(self, input_size: int, hidden_size: int):
        """
        Initialises LSTM.
        :param input_size: Number of input features.
        :param hidden_size: Number of hidden units.
        """
        super().__init__()
        self._has_weights = True
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Xavier init for weights
        # Maybe later on use a different weight initialiser
        limit = np.sqrt(1.0 / (input_size + hidden_size))
        self.W_f = np.random.uniform(-limit, limit, (input_size + hidden_size, hidden_size))
        self.W_i = np.random.uniform(-limit, limit, (input_size + hidden_size, hidden_size))
        self.W_c = np.random.uniform(-limit, limit, (input_size + hidden_size, hidden_size))
        self.W_o = np.random.uniform(-limit, limit, (input_size + hidden_size, hidden_size))

        self.b_f = np.zeros(hidden_size)
        self.b_i = np.zeros(hidden_size)
        self.b_c = np.zeros(hidden_size)
        self.b_o = np.zeros(hidden_size)

        # Adam optimiser statess
        self.m, self.v = {}, {}
        for name in ["W_f","W_i","W_c","W_o","b_f","b_i","b_c","b_o"]:
            self.m[name] = np.zeros_like(getattr(self, name))
            self.v[name] = np.zeros_like(getattr(self, name))

    def forward(self, input_seq: np.ndarray) -> np.ndarray:
        """
        Forward pass for the whole sequence.

        :param input_seq: Input shape (batch_size, seq_len, input_size)
        :return: Hidden states for each step (batch_size, seq_len, hidden_size)
        """
        batch_size, seq_len, _ = input_seq.shape
        self.input = input_seq

        # Initialize hidden and cell states
        self.h = np.zeros((batch_size, seq_len, self.hidden_size))
        self.c = np.zeros((batch_size, seq_len, self.hidden_size))

        self.cache = []  # store gates for backward

        h_t = np.zeros((batch_size, self.hidden_size))
        c_t = np.zeros((batch_size, self.hidden_size))

        for t in range(seq_len):
            x_t = input_seq[:, t, :]
            concat = np.concatenate([h_t, x_t], axis=1)

            f_t = _sigmoid(np.dot(concat, self.W_f) + self.b_f)
            i_t = _sigmoid(np.dot(concat, self.W_i) + self.b_i)
            c_hat_t = np.tanh(np.dot(concat, self.W_c) + self.b_c)
            c_t = f_t * c_t + i_t * c_hat_t
            o_t = _sigmoid(np.dot(concat, self.W_o) + self.b_o)
            h_t = o_t * np.tanh(c_t)

            self.h[:, t, :] = h_t
            self.c[:, t, :] = c_t
            self.cache.append((concat, f_t, i_t, c_hat_t, c_t, o_t, h_t))

        self.output = self.h
        return self.output

    def backward(self, output_gradient: np.ndarray, learning_rate: float, **kwargs) -> np.ndarray:
        """
        Backward pass through LSTM (simplified, no peepholes).
        :param output_gradient: Gradient w.r.t. hidden states (batch, seq_len, hidden_size)
        :return: Gradient w.r.t. input sequence (batch, seq_len, input_size)
        """
        if 't' not in kwargs:
            raise ValueError("Adam timestep 't' is required for backward pass.")
        t_step = kwargs['t']

        batch_size, seq_len, _ = output_gradient.shape
        dx = np.zeros((batch_size, seq_len, self.input_size))
        dh_next = np.zeros((batch_size, self.hidden_size))
        dc_next = np.zeros((batch_size, self.hidden_size))

        # Gradients for parameters
        grads = {name: np.zeros_like(getattr(self, name)) for name in self.m}

        for t in reversed(range(seq_len)):
            concat, f_t, i_t, c_hat_t, c_t, o_t, h_t = self.cache[t]

            dh = output_gradient[:, t, :] + dh_next
            do = dh * np.tanh(c_t) * o_t * (1 - o_t)
            dc = dh * o_t * (1 - np.tanh(c_t)**2) + dc_next
            di = dc * c_hat_t * i_t * (1 - i_t)
            dc_hat = dc * i_t * (1 - c_hat_t**2)
            df = dc * self.c[:, t-1, :] * f_t * (1 - f_t) if t > 0 else 0

            dconcat = (np.dot(do, self.W_o.T) +
                       np.dot(di, self.W_i.T) +
                       np.dot(dc_hat, self.W_c.T) +
                       (np.dot(df, self.W_f.T) if isinstance(df, np.ndarray) else 0))

            dh_next = dconcat[:, :self.hidden_size]
            dc_next = dc * f_t
            dx[:, t, :] = dconcat[:, self.hidden_size:]

            grads["W_o"] += np.dot(concat.T, do)
            grads["W_i"] += np.dot(concat.T, di)
            grads["W_c"] += np.dot(concat.T, dc_hat)
            if isinstance(df, np.ndarray):
                grads["W_f"] += np.dot(concat.T, df)

            grads["b_o"] += np.sum(do, axis=0)
            grads["b_i"] += np.sum(di, axis=0)
            grads["b_c"] += np.sum(dc_hat, axis=0)
            if isinstance(df, np.ndarray):
                grads["b_f"] += np.sum(df, axis=0)

        # Adam update
        for name in grads:
            self.__dict__[name], self.m[name], self.v[name] = self._adam_update(
                self.__dict__[name], grads[name], self.m[name], self.v[name], learning_rate, t_step
            )

        return dx

    def _adam_update(self, param, grad, m, v, lr, t):
        m = ADAM_BETA1 * m + (1 - ADAM_BETA1) * grad
        v = ADAM_BETA2 * v + (1 - ADAM_BETA2) * (grad ** 2)
        m_hat = m / (1 - ADAM_BETA1 ** t)
        v_hat = v / (1 - ADAM_BETA2 ** t)
        param -= lr * m_hat / (np.sqrt(v_hat) + ADAM_EPSILON)
        return param, m, v


