# Implement Long Short-Term Memory (LSTM) Network

## Task: Implement Long Short-Term Memory (LSTM) Network

Your task is to implement an LSTM network that processes a sequence of inputs and produces the final hidden state and cell state after processing all inputs.

Write a class `LSTM` with the following methods:

- `__init__(self, input_size, hidden_size)`: Initializes the LSTM with random weights and zero biases.
- `forward(self, x, initial_hidden_state, initial_cell_state)`: Processes a sequence of inputs and returns the hidden states at each time step, as well as the final hidden state and cell state.

The LSTM should compute the forget gate, input gate, candidate cell state, and output gate at each time step to update the hidden state and cell state.

Example
```python
import numpy as np

input_sequence = np.array([[1.0], [2.0], [3.0]])
initial_hidden_state = np.zeros((1, 1))
initial_cell_state = np.zeros((1, 1))

lstm = LSTM(input_size=1, hidden_size=1)
outputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)

print(final_h)

# Expected Output:
# [[0.73698596]] (approximate)
```

## Understanding Long Short-Term Memory Networks (LSTMs)

Long Short-Term Memory Networks are a special type of RNN designed to capture long-term dependencies in sequential data by using a more complex hidden state structure.

## LSTM Gates and Their Functions

For each time step $t$, the LSTM updates its cell state $c_t$ and hidden state $h_t$ using the current input $x_t$, the previous cell state $c_{t-1}$, and the previous hidden state $h_{t-1}$. The LSTM architecture consists of several gates that control the flow of information:

- **Forget Gate ($f_t$)**: This gate decides what information to discard from the cell state. It looks at the previous hidden state $h_{t-1}$ and the current input $x_t$, and outputs a number between 0 and 1 for each number in the cell state. A 1 represents "keep this" while a 0 represents "forget this".

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ 

- **Input Gate ($i_t$)**: This gate decides which new information will be stored in the cell state. It consists of two parts:
    - A sigmoid layer that decides which values we'll update.
    - A tanh layer that creates a vector of new candidate values $\tilde{c}_t$ that could be added to the state.
    
    $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
    $$\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ 

- **Cell State Update ($c_t$)**: This step updates the old cell state ($c_{t-1}$) into the new cell state ($c_t$). It multiplies the old state by the forget gate output, then adds the product of the input gate and the new candidate values.
 
$$c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t$$

- **Output Gate ($o_t$)**: This gate decides what parts of the cell state we're going to output. It uses a sigmoid function to determine which parts of the cell state to output, and then multiplies it by a tanh of the cell state to get the final output.
 
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t \cdot \tanh(c_t)$$

Where:

- (W_f, W_i, W_c, W_o) are weight matrices for the forget gate, input gate, cell state, and output gate respectively.
- (b_f, b_i, b_c, b_o) are bias vectors.
- $\sigma$ denotes the sigmoid activation function.
- $\cdot$ denotes element-wise multiplication.

## Implementation Steps

1. Initialization: Start with the initial cell state (c_0) and hidden state (h_0).
2. Sequence Processing: For each input (x_t) in the sequence:
    - Compute forget gate (f_t), input gate (i_t), candidate cell state ($\tilde{c}_t$), and output gate (o_t).
    - Update cell state (c_t) and hidden state (h_t).
3. Final Output: After processing all inputs, the final hidden state (h_T) (where (T) is the length of the sequence) contains information from the entire sequence.

## Example Calculation

Given:

- Inputs: (x_1 = 1.0), (x_2 = 2.0), (x_3 = 3.0)
- Initial states: (c_0 = 0.0), (h_0 = 0.0)
- Simplified weights (for demonstration): (W_f = W_i = W_c = W_o = 0.5)
- All biases: (b_f = b_i = b_c = b_o = 0.1)

Compute:

First time step ((t = 1)):

$$
f_1 = \sigma(0.5 \times [0.0, 1.0] + 0.1) = 0.6487 \\
i_1 = \sigma(0.5 \times [0.0, 1.0] + 0.1) = 0.6487 \\
\tilde{c}_1 = \tanh(0.5 \times 1.0 + 0.1) = 0.5370 \\
c_1 = f_1 \times 0.0 + i_1 \times \tilde{c}_1 = 0.6487 \times 0.0 + 0.6487 \times 0.5370 = 0.3484 \\
o_1 = \sigma(0.5 \times 1.0 + 0.1) = 0.6487 \\
h_1 = o_1 \times \tanh(c_1) = 0.6487 \times 0.3484 = 0.2169
$$ 

Second time step ((t = 2)): (Calculations omitted for brevity, but follow the same pattern using (x_2 = 2.0) and the previous states)

Third time step ((t = 3)): (Calculations omitted for brevity, but follow the same pattern using (x_3 = 3.0) and the previous states)

The final hidden state (h_3) would be the result after these calculations.

## Applications

LSTMs are extensively used in various sequence modeling tasks, including machine translation, speech recognition, and time series forecasting, where capturing long-term dependencies is crucial.

In [1]:
import numpy as np

class LSTM:
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Initialize weights and biases
        self.Wf = np.random.randn(hidden_size, input_size + hidden_size)
        self.Wi = np.random.randn(hidden_size, input_size + hidden_size)
        self.Wc = np.random.randn(hidden_size, input_size + hidden_size)
        self.Wo = np.random.randn(hidden_size, input_size + hidden_size)

        self.bf = np.zeros((hidden_size, 1))
        self.bi = np.zeros((hidden_size, 1))
        self.bc = np.zeros((hidden_size, 1))
        self.bo = np.zeros((hidden_size, 1))
        
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def forward(self, x, initial_hidden_state, initial_cell_state):
        """
        Processes a sequence of inputs and returns the hidden states, final hidden state, and final cell state.
        """
        h = initial_hidden_state
        c = initial_cell_state
        outputs = []
        
        for t in range(len(x)):
            xt = x[t].reshape(-1, 1)
            concat = np.vstack((h, xt))
            
            # Forget Gate
            ft = self.sigmoid(np.dot(self.Wf, concat) + self.bf)
            
            # Input Gate
            it = self.sigmoid(np.dot(self.Wi, concat) + self.bi)
            c_tilde = np.tanh(np.dot(self.Wc, concat) + self.bc)
            
            # Cell State Update
            c = ft * c + it * c_tilde
            
            # Output Gate
            ot = self.sigmoid(np.dot(self.Wo, concat) + self.bo)
            
            # Hidden State
            h = ot * np.tanh(c)
            
            outputs.append(h)
            
        return np.array(outputs), h, c

In [2]:
import numpy as np
input_sequence = np.array([[1.0], [2.0], [3.0]])
initial_hidden_state = np.zeros((1, 1))
initial_cell_state = np.zeros((1, 1))
lstm = LSTM(input_size=1, hidden_size=1)
# Set weights and biases for reproducibility
lstm.Wf = np.array([[0.5, 0.5]])
lstm.Wi = np.array([[0.5, 0.5]])
lstm.Wc = np.array([[0.3, 0.3]])
lstm.Wo = np.array([[0.5, 0.5]])
lstm.bf = np.array([[0.1]])
lstm.bi = np.array([[0.1]])
lstm.bc = np.array([[0.1]])
lstm.bo = np.array([[0.1]])
outputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)
print('Test Case 1: Accepted') if np.allclose(final_h, np.array([[0.73698596]])) else print('Test Case 1: Error')
print('Input:')
print('import numpy as np\ninput_sequence = np.array([[1.0], [2.0], [3.0]])\ninitial_hidden_state = np.zeros((1, 1))\ninitial_cell_state = np.zeros((1, 1))\nlstm = LSTM(input_size=1, hidden_size=1)\n# Set weights and biases for reproducibility\nlstm.Wf = np.array([[0.5, 0.5]])\nlstm.Wi = np.array([[0.5, 0.5]])\nlstm.Wc = np.array([[0.3, 0.3]])\nlstm.Wo = np.array([[0.5, 0.5]])\nlstm.bf = np.array([[0.1]])\nlstm.bi = np.array([[0.1]])\nlstm.bc = np.array([[0.1]])\nlstm.bo = np.array([[0.1]])\noutputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)\nprint(final_h)')
print()
print('Output:')
print(final_h)
print()
print('Expected:')
print('[[0.73698596]]')
print()
print()

import numpy as np
input_sequence = np.array([[0.1, 0.2], [0.3, 0.4]])
initial_hidden_state = np.zeros((2, 1))
initial_cell_state = np.zeros((2, 1))
lstm = LSTM(input_size=2, hidden_size=2)
# Set weights and biases for reproducibility
lstm.Wf = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])
lstm.Wi = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])
lstm.Wc = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])
lstm.Wo = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])
lstm.bf = np.array([[0.1], [0.2]])
lstm.bi = np.array([[0.1], [0.2]])
lstm.bc = np.array([[0.1], [0.2]])
lstm.bo = np.array([[0.1], [0.2]])
outputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)
print('Test Case 2: Accepted') if np.allclose(final_h, np.array([[0.16613133], [0.40299449]])) else print('Test Case 2: Error')
print('Input:')
print('import numpy as np\ninput_sequence = np.array([[0.1, 0.2], [0.3, 0.4]])\ninitial_hidden_state = np.zeros((2, 1))\ninitial_cell_state = np.zeros((2, 1))\nlstm = LSTM(input_size=2, hidden_size=2)\n# Set weights and biases for reproducibility\nlstm.Wf = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])\nlstm.Wi = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])\nlstm.Wc = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])\nlstm.Wo = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])\nlstm.bf = np.array([[0.1], [0.2]])\nlstm.bi = np.array([[0.1], [0.2]])\nlstm.bc = np.array([[0.1], [0.2]])\nlstm.bo = np.array([[0.1], [0.2]])\noutputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)\nprint(final_h)')
print()
print('Output:')
print(final_h)
print()
print('Expected:')
print('[[0.16613133], [0.40299449]]')

Test Case 1: Accepted
Input:
import numpy as np
input_sequence = np.array([[1.0], [2.0], [3.0]])
initial_hidden_state = np.zeros((1, 1))
initial_cell_state = np.zeros((1, 1))
lstm = LSTM(input_size=1, hidden_size=1)
# Set weights and biases for reproducibility
lstm.Wf = np.array([[0.5, 0.5]])
lstm.Wi = np.array([[0.5, 0.5]])
lstm.Wc = np.array([[0.3, 0.3]])
lstm.Wo = np.array([[0.5, 0.5]])
lstm.bf = np.array([[0.1]])
lstm.bi = np.array([[0.1]])
lstm.bc = np.array([[0.1]])
lstm.bo = np.array([[0.1]])
outputs, final_h, final_c = lstm.forward(input_sequence, initial_hidden_state, initial_cell_state)
print(final_h)

Output:
[[0.73698596]]

Expected:
[[0.73698596]]


Test Case 2: Accepted
Input:
import numpy as np
input_sequence = np.array([[0.1, 0.2], [0.3, 0.4]])
initial_hidden_state = np.zeros((2, 1))
initial_cell_state = np.zeros((2, 1))
lstm = LSTM(input_size=2, hidden_size=2)
# Set weights and biases for reproducibility
lstm.Wf = np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]])
