
## Forget Gate in LSTM

| Aspect                   | Details                                                                                                                                                                                                                                                                |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**           | The forget gate in an LSTM cell determines which information from the previous cell state should be discarded (forgotten) and which should be retained.                                                                                                                |
| **Purpose**              | Controls the flow of past information by applying a sigmoid activation function, producing values between 0 (completely forget) and 1 (completely retain).                                                                                                             |
| **Mathematical Formula** | $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ <br> - $f_t$: Forget gate output <br> - $\sigma$: Sigmoid activation <br> - $W_f$: Weight matrix for forget gate <br> - $h_{t-1}$: Previous hidden state <br> - $x_t$: Current input <br> - $b_f$: Bias for forget gate |
| **Working**              | 1. Takes the previous hidden state ($h_{t-1}$) and current input ($x_t$). <br> 2. Passes them through a sigmoid activation. <br> 3. Produces a vector of values between 0 and 1 indicating how much of each element in the previous cell state $C_{t-1}$ to keep.      |
| **Example Scenario**     | In a language model, if the previous context is irrelevant to the next word prediction, the forget gate will output values closer to 0 for that context, effectively removing it from memory.                                                                          |
| **Use Cases**            | - Machine translation <br> - Text summarization <br> - Speech recognition <br> - Time-series forecasting                                                                                                                                                               |
| **Interview Q\&A**       | **Q:** Why does the forget gate use sigmoid activation instead of ReLU or tanh? <br> **A:** Sigmoid outputs a value between 0 and 1, making it ideal for "retention ratio" control, whereas ReLU and tanh are not bounded in \[0, 1] for this purpose.                 |

---

### Python Example – Forget Gate in LSTM

```python
import numpy as np

# Sample input values
x_t = np.array([0.5, 0.1])      # Current input
h_t_minus_1 = np.array([0.2, 0.4])  # Previous hidden state

# Weight and bias initialization (random for example)
W_f = np.random.randn(4, 2)   # Weight matrix for forget gate
b_f = np.random.randn(2)      # Bias vector

# Concatenate h_(t-1) and x_t
concat_input = np.concatenate((h_t_minus_1, x_t))

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forget gate output
f_t = sigmoid(np.dot(concat_input, W_f) + b_f)

print("Forget Gate Output:", f_t)
```

---