<a href="https://colab.research.google.com/github/HosseinEyvazi/Deep-Learning/blob/main/RNN_LSTM_GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## 1) RNN (Vanilla Recurrent Neural Network)
**What it is:** A sequence model that processes tokens step-by-step, carrying a hidden state as short-term memory.[2]

**Inputs / outputs (per time step \(t\)):**[2]
- Inputs: current input vector \(x_t\) and previous hidden state \(h_{t-1}\).[2]
- Outputs: new hidden state \(h_t\); a task output \(y_t\) is usually produced by an extra output layer on top of \(h_t\).[2]

**Core update idea:** \(h_t\) is computed from \(x_t\) and \(h_{t-1}\) through a nonlinearity (commonly tanh/sigmoid-family), so information is repeatedly transformed across time.[2]

**Vanishing gradient (why RNN struggles on long dependencies):** During backpropagation through time, gradients involve products across many time steps; with repeated nonlinearities and recurrent weights, gradients can shrink toward zero, making long-range learning hard.[3][2]

## 2) LSTM (Long Short-Term Memory)
**What it adds:** LSTM introduces a separate **cell state** \(c_t\) that acts like a dedicated memory path in addition to the hidden state \(h_t\).[4]

**Inputs / outputs (per time step \(t\)):**[4]
- Inputs: \(x_t\), \(h_{t-1}\), and previous cell state \(c_{t-1}\).[4]
- Outputs: updated hidden state \(h_t\) and updated cell state \(c_t\); \(y_t\) typically comes from an output layer on \(h_t\).[4]

**What the gates do (intuitively):** Gates are sigmoid-controlled “valves” that decide how much information passes.[5]
- **Forget gate:** decides how much of \(c_{t-1}\) to keep.[5]
- **Input gate:** decides how much new information to write into the cell state.[5]
- **Output gate:** decides how much of the cell state is exposed as the hidden state \(h_t\).[5]

**Why LSTM helps vanishing gradients:** The cell-state update is built to allow a more stable gradient path over time (the “memory highway”), so long-term dependencies are learned more reliably than in vanilla RNNs.[6][4]

## 3) GRU (Gated Recurrent Unit)
**What it is:** A simpler gated alternative to LSTM that typically does not keep a separate cell state; it focuses on controlling the hidden state directly.[7][8]

**Inputs / outputs (per time step \(t\)):**[7]
- Inputs: \(x_t\) and \(h_{t-1}\).[7]
- Outputs: \(h_t\); \(y_t\) is usually produced by an output layer on top of \(h_t\).[7]

**Two gates (what they do):**[8][7]
- **Reset gate (\(r_t\))**: controls how much of the past \(h_{t-1}\) is used when forming the candidate new state (i.e., when you want to “ignore” older context).[8][7]
- **Update gate (\(z_t\))**: controls the interpolation between keeping the old state and replacing it with the candidate new state.[7]

**Vanishing gradient:** GRU mitigates vanishing gradients by creating a more direct path for information/gradients through gated updates (often similar benefits to LSTM, with fewer parameters).[6][7]

## 4) One-page cheat sheet (incl. Transformers)
Transformers process sequences largely in parallel via attention (instead of step-by-step recurrence), which is a major reason they scale well on long contexts.[9][10]

| Model | Per-step inputs | Per-step outputs | Memory mechanism | Vanishing gradient handling |
|---|---|---|---|---|
| RNN | \(x_t, h_{t-1}\) [2] | \(h_t\) (and \(y_t\) via output head) [2] | Hidden state only [2] | Often severe on long sequences [2][3] |
| LSTM | \(x_t, h_{t-1}, c_{t-1}\) [4] | \(h_t, c_t\) (and \(y_t\) via output head) [4] | Cell state + gates [4][5] | Strong mitigation via cell-state path [6][4] |
| GRU | \(x_t, h_{t-1}\) [7] | \(h_t\) (and \(y_t\) via output head) [7] | Gated hidden state (no separate \(c_t\)) [7][8] | Mitigated via gated update path [6][7] |
| Transformer | Whole sequence (parallel) [9][10] | Whole sequence representations [9][10] | Attention (no recurrence) [9][10] | Not the classic recurrent vanishing issue [9][10] |




## 8 ready-to-say Interview answers

### 1) “What is an RNN?”
“An RNN is a sequence model that processes data step-by-step. At each time step it takes the current input \(x_t\) and the previous hidden state \(h_{t-1}\), then produces a new hidden state \(h_t\). That hidden state is the model’s internal summary of the past, and we usually map it to a task output \(y_t\) with a final linear/softmax head.”

### 2) “Why do vanilla RNNs fail on long sequences?”
“Because of vanishing gradients in backpropagation through time: the gradient becomes a product of many Jacobians across timesteps. With saturating nonlinearities and recurrent multiplication, that product often shrinks toward zero, so early timesteps barely learn. Practically, RNNs tend to remember only short context.”

### 3) “What is LSTM and what are its inputs/outputs?”
“LSTM is an RNN variant designed to keep long-term memory. Per time step it takes \(x_t\), previous hidden state \(h_{t-1}\), and previous cell state \(c_{t-1}\). It outputs a new hidden state \(h_t\) and a new cell state \(c_t\). Typically \(h_t\) goes to the prediction head for \(y_t\), while \(c_t\) is mainly internal memory.”

### 4) “What is the cell state, and why does it help?”
“The cell state \(c_t\) is a dedicated memory track. It gets updated in an additive way—keep some old memory and add some new candidate memory—so it creates a more stable path for information and gradients over time. That’s why LSTMs handle long-range dependencies better than vanilla RNNs.”

### 5) “Explain LSTM gates like I’m not deep into math.”
“Gates are soft switches (values between 0 and 1). The **forget gate** decides what to erase from the old cell memory. The **input gate** decides what new information to write into the cell. The **output gate** decides what part of the cell memory should be exposed as the hidden state \(h_t\) for the current step’s output.”

### 6) “What is GRU and how is it different from LSTM?”
“GRU is a simpler gated RNN. It usually has no separate cell state \(c_t\); it uses only the hidden state \(h_t\). Compared to LSTM’s three gates, GRU has two main gates—reset and update—so it tends to have fewer parameters and can train faster while often performing similarly.”

### 7) “What does the reset gate do in GRU?”
“The reset gate controls how much of the previous hidden state \(h_{t-1}\) is used when forming the candidate new state. If reset is low, the model ‘ignores’ much of the past for that candidate—useful when context changes. If reset is high, it incorporates the past strongly.”

### 8) “When would you choose RNN vs LSTM vs GRU (practically)?”
“For short sequences or a quick baseline, vanilla RNN can be fine but usually underperforms on long context. For strong long-term dependencies, LSTM is a safe choice. If I need a lighter model with faster training/inference and similar performance, I’ll try GRU first. In practice, I’d validate both GRU and LSTM with the same training setup and pick based on metrics and latency constraints.”
