## Quiz Questions Explained

---

### Question 1: First-Order Markov Assumption

* **The Question:** This question asks you to identify the mathematical expression for the first-order Markov assumption, a foundational concept in sequence modeling. ⛓️
* **Correct Answer Explained:**
    * **C. $p(x_1, ..., x_n) = p(x_n|x_{n-1}) p(x_{n-1}|x_{n-2}) ... p(x_2|x_1) p(x_1)$**. The Markov assumption simplifies the chain rule of probability. The general chain rule states that the probability of the current state depends on *all* previous states ($p(x_t | x_{1:t-1})$). The **first-order Markov assumption** simplifies this by stating that the current state $x_t$ depends *only* on the immediately preceding state $x_{t-1}$. This equation correctly applies this assumption at each step of the sequence.

---

### Question 2: Calculating Parameters in a Simple RNN

* **The Question:** This asks you to calculate the total number of trainable parameters in a single, simple Recurrent Neural Network (RNN) cell.
* **Correct Answer Explained:**
    * **D. $10 \times 5 + 5 + 5 \times 5$**. The core RNN equation is $h_t = \tanh(h_{t-1}W + x_tU + b)$. Let's break down the parameters:
        * **Weight matrix U:** Connects the input $x_t$ to the hidden state. Its shape is `input_dim x hidden_dim`. Here, that's $10 \times 5$ parameters.
        * **Weight matrix W:** Connects the previous hidden state $h_{t-1}$ to the current hidden state. Its shape is `hidden_dim x hidden_dim`. Here, that's $5 \times 5$ parameters.
        * **Bias b:** Added to the hidden state calculation. Its shape is `hidden_dim`. Here, that's $5$ parameters.
        * **Total parameters** = $(10 \times 5) + (5 \times 5) + 5$.

---

### Question 3: RNN Equations and Shared Weights

* **The Question:** This tests your understanding of the equations for an unrolled RNN and the concept of shared weights.
* **Correct Answers Explained:**
    * **B. $h_0 = \tanh(x_0U + b), h_1 = \tanh(h_0W + x_1U + b)$**. This is correct because RNNs use the **same weight matrices (U, W) and bias (b) at every time step**. The matrix `U` that processes input $x_0$ is the exact same matrix that processes input $x_1$. This parameter sharing is what allows RNNs to handle sequences of variable length and is a core feature of their design.
    * **D. Assume the task is classification using softmax function, then $y_0 = \text{softmax}(h_0V + c)$**. This correctly describes how an output can be generated at a specific time step. The hidden state $h_0$ is passed through a separate linear layer (with its own weights `V` and bias `c`) followed by a softmax activation function to produce class probabilities.

---

### Question 4: Matching RNN Topologies to Applications

* **The Question:** This is a matching exercise to connect different RNN architectures (topologies) with common real-world applications.
* * **Correct Matching Explained:**
    * **A (Many-to-one) -> 2 (Sentiment analysis)**. In sentiment analysis, the model reads a sequence of words (many inputs) and produces a single output at the end (the overall sentiment).
    * **B (One-to-many) -> 4 (Image captioning)**. In image captioning, the model takes a single input (an image, processed by a CNN) and generates a sequence of words (many outputs).
    * **C (Many-to-many, delayed) -> 3 (Machine translation)**. In translation, the model typically needs to read the entire source sentence (many inputs) before it can start generating the translated sentence (many outputs). This is often called an encoder-decoder structure.
    * **D (Many-to-many, synced) -> 1 (Video classification at frame level)**. Here, the task is to classify each frame of a video. The model receives a sequence of frames (many inputs) and produces a corresponding classification for each frame (many outputs).

---

### Questions 5-8: Tensor Shapes in a Stacked RNN

* **The Question:** This series of questions tracks the shape of a data tensor as it passes through a multi-layer (stacked) RNN. The input is a mini-batch of shape `[2, 5]` (batch size = 2, sequence length = 5).
* **Correct Answers Explained:**
    * **Q5 (Embedding Layer Output):** An embedding layer converts integer token IDs into dense vectors. It adds a new dimension for the embedding size.
        * Shape: `[batch_size, seq_len, embed_size]` -> **`[2, 5, 10]`**.
    * **Q6 (Hidden Layer 1 Output):** The first GRU layer takes the embedded sequence as input. Its output will have the same batch size and sequence length, but the last dimension will be the size of its hidden state.
        * Shape: `[batch_size, seq_len, hidden_size_1]` -> **`[2, 5, 15]`**.
    * **Q7 (Hidden Layer 2 Output):** The second GRU layer takes the output sequence from the first hidden layer as its input.
        * Shape: `[batch_size, seq_len, hidden_size_2]` -> **`[2, 5, 20]`**.
    * **Q8 (Hidden Layer 3 Output):** The third GRU layer takes the output sequence from the second hidden layer.
        * Shape: `[batch_size, seq_len, hidden_size_3]` -> **`[2, 5, 25]`**.

---

### Questions 9-12: Code Analysis of an RNN Model

* **The Question:** This set of questions requires you to trace the tensor shapes through a PyTorch `nn.Module` that defines a stacked GRU model. The input shape is `[32, 5]` (batch, sequence length) and the embedding size is 64.
* **Correct Answers Explained:**
    * **Q9 (Shape of h1):** `h1` is the output of the embedding layer.
        * Shape: `[batch, seq_len, embed_size]` -> **`[32, 5, 64]`**.
    * **Q10 (Shape of h2):** `h2` is the output of `self.gru1`, which is defined with a hidden size of 16.
        * Shape: `[batch, seq_len, hidden_size]` -> **`[32, 5, 16]`**.
    * **Q11 (Shape of h4):** `h4` is the output of `self.gru3`. `gru3` takes the output of `gru2` (which has a hidden size of 8) as input and is defined with a hidden size of 16.
        * Shape: `[batch, seq_len, hidden_size]` -> **`[32, 5, 16]`**.
    * **Q12 (Shape of h5):** `h5` is the result of applying `nn.Flatten()` to `h4`. This collapses all dimensions except the batch dimension into a single vector.
        * Input shape: `[32, 5, 16]`. Output shape: `[batch, 5 * 16]` -> **`[32, 80]`**.

---

### Question 13: LSTM Forget Gate

* **The Question:** Asks for the properties of the **forget gate** in an LSTM cell.
* **Correct Answers Explained:**
    * **A. We apply sigmoid activation when computing forget gate**. The purpose of a gate is to control information flow. A sigmoid function squashes its output to a range between 0 (block everything) and 1 (let everything pass), making it the perfect choice for a gating mechanism.
    * **C. It controls the proportion of information in long-term memory to be carried forward**. The forget gate's output is multiplied with the previous cell state, $c_{t-1}$ (the long-term memory), to decide which parts of the old memory to discard and which to keep.

---

### Question 14: LSTM Output Gate

* **The Question:** Asks for the properties of the **output gate** in an LSTM cell.
* **Correct Answers Explained:**
    * **A. We apply sigmoid activation when computing output gate**. Like the forget gate, the output gate also uses a sigmoid to control the flow of information.
    * **C. It controls the proportion of information in current long-term memory to be carried forward to new short-term memory**. The output gate decides which parts of the newly updated cell state (long-term memory, $c_t$) will be passed on to become the new hidden state (short-term memory, $h_t$). It essentially controls what the next layer or time step gets to "see."

## Revision Notes: Key Takeaways

### 1. RNN Fundamentals 🧠
* **Sequential Data:** RNNs are designed for data where order matters, like text, time series, or video frames.
* **Markov Assumption:** The idea that the future state depends only on the present (or a limited history), not the entire past. This is a simplifying assumption that underlies many sequence models.
* **Recurrence & Shared Weights:** The core of an RNN is a loop where the same set of weights (`U` for input, `W` for hidden state) is applied at every time step. This allows the model to handle sequences of any length.

---

### 2. RNN Architectures & Applications 🏗️
* RNNs can be structured in different ways depending on the task:
    * **Many-to-one:** Sequence input, single output (e.g., **Sentiment Analysis**).
    * **One-to-many:** Single input, sequence output (e.g., **Image Captioning**).
    * **Many-to-many:** Sequence input, sequence output. This can be "synced" (e.g., **Video Frame Classification**) or "delayed" (e.g., **Machine Translation**).

---

### 3. Data Flow and Tensor Shapes in RNNs
* **Input:** Text data usually starts as a 2D tensor of token IDs: `[batch_size, sequence_length]`.
* **Embedding Layer:** This is the first layer for NLP tasks. It transforms the token IDs into dense vectors, adding a dimension: `[batch_size, sequence_length, embedding_size]`.
* **RNN/GRU/LSTM Layers:** Each recurrent layer processes the sequence and outputs a new sequence of hidden states. The shape is: `[batch_size, sequence_length, hidden_size]`. The `hidden_size` is a hyperparameter you choose.
* **Stacked RNNs:** In a stacked RNN, the output sequence of one layer becomes the input sequence for the next layer.

---

### 4. LSTMs: Solving the Memory Problem
* LSTMs (Long Short-Term Memory networks) were designed to overcome the short-term memory limitations of simple RNNs. They use a system of gates to regulate information flow.
* **Cell State ($c_t$):** The "long-term memory." It acts as a conveyor belt, carrying information through the sequence with minimal changes.
* **Hidden State ($h_t$):** The "short-term memory" and the output for the current time step.
* **Key Gates (all use sigmoid):**
    * **Forget Gate:** Decides what information to throw away from the old cell state ($c_{t-1}$).
    * **Input Gate:** Decides what new information to store in the current cell state ($c_t$).
    * **Output Gate:** Decides what part of the cell state ($c_t$) to reveal as the hidden state ($h_t$).