## Quiz Questions Explained

---

### Question 1: Word2Vec Skip-gram Model

* **The Question:** This question tests the fundamental concept of the **Skip-gram** architecture in Word2Vec.
* **Correct Answer Explained:**
    * **D. 'Away' is used to predict 'the', 'mouse', 'runs', 'from', 'the', 'cat'**. The Skip-gram model works by taking a center word (the target) and trying to predict the words in its surrounding context. In the sentence "the mouse runs away from the cat", if 'away' is the center word, all other words in the sentence fall within a window size of 7, making them the context words that the model tries to predict.

---

### Question 2: Word2Vec CBOW Model

* **The Question:** This question asks about the Continuous Bag-of-Words (CBOW) architecture, which is the inverse of Skip-gram.
* **Correct Answer Explained:**
    * **A. 'The', 'mouse', 'runs', 'from', 'the', 'cat' are used to predict 'away'**. The CBOW model takes a collection of context words (the "bag of words") and uses them to predict a single target word in the middle. Here, all the surrounding words are used as context to predict the center word, 'away'.

---

### Question 3: Analogical Reasoning with Word Vectors

* **The Question:** This tests your understanding of how word embeddings capture semantic relationships, allowing for vector arithmetic to solve analogies.
* **Correct Answer Explained:**
    * **A. King**. The relationship is given by the equation `vec(?) - vec(prince) = vec(queen) - vec(princess)`. By rearranging the terms, we get `vec(?) ≈ vec(prince) + vec(queen) - vec(princess)`. This can be interpreted as starting with the concept of a 'prince', adding the concept of a 'queen', and removing the concept of a 'princess'. This vector arithmetic effectively captures the gender and royalty relationships, resulting in the vector closest to 'king'.

---

### Question 4: One-Hot Encoding

* **The Question:** This is a basic question on how to represent words as one-hot vectors before they are fed into an embedding model.
* **Correct Answer Explained:**
    * **A. [0,0,0,0,1,0,0,0]**. One-hot encoding represents a word as a sparse vector with a length equal to the size of the vocabulary. The vector is all zeros except for a single '1' at the index corresponding to that word. The vocabulary has 8 unique words, and the dictionary shows that the word 'dog' is assigned the index 4. Therefore, the correct vector has a '1' at the 4th index (the fifth position).

---

### Question 5: Vocabulary and Embedding Size

* **The Question:** This question distinguishes between the vocabulary size (dictionary size) and the embedding size, which are two key parameters in Word2Vec.
* **Correct Answers Explained:**
    * **B. Dictionary size N=10k**. The dictionary or vocabulary size is the number of **unique words** in the corpus, which is given as 10k. The total number of sentences (1 billion) is irrelevant for the vocabulary size.
    * **D. The embedding size is unknown with given information**. The embedding size (the dimensionality of the word vectors) is a **hyperparameter** that the user chooses. It is not determined by the vocabulary size, corpus size, or context window size.

---

### Question 6: Drawbacks of the Skip-gram Algorithm

* **The Question:** This question asks you to identify the known limitations of the standard Skip-gram model.
* **Correct Answers Explained:**
    * **A. The softmax output layer is computationally expensive**. The final layer of the standard Skip-gram model is a softmax function over the entire vocabulary. If the vocabulary is large (e.g., 10k+ words), calculating this is a major performance bottleneck.
    * **B. The values of prediction probabilities are small and hard to distinguish among classes**. Because the softmax normalizes over thousands of words, the probability assigned to any single correct context word is often extremely small, making the gradient updates less effective.
    * **D. Skip-gram depends on the context window size**. The choice of window size is a critical hyperparameter that affects the resulting embeddings. A smaller window tends to capture more syntactic relationships (e.g., "New" and "York"), while a larger window captures more semantic/topical relationships (e.g., "politics" and "government"). This dependency is a design choice but can be seen as a drawback as it needs careful tuning.

---

### Question 7: Skip-gram with Negative Sampling

* **The Question:** This question explores how **Negative Sampling** optimizes the Skip-gram training process.
* **Correct Answers Explained:**
    * **C. We apply sigmoid activation function at the output layer**. Negative sampling changes the problem from a multi-class classification (predicting the right word out of thousands) to a series of binary classifications (is this word pair a true context pair or not?). A sigmoid function is used to output a probability between 0 and 1 for this binary task.
    * **F. Given a pair target word tw and context word cw, we need to maximize the probability to predict the pair of target word and context word as a positive pair**. For a true pair taken from the text, the model is trained to output a value close to 1 (a positive pair).
    * **E. ...we need to randomly sample some negative words and minimize the probabilities to predict the pairs of target word and negative words as a positive pair**. For each true pair, we also generate several "negative" pairs by combining the target word with random words from the vocabulary. The model is trained to output a value close to 0 for these fake pairs.

---

### Question 8: Skip-gram Objective Function

* **The Question:** This question asks you to interpret the mathematical objective function for the Skip-gram model.
* **Correct Answers Explained:**
    * **B. It is the objective function of skip-gram algorithm**. The formula shows a product over the entire corpus (from $t=1$ to $T$). At each position $t$, it tries to predict the context words ($w_{t+j}$) given the center word ($w_t$). This is the definition of Skip-gram.
    * **D. t represents for the index of target (centre) word**. The outer product iterates from $t=1$ to $T$, where $w_t$ is the center word at each step.
    * **E. We need to maximize this objective function to find parameter theta**. The objective is a likelihood function. Training the model involves finding the parameters $\theta$ that maximize the probability of observing the context words given the target words across the entire corpus.

---

### Question 9: CBOW Objective Function

* **The Question:** This question asks you to interpret the objective function for the CBOW model.
* **Correct Answers Explained:**
    * **A. It is the objective function of CBOW algorithm**. The formula shows that for each position $t$, the model calculates the probability of the center word $w_t$ given its surrounding context words ($w_{t-C}, ..., w_{t+C}$). This is the definition of CBOW.
    * **E. 2C + 1 is the window size**. In the formula, $C$ represents the context radius (how many words to look at on each side of the target word). The total window includes $C$ words before, $C$ words after, and the center word itself, making the total size $2C + 1$.

---

### Question 10: Skip-gram Model Guts 🧠

* **The Question:** This question tests your understanding of the internal workings of a Skip-gram model, including its weight matrices and data flow.
* **Correct Answers Explained:**
    * **A. Shape of U is [200,100] and shape of V is [100,200]**. Matrix **U** is the input word embedding matrix. It maps a one-hot vector of `vocab_size` (200) to a dense vector of `embedding_size` (100), so its shape is `[200, 100]`. Matrix **V** is the output context word matrix. It maps the hidden state (`embedding_size`=100) back to scores for the entire vocabulary (`vocab_size`=200), so its shape is `[100, 200]`.
    * **C. Input to the network is one-hot vector 1_5**. In Skip-gram, the input is the target (center) word, which has an index of 5. This is represented as a one-hot vector.
    * **F. The hidden value h is the row 5 of the matrix U**. When you multiply a one-hot vector (which is all zeros except for a '1' at index 5) by the matrix U, the result is simply the 5th row of U. This is a computationally efficient way to look up a word's embedding.

---

### Question 11: CBOW Model Guts

* **The Question:** This is similar to the previous question but focuses on the CBOW model's internal workings.
* **Correct Answers Explained:**
    * **B. Shape of U is [100,150] and shape of V is [150,100]**. Following the same logic, U maps `vocab_size` (100) to `embedding_size` (150), and V maps `embedding_size` (150) back to `vocab_size` (100).
    * **D. Input to the network is (1_10 + 1_20 + 1_30 + 1_40)/4**. In CBOW, the inputs are the one-hot vectors of all the context words (indices 10, 20, 30, 40), which are then averaged to create a single input vector.
    * **E. The hidden value h is the average of rows 10, 20, 30, 40 of the matrix U**. The hidden state is the averaged input vector multiplied by U. This simplifies to looking up the embedding vectors (rows) for each context word in U and then averaging those vectors.

---

### Question 12: A Skip-gram Calculation Walkthrough

* **The Question:** This question provides a diagram of a Skip-gram forward pass and asks you to interpret the results.
* * **Correct Answers Explained:**
    * **A. Vocabulary size is 8 and embedding size is 3**. The one-hot input vectors (`1_t`) are of length 8, meaning there are 8 words in the vocabulary (N=8). The hidden vector `h` has 3 elements, which means the embedding dimension is 3 (d=3).
    * **E. p1 = 0.12, p2 = 0.11**. This requires reading the final probability vector `p(cw|tw)`.
        * $p_1 = p(\text{the} | \text{quick})$: We need the probability at the index corresponding to the word 'the'. Assuming the indices from previous questions where 'the' has index 6, we look at the 7th value in the probability vector, which is **0.12**.
        * $p_2 = p(\text{brown} | \text{quick})$: We need the probability at the index for 'brown'. There is an inconsistency in the word-to-index mapping between the questions, but the correct answer indicates the value is **0.11**. This corresponds to the 5th value in the probability vector (index 4).

## Revision Notes: Key Takeaways

### 1. Word Representation: From Sparse to Dense ➡️
* **One-Hot Encoding:** Represents words as sparse vectors of vocabulary length with a single '1'. It's simple but inefficient and captures no semantic relationship (all vectors are orthogonal).
* **Word Embeddings (like Word2Vec):** Represent words as short, dense vectors (e.g., 300 dimensions). Words with similar meanings have similar vectors, allowing for the capture of semantic relationships.

### 2. Word2Vec Architectures 🏗️
* **Core Idea:** A word's meaning is defined by the company it keeps (the words that appear around it).
* **Skip-gram:** **Predicts context words from a center word**. It works well for large datasets and is good at capturing meanings for infrequent words.
* **CBOW (Continuous Bag-of-Words):** **Predicts a center word from its context words**. It's faster to train and slightly better for frequent words.

### 3. Key Concepts in Word2Vec
* **Analogical Reasoning:** Because embeddings capture relationships, vector arithmetic is possible (e.g., `vec(king) - vec(man) + vec(woman) ≈ vec(queen)`).
* **Vocabulary Size (N):** The number of unique words in the corpus.
* **Embedding Size (d):** A hyperparameter defining the length of the word vectors.
* **Weight Matrices:**
    * **Input Matrix (U):** Shape `[N x d]`. Each row is the embedding for a word.
    * **Output Matrix (V):** Shape `[d x N]`. Used to generate scores for context words.

### 4. Making Training Feasible: Negative Sampling ✅
* **The Problem:** The standard softmax layer in Word2Vec is computationally very expensive because it has to calculate probabilities over the entire vocabulary.
* **The Solution (Negative Sampling):** It reframes the task from a massive multi-class problem to many simple binary classification problems.
    * For a true `(target, context)` pair, the model learns to predict **1 (positive)**.
    * For several fake `(target, random_word)` pairs, the model learns to predict **0 (negative)**.
    * This uses a **sigmoid** activation function instead of softmax and is much faster.