# CS 224n Assignment #2: word2vec (44 Points)

## 1. Written: Understanding word2vec (26 points)
The goal of the skip-gram `word2vec` algorithm is to accurately learn the probability distribution $P(O|C)$. Given a specific word $o$ and a specific word $c$, we want to calculate $P(O = o|C = c)$, which is the probability that word $o$ is an ‘outside’ word for $c$, i.e., the probability that $o$ falls within the contextual window of $c$.  

In `word2vec`, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:  
$$\begin{align*}
P(O = o|C = c) = \frac{\exp (u_o^\top v_c)}{\sum_{w\in Vocab}\exp (u_w^\top v_c)} \tag{1}
\end{align*}$$  

Here, $u_o$ is the ‘outside’ vector representing outside word $o$, and $v_c$ is the ‘center’ vector representing center word $c$. To contain these parameters, we have two matrices, $U$ and $V$. The columns of $U$ are all the ‘outside’ vectors $u_w$. The columns of $V$ are all of the ‘center’ vectors $v_w$. Both $U$ and $V$ contain a vector for every $w \in$ Vocabulary.  

Recall from lectures that, for a single pair of words $c$ and $o$, the loss is given by:  
$$\begin{align*}
J_{\text{naive-softmax}}(v_c, o, U) = -\log P(O = o|C = c). \tag{2}
\end{align*}$$  

We can view this loss as the cross-entropy between the true distribution $y$ and the predicted distribution $\hat{y}$. Here, both $y$ and $\hat{y}$ are vectors with length equal to the number of words in the vocabulary. Furthermore, the $k^{th}$ entry in these vectors indicates the conditional probability of the $k^{th}$ word being an ‘outside word’ for the given $c$. The true empirical distribution $y$ is a one-hot vector with a 1 for the true outside word $o$, and 0 everywhere else. The predicted distribution $\hat{y}$ is the probability distribution $P(O| C = c)$ given by our model in equation $(1)$.

### (a)
(3 points) Show that the naive-softmax loss given in Equation $(2)$ is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that

$$\begin{align*}
-\sum_{w\in Vocab} y_w \log (\hat{y}_w) = -\log (\hat{y}_o). \tag{3}
\end{align*}$$  

Your answer should be one line.  

### $\color{red}{Answer}$  
$$y_w = \begin{cases} 1, & \mbox{if }w = o \\ 
0, & \mbox{elsewhere} \end{cases}$$

### (b)  
(5 points) Compute the partial derivative of $J_{\text{naive-softmax}}(v_c, o, U)$ with respect to $v_c$. Please write your answer in terms of $y$, $\hat{y}$, and $U$. Note that in this course, we expect your final answers to follow the shape convention. This means that the partial derivative of any function $f(x)$ with respect to $x$ should have the same shape as $x$. For this subpart, please present your answer in vectorized form. In particular, you may not refer to specific elements of $y$, $\hat{y}$, and $U$ in your final answer (such as $y_1, y_2, \ldots$).

### $\color{red}{Answer}$  
Suppose the number of words in vocabulary is $V$, the size of word embedding is $N$, then the word vectors $v_c, u_w \in \mathbb{R}^{N}$ and matrix $U \in \mathbb{R}^{N \times V}$. The true distribution $y$ and the predicted distribution $\hat{y}$ are vectors of length $V$. Following the shape convention, the partial derivative of $J_{\text{naive-softmax}}(v_c, o, U)$ with respect to $v_c$, should be $\frac{\partial J}{\partial v_c} \in \mathbb{R}^N$.  

$$\begin{align*}
J_{\text{naive-softmax}}(v_c, o, U) &= -\log P(O = o|C = c) \\
&= -\log\frac{\exp (u_o^\top v_c)}{\sum_{w\in Vocab}\exp (u_w^\top v_c)} \\
&= - u_o^\top v_c +  \log \sum_{w\in Vocab}\exp (u_w^\top v_c) \end{align*}$$  
$$\begin{align*}
\frac{\partial J(v_c, o, U)}{\partial v_c} &= -u_o + \frac{\frac{\partial}{\partial v_c}\sum_{w\in Vocab}\exp (u_w^\top v_c)}{\sum_{w\in Vocab}\exp (u_w^\top v_c)} \\
&= -u_o + \frac{1}{\sum_w^V \exp (u_w^\top v_c)}\cdot \sum_x^V \frac{\partial}{\partial v_c} \exp (u_x^\top v_c) \\
&= -u_o + \frac{1}{\sum_w^V \exp (u_w^\top v_c)}\cdot \sum_x^V \exp (u_x^\top v_c) \cdot \frac{\partial}{\partial v_c} u_x^\top v_c \\
&= -u_o + \frac{1}{\sum_w^V \exp (u_w^\top v_c)}\cdot \sum_x^V \exp (u_x^\top v_c) \cdot u_x \\
&= -u_o + \sum_x^V \frac{\exp u_x^\top v_c}{\sum_w^V \exp (u_w^\top v_c)} \cdot u_x \\
&= -u_o + \sum_x^V P(x|c)u_x \\
&= -u_o + \sum_{w \in Vocab} P(O = w|C = c)u_w \\
&= -u_o + \sum_{w \in Vocab} \hat{y}_w u_w \\
&= -\sum_{w \in Vocab}y_w u_w + \sum_{w \in Vocab} \hat{y}_w u_w \\
\therefore &= U(\hat{y} - y)
\end{align*}$$  

### (c)
(5 points) Compute the partial derivatives of $J_{\text{naive-softmax}}(v_c, o, U)$ with respect to  each of the ‘outside’ word vectors, $u_w$’s. There will be two cases: when $w = o$, the true ‘outside’ word vector, and $w \ne o$, for all other words. Please write your answer in terms of $y$, $\hat{y}$, and $v_c$. In this subpart, you may use specific elements within these terms as well, such as $(y_1, y_2, \ldots)$.  

### $\color{red}{Answer}$  
$$\begin{align*}
\frac{\partial J(v_c, o, U)}{\partial u_w} &= - \frac{\partial}{\partial u_w}u_o^\top v_c +  \frac{\partial}{\partial u_w}\log \sum_{w\in Vocab}\exp (u_w^\top v_c) \\ 
&= -\frac{\partial}{\partial u_w}u_o^\top v_c + \frac{1}{\sum_w^V \exp (u_w^\top v_c)}\cdot \frac{\partial}{\partial u_w} \sum_w^V \exp (u_w^\top v_c) \\
&= -\frac{\partial}{\partial u_w}u_o^\top v_c + \frac{1}{\sum_w^V \exp (u_w^\top v_c)}\cdot \frac{\partial}{\partial u_w} \exp (u_w^\top v_c) \\
&= -\frac{\partial}{\partial u_w}u_o^\top v_c + \frac{\exp (u_w^\top v_c)}{\sum_w^V \exp (u_w^\top v_c)} \cdot v_c \\
&\quad\quad \vdots \\
&= -\frac{\partial}{\partial u_w}u_o^\top v_c + \hat{y}_w v_c \\
&= \frac{\partial J(v_c, o, U)}{\partial u_w} = (\hat{y}_w -y_w) v_c \\
& \mbox{where } y_w = \begin{cases} 1, & \mbox{if }w = o \\ 
0, & \mbox{elsewhere} \end{cases}
\end{align*}$$  

### (d)
(1 point) Compute the partial derivative of $J_{\text{naive-softmax}}(v_c, o, U)$ with respect to $U$. Please write your answer in terms of $\frac{\partial J(v_c, o, U)}{\partial u_1}, \frac{\partial J(v_c, o, U)}{\partial u_2}, \ldots, \frac{\partial J(v_c, o, U)}{\partial u_{\lvert Vocab \rvert}}$. The solution should be one or two lines long.  

### $\color{red}{Answer}$  
$$\begin{align*}
\frac{\partial J(v_c, o, U)}{\partial u_k} &= (\hat{y}_k -y_k) v_c,\ \mbox{ for } k = 1, 2, \ldots, |Vocab|\\
\therefore \frac{\partial J(v_c, o, U)}{\partial U} &= \left[\frac{\partial J(v_c, o, U)}{\partial u_1}, \frac{\partial J(v_c, o, U)}{\partial u_2}, \ldots, \frac{\partial J(v_c, o, U)}{\partial u_{|Vocab|}} \right] = (\hat{y} - y)v_c^\top
\end{align*}$$  

### (e)
(3 Points) The sigmoid function is given by Equation 4:  
$$\begin{align*}
\sigma (x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x + 1} \tag{4}
\end{align*}$$  
Please compute the derivative of $\sigma (x)$ with respect to $x$, where $x$ is a scalar. Hint: you may want to write your answer in terms of $\sigma (x)$.  

### $\color{red}{Answer}$  
$$\begin{align*}
\frac{d \sigma (x)}{dx} &= \frac{d}{dx}\frac{1}{1+e^{-x}} \\
&= \frac{(-1)\cdot e^{-x}\cdot (-1)}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2} \\
&= \frac{1+e^{-x}}{(1+e^{-x})^2} - \frac{1}{(1+e^{-x})^2} \\
&= \frac{1}{1+e^{-x}}(1 - \frac{1}{1+e^{-x}}) \\
\therefore &= \sigma(x)(1-\sigma(x))
\end{align*}$$  

### (f)
(4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1, w_2, \ldots, w_K$ and their outside vectors as $u_1, \ldots, u_K$. For this question, assume that the $K$ negative samples are distinct. In other words, $i \ne j$ implies $w_i \ne w_j$ for $i,j \in \{1, \ldots, K \}$. Note that $o \notin \{w_1, \ldots, w_K \}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:  
$$\begin{align*}
J_{\text{neg-sample}}(v_c, o, U) = -\log (\sigma(u_o^\top v_c)) - \sum_{k=1}^K\log (\sigma(-u_k^\top v_c)) \tag{5}
\end{align*}$$  
for a sample $w_1, \ldots, w_K$, where $\sigma(\cdot)$ is the sigmoid function.  

Please repeat parts (b) and (c), computing the partial derivatives of $J_{\text{neg-sample}}$ with respect to $v_c$, with respect to $u_o$, and with respect to a negative sample $u_k$. Please write your answers in terms of the vectors $u_o$, $v_c$, and $u_k$, where $k \in [ 1, K ]$. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (e) to help compute the necessary gradients here.

### $\color{red}{Answer}$  
$$\begin{align*}
\mbox{(i) } \frac{\partial J_{\text{neg-sample}}}{\partial v_c} &= - \frac{\partial}{\partial v_c}\log(\sigma(u_o^\top v_c)) - \sum_{k=1}^K \frac{\partial}{\partial v_c} \log(\sigma(-u_k^\top v_c)) \\
&= - \frac{1}{\sigma(u_o^\top v_c)}\frac{\partial}{\partial v_c}\sigma(u_o^\top v_c) - \sum_{k=1}^K \frac{1}{\sigma(-u_k^\top v_c)}\frac{\partial}{\partial v_c}\sigma(-u_k^\top v_c) \\
&= -(1-\sigma(u_o^\top v_c))\frac{\partial}{\partial v_c}u_o^\top v_c - \sum_{k=1}^K (1-\sigma(-u_k^\top v_c))\frac{\partial}{\partial v_c}(-u_k^\top v_c)\\
&= -(1-\sigma(u_o^\top v_c))u_o + \sum_{k=1}^K (1-\sigma(-u_k^\top v_c))u_k \\
\mbox{(ii) } \frac{\partial J_{\text{neg-sample}}}{\partial u_o} &= - \frac{\partial}{\partial u_o}\log(\sigma(u_o^\top v_c)) \\ 
&= -(1-\sigma(u_o^\top v_c))v_c \\
\mbox{(iii) } \frac{\partial J_{\text{neg-sample}}}{\partial u_k} &= - \frac{\partial}{\partial u_k}\sum_{x=1}^K  \log(\sigma(-u_x^\top v_c)) \\
&= - (1-\sigma(-u_k^\top v_c))\frac{\partial}{\partial u_k}(-u_k^\top v_c)\\
&= (1-\sigma(-u_k^\top v_c))v_c
\end{align*}$$  

To calculate the probability distribution, naive-softmax loss computes an exponential score and iterates over the lexical size. On the other hand, negative sampling has a much smaller vocabulary of $K$.  

### (g)
 (2 point) Now we will repeat the previous exercise, but without the assumption that the $K$ sampled words are distinct. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1, \ldots, w_K$ and their outside vectors as $u_1, \ldots, u_K$. In this question, you may not assume that the words are distinct. In other words, $w_i = w_j$ may be true when $i \ne j$ is true. Note that $o \notin \{ w_1, \ldots, w_K \}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:  
$$\begin{align*}
J_{\text{neg-sample}}(v_c, o, U) = -\log (\sigma(u_o^\top v_c)) - \sum_{k=1}^K\log (-u_k^\top v_c) \tag{6}
\end{align*}$$  
for a sample $w_1, \ldots, w_K$, where $\sigma(\cdot)$ is the sigmoid function.  

Compute the partial derivative of $J_{\text{neg-sample}}$ with respect to a negative sample $u_k$. Please write your answers in terms of the vectors $v_c$ and $u_k$, where $k \in [1, K]$. Hint: break up the sum in the loss function into two sums: a sum over all sampled words equal to $u_k$ and a sum over all sampled words not equal to $u_k$.  

### $\color{red}{Answer}$  
$$\begin{align*}
\frac{\partial J_{\text{neg-sample}}}{\partial v_k} &= - \frac{\partial}{\partial v_k} \sum_{k=1}^K \log(\sigma(-u_k^\top v_c)) \\
&= - \frac{\partial}{\partial v_k}  \sum_{\substack{u_x = u_k \\ x = 1, \ldots, K}} \log (\sigma(-u_x^\top v_c)) - \frac{\partial}{\partial v_k} \sum_{\substack{u_x \ne u_k \\ x = 1, \ldots, K}} \log(\sigma(-u_j^\top v_c)) \\
&= \sum_{\substack{u_x = u_k \\ x = 1, \ldots, K}} (1-\sigma(-u_x^\top v_c))v_c \\
&= \left( \sum_{x=1}^K \mathbb{I}\{ u_x = u_k \} \right) (1-\sigma(-u_x^\top v_c))v_c
\end{align*}$$  

### (h)
(3 points) Suppose the center word is $c=w_t$ and the context window is $[w_{t-m}, \ldots, w_{t-1}, w_t, w_{t+1}, \ldots, w_{t+m}]$, where $m$ is the context window size. Recall that for the skip-gram version of `word2vec`, the total loss for the context window is:  

$$\begin{align*}
J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U) = \sum_{\substack{-m \le j \le m\\ j \ne 0}} J(v_c, w_{t+j}, U) \tag{7}
\end{align*}$$  
Here, $J(v_c, w_{t+j}, U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word $w_{t+j}$.  $J(v_c, w_{t+j}, U)$ could be  $J_{\text{naive-softmax}}(v_c, w_{t+j}, U)$ or  $J_{\text{neg-sample}}(v_c, w_{t+j}, U)$, depending on your implementation.  

Write down three partial derivatives:  
$$\begin{align*}
\text{(i) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial U \\
\text{(ii) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial v_c \\
\text{(iii) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial v_w \mbox{ when } w \ne c
\end{align*}$$  

Write your answers in terms of $\partial J(v_c, w_{t+j}, U)/\partial U$ and $\partial J(v_c, w_{t+j}, U)/\partial v_c$. This is very simple – each solution should be one line.  

### $\color{red}{Answer}$  

$$\begin{align*}
\text{(i) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial U = \sum_{\substack{-m \le j \le m\\ j \ne 0}}\partial J(v_c, w_{t+j}, U)/\partial U \\ 
\text{(ii) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial v_c = \sum_{\substack{-m \le j \le m\\ j \ne 0}}\partial J(v_c, w_{t+j}, U)/\partial v_c \\
\text{(iii) } & \partial J_{\text{skip-gram}}(v_c, w_{t-m}, \ldots, w_{t+m}, U)/\partial v_w = 0 \mbox{ when } w \ne c
\end{align*}$$ 

__*Once you’re done*__: Given that you computed the derivatives of $J(v_c, w_{t+j}, U)$ with respect to all the model parameters $U$ and $V$ in parts (a) to (c), you have now computed the derivatives of the full loss function $J_{\text{skip-gram}}$ with  respect to all parameters. You’re ready to implement `word2vec`!  

## 2. Coding: Implementing word2vec (18 points)  
- Implement methods in `word2vec.py`  
- Implement SGD optimizer in `sgd.py`
- Train word vectors using Stanford Sentiment Treebank (SST) dataset  

![png](https://drive.google.com/uc?id=1n7uFtEPHFuX9rppVw_duB7pKQrD-vu-J)