# Softmax for word2vec
Compute cost, gradients for softmax in word2vec model
* parameters: predicted, outputVectors, target

\begin{align}
\hat y_{o} =p(o|c) & = \frac {e^{u_{o}^\top v_{c}}} {\sum_{w=1}^{V} e^{u_{w}^\top v_{c}}} \\
J_{softmax-CE}(o, v_{c}, U) & = CE(y_{o},\hat y_{o}) = -y_{o} \log \hat y_{o} \\
& = -\log e^{u_{o}^\top v_{c}} + \log \sum_{w=1}^{V} e^{u_{w}^\top v_{c}} \\
& = -u_{o}^\top v_{c}  + \log \sum_{w=1}^{V} e^{u_{w}^\top v_{c}} \\
\frac {\partial{J}} {\partial{v_{c}}} & = -u_{o} + \sum_{w=1}^{V} \hat y_{w} u_{w} \\
\frac {\partial{J}} {\partial{u_{o}}} & = (\hat y_{w} -1) v_{c} \\
\frac {\partial{J}} {\partial{u_{w}}} & = \hat y_{w} v_{c}, \quad \text {for all}\; w \neq o
\end{align}

In the following code, **predicted** is vector $v_{c}$, **outputVectors** is matrix $U$, **target** is subscript $o$, **cost** is $J_{softmax-CE}(o, v_{c}, U)$, **gradPred** is $\frac {\partial{J}} {\partial{v_{c}}}$, and **grad** are $\frac {\partial{J}} {\partial{u_{o}}}$ and $\frac {\partial{J}} {\partial{u_{w}}}$ .

In [None]:
import numpy as np

In [None]:
target = 1

In [None]:
predicted = np.array([0.26726124, 0.53452248, 0.80178373])
predicted.shape

In [None]:
predicted = np.reshape(predicted, (-1, 1))
predicted.shape

In [None]:
outputVectors = np.array([
       [0.26726124, 0.53452248, 0.80178373],
       [0.45584231, 0.56980288, 0.68376346],
       [0.50257071, 0.57436653, 0.64616234],
       [0.52342392, 0.57576631, 0.62810871]])
outputVectors.shape

In [None]:
output = np.dot(outputVectors, predicted)
output.shape

In [None]:
sum_exp = np.sum(np.exp(output))
sum_exp

In [None]:
cost = (np.log(sum_exp) - output[target])[0]
cost

In [None]:
y_hats = np.exp(output) / sum_exp
y_hats.shape

In [None]:
y_hats[target] -= 1

In [None]:
gradPred = np.dot(outputVectors.T, y_hats)
gradPred.shape

In [None]:
grad = np.dot(y_hats, predicted.T)
grad.shape