## 1. Using word2vec for Deep Representation Learning on Graphs

The Skip-gram model was introduced in the first word2vec paper as a log-linear model. Its maximum log-likelihood objective can be expressed as:

$$
L = \frac{1}{T} \sum_{t=1}^T \sum_{-j_c \leq j \leq j_c} log q(w_{t + j} | w_t)
$$

where $q(w_{t + j} | w_t)$ is the softmax $\sigma$ of the inner product of $\vec{w_t}$ and context $w_{t + j}$, respectively.

$$
q(w_{t + j} | w_t) = \sigma (\vec{w_i} \times \vec{c_{t + j}}) = \frac{e^{\vec{w_i} \times \vec{c_{t + j}}}}{\sum_{k=1}^K e^{\vec{w_i} \times \vec{c_{t + j}}}}
$$

This particular form of the softmax function is that of multinomial logistic regression – a log linear model.

#### Task a

One way to fit the logistic / softmax function to data is to use cross-entropy. Cross-entropy measures the log expectation of distribution q with respect to distribution p, and can be formulated as

$$
H(p, q) = - E_p (\log q)
$$

where $E_p (\log q)$ is the expected value $\log q$ with respect to the distribution $p$, or alternatively as the entropy of $p$ plus the Kullback-Leibler divergence $D_{KL} (p || q)$ of $p$ from $q$.

$$
H(p, q) = H(p) + D_{KL} (p || q)
$$

Show that these two definitions of the cross-entropy are equivalent (refer to Exercise Sheet 04 for a definition of the KL divergence).

$$
\begin{align*}
H(p, q) &= H(p) + D_{KL} (p || q) = \\
&= - \sum_{i \in \Omega} p(i) \log p(i) + - \sum_{i \in \Omega} p(i) \log \frac{q(i)}{p(i)} = \\
&= - \sum_{i \in \Omega} p(i) \log p(i) - \sum_{i \in \Omega} p(i) \log q(i) + \sum_{i \in \Omega} p(i) \log p(i) = \\
&= - \sum_{i \in \Omega} p(i) \log q(i) = \\
&= - E_p (\log q)
\end{align*}
$$

#### Task b

When used as a loss function, the cross-entropy takes the names cross-entropy loss, log loss, or logistic loss. The cross-entropy loss has a unique relationship to logistic regression. For example, the gradient of the cross-entropy loss for logistic regression is the same as the gradient of the squared error loss for linear regression. We can minimize this loss function with gradient descent in the neural networks discussed in the lectures. It is often useful to instead frame the problem as maximizing an objective, instead of minimizing a loss. Maximum likelihood estimation aims to maximize the average log likelihood of all possible outcomes. The estimated probability of outcome $x$ is $q(x)$ while its true probability is $p(x)$. For N conditionally independent trials, the likelood of the parameters $\beta$ of $q(x)$ is given by the probability of x given the estimated $q(x)$

$$
L = \prod_x q(x)^{N p(x)}
$$

Show that the average log likelihood is equivalent to the negative cross-entropy loss. What does this mean for the Skip-gram model?

$$
\begin{align*}
\frac{1}{N} \log L &= \frac{1}{N} \log \prod_x q(x)^{N p(x)} = \\
&= \frac{1}{N} \sum_x \log q(x)^{N p(x)} = \\
&= \frac{1}{N} N \sum_x p(x) \log q(x) = \\
&= \sum_x p(x) \log q(x) = \\
&= - H(p, q)
\end{align*}
$$

The connection between cross-entropy loss, maximum likelihood estimation, and the negative log likelihood means that the model is trained to maximize the likelihood of observed context words given target words or minimize the negative cross-entropy loss.

#### Task c

Consider an undirected graph with no self-loops or multi-edges, with adjacency matrix $A$. Write down the probability $Prob(i \to j)$ of walking from node i to node j in an infinite random walk.

Well in a random walk, the transition probability from node i to node j:

$$
P_{ij} = \frac{A_{ij}}{d_i}
$$

But we are interested in an infinite random walk, so a $Prob(i \to j)$ is proportional to the stationary probability:

$$
Prob(i \to j) = \pi_j = \frac{d_j}{\sum_{k=0}^{|V|} d_k} 
$$

#### Task d

We can rewrite the objective of the Skip-gram model to consider the frequency $Pr(w, c)$ that a word-context pair $(w, c)$ occurs in an infinitely large corpus.

$$
\lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^T \log q(w_{t + j}|w_t) = \sum_{h, l \in V} Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) = \sum_{h, l \in V} Pr(h \to l) \log \sigma (x_{hl})
$$

where $x_{hl} = \vec{w_h} \times \vec{c_l}$ and $V$ is the set of vertices. Consider its corressponding cross-entropy losss H and argue that

$$
\frac{\partial}{\partial x_{ij}} H = - \frac{\partial}{\partial x_{ij}} \sum_{h, l \in V} \delta_{i, j} Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) = - \frac{\partial}{\partial x_{ij}} \sum_{l \in V} Pr(i \to l) \log \sigma (\vec{w_h} \times \vec{c_l})
$$

Considering the delta function:

$$
\delta_{i, h} = \begin{cases}
       1 &\quad\text{if } i = h\\
       0 &\quad\text{else } i \neq h \\
     \end{cases}
$$

$$
\begin{align*}
- \frac{\partial}{\partial x_{ij}} \sum_{h, l \in V} \delta_{i, j} Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) &= 
- \frac{\partial}{\partial x_{ij}} \left( 
  \sum_{l \in V, h = i} 1 \times Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) + 
  \sum_{h, l \in V, h \neq i} 0 \times  Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) \right) = \\
&= - \frac{\partial}{\partial x_{ij}} \sum_{l \in V, h = i} 1 \times Pr(h \to l) \log \sigma (\vec{w_h} \times \vec{c_l}) = \\
&= - \frac{\partial}{\partial x_{ij}} \sum_{l \in V} Pr(i \to l) \log \sigma (\vec{w_i} \times \vec{c_l})
\end{align*}
$$

#### Task e

Evaluate the derivate and substitute in your solution for $Pr(i \to j)$. Thereby solve for the $\sigma (\vec{w_i} \times \vec{c_l})$ that minimizes $H$. Considering your solution, argue the utility of the softmax function $\sigma$. Defining $\sigma_{ij} = \sigma (\vec{w_i} \times \vec{c_l})$, the following derivative may be useful:

$$
\frac{\partial}{\partial x_{ij}} \log \sigma_{il} = \delta_{jl} - \sigma_{ij}
$$

, where 

$$
\delta_{i, h} = \begin{cases}
       1 &\quad\text{if } i = h\\
       0 &\quad\text{else } i \neq h \\
     \end{cases}
$$

$$
\begin{align*}
\frac{\partial}{\partial x_{ij}} H &= - \frac{\partial}{\partial x_{ij}} \sum_{l \in V} Pr(i \to l) \log \sigma (\vec{w_i} \times \vec{c_l}) = \\
&=  - \sum_{l \in V, l = j} Pr(i \to l) (1 - \sigma_{ij}) - \sum_{l \in V, l \neq j} Pr(i \to l) (- \sigma_{ij}) = \\
&= - Pr(i \to j) (1 - \sigma_{ij}) + \sigma_{ij} \sum_{l \in V, l \neq i} Pr(i \to l) = \\
&= - Pr(i \to j) (1 - \sigma_{ij}) + \sigma_{ij} (1 - Pr(i \to j)) = \\
&= - Pr(i \to j) + \sigma_{ij} = 0 \\
\sigma_{ij} &= Pr(i \to j) \\
\sigma_{ij} &= Pr(i \to j) = \pi_j = \frac{d_j}{\sum_{k=0}^{|V|} d_k}
\end{align*}
$$

So this value maximizes $H$.

#### Task f

$x_{ij} = \vec{w_h} \times \vec{c_l}$ can be considered as entries in the matrix multiplication $\vec{w_h} \times \vec{c_l} = \sum_{k = 1}^d W_{ik} C_{kj} = M_{ij}$. The matrices W and C are known as the word and context embeddings, respectively, and it is W that is usually used as the resulting word (node) embeddings for the Skip-gram model (DeepWalk / node2vec). Argue what this simplified version of DeepWalk implicitly factorizes, and argue the utility of the Skip-gram model / similar NLP methods for embedding walks on graphs vs [any form of] explicit matrix factorization.

1. In the simple case the transition matrix is factorized:

$$
T = \sigma = X \times W \times C
$$

2. From the persspective of representation learning task the node2vec and RandomWalk techniques capture structure of the network and easy to use for different ML tasks like link prediction and node classification. At the same time the dimensionality reduction task is better to do with matrix factorization techniques because of good interpretability.