# Semi-supervised learning using GANs, Part 1

## Understanding the loss function

Suppose that we have $N$ training examples  $\mathbf{x_i}$ with labels $\mathbf{y_i}$, $i = 1 \ldots N$ with $K$ classes $1 \ldots K$. We have $M$ examples generated by the GAN which we label with an additional class $K+1$. We will now explain how we can train the model with softmax cross-entropy loss as for a fully supervised case. The loss for semi-supervised learning is given by the sum of two terms, one which uses the labelled data and another which uses the generated examples.

$$L = -\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x},y)} \log(p_{\text{model}}(y|\mathbf{x})) + \mathbb{E}_{\mathbf{x} \sim p_\text{G}(\mathbf{x})} \log(p_{\text{model}}(y = K+1|\mathbf{x}))$$

### Loss function using labelled data

To begin with let us be clear about the the first term of the loss relates to the usual average cross entropy loss function that we use for supervised learning:

$$L' = -\frac{1}{N}\sum_i\log(p_{\text{model}}(y = y_i|\mathbf{x} = \mathbf{x}_i))$$

We can show that $L'$ is an approximation of $L$ using the training set distribution $\hat{p}_\text{data}$ as an estimate for the true data distribution.

$$p_{\text{model}}(y = k|\mathbf{x}) = \text{SoftMax}(\mathbf{l})_k$$

where $\mathbf{l} = \{l_1 \ldots l_K\}$ is the vector of logits output by the model with input $\mathbf{x}$. Let us approximate the joint probability of $\mathbf{x}$ and $y$ given  according to the data distribution, where $r(\mathbf{x})$ is the true class of $\mathbf{x}$

$$\hat{p}_\text{data}(\mathbf{x}, y) = \hat{p}_\text{data}(y|\mathbf{x})\hat{p}_\text{data}(\mathbf{x}) = \frac{1}{N}\cdot\mathbb{1}_{y=r(\mathbf{x})} = \left\{\begin{array}{ll}
                  1,\text{ }y = r(\mathbf{x}) \\
                  0,\text{  }y\neq r(\mathbf{x})\\
                \end{array}
              \right.$$



Although $\mathbf{x}$ is actually continuous we approximate it by the discrete datapoints $\mathbf{x}_i$ so that the expectation with respect to $\hat{p}_\text{data}(\mathbf{x})$ is written as a sum rather than an integral. Note also that $r(\mathbf{x}_i) = y_i$. We can see that 

$$-\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)} \log(p_{\text{model}}(y|\mathbf{x})) 
\\ \approx-\mathbb{E}_{(\mathbf{x},y) \sim \hat{p}_\text{data}(\mathbf{x}, y)} \log(p_{\text{model}}(y|\mathbf{x}))
\\=-\mathbb{E}_{\hat{p}_\text{data}(\mathbf{x})}\left[\mathbb{E}_{\hat{p}_\text{data}(y|\mathbf{x})}\log(p_{\text{model}})\right]
\\=\mathbb{E}_{\hat{p}_\text{data}(\mathbf{x})}H(\hat{p}_{\text{data}}(y|\mathbf{x})),p_{\text{model}})
\\=-\sum_\mathbf{x} \sum_y \frac{1}{N}\cdot\mathbb{1}_{y=r(\mathbf{x})}\log(p_{\text{model}}(y|\mathbf{x})) 
\\=-\sum \limits_{i=1}^{N} \sum \limits_{k=1}^{K} \frac{1}{N}\cdot\mathbb{1}_{y=y_i}\log(p_{\text{model}}(y = k|\mathbf{x} = \mathbf{x}_i) ) 
\\=-\frac{1}{N}\sum \limits_{i=1}^{N} \log(p_{\text{model}}(y = y_i|\mathbf{x} = \mathbf{x}_i) ) = L'$$

### Loss function using generated data

By defining $p_G(\mathbf{x}, y) = \mathbb{1}_{y=K+1}\cdot p_G(\mathbf{x})$, it is easy to see that the additional term in the loss function

$$\mathbb{E}_{\mathbf{x} \sim p_\text{G}(\mathbf{x})} \log(p_{\text{model}}(y = K+1|\mathbf{x})) = \mathbb{E}_{\mathbf{x} \sim p_\text{G}(\mathbf{x})} \sum_k\log(p_{\text{model}}(y = k|\mathbf{x}))p_G(y) = \mathbb{E}_{(\mathbf{x},y) \sim p_\text{G}(\mathbf{x}, y)} \log(p_{\text{model}}(y|\mathbf{x}))$$

also represents an average softmax cross entropy loss, which can be approximated by the following:

$$\frac{1}{M}\sum_j\log(p_{\text{model}}(y = y_j|\mathbf{x} = \mathbf{x}_j))$$

### Interpretation of the loss function

From now onwards for clarity and consistency with the paper, we will write the true loss using ${p}_\text{data}$ rather than the approximation of it using $\hat{p}_\text{data}$ that is used in practice. Now let us define an indicator variable $I_K = \mathbb{1}_{y\leq K}$. Then for the pair of variables $(y, I_k)$ we have:

$$(y, I_k) = \left\{\begin{array}{ll}
                  (y, 1),\text{ }y < K+1 \\
                  (y, 0),\text{  }y=K+1\\
                \end{array}
              \right.$$

It is then easy to see that the joint probability of $(y, I_k)$ conditioned on $\mathbf{x}$ can be expressed as

$$p_{\text{model}}(y, y < K+1|\mathbf{x}) = p_{\text{model}}(y, I_K = 1|\mathbf{x}) = \left\{
                \begin{array}{ll}
                  p_{\text{model}}(y|\mathbf{x}),\text{ }y < K+1 \\
                  0,\text{  }y=K+1\\
                \end{array}
              \right.$$

In the first term of the loss it will always be the case that $y < K + 1$ since this term only considers the samples from the data distribution so using the identity above we can rewrite it as 

$$\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)} \log(p_{\text{model}}(y|\mathbf{x})) = \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)} \log(p_{\text{model}}(y, I_K=1|\mathbf{x})) 
\\= \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)} \log(p_{\text{model}}(y|I_K=1, \mathbf{x})\cdot p_{\text{model}}(I_K=1|\mathbf{x})) 
\\= \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)} \left[\log(p_{\text{model}}(y|I_K=1, \mathbf{x})) + \log( p_{\text{model}}(I_K=1|\mathbf{x}))\right]$$

Since the value of $I_k$ is known, $p_{\text{model}}(I_K=1|\mathbf{x}))$ is constant with respect to $y$ so we only take the expectation of the second log term over $\mathbf{x}$

$$\\= \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)}\log(p_{\text{model}}(y|I_K, \mathbf{x})) + \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\mathbb{E}_{\mathbf{y} \sim p_\text{data}(y|\mathbf{x})}\log( p_{\text{model}}(I_K=1|\mathbf{x})) 
\\= \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)}\log(p_{\text{model}}(y|I_K, \mathbf{x})) + \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log( p_{\text{model}}(I_K=1|\mathbf{x})) 
\\= \mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)}\log(p_{\text{model}}(y|y < K + 1, \mathbf{x})) + \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log( 1 - p_{\text{model}}(y = K + 1|\mathbf{x})) $$


Thus the overall loss function can be written as 

$$L = -\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}} \log(p_{\text{model}}(y|\mathbf{x})) - \mathbb{E}_{\mathbf{x} \sim p_\text{G}} \log(p_{\text{model}}(y = K+1|\mathbf{x}))
\\=-\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)}\log(p_{\text{model}}(y|y < K + 1, \mathbf{x})) \\- \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log( 1 - p_{\text{model}}(y = K + 1|\mathbf{x})) - \mathbb{E}_{\mathbf{x} \sim p_\text{G}} \log(p_{\text{model}}(y = K+1|\mathbf{x}))
\\=L_{supervised} + L_{unsupervised}$$

$$ L_{supervised} = -\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x}, y)}\log(p_{\text{model}}(y|y < K + 1, \mathbf{x}))
$$

$$L_{unsupervised} = - \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log( 1 - p_{\text{model}}(y = K + 1|\mathbf{x})) - \mathbb{E}_{\mathbf{x} \sim p_\text{G}} \log(p_{\text{model}}(y = K+1|\mathbf{x}))$$

### Unsupervised loss as GAN loss

Recall that the discriminator $D$ in the GAN takes as input $\mathbf{x}$ which is either a real example $\mathbf{x}_{\text{data}}$ or the output of the generator $G$, $G(z)$ and predicts the probability that the input is real i.e. $D(\mathbf{x}) = p_{model}(\mathbf{x} \text{ is real})$. For the model above, the predicted probability an input $\mathbf{x}$ belongs to class $K+1$ is equivalently the predicted probability that $\mathbf{x}$ is fake which means that:

$$p_{model}(\mathbf{x} \text{ is real}) = 1 - p_{\text{model}}(y = K + 1|\mathbf{x})$$

Letting $D(\mathbf{x}) = 1 - p_{\text{model}}(y = K + 1|\mathbf{x})$ and substituting this in $L_{unsupervised}$ recovers the GAN loss function:

$$L_{unsupervised} = - \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log(D(\mathbf{x})) - \mathbb{E}_{\mathbf{x} \sim p_\text{G}} \log(1-D(\mathbf{x})) \\- \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})}\log(D(\mathbf{x})) - \mathbb{E}_{\mathbf{x} \sim p_\text{G}} \log(1-D(G(\mathbf{z})))$$

### Optimization of the loss function

First let us explicitly write out the expectations in the loss function:

$$L = -\mathbb{E}_{(\mathbf{x},y) \sim p_\text{data}(\mathbf{x},y)} \log(p_{\text{model}}(y|\mathbf{x})) + \mathbb{E}_{\mathbf{x} \sim p_\text{G}(\mathbf{x})} \log(p_{\text{model}}(y = K+1|\mathbf{x}))
\\=-\int d\mathbf{x} \sum\limits_{k=1}^{K}\log(p_{\text{model}}(y=k|\mathbf{x}))\cdot p_{\text{data}}(\mathbf{x}, y=k) -\int d\mathbf{x} \log(p_{\text{model}}(y = K+1|\mathbf{x}))\cdot p_\text{G}(\mathbf{x})$$

Now we will define a distribution a joint distribution $q(\mathbf{x},y)$, that combines $p_\text{data}$ and $p_\text{G}$:

$$q(\mathbf{x},y) = q(y|\mathbf{x}) q(\mathbf{x}) \propto \left\{\begin{array}{ll}
                   p_{\text{data}}(\mathbf{x},y) \text{,  }y < K+1 \\
                   p_\text{G}(\mathbf{x})\text{,  }y=K+1\\
                \end{array}
              \right.$$
             


Using this we can now rewrite the loss function in terms of the cross entropy between $q(y|\mathbf{x})$ and $p_{\text{model}}(y|\mathbf{x})$ i.e . $H(q(y|\mathbf{x}),p_{\text{model}})$

$$L = -\int d\mathbf{x} \left( \sum\limits_{k=1}^{K}\log(p_{\text{model}}(y=k|\mathbf{x}))p_{\text{data}}(\mathbf{x}, y=k) + \log(p_{\text{model}}(y = K+1|\mathbf{x}))p_\text{G}(\mathbf{x}) \right) 
\\ \propto -\int d\mathbf{x} \left( \sum\limits_{k=1}^{K+1}\log(p_{\text{model}}(y=k|\mathbf{x}))\cdot q(\mathbf{x}, y)\right) 
\\ \propto -\int d\mathbf{x} \cdot q(\mathbf{x})\left( \sum\limits_{k=1}^{K+1}\log(p_{\text{model}}(y=k|\mathbf{x}))\cdot q(y|\mathbf{x})\right) 
\\ \propto -\mathbb{E}_{q(\mathbf{x})}\left[\mathbb{E}_{q(y|\mathbf{x})}\log(p_{\text{model}}(y|\mathbf{x}))\right] = -\mathbb{E}_{q(\mathbf{x})}H(q(y|\mathbf{x}),p_{\text{model}})$$

Using the fact that cross-entropy between $q$ and $p$ is minimized when $q = p$, we can see that the optimal model distribution is given by $p^*_{\text{model}}(y=k|\mathbf{x}) = q(y|\mathbf{x}) \implies p^*_{\text{model}}(y=k|\mathbf{x}) \propto q(\mathbf{x}, y).$

We can derive the expressions given the paper for $\exp((l_k(\mathbf{x}))$ by noting that $p_{\text{model}}(y = k|\mathbf{x}) = \text{SoftMax}(\mathbf{l})_k \propto {\exp((l_k(\mathbf{x}))}$. Thus when $p_{\text{model}} = p_{\text{model}}^*$, ${\exp((l_k(\mathbf{x}))} \propto q(\mathbf{x}, y)$ so that  

$${\exp((l_j(\mathbf{x}))} = \left\{\begin{array}{ll}
                  c(\mathbf{x}) p_{\text{data}}(\mathbf{x},y=k) \text{,  }j < K+1 \\
                  c(\mathbf{x}) p_\text{G}(\mathbf{x})\text{,  }j=K+1\\
                \end{array}
              \right.$$
              
The proportionality above is with respect to $y$ only so in the expression for ${\exp((l_j(\mathbf{x}))}$ above we have to include a term $c(\mathbf{x})$ which is a function of $\mathbf{x}$ but constant with respect to $y$.

### Using only $K$ outputs

It turns out that the classifier with $K+1$ outputs is over parameterised. To see why we don't need $K$ outputs first note subtracting the same value from each of the logits does not change the softmax values:

$$\text{SoftMax}(\mathbf{l}(\mathbf{x}) - f(\mathbf{x}))_j = \frac{\exp(\mathbf{l}(\mathbf{x}) - f(\mathbf{x}))}{\sum\limits_{k=1}^{K+1}\exp(l_k(\mathbf{x}) - f(\mathbf{x}))}
= \frac{A(\mathbf{x})\exp(l_j(\mathbf{x}))}{A(\mathbf{x})\sum\limits_{k=1}^{K+1}\exp(l_k(\mathbf{x}))} = \text{SoftMax}(\mathbf{l}(\mathbf{x}))_j
$$

Let $l'_j(\mathbf{x}) = l_j(\mathbf{x}) - f(\mathbf{x})$ and $f(\mathbf{x}) = l_{K+1}(\mathbf{x})$ which amounts to setting $l'_{K+1}(\mathbf{x}) = 0$ for all $\mathbf{x}$. Then we can also express $D(\mathbf{x})$ without using $l_{K+1}$ as:

$$D(\mathbf{x}) = \frac{Z(\mathbf{x})}{Z(\mathbf{x}) + 1}$$

$$Z(\mathbf{x}) = \sum\limits_{k=1}^{K}\exp(l_k(\mathbf{x}))$$