## Generative Adversarial Networks (GANS)

Generative Adversarial Nets(2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair Aaron Courville, and Yoshua Bengio.   
https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf  
https://www.youtube.com/watch?v=eyxmSmjmNS0

GANS are models with two parts. The goal is to produce realistic images by having the **Generator** create an image that the **Discriminator** classifies as either real or fake.

Generative models learn p(y|x) by leaning P(x,y) =p(x|y)p(y). Discrimative models learn P(y|x) directly. We can sample from Generative Models since they learn p(y).

GANs are trained by a zero-sum game between the Generator and the Discriminator.

![](GAN.png)

#### Generator

The generator takes random noise from some probability distribution as input and tries to generate a realistic output image

$G(z,\theta_g), z \sim{N(0,1)} \text{ or } z \sim{U(-1,1)}$ is a sample from a Normal or Uniform distribution.


#### Discriminator

The Discriminator takes two alternating inputs: the real images of the training dataset or the generated fake samples from the generator. It classifies the input image as real or fake (i.e. comes from the Generator).

$D(x,\theta_d)$  

Input: $z \sim{p_g(z)}$ or $x \sim{p_{data}(x)}$  
Output: 1 = real, 0 = fake



### Loss Functions

#### Disciminator Loss Function

For the Disciminator we want to minimize the loss function.

$$L^{(D)} = \mathbb{E}_{x\sim{p_{data}}}log(D(x)) + \mathbb{E}_{z\sim{p_z}}log(1 - D(G(z))$$


$\mathbb{E}_{x\sim{p_{data}}}log(D(x))$ is the loss when input is sampled from the real data. 

$\mathbb{E}_{z\sim{p_z}}log(1 - D(G(z))$ is the loss when the input is sampled from the Generator


#### The Generator Loss function

$$L^{(G)} = \mathbb{E}_{z\sim{p_z}}log(1 - D(G(z))$$

#### GAN Objective Function

Combining the Generator and Discriminator Loss Function

$$\underset{G}{\mathrm{min}}\text{ }\underset{D}{\mathrm{max}}V(G,D) = \mathbb{E}_{x\sim{p_{data}}}log(D(x)) + \mathbb{E}_{z\sim{p_z}}log(1 - D(G(z))$$

### Training

These networks are hard to train. They are trained sequentially (i.e. one after the other), and alternate between the two over multiple epochs.

#### Training Loop

1. Repeat for k steps, where k is a hyperparameter (set k = 1):  
    - Sample a mini-batch of m noise samples $(z^{(1)},z^{(2)},...,z^{(m)})$ and transform with the Generator
    - Sample a mini-batch of m samples from the real data, $(x^{(1)},x^{(2)},...,x^{(m)})$
    - Update the discriminator weights $\theta_d$ by **ascending** the stochastic gradient of its loss:
$$\nabla_{\theta_d}\frac{1}{m}\sum_i^m[log(D(x^{(i)})) + log(1 - D(G(z^{(i)}))]$$
    - The generator weights $\theta_g$ will be locked and only the discriminator weights $\theta_d$ are updated.
    
2. Sample a mini-batch of m noise samples $(z^{(1)},z^{(2)},...,z^{(m)})$ and transform with the Generator
3.  Update the generator by **descending** the stochastic gradient of its loss:
$$\nabla_{\theta_d}\frac{1}{m}\sum_i^m[ log(1 - D(G(z^{(i)}))]$$
    - The discriminator weights $\theta_d$  are locked and we can only adjust the Generator weights $\theta_g$. 

#### Training Tricks 


* Training GANs is notoriously difficult, below are a few of the tricks (i.e. heuristics) to try

* Use tanh as the last activation in the generator, instead of the sigmoid
* Sample points from the latent space using a Gaussian not a uniform distribution.
* Introduce randomness ways: 
    - Use dropout in the discriminator, 
    - Add some random noise to the labels for the discriminator.
* Use LeakyReLU instead of a ReLU activation to ease sparsity constraints by allowing small negative activation values.


#### Sketch of proof that the Objective Function converge to $p_{data}(x) = p_g(x)$

For complete derivation see https://www.youtube.com/watch?v=7G4_Y5rsvi8

$$V(G,D) = \mathbb{E}_{x\sim{p_{data}}}log(D(x)) + \mathbb{E}_{z\sim{p_z}}log(1 - D(G(z)) \\
=\int_xp_{data}log(D(x))dx +  \int_zp_zlog(1 - D(G(z))dz $$

Rewrite in terms of x, using $z=G^{-1}(x), dx$ and $p_g$:

$$=\int_xp_{data}log(D(x))dx +  \int_xp_x(G^{-1}(x))(log(1 - D(x))G^{-1}(x)dx \\
=\int_xp_{data}log(D(x))dx + \int_xp_g(x)log(1 - D(x))dx$$



Do optimization for max D

$$\underset{D}{\mathrm{max}}V(D,G) = \underset{D}{\mathrm{max}}\int_xp_{data}log(D(x)) + p_g(x)log(1 - D(x))dx$$

 i.e. set derivative = 0 and solve:

$$\frac{\partial}{\partial D(x)}(p_{data}(x)log(D(x))) + p_g(x)log(1-D(x))) = 0 \\
\Rightarrow \frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} =0 \\
\Rightarrow D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$$

Setting D to the max value, $D^*_G(x)$, the loss function for G:

$$L(G) = \underset{D}{\mathrm{max}}V(G,D) \\
= \underset{D}{\mathrm{max}}\int_xp_{data}log(D(x)) + p_g(x)log(1 - D(x))dx $$

$$ = \int_xp_{data}log(D^*_G(x)) + p_g(x)log(1 - D^*_G(x))dx $$

$$ = \int_xp_{data}log\left(\frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\right) + p_g(x)log\left(1 - \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\right)dx $$

Divide each term by 2 and multiply by 2.

$$ = \int_xp_{data}log\left(\frac{p_{data}(x)}{\frac{p_{data}(x) + p_g(x)}{2}}\right) + p_g(x)log\left(1 - \frac{p_{data}(x)}{\frac{p_{data}(x) + p_g(x)}{2}}\right)dx -2log(2)$$

This is KL-divergence, so the Loss function for G:

$$L(G)= KL[p_{data}(x)||\frac{p_{data}(x) + p_g(x)}{2} ] + KL[p_g(x)||\frac{p_{data}(x) + p_g(x)}{2}] - 2log2$$
    

KL-divergence is always $\ge$ 0 therfore the minimum will be -2log(2). 

$$ \underset{G}{\mathrm{min}}L(G) = 0 + 0 - 2log(2) = -2log2 \\
KL[p_{data}||\frac{p_{data}(x) + p_g(x)}{2} ] = 0 \\
\text{when } p_{data} = \frac{p_{data}(x) + p_g(x)}{2} $$

Since KL-divergence can only = 0 when the distributions are equal, at the minimum, this implies:

$$\Rightarrow p_{data}(x) = p_g(x)$$

The Discriminator will assign probability = .5 to real image and fake image.



### References

Vasilev,Slatr,Spacagna,Roelants,Zocca (2019) Python Deep Learning, 2nd Edition