# GANs - notes based on [Coursera Course](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans)

## Introduction

> :camera: Images used in this gits come from lecture notes to the course [Generative Adversarial Networks (GANs) Specialization](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans/) on Coursera.

A Generative Adversial Network consists of two models. A **discriminator** and a **generator** which compeat with each other.

### Discriminator

- Task: Grade the generator by evaluating the generated image as fake or real.
- Input: Image (generated or real) / Input features X.
- Output: Probability of class y (number between 0 and 1).

#### Discriminator - details

![Alt text](./figures/discriminator-2.png)

### Generator

- Task: Generate fake images that will look like real ones.
- Input: Random noise (random features).
- Output: Generated imge.

  ![Alt text](./figures/generator-2.png)
  ![Alt text](./figures/generator-1.png)


### BCE Cost Function

- Measure of how far away from the actual label (0 or 1) is the prediction.
- **Close to zero** when the label and the prediciton are **similar** (perfect model would have 0 loss). Approaches **infinity** when the label and the prediciton are **different.**
- Output range: between 0 and 1
- Used when we want to classify things into two classes (here real and fake images).
- For example if the real label is 0 a prediciton 0.98 would be bad and the loss would get high.
- Log loss penalizes wrong predictions as well as confident and wrong predicitons.

Formula:
$$BCE = -\frac{1}{n}\sum_{i=1}^n[ (y_i \log{(p(\hat{y_i}))} + (1 - y_i)\log{(1- p( \hat{y_i}))}]$$
$$BCE = -\frac{1}{n}\sum_{i=1}^n[ (y_i \log{(\hat{y_i})} + (1 - y_i)\log{(1- \hat{y_i})}]$$
- $y$ - ground truth label (0 or 1).
- $\hat{y}$ - predicted label (between 0 and 1).
- $n$ - number of samples.

#### Binary Cross Entropy calculation example

| i-th sample | Ground truth label $y$ | Predicted label $\hat{y}$ |
| ----------- | ---------------------- | ------------------------- |
| 1           | 1                      | 0.9                       |
| 2           | 1                      | 0.1                       |
| 3           | 0                      | 0.3                       |

$BCE = - \frac{(1 \cdot \log{0.9} + (1 - 1)\cdot \log{(1- 0.9)} ) + (1 \cdot \log{0.1} + (1 - 1)\cdot \log{(1- 0.1)} ) + (0 \cdot \log{0.3} + (1 - 0)\cdot \log{(1 - 0.3)} )}{3} ≈ -\frac{\log{(0.9)} + \log{(0.1)} + \log{(0.7)}}{3} ≈ - \frac{-0.105 -2.303 -0.357}{3} = 0.56$

Coursera formula:
$$J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)}, \theta) + (1 - y^{(i)})log(1 - h(x^{(i)}, \theta))]$$

$\theta$ - parameters.
$y^{(i)}logh(x^{(i)}, \theta)$ -
$(1 - y^{(i)})log(1 - h(x^{(i)}, \theta))$


#### Binary Cross Entropy calculation example

| i-th sample | Ground truth label $y$ | Predicted label $\hat{y}$ |
| ----------- | ---------------------- | ------------------------- |
| 1           | 1                      | 0.9                       |
| 2           | 1                      | 0.1                       |
| 3           | 0                      | 0.3                       |

$BCE = - \frac{(1 \cdot \log{0.9} + (1 - 1)\cdot \log{(1- 0.9)} ) + (1 \cdot \log{0.1} + (1 - 1)\cdot \log{(1- 0.1)} ) + (0 \cdot \log{0.3} + (1 - 0)\cdot \log{(1 - 0.3)} )}{3} ≈ -\frac{\log{(0.9)} + \log{(0.1)} + \log{(0.7)}}{3} ≈ - \frac{-0.105 -2.303 -0.357}{3} = 0.56$

Coursera formula:
$$J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)}, \theta) + (1 - y^{(i)})log(1 - h(x^{(i)}, \theta))]$$

$\theta$ - parameters.
$y^{(i)}logh(x^{(i)}, \theta)$ -
$(1 - y^{(i)})log(1 - h(x^{(i)}, \theta))$

![Alt text](./figures/gan-1.png)

### Resources to this topic
* [Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) chapter 4 (logistic regression)

### GAN Challenges | Herausforderungen

#### Confusing loss function values :confused:

Intuitive the lower is the generators cost function the better should be the generated image. In reality it is not so straight forward. The gerator is graded by the discriminator which improves over time and detects even verry realistic fake images. Sometimes **despite improvment in qualitu of generated images the generators loss function is growing.**

#### Hyperparameters

There are a lot of hyperparameters in GAN networks. Moreover, GANs are **extremly sensitive** to changes of them. Thats why hyperparameter tuning becomes a challenging task.

#### Mode collapse (pol. załamanie trybu)

It occurs when the generator finds a **small number of samples** (instead of exploring the whole training data) that succesfuly **fool the discriminator.** In result the generator becomes stuck in a particular pattern, failing to generate diverse images.

#### Overcoming challenges :sunglasses:

In order to overcome described problems a WGAN or WGAN-GP (Wasserstein GAN Gradient Penalty) archtecture should be consideres.

## DCGAN (Deep Convolutional GAN)

<details>
<summary>
<font size="3" color="green">
<b>Gan Archtecture Scheme</b>
</font>
</summary>
<div>
<img src = "layers.png" width=800>
</div>

</details>


Architecture guidelines for stable Deep Convolutional GANs (cited from [DCGAN paper](https://arxiv.org/abs/1511.06434) )
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided
convolutions (generator).
• Use batchnorm in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.



## Wassersteing GAN

> Also called WGAN. It is a **upgrated version of GAN** that introduces another cost function and minimizes the Earth-Mover's distance (EM).

### Resources for this topic

- [Towards Data Science](https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490)

### WGAN - Pros and Cons

<div style="text-align: center;">
  <table style="display: inline-block;">
    <tr style="background-color: lightgray;">
      <th>Pros</th>
      <th>Cons</th>
    </tr>
    <tr>
      <td>Better stability</td>
      <td>Longer training</td>
    </tr>
    <tr>
      <td>Meaningful loss (which is correlated with convergence and quality of samples)</td>
      <td>???</td>
    </tr>
    <tr>
        <td>Improved stability</td>
        <td>???</td>
    </tr>
  </table>
</div>

> [Read-through: Wasserstein GAN](https://www.alexirpan.com/2017/02/22/wasserstein-gan.html) article by Alexander Irpan in order to better understand the math behind WGAN.

- Mode Collapse - when the model collapses and generates images of only one class or only specific classes.

- Wasserstein Loss - approximates the Earth Mover's Distance
<div style="text-align:center;">
    <img src="image-1.png" alt="Your Image" width="600">
</div>

- Critic - it tries to maximize the distance between the real distribution and the fake distribution.
- intermediate image -


### BCE Loss vs Wasserstein Loss

| BCE Loss                                                                                                                                                          | Wasserstein Loss                                                                                                                            |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **Discriminator** outputs values between **0 and 1** (classifies fake and real as 0 and 1). This is because of the sigmoid activation function in the last layer. | **Critic** outputs **any number** (scores images with real numbers). It is **not bounded!** There is no sigmoid function in the last layer. |
| $-[\mathbb{E}\log{(d(x))} + \mathbb{E}(1 - \log{(d(g(z)))})]$                                                                                                     | $\underset{g}{\min} \: \underset{c}{\max} \: \mathbb{E}(c(x)) - \mathbb{E}(c(g(z)))$                                                        |
|                                                                                                                                                                   | Helps with mode collapse and vanishing gradient problem.                                                                                    |
| Measures how bad, on average, some observations are being classified by the discriminator, as fake and real.                                                      | Approximates the **Earth Mover's Distance**.                                                                                                |
| There is **no special condition**.                                                                                                                                | **Condition:** function needs to be 1-L Continuous $$\|\nabla \text{critic}(\text{image})\|_2 \le 1$$                                       |
| Uses $0$ and $1$ as labels.                                                                                                                                       | Uses $1$ and $-1$ as labels.                                                                                                                |

### Lipschitz continuity :small_red_triangle:
> :bulb: It is a **neccesery restriction for the critic** used in WGAN. The critic have to be a [continuous](https://en.wikipedia.org/wiki/Continuous_function) [function 1-Lipschtiz](https://en.wikipedia.org/wiki/Lipschitz_continuity). 

Critic is a function which transforms an image into a prediction.

Critic is a 1-Lipschtiz function if for any two images $x_1$ and $x_2$:
$$\frac{|C(x_1) - C(x_2)|}{|x_1 - x_2|} \le 1$$
In this formula $|C(x_1) - C(x_2)|$ is the absolute difference between the critics predictions and $x_1 - x_2$ is the difference between pixel values.

> :bulb: In other words we **restrict the speed** of the changes in critics predictions.





### Weight clipping :scissors:

### Gradient penalty :tired_face:

Calculating the gradient penalty can be broken into two functions: (1) compute the gradient with respect to the images and (2) compute the gradient penalty given the gradient.

$$(\|\nabla c(\hat{x}) \|_2 - 1)^2$$
$\hat{x} = \epsilon x + ( 1 - \epsilon) g(z) $
$\hat{x}$ - mixed image.
$x$ - real image.
$g(z)$ - generated (fake) image.
$\epsilon$ - ??? Small number?
$c(\hat{x})$ - critics score on the mixed image.

![Alt text](image-2.png)


# Conditional GAN & Controllable Generation
> ⚠️ Conditional GANs **require a labeled dateset!**

| Conditional                            | Unconditional                                  |
| -------------------------------------- | ---------------------------------------------- |
| Examples from the classes you want.    | Examples from random classes.                  |
| Training dataset have to be annotated. | Training dataset dosen't need to be annotated. |

### ➡️ Generator Input
| Component       | Description                                                                            | Carried Task                                   |
| --------------- | -------------------------------------------------------------------------------------- | ---------------------------------------------- |
| Noise Vector    | One dimensional vector of random numbers.                                              | Providing randomness in the generation proces. |
| Class Vector    | One-hot encoded vector telling the model instance from which class should be generated | Controlling the generation process.            |
| Combined vector | Noise vector + Class vector                                                            |                                                |

> Size of the class vector is equal to the number of classes.

### ➡️ Discriminator Input
> The classes are passed to the discriminator as on-hot matrices.

## Controllable Generation vs, Conditional Generation
|Compared Feature| Controllable | Conditional |
|----| ------------ | ----------- |
|Examples| Examples with the features you desire. | Examples of the classes you desire. |
| Training dataset| Training dataset dosen't need to be annotated. | Training dataset have to be annotated. |
|| Manipulate the z vector input. | Append a class vector to the input. |
|⚠️Challanges | When trying to control one feature, others that are correlated might change.| |

> ⚠️ It is not possible to control **single** output features.

### Transpose Convolution `torch.nn.ConvTranspose2d`
> ⚠️ The name can be confussig. There is no transpose or real convolution used. A good explanation of this by [
Shubham Singh](https://www.youtube.com/watch?v=U3C8l6w-wn0) on YouTube.

* Transpose convolution means to scalar-multiply a kernel by each pixel in an image.
* The dimensions of the result tensor of transpose convolution is greater than the source dimensions.
* It is used to upscale images.
* Takes the same parameters as standard convolution: `kernel_size`, `padding` and `stride`.

$$N_h = s_h(M_h - 1) + k - 2p$$
Where:
- $N_h$ - Number of pixels in output image.
- $M_h$ - Number of pixels in input image.
- $s_h$ - Stride (skipping parameter). 
- $p$ - Padding.
- $k$ - Kernel size.

<details>
<summary>
<font size="3" color="green">
<b>Visual Example of Transpose Convolution</b>
</font>
</summary>
<img src="./figures/transpose.png">

</details>




---

# OTHER

# Glossary

- GAN -
- Discriminator - Minimizes cost
- Generator - Maximizes cost
- BCE Loss - essentially measures how bad, on average, some observations are being classified by the discriminator, as fake and real.
  Real / Fake Distribution -
- Earth Mover's Distance - its a measure of **how different two distributions are** by estimating the effort it takes to make the generated distribution equal to the real one.
- Vanishing Gradient Problem -



# Practical notes

- Nice functions for creatning blocks of the neural network latter used in the class implementation as a layer.
- The magnitude of a gradient is also called the norm.

### Problems

- I didn't really understood how the gradient in Wasserstein GAN is calculated.

---

### Questions (theory) - DO NOT PUBLISH

> **What is the primary goal of the discriminator, in a probabilistic sense?**
> The discriminator finds the probability of class y (real or fake) given input features x.

> **What is the primary goal of the generator, in a probabilistic sense?**
> Model the features x conditioned on class y: P( x | y ).

> **How does the discriminator learn over time?**
> Getting feedback on if its classification was correct.

- What should the skill levels of the discriminator and generator be? - Both should be at similar skill levels.
- What is the difference between upsampling and transposed convolution layers? - Upsampling infers pixels using a predefined method, while transposed convolution learns a filter
- What do the discriminator and critic have in common? - They both want to maximize the difference between the expected values of the predictions for real and fake.
- What points on a function are considered for the evaluation of 1-Lipschitz continuity? -
  All points on the function. The slope can not be greater than 1 at any point on a function in order for it to be 1-Lipschitz Continuous.
- When is a function 1-Lipschitz Continuous? -
  When its gradient norm is less than or equal to 1 at all points.
- Why do you use an intermediate image for calculating the gradient penalty? - Since checking the critic’s gradient at each possible point of the feature space is virtually impossible, you can approximate this by using interpolated images.
- What is a soft way to restrict the critic to be 1-Lipschitz? -
  Adding a regularization term for the weights, as in L2 norm/regularization. By using a gradient penalty, you are not strictly enforcing 1-L continuity, but encouraging it
> **How does the generator learn what class to generate (in Condiditional GANs)?**
> The discriminator is checking if an image looks real or fake based on (conditioned on) a certain class.

> **How is adding the class information different for the discriminator and generator, and why (in Condiditional GANs)?**
> For the discriminator, the class information is appended as a channel or some other method, whereas for the generator, the class is encoded by appending a one-hot vector to the noise to form a long vector input.

> **What is a key difference between controllable generation and conditional generation?**
> Controllable generation is done after training by modifying the z vectors passed to the generator while conditional generation is done during training and requires a labelled dataset.

> **How are controllable generation and interpolation similar?**
> They both change features by adapting values of the z vector.

> **When does controllable generation commonly fail?**
> When features strongly correlate with each other and z values don’t correspond to clear mappings on their images.

> **How can you use a classifier for controllable generation?**
> You can calculate the gradient of the z vectors along certain features through the classifier to find the direction to move the z vectors.

> **What is the purpose of disentangling models?**
> To correspond values in a z vector to meaningful features.



# Resources

### You Tube
- [Practical GAN YouTube tutorials](https://www.youtube.com/watch?v=OXWvrRLzEaU&list=PLhhyoLH6IjfwIp8bZnzX8QR30TRcHO8Va) by 
Aladdin Perss
- [WGAN implementation from scratch (with gradient penalty)](https://www.youtube.com/watch?v=pG0QZ7OddX4)
- Understand [Transpose convolution](https://www.youtube.com/watch?v=U3C8l6w-wn0) by Shubham Singh - theory explanation and Python implementation.

### Papers

- Interactive paper [Deconvolution and Checkerboard Artifacts](https://distill.pub/2016/deconv-checkerboard/)
- [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434)

# Ideas

- Easy GAN project (MINST or something at similar difficulty level) max 4h of work.
- Fast SQL task during breaks
- The course exercise notebooks are well structured. It is agood place to take inspiration for my own project and/or Czarna Magia Bootcamp.
