In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Generative adversarial network (GAN)

Before considered different approaches for training models
- Train model with one objective
- Train model with multi objectives (multi-task learning) localization

Multi-task learning
<img src="images/od/localization_classification.png" height="1000" width="1000">

Now let's consider multi-model learning
<br>
We train different models simultaneously

## GAN

Let's we have one model which generates data from input vector $z$


Recall auto-encoders:
<img src="images/ssl/ae_1.png" height="800" width="800">

But let's generate directly from the latent vectors - from the vectors from noise
<br>
But how can we be sure, that generation is "appropriate"

Consider model:
$$G:Z \to X$$
<br>
Call this model the <b>generator</b> which generates (for instance ConvNet with up-sampling layers)
To estimate that model is generates something we should estimate generated data
<br>
Consider the second model 
$$D:X \to \{0, 1\}$$
<br>
Call this model the <b>discriminator</b> and train this model to classify between real and generated images
<br>
In other words model consumes real and generated data and discriminates them

In other words: discriminator model is a binary classifier
<br>
Now how can we train the both model?
<br>
Lets play a game:
<br>
Train generator to generate more realistic data and train discriminator to distinguish between generated and the real data

Consider the objective:
$$
\min_{G}\max_{D}\mathbb{E}_{x~p_{data}}[\log(D(x))] + \mathbb{E}_{z~p_{z}}[\log(1 - D(G(z)))]
$$
<br>
Min the generator value for generated images and max the discriminators value

For more details:
<img src="images/ssl/gan_1.png" height="800" width="800">

Let's look into the pseudo-code:
<img src="images/ssl/gan_2.png" height="800" width="800">

Results of generated images:
<img src="images/ssl/gan_3.png" height="800" width="800">

## Deep convolutional GANs (DCGAN)

First big results with GANs was deep convolutional GANs:
<br>
It does not mean that before GANs where not using ConvNets
<br>
But DCGAN came up with several engineering ideas

<img src="images/ssl/dcgan_1.png" height="800" width="800">

<img src="images/ssl/dcgan_2.png" height="800" width="800">

<img src="images/ssl/dcgan_3.png" height="800" width="800">

<img src="images/ssl/dcgan_4.png" height="800" width="800">

<img src="images/ssl/dcgan_5.png" height="800" width="800">

<img src="images/ssl/dcgan_6.png" height="800" width="800">

<img src="images/ssl/dcgan_7.png" height="800" width="800">

<img src="images/ssl/dcgan_8.png" height="800" width="800">

## More different techniques

More techniques where applied to the GANs
- Label smoothing
- Feature matching

Label smoothing:
<br>
Instead of using the hard-labels
$$
\{0, 1\}
$$
<br>
Use smooth labels:
$$
\{0,01, 0.99\}
$$

Feature matching:
$$
||\mathbb{E}_{x~p_{data}}f(x) - \mathbb{E}_{z~p(x)}f(G(z))||^2
$$
<br>
Here $f$ goes for features
<br>
Make layer-wise feature matching is the <b>perceptual loss</b>

## Progressive growing GAN

The training techniques which we talked about
- Train with small images
- After some iteration, increase image size and train more
- Repeat the process several times


## Multi-objectives

As well we can add more layers and layer-blocks as "lego" and experiment
<br>
We can have different models and costs and experiments with them

For instance we can have auto-encoder plus GAN generator and discriminators
<br>
We can have different generators and discriminators
<br>
We can have a cycles for transfer images between domains

## Cycle GAN

One of the techniques of multi-generator and discriminators is the CycleGAN approach
- We have to domains $X$ and $Y$
- We have two generators $G:X \to Y$ and $F:Y \to X$
- Two discriminators $D_X$ and $D_Y$

let's look at the results of different domains:
<img src="images/ssl/cg_1.jpg" height="800" width="800">

Cycle consistency loss:
$$
C = \mathbb{E}_{x~p_{data}(x)}||F(G(x)) - x||^1 + \mathbb{E}_{y~p_{data}(y)}||G(F(y)) - y||^1
$$

The major idea is that it does not need pairing:
<br>
We can just have two domains and make style transfer without additional labeling

<img src="images/ssl/cgpr_1.jpeg" height="800" width="800">

## Worth to mention

BiGAN when GAN is learning representations

Use GAN for classification or as an additional loss

## Self-supervised learning

Self supervised learning is ML learning when labels are simple generated by the modification of original data, without human-in-the loop:
- Generative models:
 - Auto-encoders when labels are the original images
- Discriminative models:
 - When labels are the meta information about original data modification

Self-supervised representation learning is a self-supervised learning, when model is trained for pretext task and for downstream task

## Representation learning

Deep models for feature extraction.
<br>
Train deep model on modified data (pretext task)
<br>
Use it without last layers as feature extractors and train other model on top of it (downstream task)

## Pluses of self-supervised learning

- Labeling is expensive and time-consuming, mistakes are almost inevitable.
- Labeled data for classification contains small amount of information in comparison with images which has other different objects
- Multi-task learning adds more difficulties and hyper-parameters
- Multi object classifiers need more labeling

For deep learning:
<img src="images/ssl/dl_cacke_1.png" height="800" width="800">

## Generative models

Generative models are learning to generate data from the representation (latent vectors)


## Auto-encoders

Auto-encoders learn from the original images, they encode and reconstruct data
<img src="images/ssl/ae_1.png" height="800" width="800">

## Denoising auto-encoders

If we add augmentation and make the model extract features from images:
<img src="images/ssl/dae_1.png" height="800" width="800">

## Context encoders

<img src="images/ssl/ce_1.png" height="800" width="800">

<img src="images/ssl/ce_2.png" height="800" width="800">

<img src="images/ssl/ce_3.png" height="800" width="800">

<img src="images/ssl/ce_4.png" height="800" width="800">

<img src="images/ssl/ce_5.png" height="800" width="800">

<img src="images/ssl/ce_6.png" height="800" width="800">

## Predicting one view from another

<img src="images/ssl/pv_1.png" height="800" width="800">

<img src="images/ssl/pv_2.png" height="800" width="800">

<img src="images/ssl/pv_3.png" height="800" width="800">

<img src="images/ssl/pv_4.png" height="800" width="800">

<img src="images/ssl/pv_5.png" height="800" width="800">

<img src="images/ssl/pv_6.png" height="800" width="800">

<img src="images/ssl/pv_7.png" height="800" width="800">

<img src="images/ssl/pv_8.png" height="800" width="800">

<img src="images/ssl/pv_9.png" height="800" width="800">

DeOldify
<img src="images/ssl/pv_10.jpg" height="800" width="800">

## Relative position of image patches

<img src="images/ssl/rp_1.png" height="800" width="800">

<img src="images/ssl/rp_2.png" height="800" width="800">

<img src="images/ssl/rp_3.png" height="800" width="800">

<img src="images/ssl/rp_4.png" height="800" width="800">

## Solving jigsaw puzzles

<img src="images/ssl/jp_1.png" height="800" width="800">

<img src="images/ssl/jp_2.png" height="800" width="800">

## Rotation classifier

Learn many features to understand position
<img src="images/ssl/rc_1.png" height="800" width="800">

<img src="images/ssl/rc_2.png" height="800" width="800">

<img src="images/ssl/rc_3.png" height="800" width="800">

<img src="images/ssl/rc_4.png" height="800" width="800">

<img src="images/ssl/rc_5.png" height="800" width="800">

<img src="images/ssl/rc_6.png" height="800" width="800">

## Deep methic learning

Instead of just classify among different classes:
$$
P(x|c)
$$
<br>
Let's train model to distinguish between classes in terms of metrics (recall the kernel methods):
<br>
Model outputs vectors $v$ and we train model to make $v_1, v_2$ "closer" (with different) metric if inputs are in the same classes and make them "far" otherwise

## Siamese network

Model outputs $d$ dimensional (not-probability) vector
$$
m:X\to\mathbb{R}^d
$$
<br>
The idea was use two samples: anchor and positive or anchor and negative
<br>
Extract vectors by the model
<br>
Make anchors and positives closer and anchors and negatives far

Model consumes two images (with the shared weights) and outputs the vectors:
<img src="images/ssl/siames_1.png" height="800" width="800">

Optimize for Euclidean distance
<img src="images/ssl/siames_2.png" height="800" width="800">

Objective for the Siamese model:
$$
C = \sum_{i=1}^{N}[(1 - y_i)\frac{1}{2}(d_i)^2 + y_i(\max(0, m-d_i))^2]
$$
<br>
Where $m$ is a margin

```python
euclidean_distance = d(v1, v2)
loss = torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2)
```

Then we can use the single model to generate (extract) vector from the image and compare it with existing generated vectors
<br>
If distance is more then margin then they belong to the different classes
<br>
Else they are the same classes

## Triplet loss

The FaceNet was the paper for the distance (metric) learning with different approach:
<br>
Use triplets instead of duples: anchors, positives and negatives
<br>
make anchors and positives closed and anchors and negatives far at the same time

<img src="images/ssl/triplet_1.jpeg" height="800" width="800">

<img src="images/ssl/triplet_2.png" height="800" width="800">

Objective for the triplet model:
$$
C = \sum_{i=1}^{N}[||f^a_i - f^p_i||^2 - ||f^a_i - f^n_i|| + \alpha]
$$
<br>
Where $\alpha$ is a margin

- We need labeled data
- Data is tripled
<br>
- We don't need re-training for new class
- It's an example of so called one-shot learning

There where techniques for training:
<br>
Hard mining
<br>
Offline vs online hard mining, etc

## Predicting neighbouring context

<img src="images/ssl/nc_1.png" height="800" width="800">

## Word embeddings with dimensionality reduction

<img src="images/ssl/we_1.png" height="800" width="800">

<img src="images/ssl/we_2.png" height="800" width="800">

<img src="images/ssl/we_3.png" height="800" width="800">

<img src="images/ssl/we_4.png" height="800" width="800">

<img src="images/ssl/we_5.png" height="800" width="800">

## Word embeddings (Word2Vec)

<img src="images/ssl/wv_2.png" height="800" width="800">

<img src="images/ssl/wv_3.png" height="800" width="800">

<img src="images/ssl/wv_4.png" height="800" width="800">

<img src="images/ssl/wv_5.png" height="800" width="800">

<img src="images/ssl/wv_6.png" height="800" width="800">

<img src="images/ssl/wv_7.png" height="800" width="800">

<img src="images/ssl/wv_8.png" height="800" width="800">

## Contrastive predicting coding (CPC)

<img src="images/ssl/cpc1_1.png" height="800" width="800">

<img src="images/ssl/cpc1_2.png" height="800" width="800">

<img src="images/ssl/cpc1_3.png" height="800" width="800">

<img src="images/ssl/cpc1_4.png" height="800" width="800">

<img src="images/ssl/cpc1_5.png" height="800" width="800">

<img src="images/ssl/cpc1_6.png" height="800" width="800">

<img src="images/ssl/cpc1_7.png" height="800" width="800">

<img src="images/ssl/cpc1_8.png" height="800" width="800">

<img src="images/ssl/cpc1_9.png" height="800" width="800">

<img src="images/ssl/cpc1_10.png" height="800" width="800">

<img src="images/ssl/cpc1_11.png" height="800" width="800">

<img src="images/ssl/cpc1_12.png" height="800" width="800">

## CPC2

<img src="images/ssl/cpc2_1.png" height="800" width="800">

<img src="images/ssl/cpc2_2.png" height="800" width="800">

<img src="images/ssl/cpc2_3.png" height="800" width="800">

<img src="images/ssl/cpc2_4.png" height="800" width="800">

<img src="images/ssl/cpc2_5.png" height="800" width="800">

<img src="images/ssl/cpc2_6.png" height="800" width="800">

<img src="images/ssl/cpc2_7.png" height="800" width="800">

<img src="images/ssl/cpc2_8.png" height="800" width="800">

<img src="images/ssl/cpc2_9.png" height="800" width="800">

<img src="images/ssl/cpc2_10.png" height="800" width="800">

<img src="images/ssl/cpc2_11.png" height="800" width="800">

<img src="images/ssl/cpc2_12.png" height="800" width="800">

<img src="images/ssl/cpc2_13.png" height="800" width="800">

<img src="images/ssl/cpc2_14.png" height="800" width="800">

<img src="images/ssl/cpc2_15.png" height="800" width="800">

<img src="images/ssl/cpc2_16.png" height="800" width="800">

<img src="images/ssl/cpc2_17.png" height="800" width="800">

<img src="images/ssl/cpc2_18.png" height="800" width="800">

## Instance discrimination

<img src="images/ssl/id_1.png" height="800" width="800">

<img src="images/ssl/id_2.png" height="800" width="800">

## Momentum contrast (MoCo)

<img src="images/ssl/moco1_1.png" height="800" width="800">

<img src="images/ssl/moco1_2.png" height="800" width="800">

<img src="images/ssl/moco1_3.png" height="800" width="800">

<img src="images/ssl/moco1_4.png" height="800" width="800">

<img src="images/ssl/moco1_5.png" height="800" width="800">

<img src="images/ssl/moco1_6.png" height="800" width="800">

## SimCLR

<img src="images/ssl/simclr_1.png" height="800" width="800">

<img src="images/ssl/simclr_2.png" height="800" width="800">

<img src="images/ssl/simclr_3.png" height="800" width="800">

<img src="images/ssl/simclr_4.png" height="800" width="800">

## MoCo2

<img src="images/ssl/moco2_1.png" height="800" width="800">

<img src="images/ssl/moco2_2.png" height="800" width="800">

<img src="images/ssl/moco2_3.png" height="800" width="800">

## Image super-resolution

Recall denoising autoencoders
<br>
Let's now upscale image from the small resolution to the higher resolution

<img src="images/ssl/sr_1.jpeg" height="800" width="800">

<img src="images/ssl/sr_2.jpg" height="800" width="800">

Use the high resolution images for the training
<br>
Downscale high resolution images
<br>
Use downscaled images as training data and original images as labels for autoencoders

<img src="images/ssl/sr_3.png" height="800" width="800">

Another approach would be:
<br>
Down-sample high-resolution images
<br>
Up-sample them with chosen interpolation
<br>
Use up-sampled images for training set
<br>
Use original (high-resolution) images as labels

<img src="images/ssl/sr_4.jpg" height="800" width="800">

## Perceptual loss

Use pre-trained model:
<br>
ResNet50
<br>
InceptionV3
<br>
VGG16

Calculate features of the label images on the several layers:
$$p:X \to Y$$
<br>
$$
p = p_n\circ p_{n_1} \circ \dots \circ p_1
$$

Use some of this features for the estimation of the up-sampling layers
<br>
$$
\sum_{i=1}^{k}||u^l_i(X) - p^s_i(X)||^2
$$
<br>
add this value to the last-layer loss ($L_2$ distance with the original image)

Model VGG (visual geometry group) was first presented at 2013 ImageNet competition and got high results (not winner ZENet was winner) but it was fast and had a simple architecture.
VGG-16 is still used in
- Backbone for many models
- Perceptual loss for style transfer and auto-encoder (super-resolution, variational auto-encoder) models
- From my experience, VGG-16 has a place, color and size invariant feature extraction property because of pooling layers and convolutional stack.

One of the best features extractor for background jobs:
- High dimension
- Spark job for image clustering

Perceptual loss:
<img src="images/od/perceptual_1.jpg" height="800" width="800">

VGG-16 architecture:
<img src="images/od/vgg_16_1.png" height="800" width="800">

<img src="images/ssl/pl_1.jpg" height="800" width="800">

<img src="images/ssl/pl_2.jpg" height="800" width="800">

Perceptual loss was developed for the style-transfer models
<img src="images/ssl/st_1.png" height="800" width="800">

<img src="images/ssl/st_2.jpeg" height="800" width="800">

<img src="images/ssl/st_3.png" height="800" width="800">

<img src="images/ssl/st_4.png" height="800" width="800">

## Questions

<img src="images/od/questions_2.jpg" height="800" width="800">

## Thank you