In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Transfer learning

Deep models to generate representation vectors:
$$
m : X \to \mathbb{R}^d
$$
<br>

Mapping elements of some space $X$ in Euclidean space such that eac vector has as much information about input as possible and losses as much information as it's enough for better generalization

<img src="images/sslng_1/tl_7.png" height="800" width="800">

<img src="images/sslng_1/tl_8.png" height="800" width="800">

Use pre-trained models as building blocks for other models
- Word2Vec
- Backbones for object detection or semantic segmentation
- Backbones for NLP tasks
- Clusterization or similarity search

## Self-supervised learning

Labeling data is expensive:
- Labeling images for classification
- Tend to be biased and has an errors
- ImageNet is one of the
- Labeling images for detection and segmentation is at least twice expensive and biased

Self supervised learning is ML learning when labels are simple generated by the modification of original data, without human-in-the loop:
- Generative models:
 - Auto-encoders when labels are the original images
- Discriminative models:
 - When labels are the meta information about original data modification

Self-supervised representation learning is a self-supervised learning, when model is trained for pretext task and for downstream task

## Representation learning

Deep models for feature extraction.
<br>
Train deep model on modified data (pretext task)
<br>
Use it without last layers as feature extractors and train other model on top of it (downstream task)

## Predicting neighbouring context

<img src="images/sslng_1/nc_1.png" height="800" width="800">

## Word embeddings with dimensionality reduction

<img src="images/sslng_1/we_1.png" height="800" width="800">

## Word embeddings (Word2Vec)

<img src="images/sslng_1/wv_2.png" height="800" width="800">

<img src="images/sslng_1/wv_3.png" height="800" width="800">

<img src="images/sslng_1/wv_4.png" height="800" width="800">

Take a neighbor words from pre-defined window (6, 12, 18, etc) for instance "Our dried voices, when
    We whisper together
    Are quiet and meaningless
    As wind in dry grass
    Or rats' feet over broken glass
    In our dry cellar"
<br>

$$
p(w_i|c) = \frac{e^{v_i \cdot v_k}}{\sum_{j=1}^{|V|}{e^{v_i \cdot v_j}}}
$$
<br>
If we maximize that
$$
\max_{w_i \in W}p(w_i|c)
$$
<br>
This implies that we maximize $v_i \cdot v_k$ and minimize $v_i \cdot v_j$ for all other $v_j \in |V|$ which makes vectors from the same context close to each other


Noise contrastive estimation, instead of softmax on the huge amount of negative samples, we create binary classifier, if pair of vectors are from the same class (yes, no)
<br>
In our case the same class means then, pair of vectors are from the same image, with different augmentation (or augmentation and source images), or part of the same image
<br>
All other images or patches can be considered as images from the different classes

## Contrastive learning

Imagine two normalized vectors $v_1 = \frac{z_1}{||z_1||}$ and $v_2 = \frac{z_2}{||z_2||}$
<br>

Lets compute scalar product $v_1 \cdot v_2 = ||v_1|| \cdot ||v_2|| \cdot \cos{(v_1, v_2)}$
<br>

Because they are normalized, we can imagine that $||v_1|| \cdot ||v_2|| \approx 1$ and therefore only $cos(v_1, v_2)$ matters
<br>

Recall that $\cos{90^{\circ}} = 0$ and $\cos{0^{\circ}} = 1$
<br>

So higher cosine means that angle is sharper and therefore (recall that vectors are normalized) vectors are closer to each other:

$$
v_1 \cdot v_2 \text{ is higher } \implies v_1 \text{is closer to } v_2
$$

<img src="images/sslng_1/sim_1.png" height="800" width="800">

## Geometric and algorithmic perspective

<img src="images/sslng_1/nce_det_1.png" height="800" width="800">

<img src="images/sslng_1/nce_det_2.png" height="800" width="800">

<img src="images/sslng_1/nce_det_3.png" height="800" width="800">

## Instance discrimination

<img src="images/sslng_1/id_1.png" height="800" width="800">

<img src="images/sslng_1/id_2.png" height="800" width="800">

For images we had the same appoach:
- CPC
- MoCo
- SimCLAR
- CPC2
- MoCo-V2
- SWav
- etc

## Momentum contrast (MoCo)

<img src="images/sslng_1/moco1_1.png" height="800" width="800">

<img src="images/sslng_1/moco1_2.png" height="800" width="800">

## Self-supervised learning for language models

#### Local vs global context - self-attention layers

Examples of synonymes

The encoder and decoder networks:
<img src="images/sslng_1/transf_1.png" height="600" width="600">

Encoder consists with several different networks stacked together as well as decoder ($6$ in paper):
<img src="images/sslng_1/transf_2.png" height="600" width="600">

The architecture of each layer are similar: (multi-head) self-attention and then feed-forward layers for encoder and decoder, plus decoder has encoder-decoder attention layer as sequence2sequence models:
<img src="images/sslng_1/transf_3.png" height="600" width="600">

#### Self-attention

First step: create three different vectors from each input vector (word embedding) using three different weight matrices: query, key and value

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant
<img src="images/sslng_1/selfatt_1.png" height="600" width="600">

The second step: calculate the score for each query with different keys:

<br>
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2
<img src="images/sslng_1/selfatt_2.png" height="600" width="600">

Third step: Divide the scores by 8 (the square root of the dimension of the key vectors (in the paper 64 and divide on 8 respectively)
<br>
Fourth step: SoftMax the the results:
<img src="images/sslng_1/selfatt_3.png" height="600" width="600">

Fifth step: Multiply each value vector on this scores, this will generate probability masked values as before
<br>
Sixth step: Sum all value vectors as output for first embedding:
<img src="images/sslng_1/selfatt_4.png" height="600" width="600">

SoftMax function:
$$
\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \ \ \ \ \text{ for } i = 1, \dotsc , K \text{ and } \mathbf z=(z_1,\dotsc,z_K) \in \mathbb{R}^K.
$$

The resulting vector is one we can send along to the feed-forward neural network

All above steps might be done in matrix calculation:
<img src="images/sslng_1/selfatt_5.png" height="600" width="600">

Then calculate attention outputs
<img src="images/sslng_1/selfatt_6.png" height="600" width="600">

Instead of single self attention, multi-head attention is applied with different weights ($8$ in original paper):
<img src="images/sslng_1/selfatt_7.png" height="600" width="600">

Per embedding, different outputs are generated:
<img src="images/sslng_1/selfatt_8.png" height="600" width="600">

Then outputs are concatenated horizontally and additional weights matrix is used to produce single matrix:
<img src="images/sslng_1/selfatt_9.png" height="600" width="600">

Here is the big picture, performance is improved with multi-head attention (compare to features map):
<img src="images/sslng_1/selfatt_10.png" height="600" width="600">

#### Positional encoding

There is no notion of word order (1st word, 2nd word, ..) in the transformer architecture, thus positional encoding is applied, for $d$ dimensional embeddings:
$$
\text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),
$$
<br>
and
$$
\text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right).
$$
generate the $d$ dimensional $\mathbb{R}^d$ vectors with encoded positional information
<br>
$d_{model}=512$ model $i \in [0, 255]$ in paper

<br>
This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
<br>

The main thing here is, positional encoders should be distinguishable and periodic in order to encode sentences with length which was not seen during the training.
<br>

From the paper: "We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos."

These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention:
<img src="images/sslng_1/selfatt_11.png" height="600" width="600">

For example:
<img src="images/sslng_1/selfatt_12.png" height="600" width="600">

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible:
<img src="images/sslng_1/selfatt_13.png" height="600" width="600">

Residual connections:
<img src="images/sslng_1/selfatt_14.png" height="600" width="600">

#### Different type of normalizations (not only batch normalization exists)

Instead of batch normalization, lets learn mean and standard deviation for instance, layer, group, etc:
<img src="images/sslng_1/norm_1.png" height="1000" width="1000">
<br>
For images anything beside batch normalization does not give any improvement and sometimes deteriorates performance, because of channel structure, but for transformer architecture, according to the inter-text context and (multi-head) attention it significantly improves performance

More detailed illustration of layer normalization:
<img src="images/sslng_1/norm_2.png" height="1000" width="1000">

Layer normalization in transformer associated with self attention, here input embeddings $X$ and output of the layer are summed to preserve the original information:
<img src="images/sslng_1/selfatt_15.png" height="600" width="600">

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
<img src="images/sslng_1/selfatt_16.png" height="1000" width="1000">

#### Decoder

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:
<img src="images/sslng_1/selfatt_17.gif" height="1000" width="1000">

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

<img src="images/sslng_1/selfatt_18.gif" height="1000" width="1000">

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

## BERT (Bi-directional encoder representation from transformers)

Training the transformer and use encoder for representation embedding generator

The additional layers for SimCLR, MoCo2, etc

## Output of transformer

Transformers has fully connected last layers with SoftMax activation of the vocabulary length vector, encoding by the probability, one-hot encoded word:
<img src="images/sslng_1/transf_4.png" height="1000" width="1000">

<img src="images/sslng_1/mask_lm_1.png" height="800" width="800">

<img src="images/sslng_1/ce_1.png" height="800" width="800">

<img src="images/sslng_1/ce_5.png" height="800" width="800">

<img src="images/sslng_1/mask_lm_2.png" height="800" width="800">

<img src="images/sslng_1/nw_lm_1.jpg" height="800" width="800">

<img src="images/sslng_1/nw_lm_2.png" height="800" width="800">

Next word probability according to previous words:
$$
P(w_k | w_1, w_2, \dots, w_{k-1})
$$
<br>

and for masked language model:
$$
P(w_j | w_1, w_2, \dots, w_{j-1}, w_{j+1}, \dots w_{k})
$$

## Self-supervised learning on graphs

<img src="images/sslng_1/graph_ms_1.gif" height="800" width="800">

<img src="images/sslng_1/graph_ms_2.gif" height="800" width="800">

<img src="images/sslng_1/graph_ms_3.png" height="800" width="800">

<img src="images/sslng_1/graph_ms_4.gif" height="800" width="800">

Take a neighbor words from pre-defined window (6, 12, 18, etc) for instance "Our dried voices, when
    We whisper together
    Are quiet and meaningless
    As wind in dry grass
    Or rats' feet over broken glass
    In our dry cellar"
<br>

$$
p(w_i|c) = \frac{e^{v_i \cdot v_k}}{\sum_{j=1}^{|V|}{e^{v_i \cdot v_j}}}
$$
<br>
If we maximize that
$$
\max_{w_i \in W}p(w_i|c)
$$
<br>
This implies that we maximize $v_i \cdot v_k$ and minimize $v_i \cdot v_j$ for all other $v_j \in |V|$ which makes vectors from the same context close to each other


## Conclusion

Deep learning model's representation part is part which reduces dimension and converts input row / word / text/ image / node / graph into the vector which preserves representational information about input:

$$
h: X \to \mathbb{R}^d
$$
<br>

This vectors then might be used for other training objectives
<br>

Train model with auto generated targets pretext task such that, $h$ will accumulate enough information for downstream tasks

Most effective pretext tasks:
- Contrastive learning for computer vision
- Unmask + next word prediction for NLP
- Contrastive learning with MI maximization for graphs

We preserve weights for $h$ model and train other model on top of it
<br>

Or fine tune entire model

Worth to mention:
- Contrastive clustering
- Mutual information maximization

## Questions

<img src="images/sslng_1/questions_1.jpg" height="600" width="600">

## Thank you