<a id="visualization"></a>
# Understanding CNN representations - Visualization

As the convolutional networks got deeper and deeper, it became an important question to investigate 
- what these networks learn
- what the successive layers of representation look like in practice
- or whether there is any point in adding layers at all

For this purpose it became a growing field of experimentation to try to visualize the inner workings of CNNs.


-----------------------------------------
## ZFNet - the first tries

[Source](https://arxiv.org/abs/1311.2901)

Zeiler and Fergus created this winning model in 2013 by the in-depth analysis of AlexNet (and thus employing smaller filters and smaller strides in the first layers), but their most important contribution was the introduction of a "deconvnet" for visualization of intermediate representations (Zeiler, M., Taylor, G., and Fergus, R. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, 2011.)

### The main idea: Deconvolution

Deconvolution operator - or maybe better to say "transposed convolution" - operator is used to lossfully reconstruct the original representation before a convolutional layer.

<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/padding_strides_transposed.gif" width=400 heigth=400>

- **Unpooling:** In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. 
- **Rectification:** The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.
- **Filtering**: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally



### Deconvnet structure for visualization:

<img src="http://drive.google.com/uc?export=view&id=1BprM4pxtQHLABOjPiegMeZuucBhZGjEa" width=300 height=300>

<img src="https://cdn-images-1.medium.com/max/1600/1*KyfQTpv1hYDg8ABXNt0FVg.jpeg" width=400 heigth=400>


Projections of activations layer by layer into pixel space:

<img src="http://drive.google.com/uc?export=view&id=1aB5vreWKi3eFekLywq9hZF8Zbgs_frtE" width=700 height=700>

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions.Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text(R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4)

A nice summary of the ZFnet paper can be found [here](https://medium.com/coinmonks/paper-review-of-zfnet-the-winner-of-ilsvlc-2013-image-classification-d1a5a0c45103).


## Methods for visualization

Based on ["Visualizing what ConvNets learn"](https://cs231n.github.io/understanding-cnn/)

### Visualization of activations in early layers
**Layer Activations**. The most straight-forward visualization technique is to show the activations of the network during the forward pass. For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become more sparse and localized. One dangerous pitfall that can be easily noticed with this visualization is that some activation maps may be all zero for many different inputs, which can indicate dead filters, and can be a symptom of high learning rates.


<img src="https://cs231n.github.io/assets/cnnvis/act2.jpeg" width=400 heigth=400>

<img src="https://cs231n.github.io/assets/cnnvis/act1.jpeg" width=400 heigth=400>

Typical-looking activations on the first CONV layer (left), and the 5th CONV layer (right) of a trained AlexNet looking at a picture of a cat. Every box shows an activation map corresponding to some filter. Notice that the activations are sparse (most values are zero, in this visualization shown in black) and mostly local.


### Visualization of weights
**Conv/FC Filters**. The second common strategy is to visualize the weights. These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network. The weights are useful to visualize because well-trained networks usually display nice and smooth filters without any noisy patterns. Noisy patterns can be an indicator of a network that hasn’t been trained for long enough, or possibly a very low regularization strength that may have led to overfitting.

<img src="https://cs231n.github.io/assets/cnnvis/filt1.jpeg" width=400 heigth=400>

### Using prototypical images

We can scan through the training set, and choose for every unit the picture that activates it the most, thus getting a "prototypical" image. 

<img src="https://cs231n.github.io/assets/cnnvis/pool5max.jpeg" width=600 heigth=600>

"Maximally activating images for some POOL5 (5th pool layer) neurons of an AlexNet. The activation values and the receptive field of the particular neuron are shown in white. (In particular, note that the POOL5 neurons are a function of a relatively large portion of the input image!) It can be seen that some neurons are responsive to upper bodies, text, or specular highlights."

Interesting parallelism is, that in certain recurrrent nets (see next class) some tendency was captured, which utilizes one specific neuron to signal eg. the sentiment of a text. Thus the visualization of activations can be indeed quite insightful!

<img src="https://openai.com/content/images/2017/04/low_res_maybe_faster.gif" width=600 heigth=600>

Source: [Unsupervised Sentiment Neuron](https://blog.openai.com/unsupervised-sentiment-neuron/)

### Mapping the full space with the training images

We can use the CNN's inner representation vectors and put them into a common space, do a dimensionality reduction (typically t-SNE), and then place the original images at the resulting locations, thus getting a feeling about how the inner space of the CNN is structured.

<img src="https://cs231n.github.io/assets/cnnvis/tsne.jpeg" width=900 heigth=900>

### Covering some parts of the input

We can cover parts of the picture and "convolve" the covering area in such a way as to get a probaility value for a certain class (eg that if I cover this part, is this picture still classified as dog?).
This can give insight about the key features used by the recognition of that given class.

<img src="https://cs231n.github.io/assets/cnnvis/occlude.jpeg" width=55%>

### Perturbing a single pixel (closely related to covering some parts of the input
_It is this are of research that led to the development of adversarial examples._

Up until their logical extreme, the ["one pixel attack"](https://arxiv.org/abs/1710.08864).

<img src="http://itslab.csce.kyushu-u.ac.jp/~vargas/images/understanding.png" width=55%>

- Pixel pertubations optimized with differential evolution (DE)
- Population based optimization algorithm for solving complex multi-modal optimization problems. DE belongs to the general class of evolutionary algorithms (EA). Moreover, it has mechanisms in the population selection phase that keep the diversity such that in practice it is expected to efficiently find higher quality solutions than gradient-based solutions or even other kindsof EAs. In specific, during each iteration another set of candidate solutions (children) is generated according to the current population (parents). Then the children are compared with their corresponding parents, surviving if they are more fitted (possess higher fitness value) than their parents. In such a way, only comparing the parent and his child, the goal of keeping diversity and improving fitness values can be simultaneously achieved. DE does not use the gradient information for optimizing and therefore does not require the objective function to be differentiable or previously known. Thus, it can be utilized on a wider range of optimization problems compared to gradient based methods (e.g., non-differentiable, dynamic, noisy, among others). The use of DE for generating adversarial images have the following main advantages:
- Higher probability of Finding Global Optima - DE is a meta-heuristic which is relatively less subject to local minima than gradient descent or greedy search algorithms (this is in part due to diversity keeping mechanisms and the use of a set of candidate solutions). 
- Require Less Information from Target System - DE does not require the optimization problem to be differentiable as is required by classical optimization methods such as gradient descent and quasi-newton methods. This is critical in the case of generating adversarial images since 1) There are networks that are not differentiable. 2) Calculating gradient requires much more information about the target system which can be hardly realistic in many cases.
- Simplicity - The approach proposed here is independent of the classifier used. For the attack to take place it is sufficient to know the probability labels. There are many DE variations/improvements such as selfadaptive, multi-objective, among others. The current work can be further improved by taking these variations/improvements into account.


And the famous reprogramming paper: [ADVERSARIAL REPROGRAMMING OF
NEURAL NETWORKS](https://arxiv.org/pdf/1806.11146.pdf)

<img src="https://venturebeat.com/wp-content/uploads/2018/07/Capture-boring.png?fit=400%2C312&strip=all" width=60%>

Some more interesting visualization methods can be found [here](https://distill.pub/2019/activation-atlas/).

# Representation learning revisited

The canonical example we gave for the effectiveness of deep networks is a hierarchical application of some "kernels", some folding of the input space that enables good separation.

<img src="http://drive.google.com/uc?export=view&id=1tQu8JagtQKjd7xVbB5uDBA0CebjQcZ2B" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1q6TEXhcZ0hU9nv4CycGcNJyUb9RqC_Xy" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1UFV35b84geZTymaTKQBpLXva8efafloW" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1jAyFn9iKhjSADG-YViN73goVic1x8iu5" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1XSrsBdnan08LVjcVjiJwRwn6_u3HyzvA" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1Aqx6qLy9pVt1-p2IG_CM0SKLnh7-Y2cI" width=50%>


But what does this mean in practice? 


## Connection with "representation" techniques

<img src="https://sebastianraschka.com/images/blog/2014/linear-discriminant-analysis/lda_1.png" width=75%>

For classical "representation learning" techniques we have observed, that they try to investigate a transformation of the original data space, so that it becomes more "informative".

The baseline approach is **principal components analysis**, during which the search for linear transformations (rotations) of the representation space is guided by the **variance pattern** of the data, coming up with "axes" that explain the most of the original variance of the data. The drawback of this method is, that it uses **linear transformations** and **does not utilize category information** of the classification problem.

Going one step further **linear discriminant analysis** tries to mitigate this second drawback by coming up with transformations that are **explicitly useful for classification**, though the constraint - as name implies - of linear separation remains.

On the other hand more recent **non-linear** embedding methods as [**t-SNE**](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) (t distributed Stochastic Neighborhood Embedding) and more recently [UMAP](https://arxiv.org/abs/1802.03426) successfully relax the constraint of linear embedding and try to preserve the neighborhood structure of complex manifolds after projecting them to lower dimensionality.  

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_t_sne_perplexity_001.png" width=75%>

(By the way, all the above visualizations of neural weights or activations were t-SNE projected to be visible in 2-3D, that is it's original use-case.)

But amongs some other drawbacks (computational and generalization based) these techniques are not learning **target specific** representations of the data.


We can state, that the best "embedding" of a set of data is through it's **salient features with respect to a given problem**.

And we can demonstrate, that convnets are exactly doing this:

## They DO learn features hierarchies!

How do we know? - We, use the visualization techniques to track it!

<img src="http://drive.google.com/uc?export=view&id=1kH7OYoulhmwPbGeIU_gpBNKKfBWn-s0Y">


## Generality of learned features

If we examine more in detail the learned representations based on given tasks, we can see, that:

1. There are distinct patterns learned in the case of separate classes
2. The low level features are not that distinct
3. If we learn on a broad set of classes, maybe our features will be more general?

<img src="https://deliveryimages.acm.org/10.1145/2010000/2001295/figs/f4.jpg" width=85%>



## Detour: Learned representations in "shallow" networks: word2vec

If we take a look at what happened in the area of NLP, we can observe some pretty parallel developments:

**"Don't count, predict!"**


<img src="http://drive.google.com/uc?export=view&id=1uu00eAJi3-3tH2Iz8IMP9vz3iWtgdmqM" width=75%>


With the publication of "Distributed Representations of Words and Phrases and their Compositionality" by [Mikolov et al. 2013](https://arxiv.org/pdf/1310.4546.pdf) a _huge_ shift occured in the NLP community, that led away from frequency based methods and introdiced the usage of prediction based methods for the generation of efficient language models. (First at word level.)

### Schematics

<img src="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png" width=65%>

<img src="https://i.stack.imgur.com/igSuE.png" width=45%>

(Important to note that the invention of "hierarchic softmax" came from this research, since the many $v$ vocabulary width layers were consuming extreme amout of computation (by 2013 standards) for a vocabulary of 300k. Based on this there are CPU programmable efficient implementations of word2vec "out of the box", like in [Gensim](https://radimrehurek.com/gensim/models/word2vec.html).  

### Advantage

The real advantage of these "word embeddings" (which became the workhorse of NLP eversince) was not that they were useful in predicting the next words "autocomplete style" (as we have seen before in our training), but much more as general dense vector representations, "embeddings" of words. The main breakthrough of Mikolov et al. was to discover the deep structure that the vectorspaces exhibit after training!


<img src="http://drive.google.com/uc?export=view&id=1heogQhMfvtiOSfPtKvmc2OtyGadAsOmd" width=60%>


A good analysis of this topic can be found in [Marek Rei's blogpost](http://www.marekrei.com/blog/dont-count-predict/).

(Naturally, progress did not stop here, you can read up on successive generations of vector embedding for NLP [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a).)

### Why can this be?

For the model to solve the prediction task effectively, it has to come up with a representation that captures the salient features of the data in the most copact way, that is, it lossfully memorizes and compresses the data, during which it captures it's main features.

**It turns out, that the decisive advantage of deep learning based methods is exactly this: the "byproduct" of learning hierarchic, meaningful features during training.** Throughout this class, we will examine the effects and possibilities arising from this. 

Remember:
**Representation is everything!**

(And observe: **Sometimes ONE hidden layer was enough**!)

Big question:

**What can we do when we have some domain independent representations?**