In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

$$
\newcommand{\kernel}{\mathbf{k}}
$$

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
from IPython.display import Image

import cnn_helper
%aimport cnn_helper
cnnh = cnn_helper.CNN_Helper()

# How does a Deep Learning Classifier work ?

We will present the outputs of a very high accuracy classifer for the ImageNet dataset

ImageNet:
- Large database of hand-labelled images
    -14MM images, 22K classes
    - [High level categories](http://image-net.org/about-stats)

- Annual competition
    - no Deep Learning prior to 2012
    - drove innovation in Deep Learning post 2012

    - Training data: 1.2MM images
    - 200 dogs and cats !


<img src="images/ImageNet_progress.jpg" width=900>

Here is the classifier's response to a cat image:
<table>
<img src="images/cat7_classified.png" width=600>
</table>

High confidence.

How does the classifier "recognize" this as a "tiger cat" ?

Maybe: by it's parts ?

<table>
    <tr>
        <center>How does it work: Parts ?</center>
    </tr>
    <tr>
        <td><img src="images/cat7_shuffle_classified_1.png" width=450></td>
        <td><img src="images/cat7_shuffle_classified_2.png" width=450></td>
    </tr>
</table>

Maybe parts, but certainly not arranged properly !
- CNN filter looking for presence/absence of features
- not necessarily location

<table>
    <tr>
        <center>How does it work: Parts ?</center>
    </tr>
    <tr>
        <td><img src="images/cat7_occlude_250.png" width=900></td>
    </tr>
    <tr>
        <td><img src="images/cat7_occlude_200.png" width=900></td>
    </tr>
</table>


Probably not.  Covering up (occluding) various parts still results in correct classification.

What about it's shape ?

<table>
    <tr>
        <center>How does it work: Shape ?</center>
    </tr>
    <tr>
        <td> <img src="images/cat7_classified.png" width=450> <td>
        <td><img src="images/cat7_silhouette_classified.png" width=450></td>
    </tr>
</table>


Probably not.

Maybe: texture ?

 <table>
    <tr>
        <center>How does it work: Texture ?</center>
    </tr>
    <tr>
        <td> <img src="images/cat7_classified.png" width=450> <td>
        <td><img src="images/cat7-elephant1_classified.png" width=450></td>
    </tr>
</table>

Perhaps it's the texture.

# What is a feature map looking for ?

Up until now, our understanding of the workings of a NN has been limited
- each layer is a transformation
    - from representation ("synthetic features") given by output of layer $l-1$
    - to a new latent representation, the output of layer $l$
- for classification
    - the final layer is a logistic regression
        - pattern matching the features of penultimate layer
        
We now begin a quest to understand *what* these transformations are accomplishing.

Much of the presentaton is based on a very influential paper
by [Zeiler and Fergus](https://arxiv.org/abs/1311.2901)
- NYU PhD candidate and advisor !



## The first layer

It is relatively easy to understand the transformation of the first layer, as its inputs are
from our problem domain, which makes them interpretable.

For a Dense (FC) layer, we can view the weights as the pattern in the input domain being searched for.

This applies as well to the filters in the first convolutional layer.

<table>
    <center>Layer 1 filters</center>
    <tr>
        <td><img src="images/img_on_page_-004-112.jpg", width=800"></td>
    </tr>
</table>

- Each square in the grid represents values of a single filter ("template") in Convolutional Layer $1$
- The templates seem to represent simple geometric shapes
    - lines in various orientations
    - colors
    - shading

## Beyond the first layer

Interepretation of filter/weights beyond the first layer is difficult:
- layer $\ll$ takes features from layer $\ll-1$
- which are synthesized and not necessarily interpretable

What we can hope to do
- somehow map the representation created by layer $\ll >1$ into the inputs (layer 0 output)

The methods fall into two classes
- input dependent
- input independent

# Interpretation of activations

Recall that the output of each layer is a transformed representation of the input.

So input $\x^\ip$ gets transformed to $\y_\llp^\ip$ at layer $\ll$.

We will give several methods to try to discern the meaning of $\y_\llp$.

# PCA


- Feed all $n$ examples in $\X$ into the NN
    - layer $\ll$ representation of $\x^\ip: \y^\ip_\llp$
- Compute PCA of the collection of representations $[ \y^\ip_\llp | 1 \le i \le n ]$
_ Is there some property $p$ such that
    - Project example $i$ onto the first few PC's
    - label the projected point with $p(\x^\ip)$
    - are clusters formed with similar values of property $p$ ? 


Here is a Convolutional Neural Network applied to MNIST digit classification.

<table>
    <tr>
        <center>MNIST CNN</center>
    </tr>
    <tr>
        <td><img src="images/mnist_cnn_pca_0.jpg" width=800></td> 
    </tr>
</table>

And a projection of the representaion produced by the first Convolutional Layer onto the first 2 PC's

<table>
    <tr>
        <center>MNIST CNN Conv1 PCA</center>
    </tr>
    <tr>
        <td><img src="images/mnist_cnn_pca_1.jpg" width=800></td> 
    </tr>
</table>

The property we are postulating is useful because "similar" digits are clustered together
- Is the layer recognizing features that group digits ?
- Left to right: strong vertical ("1", "7") to less vertical ?
- Bottom to top: digits *without* "curved tops" to those with tops ?

Let's examine the representation after the second Convolutional Layer.
<table>
    <tr>
        <center>MNIST CNN Conv1 PCA</center>
    </tr>
    <tr>
        <td><img src="images/mnist_cnn_pca_2.jpg" width=800></td> 
    </tr>
</table>

# Interpretation: What is the role of a single neuron/single feature map ?

Rather than intepreting $\y_\llp$ in its entierty, perhaps we can discern the meaning of a single element
$j$ of $\y_\llp$.

For a given layer $\ll$, the layer output $\y_\llp$ consists of many features.

Can we discern what role feature $\y_{\llp,j}$ plays ?

**Note**
If $\y_\llp$ is of dimension $n_\llp > 2$ then $\y_{\llp,j}$ denotes a single *feature map*
spanning a multi-dimensional space.

So when we refer to the value of $\y_{\llp,j}$, we mean a summary (e.g., max, average) of all values
in the feature map.

# Maximally Activating Examples
- Feed all $n$ examples in $\X$ into the NN
- Measure the response $\y_{\llp,j}$ 
- Do the examples with largest/smallest responses share some common property $p$ ?

If so, then perhaps $\y_{\llp,j}$ encodes a feature measuring the strength of $p$

Let
- $\y_{\llp,j}^\ip$ denote the response of feature $\y_{\llp,j}$ to input $\x^\ip$.
- $[ i_1, i_2, \ldots, i_n ]$ permutation of $[1, \ldots, n]$ that sorts the responses $[ \y_{\llp,j}^\ip | 1 \le i \le n]$
    - $\y_{\llp,j}^{(i_1)} \le \y_{\llp,j}^{(i_2)} \dots \le \y_{\llp,j}^{(i_n)}$
- Is $p(\x^{(i^k)})$ true for all $k > T$ for some index $T \le n$ ?   

<table>
    <tr>
        <center>MNIST CNN maximally activating 8's</center>
    </tr>
    <tr>
        <td><img src="images/mnist_cnn_max_activating_8.jpg" width=800></td> 
    </tr>
</table>

Interesting !  Do we have a problem with certain 8's ?

Much lower probability when
- 8 is thin versus thick
- tilted left versus right

## Occlusion

Maximally activating inputs are very coarse: they identify concepts at the level of entire input.
    
But, it's reasonable to suspect that some elements of the input are more important to the concept than others.

In particluar, a CNN has a "receptive field" which defines the input elements that contribute to the layer output.

Close to the input layer, the receptive field is narrow so its clear that the "features" being identified are small in span.

Occlusion is one way of identifying the elements of the input layer that most affect the latent
representation.  

We will describe this in terms of a 2D input, but we can generalize.

Let
- $\y_{\llp,j}^\ip$ denote the response of feature $\y_{\llp,j}$ to input $\x^\ip$.
- Place an occulding square over some portion of input $\x^\ip$ and measure the change in $\y_{\llp,j}$
- Do this for each location in input $\x^\ip$ and create a "heat map" of changes in response $\y_{\llp,j}$ 

The number on top is the percent decrease in $\y_{(L),j}$, the logit for digit 8.

<table>
    <tr>
        <center>Occluding 8</center>
    </tr>
    <tr>
        <td><img src="images/mnist_cnn_occlude_8.jpg" width=800></td> 
    </tr>
</table>

Not what we expected !  

The mere presence of the square changes the classification probability
greatly, even when we are not blocking the "waist" of the 8.

Here is the change in response of a single feature map in layer 5 of an image classifier (Zeiler and Fergus).

The chosen feature map is the one with the highest activation level in the layer.

You can see that it is responding to "faces".

<table>
    <tr>
        <th><center>Input image</center></th>
        <th><center>Activation of one filter at layer 5</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=400"></td>
        <td><img src="images/img_on_page_-007-148.png" width=400></td>
    </tr>
</table>

Zeiler and Fergus also measured the change in activation of $\y_{(L),j}^\ip$, the logit corresponding to the correct
class ("Afghan Hound").

<table>
      <tr>
        <th><center>Input image</center></th>
        <th><center>Change in logit for "Afghan hound"</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-007-139.png" width=400"></td>
        <td><img src="images/img_on_page_-007-145.png" width=400></td>
    </tr>
</table>

- When the dog is masked, logits drop a lot (get colder: blue)
- When the two faces are masked, the logits increased (get hotter: red)
    - perhaps the faces were competing with the dog for possible classification ?

This is super interesting because "Face" is *not* in the set of target classes for the training set!

This suggests that the CNN 
- found it useful to recognize faces
- even though Face is not an output class
- perhaps because Face is *correlated with* some other target class

The authors also found this to be true for Text.
- not an output class
- but strongly correlated with Book, which is a target class

# Gradient Ascent: Inverting a Neural Network

Our use of NN thus far has been to find weights $\W$
- that maximize
correctly predicting $\hat{y} = \y^\ip$ given input $\x^\ip$.

$$
\W = \argmin{\W} \loss
$$

where $\loss$ is a measure of the "distance" between correct prediction $\y^\ip$ and actual prediction $\hat{\y}$.
- typically MSE for regression
- cross entropy for classification

We solved for the loss minimizing $\W$ by Gradient Descent on the loss function
- Find the gradient of the loss with respect to weights
$$
\frac{\partial \loss}{\partial \W}
$$
- update weights $\W$ in the negative direction of the gradient

Let's suppose our trained NN 
classifies inputs $\x$ among classes in set $C$.

Let $\W$ be the weights that we obtained.

Suppose we *freeze* $\W$ and present the NN
with the following input/target pair
$$
(\x^{(i')}, \y^{(i')}) = (\x^{(0)}, \y^{(0)})
$$

where
- $\x^{(0)}$ is random noise, or all zero
- $\y^{(0)} \in C$

What does the following derivative do ?

$$
\frac{\partial \loss}{\partial \x}
$$

This gradient (with respect to the input layer $\x$)
- shows us the direction in which to modify $\x$
- so that the NN will classify the modified input as $\y^{(0)}$



Our optimization becomes
$$
\begin{array}[lll]\\
\x = \argmin{\x} \loss^\ip \\
\text{subject to} \\
\hat{\y}^\ip = \y^{(0)}
\end{array}
$$

That is, we are
- starting with $\x = \x^{(0)} $
- modifying $\x$
- to derive an input $\x$ that causes the NN to predict $\hat{y} = \y^{(0)}$ with highest probability

We are inverting the output.

We refer to this process as *Gradient Ascent* on the input.

Note that the final $\x$ is not unique
- it is conditioned on the initial  $\x^{(0)}$

We will show how to abuse the NN by choice of $\x^{(0)}$

The inverted $\x$ can be thought of as *the canonical* example of an input with class $\y^{(0)}$ 

We can use this $\x$ as a way of "discovering" the input pattern that the NN is seeking to match
when classifying an input with label $\y^{(0)}$.

Note that we don't have to invert ultimate output $\hat{y} = \y_{(L)}$.

We can invert any single activation in any layer $\y_\llp$.

Thus we are "discovering" what any individual unit in any layer is seeking to match.

# What is a CNN looking for: Zeiler and Fergus

 

## Input dependent methods


### Saliency maps/guided back propagation/deconvolution

A method similar to occlusion is to just compute the derivative
of $\y_{\llp,w,h,c}$ with respect to each coordinate $(w,h)$ in input $\x^\ip$:

$$
\frac{\partial \y_{\llp,w,h,c}}{\partial \x^\ip_{w,h}}
$$

There are a group of closely related methods along these lines that were discovered independently.

The objective of each of these methods is to identify the elements of the input example
that most affect an activation (or summary).

All the techniques use variants of back propagation on 
$
\frac{\partial \y_{\llp,w,h,c}}{\partial \x^\ip_{w,h}}
$

Observe that input elements $\x^\ip_{w,h}$ that don't contribute to $\y_{\llp,w,h,c}$ have zero 
derivatives.

For those input elements $\x^\ip_{w,h}$ that do contribute to $\y_{\llp,w,h,c}$ 
the various methods compute slight variations of the true derivative.
Some examples
- back-propagate only *positive* intermediate values
    - the "derivative" identifies only strong positive activations
- ignore the channel dimension by back propagating only the maximum (across channels) of the gradient
    - so identifies influential elements of the input but can't associate it with a partciluar concept ?

It is not so important to understand the details as the concept: identificaton of the elements 
of input $\x^\ip$ that most affect the feature map being probed.

#### Deconvolution and Maximally Activating Patches

The Zeiler and Fergus paper uses a form of guided back propagation called *de-convolution* (also know as Convolution Transpose).
It's main features
- ignores signs of activations
    - so looks for "strong" signa, either positive or negative
    - it is able to go "backwards" through max-pooling layers by recording how the forward pass flowed in "switches"

What Deconvolution will do is find the sub-areas (called "patches"_ of the input that influence
a feature map.

- Similar to Occlusion in that it identifes sub-areas of the input
- Different in
that it starts at the feature map and works backwards
    - rather than starting at the input and working
forwards.

For each layer beyond the first, Zeiler and Fergus show the 9 patches that maximally activate
a feature map.
That is
- a feature map is summarized by the largest absolute value
- we find the $N = 9$ inputs responsible for the largest summary of the chosen feature map
- we identify the sub-areas (patches) of these image

What we find is
- closest to the input, where the receptive field is smallest, primitive geometric features matched
    - lines
    - shades of color
- layer $(\ll+1)$ combines features of layer $\ll$ into increasingly complex templates for matching
    - edges combined into corners; corners into squares
- at higher layers, "concepts" start to emerge
    - dog face
    - text
- as the layers get closer to the classifier "head", the templates become more specialized for the specific task
    - this has implications for Transfer Learning
        - too close to the classifier results in task-specific features that may not transfer as well as shallower features
        

From Figure 2.  Best viewed under magnification.
(we have intentionally scaled the right image to be larger in order to better see its elements)

Here are the filters for the first layer, and the patches that activate them.

Note: for each filter, the 9 most activating patches are shown.

So each grid element on the left corresponds to a a grid element on the right organized as  $3 \times 3$ patches.

For example, the filter at row 7, column 4 (on the left) seems to respond to checked patches (on the right).

<table>
    <center>Layer 1 filters</center>
    <tr>
        <th><center>Filter</center></th> <th><center>Strongly activating image patch for each filter</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-004-112.jpg", width=300"></td>
        <td><img src="images/img_on_page_-004-114.jpg", width=900></td>
    </tr>
</table>

From Figure 2.  Best viewed under magnification.

Layer 3.

Left: Each grid element are the top 9 activations (organized as a $3 \times 3$ grid) of some feature map,
projected to input space.

Right: Each grid element is the patch of the activating image.

<table>
    <tr>
        <th><center>Layer 3 feature map projected to image, for multipe filters</center></th>
        <th><center>Strongly activating image patch for each filter</center></th>
    </tr>
    <tr>
        <td><img src="images/img_on_page_-004-115.jpg" width=600"></td>
        <td><img src="images/img_on_page_-004-117.jpg" width=600></td>
    </tr>
</table>

Magnified view of layer 3 maximally activating patches.

<img src="images/img_on_page_-004-117.jpg" width=1000>

## Input independent methods

All of the preceding methods were specific to a particular input $\mathbf{x}^{(i)}$.

We now describe a method independent of the input.

The objective will be to construct a synthetic input $\mathbf{x}'$
that maximizes the value of the activation $\y_{\llp,w,h,c}$ of a chosen feature map.

One can then try to interpret the activation $\y_{\llp,w,h,c}$ (or a summary of the activations in a layer $\ll$) as attempting to match the synthetic input.
Again, if the synthetic input is readily identifiable, we can attempt to infer that the layer
is searching for this pattern.

### Final word on gradient ascent, with constraints: Cool cost functions

Neural style transfer

# Further exploration

There is a nice video by [Yosinski](https://youtu.be/AgkfIQ4IGaM) which examines the behavior of
a Neural Network's layers on video images rather than stills.


In [4]:
print("Done")

Done
