# patternnet and patternattribution

this is a notebook accompanying a paper review of [this paper](https://arxiv.org/pdf/1705.05598.pdf), with notes on mendeley.

## userful links

+ [the original paper](https://arxiv.org/pdf/1705.05598.pdf)
+ [a github repo implementing these](https://github.com/albermax/innvestigate)
    + [the specific py file that implements them](https://github.com/albermax/innvestigate/blob/master/innvestigate/analyzer/pattern_based.py)
+ cited previous explainability techniques
    + [Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.](https://arxiv.org/abs/1312.6034): artificial image generation, class saliency maps
    + [Jason Yosinski, Jeff Clune, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. In ICML Workshop on Deep Learning, 2015.](http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf): one tool for visualizing layer activations, another for visualizing layer features via regularized optimization in image space
    + [Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pp. 3387–3395, 2016.](https://arxiv.org/abs/1605.09304): deep generator network (dgn) for creating an image which highly activates a neuron -- activation maximization (am)
    + [David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus- Robert M¨uller. How to explain individual classification decisions. Journal ofMachine Learning Research, 11(Jun):1803–1831, 2010.](http://www.jmlr.org/papers/volume11/baehrens10a/baehrens10a.pdf): "decision vectors" measuring first-order effect of each input feature on classification output **very interesting**
    + [Sebastian Bach, Alexander Binder, Gr´egoire Montavon, Frederick Klauschen, Klaus-Robert M¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4498753/): display pixel-level contributions to overall prediction value (applications to bag of words as well)
    + [Gr´egoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert M¨uller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.](https://arxiv.org/abs/1512.02479): deep taylor decomposition
    + [Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Springer, 2014.](https://arxiv.org/abs/1311.2901): deconvnet to generate images, focus on intermediate layer pixel-level fizualization (I think)
    + [Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In ICLR, 2015.](https://arxiv.org/abs/1412.6806): new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches. made because deconv without pooling sucks
    + [Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and MaxWelling. Visualizing deep neural network decisions: Prediction difference analysis. In ICLR, 2017.](https://arxiv.org/abs/1702.04595): prediction difference analysis, looks like it is a discrete first-order effect size visualization
    + [Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML 2017, 2017](https://arxiv.org/abs/1703.01365): axiomatic statements about vis: sensitivity and implementation invariance, and a resulting method: "integrated gradients", another first-order approach
    + [Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi´egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.](https://arxiv.org/abs/1706.03825): smoothgrad (a method which sharpens gradient-based sensitivity maps)

## linear models

let's reproduce the signal/distractor example:

In [1]:
import numpy as np
import plotly.graph_objs as go
import plotly.offline

plotly.offline.init_notebook_mode(connected=True)

In [2]:
def distractor_vector(theta):
    return np.array([
        [np.cos(theta)],
        [np.sin(theta)]
    ])


def distractor(theta):
    return (np.random.randn(500) * distractor_vector(theta)).transpose()

In [3]:
np.random.seed(1337)
signal_x0 = np.linspace(-1, 1, 500)
signal = np.zeros((500, 2))
signal[:, 0] = signal_x0
signal[:10]

array([[-1.        ,  0.        ],
       [-0.99599198,  0.        ],
       [-0.99198397,  0.        ],
       [-0.98797595,  0.        ],
       [-0.98396794,  0.        ],
       [-0.97995992,  0.        ],
       [-0.9759519 ,  0.        ],
       [-0.97194389,  0.        ],
       [-0.96793587,  0.        ],
       [-0.96392786,  0.        ]])

In [4]:
distractor(theta=np.pi / 2)[:10]

array([[-4.30578044e-17, -7.03187310e-01],
       [-3.00211363e-17, -4.90282363e-01],
       [-1.97054444e-17, -3.21814330e-01],
       [-1.07467577e-16, -1.75507872e+00],
       [ 1.26545491e-17,  2.06664470e-01],
       [-1.23154436e-16, -2.01126457e+00],
       [-3.41217648e-17, -5.57250708e-01],
       [ 2.06485865e-17,  3.37217008e-01],
       [ 9.48388508e-17,  1.54883597e+00],
       [-8.39334069e-17, -1.37073656e+00]])

In [5]:
X = distractor(theta=np.pi / 2)
data = [
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode='markers',
        marker={
            'color': signal[:, 0],
            'colorscale': 'Blues'
        },
    ),
]
plotly.offline.iplot(data)

In [6]:
def make_x(theta):
    return signal + distractor(theta)

In [7]:
def stupid_arrow(theta, name):
    return go.Scatter(
        x=[0, 2 * np.cos(theta)],
        y=[0, 2 * np.sin(theta)],
        mode='lines',
        name=name,
        line={'width': 4}
    )

def make_distractor_plot(theta=0):
    d = distractor(theta)
    X = signal + d
    W = distractor(theta - np.pi / 4)

    data = [
        go.Scatter(
            x=X[:, 0],
            y=X[:, 1],
            mode='markers',
            marker={
                'color': signal[:, 0],
                'colorscale': 'Blues'
            },
        ),
        # signal, distractor, and weight "arrows"
        stupid_arrow(0, 'signal'),
        stupid_arrow(theta, 'distractor'),
        stupid_arrow(theta - np.pi / 2, 'W'),
    ]
    layout = go.Layout(
        height=800,
        width=800,
        xaxis={'range': [-4, 4]},
        yaxis={
            'range': [-4, 4],
            'scaleanchor': 'x',
        }
    )
    figure = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(figure)

In [8]:
from ipywidgets import interactive
interactive(make_distractor_plot, theta=(0.0, 2 * np.pi, np.pi / 20))

interactive(children=(FloatSlider(value=0.0, description='theta', max=6.283185307179586, step=0.15707963267948…

## categorizing visualizations

the author claims there are three types of visualizations classified by *what* they are attempting to visualize; I'll try and summarize the categories in my own words here

1. **functional**: model is treated as a function of the inputs $f(\mathbf{x})$, and we look at the gradient of that function evaluated at our input points
1. **signal**: given an output, backtrace the contributions of that output to exact input features. I believe this is a global calculation, not a local one. then, you scale up the input image in terms of the overall signal weighting
1. **attribution**: decompose any activation (ouptut or intermediate neurons) in terms of contributions from inputs

the author's final summary statement:

> Summarizing, the **function** extracts the **signal** from the data by removing the distractor. The **attribution** of output values to input dimensions shows how much an individual component of the signal contributes to the output, which is what LRP calls relevance.

# implementations

as I wrote up in the important links section, there is [an implementation](https://github.com/albermax/innvestigate). let's mess around with that.