# Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

In [2]:
import torch
from torch import tensor
import torchvision

project_dir = os.path.split(os.getcwd())[0]
if project_dir not in sys.path:
    sys.path.append(project_dir)

from models.Paper import CNN

## Matrix-Based Mutual Information

Rényi’s $\alpha$-order entropy is a generalization of Shannon’s entropy. For a random variable $X$ with probability density function (PDF) $f(x)$ over a finite set $\chi$, Rényi’s $\alpha$-order entropy is defined as

\begin{align}
\ H_{\alpha} f(x) = \frac{1}{1- \alpha} log \int_{\chi} f^{\alpha}(x) dx
\end{align}

Equation 1 has been widely applied in machine learning (Principe, 2010), and the particular case of $\alpha= 2$, combined with Parzen window density estimation. However,accurately estimating PDFs in high-dimensional data, which is typically the case for DNNs (Deep Neural Networks), is a challenging task. To avoid the problem of high-dimensional PDF estimation, Giraldo et al. \[3\] proposed a non-parametric framework for estimating entropy directly from data using infinitely divisible kernels with similar properties as Rényi’s $\alpha$-order entropy

**Definition**: let $x_i \in \mathcal{X} , \ i=1,2, ..., N$ denote data points and let $k : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ be an divisible positive definite kernel. Given the kernel matrix $K \in \mathbb{R}^{N \times N}$ with elements $(K)_{ij} = k(x_i, x_j)$ and the matrix $A$, $(A)_{ij}= \frac{1}{N} \frac{(K)_{ij}}{\sqrt{(K)_{ii} (K)_{jj}}}$, the matrix-based Rényi's $\alpha$-order entropy is given by

\begin{align}
\ S_{\alpha} = \frac{1}{1-\alpha} \log_{2} \left( tr(A^{\alpha})\right) = \frac{1}{1-\alpha} \left[ \sum_{i=1}^{N} \lambda_i (A)^\alpha \right]
\end{align}

where $tr(.)$ denotes the trace and $\lambda_i(A)$ denotes the $i^{th}$ eigenvalue of A.

The matrix–based Renyi’s entropy shown in previous equation have the same functional form of the statistical quantity in a Reproducing Kernel Hilbert Space (RKHS). Essentially, It is projecting marginal distribution to an RKHS to measure entropy and mutual information.

In addtion to the definition of matrix based entropy, Giraldo et al. \[3\] define the **joint entropy** between $x \in \mathcal{X}$ and $y \in \mathcal{Y}$ as

\begin{align}
\ S_\alpha(A_{\mathcal{X}}, A_{\mathcal{Y}}) =  S_\alpha \left( \frac{A_{\mathcal{X}} \circ A_{\mathcal{Y}}}{tr(A_{\mathcal{X}} \circ A_{\mathcal{Y}}) } \right)
\end{align}

where $x_i$ and $y_i$ are two different representations of the same object and $\circ$ denotes the Hadamard product, an element-wise product. Finally, the **Mutual Information (MI)** is defined as 

\begin{align}
\ I_\alpha (A_{\mathcal{X}}, A_{\mathcal{Y}}) = S_\alpha (A_{\mathcal{X}}) + S_\alpha (A_{\mathcal{Y}}) - S_\alpha (A_{\mathcal{X}}, A_{\mathcal{Y}})
\end{align}

As we can see, this definition of MI is similar to Shannon's formulation.

# Tensor-Based Mutual Information

The output of a convolutional layer is represented as a tensor $\mathbb{X}_i \in \mathbb{R}^C \otimes \mathbb{R}^H \otimes \mathbb{R}^W$ for a data point *i*, where $H$ and $W$ represents the spacial dimensional of the convolutions and $C$ the number of filters, or *channels*. Wickstrom *et al.* \[1\] propose to utilize tensor kernels \[2\] to produce a kernel matrix, $\textbf{K} \in \mathbb{R}^{N \times N}$, for the output of a convolutional layer. The ***radial basis function* (RBF) kernel** is represented as

\begin{align}
\ k_{ten}(X_i, X_j) = e^{-\frac{1}{\sigma^2} || X_i - X_j ||_{F}^{2}}
\end{align}

where $|| . ||_{F}$ denotes the Frobenius norm and $\sigma$ is the kernel width parameter.

In practice, the tensor in previous equation is computed by reshaping the tensor into a vectorized representation, $\mathbb{X}^{C \times W \times H}$, and replacing the Frobenius norm with a Euclidean norm.

This tensor-based approach is posible to estimate entropy and MI in Deep Neural Networks (DNN). This estimation can be done replacing the matrix A with

\begin{align}
\ (A_{ten})_{ij} = \frac{1}{N} \frac{(K_{ten})_{ij}}{\sqrt{(K_{ten})_{ii} (K_{ten})_{jj}}} 
\ = \frac{1}{N} k_{ten}(X_i, X_j)
\end{align}

In [11]:
def RBF(x, y, sigma):
    euclidean_norm = torch.sqrt((x[:,] - y[:,])**2)
    print(euclidean_norm)
    return torch.exp(-euclidean_norm/(sigma**2))

rand_t = torch.rand((1,3,2,1), dtype=torch.float32)
rand_t_flatten = torch.flatten(rand_t, 1)
# print(torch.numel(rand_t_flatten))
# print(rand_t_flatten.T)

A = rand_t_flatten.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
print(A)
B = rand_t_flatten.T.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
print(B)
RBF(A, B, 0.1)

# print(rand_t_flatten.expand((2,rand_t_flatten.shape[1])))

tensor([[0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786],
        [0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786],
        [0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786],
        [0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786],
        [0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786],
        [0.5925, 0.1063, 0.4847, 0.9905, 0.2020, 0.8786]])
tensor([[0.5925, 0.5925, 0.5925, 0.5925, 0.5925, 0.5925],
        [0.1063, 0.1063, 0.1063, 0.1063, 0.1063, 0.1063],
        [0.4847, 0.4847, 0.4847, 0.4847, 0.4847, 0.4847],
        [0.9905, 0.9905, 0.9905, 0.9905, 0.9905, 0.9905],
        [0.2020, 0.2020, 0.2020, 0.2020, 0.2020, 0.2020],
        [0.8786, 0.8786, 0.8786, 0.8786, 0.8786, 0.8786]])
tensor([[0.0000, 0.4862, 0.1078, 0.3980, 0.3905, 0.2861],
        [0.4862, 0.0000, 0.3784, 0.8842, 0.0956, 0.7722],
        [0.1078, 0.3784, 0.0000, 0.5058, 0.2827, 0.3939],
        [0.3980, 0.8842, 0.5058, 0.0000, 0.7886, 0.1120],
        [0.3905, 0.0956, 0.2827, 0.7886, 0.0000, 0.6766],
        [0.2

tensor([[1.0000e+00, 7.6818e-22, 2.0813e-05, 5.1738e-18, 1.0947e-17, 3.7732e-13],
        [7.6818e-22, 1.0000e+00, 3.6909e-17, 3.9744e-39, 7.0170e-05, 2.8985e-34],
        [2.0813e-05, 3.6909e-17, 1.0000e+00, 1.0768e-22, 5.2599e-13, 7.8532e-18],
        [5.1738e-18, 3.9744e-39, 1.0768e-22, 1.0000e+00, 5.6640e-35, 1.3712e-05],
        [1.0947e-17, 7.0170e-05, 5.2599e-13, 5.6640e-35, 1.0000e+00, 4.1307e-30],
        [3.7732e-13, 2.8985e-34, 7.8532e-18, 1.3712e-05, 4.1307e-30, 1.0000e+00]])

## Chosing the Kernel Width

As we can see in *RBF Kernel*, $\sigma$ represents the kernel width parameter and it is always critical setting. Wickstrom et al.  choose $\sigma$ using a optimality criterion which is proposed by them \[1\]. A good kernel matrix should reveal the class structures present in the data. This can be accomplished by maximizing the so–called kernel alignment loss \[4\] between the kernel matrix of a given layer,$K_\sigma$, and the label kernel matrix, $K_y$

The kernel alignment loss is defined as:

\begin{align}
\ A(K_a, K_b) = \frac{\left< K_a, K_b  \right>_F}{||K_a||_F ||K_b||_F}
\end{align}

where $|| . ||_F$ and $\left< . \right>$ denotes the Frobenius norm and inner product, respectively. Thus, they choose the optimal $\sigma$ as

\begin{align}
\sigma^* = arg \ max_\sigma \ A(K_\sigma, K_y)
\end{align}

To stabilize the $\sigma$ values across mini batches, we employ an exponential moving average, such that in layer $l$ at iterationt $t$ , they have

\begin{align}
\sigma_{l,t} = \beta \sigma_{l,t-1} + (1-\beta) \sigma_{l,t}^*
\end{align}

where $\beta \in [0,1]$ and $\sigma_{l,1} = \sigma_{l,1}^* $ 

In [3]:
def kernelAligmentLoss(x, y):
    return (torch.sum(x*y))/(torch.norm(x) * torch.norm(y))

In [4]:
a = torch.tensor([[1,2,3,4], [5,6,7,8]], dtype=torch.float32)
b = torch.tensor([[10,20,2,1], [5, 10, 3, 2]], dtype=torch.float32)

# print(a*b)
c = torch.sum(a*b)
print(c)

d = torch.norm(a, p='fro') * torch.norm(b, p='fro')
print(d)

print(c/d)

# print(a.T.shape)
# print(a.shape)
# torch.matmul(a, a.T)

# torch.norm(a, p='fro')

print(kernelAligmentLoss(a, b))

tensor(182.)
tensor(362.1768)
tensor(0.5025)
tensor(0.5025)


In [138]:
dataset = torchvision.datasets.MNIST("../datasets/MNIST/", train=True, download=True)

a = next(iter(dataset))


model = CNN()
a = torch.rand((2,1,28,28))
first_conv = model.conv_layers[0](a)
weight = model.conv_layers[0].weight
print(a.shape)
print(weight.shape)
print(first_conv.shape)
b = model.conv_layers(a)


softmax = torch.nn.Softmax(dim=1)
print(softmax(d))


torch.Size([2, 1, 28, 28])
torch.Size([4, 1, 3, 3])
torch.Size([2, 4, 26, 26])
tensor([[0.0776, 0.1045, 0.1015, 0.1151, 0.0810, 0.1208, 0.0765, 0.0725, 0.1094,
         0.1409],
        [0.1278, 0.0929, 0.0940, 0.0841, 0.1215, 0.0712, 0.1093, 0.1429, 0.0891,
         0.0671]], grad_fn=<SoftmaxBackward>)


In [10]:
batch = torch.rand((10, 4, 12, 12), dtype=torch.float32)
minibatch = batch[:5]
minibtach_flatten = torch.flatten(minibatch, 1)

distance = torch.norm(minibtach_flatten, dim=1)

print(distance)

print(len(torch.arange(0, 100, (100-0)/75)))

tensor([13.9208, 13.8435, 14.3359, 13.7861, 14.2711])
75


In [7]:
class MutualInformation(torch.nn.Module):
    '''
        param step: number of steps in order to reduce the number of possible sigma 
        values.

        param sigma_values: number of possible sigma values for optimizing process.
    '''
    def __init__(self, sigma_values=75, step=150):
        # TODO
        return

    def forward(self, x, y):
        # TODO

        mean_distance_x = torch.tensor[torch.dist(x[i-1], x[i]) for i in range(1, len(x))].mean()
        mean_distance_y = torch.tensor[torch.dist(y[i-1], y[i]) for i in range(1, len(y))].mean()
        
        return

    '''
         Tensor Based Radial Basis Function (RBF) Kernel
    '''
    def RBF(self, x, sigma):
        if len(x.shape) < 2:
            x = x.reshape(1, len(x))

        a = x.expand((torch.numel(x), torch.numel(x)))
        b = x.T.expand((torch.numel(x), torch.numel(x)))
        euclidean_norm = torch.sqrt((a[:,] - b[:,])**2)
        return torch.exp(-euclidean_norm/(sigma**2))

    '''
        Kernel Aligment Loss Function.

        This function is used in order to obtain the optimal sigma parameter from
        RBF kernel.  
    '''
    def kernelAligmentLoss(self, x, y):
        return (torch.sum(x*y))/(torch.norm(x) * torch.norm(y))

    def optimizeSigmaValue(self, x):
        '''
            This function is used in orter to obtain the optimal kernel width for
            an l DNN layer
        '''
        # TODO
        # note:
        #  input kernel width : 8
        #  label kernel width : 0.1
        distance = tensor[torch.dist(x[i-1], x[i]) for i in range(1, len(x))].mean()
        sigma_values = torch.arange(distance*0.1, distance*10, (distance*10 - distance*0.1)/75)
        return

In [14]:
def RBF(x, sigma):
    # x = x.reshape(1, len(x))
    print(x.shape)
    a = x.expand((torch.numel(x), torch.numel(x)))
    b = x.T.expand((torch.numel(x), torch.numel(x)))
    euclidean_norm = torch.sqrt((a[:,] - b[:,])**2)
    return torch.exp(-euclidean_norm/(sigma**2))

rand_t = torch.rand((2,2,6,1), dtype=torch.float32)
rand_t_flatten = torch.flatten(rand_t, 1)
rand_y = torch.zeros((2, 4), dtype=torch.float32)
rand_y[0,1] = 1
rand_y[1,3] = 1

print(rand_y.shape)
print(rand_y.shape)

p = RBF(rand_t_flatten[:], 0.1)
# q = RBF(rand_y[0,:], 0.1)
# for i in range(1, len(rand_t_flatten)):
#     print("entré")
#     p = torch.stack((p, RBF(rand_t_flatten[i,:], 0.1)))


print(p.shape)
print(q.shape)

# a = torch.rand((2,2))
# b = torch.rand((2,2))
# c = torch.stack((a,b))
# print(c.shape)

# A = rand_t_flatten.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
# print(A)
# B = rand_t_flatten.T.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
# print(B)
# RBF(A, B, 0.1)

# print(rand_t_flatten.expand((2,rand_t_flatten.shape[1])))

torch.Size([2, 4])
torch.Size([2, 4])
torch.Size([2, 12])


RuntimeError: The expanded size of the tensor (24) must match the existing size (12) at non-singleton dimension 1.  Target sizes: [24, 24].  Tensor sizes: [2, 12]

In [128]:
a = torch.rand((4, 2000))
a = torch.rand((4, 3, 12, 12))

print(a[0][0].flatten().shape)

# b=[]
# for i in range(1, len(a)):
#     b.append(torch.dist(a[i-1], a[i]))

distance = torch.tensor([torch.dist(a[i-1], a[i]) for i in range(1, len(a))]).mean()
label = torch.zeros((4, 4))
label[0, 1] = 1
label[1, 0] = 1
label[2, 3] = 1
label[3, 2] = 1
print(label)

def kernelAligmentLoss(x, y):
    return (torch.sum(x*y))/(torch.norm(x) * torch.norm(y))

kernel_label = RBF(label[0], 0.1)
kernel_data = RBF(a[0][0].flatten(), distance-10)
print(kernel_label.shape)
print(kernel_data.shape)
print(kernelAligmentLoss(kernel_data, kernel_label))

# print(distance)
# sigma_values = torch.arange(distance*0.1, distance*10, (distance*10 - distance*0.1)/75)
# print(sigma_values)

torch.Size([144])
tensor([[0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [0., 0., 1., 0.]])
torch.Size([4, 4])
torch.Size([144, 144])


RuntimeError: The size of tensor a (144) must match the size of tensor b (4) at non-singleton dimension 1

# References

\[1\] Wickstrøm, K., Løkse, S., Kampffmeyer, M., Yu, S., Principe, J., & Jenssen, R. (2019). Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels. arXiv preprint arXiv:1909.11396.

\[2\] Marco Signoretto, Lieven De Lathauwer, and Johan AK Suykens.   A kernel-based framework to tensorial data analysis.Neural networks , 24(8):861–874, 2011

\[3\] Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jos ́e Carlos Pr ́ ıncipe.  Measures of entropy from data using infinitely divisible kernels.IEEE Transactions on Information Theory , 61:535–548, 2012

\[4\] Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola.  On kernel-target align-ment. InAdvances in neural information processing systems , pp. 367–373, 2002