# Information PLane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

In [2]:
import torch

# Matrix-Based Mutual Information
## Matrix-Based Rényi's $\alpha$-order Entropy

Rényi’s $\alpha$-order entropy is a generalization of Shannon’s entropy which is represented as:

\begin{align}
\ H_{\alpha}(f) = \frac{1}{1-\alpha}log \int_{X} f^{\alpha}(x)dx
\end{align}

Take into account that this measure is more flexible than Shanon entropy due to $\alpha $ parameter.This equation has been really widely applied in *machine learning*, specifically $H_2$, $\alpha = 2$, is quite interesting with Parzen windows density estimation to stimate the Probability Density Function (PDF) of a random variable. However, estimating PDF's in high-dimensional data is a challeging task and it is the typical environment in Deep Learning cases. To avoid the problem of high-dimensional PDF estimation , Giraldo et al. \[3\] proposed a non-parametric framework for estimating entropy directly from data using infinetely divisible kernels with similar properties as Rényi's $\alpha$-order entropy.

Giraldo's definition: Let $x_i \in \mathbb{X}, \ i = 1, 2, ..., N$ denote data points and let $k : X \times X \rightarrow \mathbb{R}$ be a kernel. Given the kernel matrix $\textbf{K} \in \mathbb{R}^{N \times N}$ where $(K)_ij = k(x_i, x_j)$ and the matrix $A$, $(A)_{ij} = \frac{1}{N} \frac{(K)_{ij}}{\sqrt{(K)_{ii} (K)_{jj}}}$, the matrix-based **Rényi’s $\alpha$-order entropy** is given by

\begin{align}
S_{\alpha}(A) = \frac{1}{1 - \alpha} log_{2} (tr(A^\alpha)) = \frac{1}{1 - \alpha} log_{2} \left[\sum_{i=1}^{N} \lambda_i (A)^{\alpha}\right]
\end{align}

where tr(.) denotes the trace and $\lambda_i (A)$ denotes the $i^{th}$ eigenvalue of A.

The matrix–based Renyi’s entropy shown in the previous equation have the same functional form of the statistical quantity in a Reproducing Kernel Hilbert Space (RKHS). It means that we are **projectiing marginal distribution to an RKHS to measure entropy and mutual information** \[1\].

The **joint entropy** between $x \in \mathbb{X}$ and $y \in \mathbb{Y}$ is defined by Giraldo et al \[3\] as

\begin{align}
S_{\alpha}(A_x, A_y) = S_{\alpha} \left( \frac{A_x \circ A_y}{tr(A_x \circ A_y)} \right)
\end{align}

where $x_i$ and $y_i$, are two different representations of the same object and $\circ$ denotes the Hadamard product, a element-wise product. Finally, the **Mutual Information (MI)** is defined as

\begin{align}
I_{\alpha}(A_x, A_y) = S_{\alpha}(A_x)  + S_{\alpha}(A_y) - S_{\alpha}(A_x, Ay)
\end{align}

As we can see, this definition of MI is similar to Shannon's formulation.

# Tensor-Based Mutual Information

The output of a convolutional layer is represented as a tensor $\mathbb{X}_i \in \mathbb{R}^C \otimes \mathbb{R}^H \otimes \mathbb{R}^W$ for a data point *i*, where $H$ and $W$ represents the spacial dimensional of the convolutions and $C$ the number of filters, or *channels*. Wickstrom *et al.* \[1\] propose to utilize tensor kernels \[2\] to produce a kernel matrix, $\textbf{K} \in \mathbb{R}^{N \times N}$, for the output of a convolutional layer. The ***radial basis function* (RBF) kernel** is represented as

\begin{align}
\ k_{ten}(X_i, X_j) = e^{-\frac{1}{\sigma^2} || X_i - X_j ||_{F}^{2}}
\end{align}

where $|| . ||_{F}$ denotes the Frobenius norm and $\sigma$ is the kernel width parameter.

In practice, the tensor in previous equation is computed by reshaping the tensor into a vectorized representation, $\mathbb{X}^{C \times W \times H}$, and replacing the Frobenius norm with a Euclidean norm.

This tensor-based approach is posible to estimate entropy and MI in Deep Neural Networks (DNN). This estimation can be done replacing the matrix A with

\begin{align}
\ (A_{ten})_{ij} = \frac{1}{N} \frac{(K_{ten})_{ij}}{\sqrt{(K_{ten})_{ii} (K_{ten})_{jj}}} 
\ = \frac{1}{N} k_{ten}(X_i, X_j)
\end{align}

In [3]:
def RBF(x, y, sigma):
    euclidean_norm = torch.sqrt((x[:,] - y[:,])**2)
    print(euclidean_norm)
    return torch.exp(-euclidean_norm/(sigma**2))

rand_t = torch.rand((1,3,2,1), dtype=torch.float32)
rand_t_flatten = torch.flatten(rand_t, 1)
# print(torch.numel(rand_t_flatten))
# print(rand_t_flatten.T)

A = rand_t_flatten.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
print(A)
B = rand_t_flatten.T.expand((torch.numel(rand_t_flatten), torch.numel(rand_t_flatten)))
print(B)
RBF(A, B, 0.12)

# print(rand_t_flatten.expand((2,rand_t_flatten.shape[1])))

tensor([[0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670],
        [0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670],
        [0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670],
        [0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670],
        [0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670],
        [0.5729, 0.3119, 0.8194, 0.4816, 0.5006, 0.6670]])
tensor([[0.5729, 0.5729, 0.5729, 0.5729, 0.5729, 0.5729],
        [0.3119, 0.3119, 0.3119, 0.3119, 0.3119, 0.3119],
        [0.8194, 0.8194, 0.8194, 0.8194, 0.8194, 0.8194],
        [0.4816, 0.4816, 0.4816, 0.4816, 0.4816, 0.4816],
        [0.5006, 0.5006, 0.5006, 0.5006, 0.5006, 0.5006],
        [0.6670, 0.6670, 0.6670, 0.6670, 0.6670, 0.6670]])
tensor([[0.0000, 0.2610, 0.2465, 0.0912, 0.0723, 0.0942],
        [0.2610, 0.0000, 0.5075, 0.1698, 0.1887, 0.3552],
        [0.2465, 0.5075, 0.0000, 0.3378, 0.3188, 0.1524],
        [0.0912, 0.1698, 0.3378, 0.0000, 0.0190, 0.1854],
        [0.0723, 0.1887, 0.3188, 0.0190, 0.0000, 0.1664],
        [0.0

tensor([[1.0000e+00, 1.3431e-08, 3.6708e-08, 1.7706e-03, 6.6124e-03, 1.4464e-03],
        [1.3431e-08, 1.0000e+00, 4.9302e-16, 7.5856e-06, 2.0312e-06, 1.9427e-11],
        [3.6708e-08, 4.9302e-16, 1.0000e+00, 6.4994e-11, 2.4272e-10, 2.5378e-05],
        [1.7706e-03, 7.5856e-06, 6.4994e-11, 1.0000e+00, 2.6777e-01, 2.5610e-06],
        [6.6124e-03, 2.0312e-06, 2.4272e-10, 2.6777e-01, 1.0000e+00, 9.5642e-06],
        [1.4464e-03, 1.9427e-11, 2.5378e-05, 2.5610e-06, 9.5642e-06, 1.0000e+00]])

## Chosing the Kernel Width

As we can see in *RBF Kernel*, $\sigma$ represents the kernel width parameter and it is always critical setting. Wickstrom et al.  choose $\sigma$ using a optimality criterion which is proposed by them \[1\]. A good kernel matrix should reveal the class structures present in the data. This can be accomplished by maximizing the so–called kernel alignment loss \[4\] between the kernel matrix of a given layer,$K_\sigma$, and the label kernel matrix, $K_y$

The kernel alignment loss is defined as:

\begin{align}
\ A(K_a, K_b) = \frac{\left< K_a, K_b  \right>_F}{||K_a||_F ||K_b||_F}
\end{align}

where $|| . ||_F$ and $\left< . \right>$ denotes the Frobenius norm and inner product, respectively. Thus, they choose the optimal $\sigma$ as

\begin{align}
\sigma^* = arg \ max_\sigma \ A(K_\sigma, K_y)
\end{align}

To stabilize the $\sigma$ values across mini batches, we employ an exponential moving average, such that in layer $l$ at iterationt $t$ , they have

\begin{align}
\sigma_{l,t} = \beta \sigma_{l,t-1} + (1-\beta) \sigma_{l,t}^*
\end{align}

where $\beta \in [0,1]$ and $\sigma_{l,1} = \sigma_{l,1}^* $ 

# References

\[1\] Wickstrøm, K., Løkse, S., Kampffmeyer, M., Yu, S., Principe, J., & Jenssen, R. (2019). Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels. arXiv preprint arXiv:1909.11396.

\[2\] Marco Signoretto, Lieven De Lathauwer, and Johan AK Suykens.   A kernel-based framework to tensorial data analysis.Neural networks , 24(8):861–874, 2011

\[3\] Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jos ́e Carlos Pr ́ ıncipe.  Measures of entropy from data using infinitely divisible kernels.IEEE Transactions on Information Theory , 61:535–548, 2012

\[4\] Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola.  On kernel-target align-ment. InAdvances in neural information processing systems , pp. 367–373, 2002

$\mathbb{X}$