One way to reduce the effective complexity of a network with a large number of weights is to constrain weights within certain group to be equal (CNN). However, this approach is only applicable to particular problems. <font color='red'> Here we consider a form of *soft weight sharing* in which the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values.</font>

Weight decay can be interpreted as a soft weight sharing algorithm that encourage weights to be $0$. In the perspective of probability, these weights are determined by a zero-mean isotropic Gaussian distribution.  

$$p(\mathbf{w}) = \prod_i p(w_i) = \prod_i\mathcal{N}(w_i|0, \sigma^2) $$

In a more genaral case, we can assume that these weights are determined by the mixture of multiple isotropic Gaussian distributions, which is equivalent to the division of weights into groups.

$$p(\mathbf{w}) = \prod_i p(w_i) = \prod_i\left(\sum_{j=1}^M \pi_j \mathcal{N}(w_i|\mu_j, \sigma_j^2)\right) \tag{5.136,5.137} $$

where the term inside the brace is the mixture gaussian density given in (2.193), and $\pi_j$ are the mixing coeffieients. The same as weight decay, taking the negative logarithm then leads to a regularization of the form.

$$\Omega(\mathbf{w}) = -\sum_i\ln\left(\sum_{j=1}^M\pi_j\mathcal{N}(w_i|\mu_j,\sigma_j^2)\right) \tag{5.138}$$

The total error function is then given by

$$\tilde{E}(\mathbf{w}) = E(\mathbf{w})+\lambda\Omega(\mathbf{w}) \tag{5.139}$$

------------------

# Derivative with respect to weight

$$\begin{align*}
\frac{\partial \tilde{E}}{\partial w_i} 
&= \frac{\partial E}{\partial w_i} + \lambda\frac{\partial\left[ -\sum_i\ln\left(\sum_j \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)\right]}{\partial w_i}\\
&= \frac{\partial E}{\partial w_i} + \lambda\frac{\partial\left[ -\ln\left(\sum_j \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)\right]}{\partial w_i}\\
&= \frac{\partial E}{\partial w_i} + \lambda\frac{\partial\left[ -\ln\left(\sum_j \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)\right]}{\partial\left(\sum_j \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)}
\sum_j \left\{\frac{\partial \left(\pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)}{\partial\left(-(w_i-\mu_j)^2/(2\sigma_j^2)\right)}
\frac{\partial\left(-(w_i-\mu_j)^2/(2\sigma_j^2)\right)}{\partial w_i}\right\}\\
&= \frac{\partial E}{\partial w_i} + \lambda\frac{-1}{\left(\sum_j \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\right)}
\sum_j \left\{\pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\frac{-(w_i-\mu_j)}{\sigma_j^2}\right\}\\
&= \frac{\partial E}{\partial w_i} + \lambda\sum_j\frac{1}{\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}
 \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\frac{(w_i-\mu_j)}{\sigma_j^2}\\
&= \frac{\partial E}{\partial w_i} + \lambda\sum_j\gamma_j(w_i)\frac{w_i-\mu_j}{\sigma_j^2}\qquad \text{where }\gamma_j(w)=\frac{\pi_j\mathcal{N}(w|\mu_j,\sigma_j^2)}{\sum_k\pi_k\mathcal{N}(w|\mu_k,\sigma_k^2)} \tag{5.140,5.141}
\end{align*}$$

where $\gamma_j(w)$ , is from (2.192), denotes the posterior probability that $w$ belongs to the $j^{th}$ Gaussian distribution in the mixture distributions.

Network training tends to find the stationary point in which $\frac{\partial E}{\partial w_i}=0$. The effect of the regularization term is therefore to pull each weight towards the center of the $j^{th}$ Gaussian, with a force proportional to the posterior probability of that Gaussian for the given weight.


------------
# Derivative with respect to mean

$$\begin{align*}
\frac{\partial \tilde{E}}{\partial \mu_j}
&= \frac{\partial E}{\partial \mu_j} + \lambda\frac{\partial\left[ -\sum_i\ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial \mu_j}\\
&= 0 -\lambda\sum_i\frac{\partial\left[\ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial \mu_j}\\
&= - \lambda\sum_i\frac{\partial\left[ \ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}
\frac{\partial \sum_k \left(\pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}{\partial\left(-(w_i-\mu_j)^2/(2\sigma_j^2)\right)}
\frac{\partial\left(-(w_i-\mu_j)^2/(2\sigma_j^2)\right)}{\partial \mu_j}\\
&= - \lambda\sum_i\frac{1}{\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}
 \pi_j \mathcal{N}(w_i|\mu_j,\sigma_j^2)\frac{(w_i-\mu_j)}{\sigma_j^2}\\
&= \lambda\sum_i\gamma_j(w_i)\frac{\mu_j-w_i}{\sigma_j^2} \tag{5.142}
\end{align*}$$

$\mu_j$ is pushed towards an average of the weight values, with a force proportional to the posterior probability of that Gaussian for the given weight.


-----------
# Derivative with respect to variance

$$\begin{align*}
\frac{\partial \tilde{E}}{\partial \sigma_j}
&= \frac{\partial E}{\partial \sigma_j} + \lambda\frac{\partial\left[ -\sum_i\ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial \sigma_j}\\
&= \lambda\sum_j\gamma_j(w_i)\left(\frac{1}{\sigma_j}-\frac{\partial\left(-(w_i-\mu_j)^2/(2\sigma_j^2)\right)}{\partial \sigma_j}\right)\\
&= \lambda\sum_j\gamma_j(w_i)\left(\frac{1}{\sigma_j}-\frac{(w_i-\mu_j)^2}{\sigma_j^3}\right) \tag{5.143}
\end{align*}$$

which drives $\sigma_j$ towards the weighted average of the squared deviations of the weights around the corresponding $\mu_j$, where the weighting coefficients are again given by the posterior probability that each weight is generated by the $j^{th}$ Gaussian component.

In pratice, we need to keep the variance non-negative. This can be done by introducing a set of auxiliary variables $\eta_j$

$$\sigma_j^2 = exp(\eta_j) \tag{5.144}$$



----------------
# Derivative with respect to coefficient

For the derivatives with respect to the mixing coefficients $\pi_j$, we need to take accound of the constraints

$$\sum_j \pi_j = 1,\qquad 0\leqslant \pi_i\leqslant 1 \tag{5.145}$$

which can be done by expressing the mixing coefficnets in terms of a set of auxiliary variables $\eta_j$ using the softmax function given by

$$\pi_j = \frac{exp(\eta_j)}{\sum_k exp(\eta_k)} \tag{5.146}$$

The derivatives of the regularized error function with respect to $\eta_j$ then takes the form

$$\begin{align*}
\frac{\partial \tilde{E}}{\partial \eta_j}
&= \frac{\partial E}{\partial \eta_j} + \lambda\frac{\partial\left[ -\sum_i\ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial \eta_j}\\
&= \lambda\sum_i\frac{\partial\left[ -\ln\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\right]}{\partial\left(\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}
\frac{\partial \sum_k \left(\pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)}{\partial \eta_j}\\
&= - \lambda\sum_i\frac{1}{\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)}
\sum_k\mathcal{N}(w_i|\mu_k,\sigma_k^2)\frac{\partial\pi_k}{\partial \eta_j}\\
&= - \lambda\sum_i\frac{1}{\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)}
\sum_k\mathcal{N}(w_i|\mu_k,\sigma_k^2)
\left\{\begin{array}{ll}
(-\pi_k\pi_j) &k\neq j\\
\pi_j(1-\pi_j) &k=j
\end{array}\right.\\
&= - \lambda\sum_i\frac{1}{\sum_k \pi_k \mathcal{N}(w_i|\mu_k,\sigma_k^2)}
\left(\pi_j\mathcal{N}(w_i|\mu_j,\sigma_j^2)-\sum_k \pi_k\pi_j\mathcal{N}(w_i|\mu_k,\sigma_k^2)\right)\\
&=\sum_i\{\pi_j-\gamma_j(w_i)\} \tag{5.147}
\end{align*}$$


We see that $\pi_j$ is therefore driven towards the average posterior probability for the $j^{th}$ Gaussian component.


---------------------

# The learning process

The learning process is simple. 

1. Pick initial values for these parameters ($w_i, \pi_j, \mu_j, \sigma_j$).
2. Compute the derivatives through the equations above.
3. Use these derivatives to refresh the parameters.
4. Execute the step 2 and 3 iteratively until the stationary point is founds

Following this procedure, the parameters $w_i, \pi_j, \mu_j, \sigma_j$ can be determined adaptively in the learning process.