# Weight Initialization
Lets look at some activation statistics. E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as normal distribution.

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

In [None]:
D = np.random.randn(1000, 500)
hidden_layer_sizes = [500] * 10
nonlinearities = ['tanh'] * len(hidden_layer_sizes)

In [None]:
act = {'relu': lambda x: np.maximum(0, x), 'tanh': lambda x: np.tanh(x)}
Hs = {}
for i in range(len(hidden_layer_sizes)):
    X = D if i == 0 else Hs[i-1]
    fan_in = X.shape[1]
    fan_out = hidden_layer_sizes[i]
    
    # Almost all neurons completely saturated, either -1 and 1. Grandients will be all zero.
    # W = np.random.randn(fan_in, fan_out) * 1.0
    
    # W = np.random.randn(fan_in, fan_out) * 0.01
    
    # Xavier initialization [Glorot et al., 2010]
    W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
    
    H = np.dot(X, W)
    H = act[nonlinearities[i]](H)
    Hs[i] = H

In [None]:
print('input layer had mean %f and std %f' % (np.mean(D), np.std(D)))
layer_means = [np.mean(H) for i,H in Hs.items()]
layer_stds = [np.std(H) for i,H in Hs.items()]
for i,H in Hs.items():
    print('hidden layer %d had mean %f and std %f' % (i+1, layer_means[i], layer_stds[i]))

In [None]:
plt.figure()
plt.subplot(1, 2, 1)
plt.plot(Hs.keys(), layer_means, 'ob-')
plt.title('layer mean')
plt.subplot(1, 2, 2)
plt.plot(Hs.keys(), layer_stds, 'or-')
plt.title('layer std')

In [None]:
plt.figure()
for i,H in Hs.items():
    plt.subplot(1, len(Hs), i+1)
    plt.hist(H.ravel(), 30, range=(-1, 1))

## References
To learn the plot functions
* [matplotlib.pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot)