#### About

> Gradient initialization

Gradient initialization is an important step in deep neural network training because it determines the initial values ​​of model parameters, which have a significant impact on model convergence and performance during training. Here is an explanation of some common gradient initialization methods used in deep learning:

1 Zero initialization

With zero initialization, all model parameters (weights and biases) are initialized to zeros. Mathematically, this can be expressed as:


W[l] = 0, where W[l] is the weight matrix of layer l
b[l] = 0, where b[l] is the bias vector of layer l


But initializing all parameters to zero can cause a "symmetry problem", i.e. all neurons in a single layer will have the same output and gradient and will be constantly updated with the same values ​​during training, resulting in poor model performance. bad.





In [1]:
import numpy as np

def initialize_parameters_zeros(layers_dims):
    parameters = {}
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters

layers_dims = [3, 4, 2] # example network architecture
parameters = initialize_parameters_zeros(layers_dims)
print(parameters)

{'W1': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]]), 'b1': array([[0.],
       [0.],
       [0.],
       [0.]]), 'W2': array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]]), 'b2': array([[0.],
       [0.]])}


2 Sample initialization

Random initialization involves initializing the parameters with random values ​​drawn from a given distribution. This helps eliminate symmetry issues and introduces some randomness to the pattern. Common distributions used for random initialization are the Gaussian (or normal) distribution and the Xavier/Glorot initialization. 


2.1. Gaussian (normal) initialization


In Gaussian initialization, parameters are initialized with random values ​​drawn from a Gaussian distribution with mean 0 and specified standard deviation (sigma). Mathematically, this can be expressed as:

W[l] = np.random.randn(lag_dims[l], lag_dims[l-1]) * sigma,
where W[l] is the weight matrix of layer l,
Sigma is the standard deviation,
np.random.randn generates random samples from a standard normal distribution
b[l] = np.zeros((layers_dims[l], 1))




In [2]:
import numpy as np

def initialize_parameters_gaussian(layers_dims, sigma):
    parameters = {}
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * sigma
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters


layers_dims = [3, 4, 2] # network architecture
sigma = 0.01 # standard deviation
parameters = initialize_parameters_gaussian(layers_dims, sigma)
print(parameters)


{'W1': array([[-0.00695358,  0.00745988, -0.00420675],
       [-0.00317408,  0.00562514, -0.00397311],
       [-0.00289508,  0.00873246, -0.00480993],
       [ 0.00251623, -0.00502894,  0.01279217]]), 'b1': array([[0.],
       [0.],
       [0.],
       [0.]]), 'W2': array([[-0.00066028, -0.00999176,  0.00600391,  0.01159534],
       [-0.00347197, -0.0055126 ,  0.00232397, -0.00591203]]), 'b2': array([[0.],
       [0.]])}


2.2 Initialization of Xavier/Glorot

Xavier/Glorot initialization is a popular method for parameter initialization of deep neural networks. Its purpose is to make the activations and gradient differences of different layers roughly the same during training. Mathematically, Xavier's initialization can be expressed as:

W[l] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(1 / layers_dims[l-1]),
where W[l] is the weight matrix of layer l,
np.sqrt(1 / layers_dims[l-1]) is the scaling factor



In [3]:
import numpy as np

def initialize_parameters_xavier(layers_dims):
    parameters = {}
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(1 / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters


layers_dims = [3, 4, 2] # network architecture
parameters = initialize_parameters_xavier(layers_dims)
print(parameters)


{'W1': array([[-0.11098635, -0.24493321, -0.67057116],
       [-0.28679694, -0.28776603,  0.24651697],
       [ 0.46407216,  0.85892708, -0.06777731],
       [-1.41570239,  0.40839064, -0.06516387]]), 'b1': array([[0.],
       [0.],
       [0.],
       [0.]]), 'W2': array([[-0.33810429,  0.68169883, -0.03199655, -0.01838845],
       [ 0.88044223, -0.02293936, -0.07781658,  0.71352702]]), 'b2': array([[0.],
       [0.]])}


3 He initialization
Initialization is another popular parameter initialization method for deep neural networks, especially for ReLU activation functions. Its purpose is to make the differences in activations and gradients of different layers roughly equal during training. Mathematically, the initialization of He can be expressed as:

W[l] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1]),
where W[l] is the weight matrix of layer l,
np.sqrt(2 / layers_dims[l-1]) is the scaling factor



In [4]:
import numpy as np

def initialize_parameters_he(layers_dims):
    parameters = {}
    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters


layers_dims = [3, 4, 2]
parameters = initialize_parameters_he(layers_dims)
print(parameters)


{'W1': array([[-1.98472076e-02,  1.87633010e-02, -8.23628512e-02],
       [-8.91678901e-01, -8.56789613e-01, -1.77551319e-01],
       [ 9.49111713e-01, -5.85277205e-01, -9.94537159e-04],
       [-7.57846243e-01, -6.68402924e-01, -1.53590788e+00]]), 'b1': array([[0.],
       [0.],
       [0.],
       [0.]]), 'W2': array([[ 1.07533685, -0.2003852 ,  0.15137989, -0.41358273],
       [-0.19039654,  0.11565043,  0.04322966, -0.97289426]]), 'b2': array([[0.],
       [0.]])}


Gradient initialization is a critical step in training deep neural networks, and the choice of initialization method can significantly affect model convergence and performance. It is important to test different initialization methods and choose the one that works best for your particular model and task.