# Initialization

## Zero initialization

In [3]:
def initialize_parameters_zeros(layers_dims):
    """
    layer_dims:  list containing the size of each layer.
    
    Returns:
    parameters: dict containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1: weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1: bias vector of shape (layers_dims[1], 1)
                    ...
                    WL: weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL: bias vector of shape (layers_dims[L], 1)
    """
    
    parameters = {}
    L = len(layers_dims) # num of layers
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    
    return parameters

Initializing weights to zero ($W^{[l]} = 0$) causes:
1. Hidden units receive the same input and compute the same output—creating symmetry.
2. Symmetry makes gradients identical.
3. Parameters gets updated to the same values.

The predictions get stuck and the network won't learn. Therefore, weights $W^{[l]}$ should be initialized randomly to break symmetry. 

Initializing the biases $b^{[l]}$ to zeros is fine. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.

## Random initialization

In [5]:
def initialize_parameters_random(layers_dims):
    """
    Arguments:
    layer_dims: list containing the size of each layer.
    
    Returns:
    parameters: dict containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1: weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1: bias vector of shape (layers_dims[1], 1)
                    ...
                    WL: weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL: bias vector of shape (layers_dims[L], 1)
    """

    parameters = {}
    L = len(layers_dims) # num of layers
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters

Initialize randomly with small random values. Large values slows down optimization.

`rand()` vs `randn()`:
- `np.random.rand()` produces a uniform distribution.
- `np.random.randn()` produces a normal distribution.

`np.random.randn()` is ideal as it clusters values around 0 preventing extreme values.

## He initialization

Also known as: Kaiming initialization \
Use cases: ReLU and its variants

$$
\sigma^2 = \frac{2}{n^{[l-1]}}
$$

Rationale: ReLU zeroes out about half of its inputs. Double the variance relative to $n^{[l-1]}$ to preserve the variance to mitigate vanishing gradients.

In [7]:
# GRADED FUNCTION: initialize_parameters_he

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims: list containing the size of each layer.
    
    Returns:
    parameters: dict containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1: weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1: bias vector of shape (layers_dims[1], 1)
                    ...
                    WL: weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL: bias vector of shape (layers_dims[L], 1)
    """
    
    parameters = {}
    L = len(layers_dims) - 1 # int representing the num of layers
     
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2/layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))       
        
    return parameters