## Regularization

Regularization will help you reduce overfitting by driving weights to lower values.

Regularization hurts train set performance by limiting the ability of the network to overfit to the train set. But, it ultimately gives better test accuracy, it HELPS the system.

## L2 regularization

**L2 regularization** is a standard way to avoid overfitting is called.

Rationale: L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. 
1. By penalizing the square values of the weights in the cost function $J$, you drive all the weights to smaller values.
2. The output change becomes more stable as the input changes.

Remarks:
- You can tune the value of the regularization hyperparameter $\lambda$  using a dev set.
- L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to _oversmooth_—resulting in a model with high bias.

#### Cost

Modify your cost function by adding a regularization term to the cost $J$, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}$$
To:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{\lambda}{2m} \sum_{l=1} \vert \vert W^{[l]} \vert \vert_F^2}_\text{L2 regularization cost}$$

In [17]:
def compute_cost(AL, Y):
    """
    Arguments:
    AL: probability vector corresponding to your label predictions, shape (1, number of examples)
    Y: true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost: cross-entropy cost
    """
    
    m = Y.shape[1]

    # compute loss from aL and y.
    logprobs = np.multiply(Y, np.log(AL)) + np.multiply((1-Y), np.log(1-AL))
    cost = -(1/m)*np.sum(logprobs)
    
    cost = np.squeeze(cost) # e.g. turns [[71]] into 71
    return cost

In [14]:
def compute_cost_with_regularization(A_final, Y, parameters, lambd):
    """    
    Arguments:
    A_final: post-activation output of the final layer (prediction), shape (output size, number of examples)
    Y: "true" labels vector, of shape (output size, number of examples)
    parameters: dict containing W1, b1, ..., WL, bL
    lambd: regularization hyperparameter

    Returns:
    cost: regularized cost
    """
    m = Y.shape[1]

    # cross-entropy cost
    cross_entropy_cost = compute_cost(A_final, Y)

    # L2 regularization cost
    L2_regularization_cost = 0
    L = len(parameters) // 2  # num of layers
    
    for l in range(1, L + 1):
        W = parameters[f'W{l}']
        L2_regularization_cost += np.sum(np.square(W))
    
    L2_regularization_cost *= lambd / (2 * m)

    # overall cost
    cost = cross_entropy_cost + L2_regularization_cost
    return cost

#### Gradient descent

Update rule in gradient descent changes:
$$
\begin{align*}
dW^{[l]} &= \text{(from backprop)} + \frac{\lambda}{m} W^{[l]} \\
W^{[l]} &:= W^{[l]} - \alpha \cdot dW^{[l]} \\
&:= W^{[l]} - \alpha \big[\text{(from backprop)} + \frac{\lambda}{m} W^{[l]}\big] \\
&:= W^{[l]} - \frac{\alpha \lambda}{m} W^{[l]} - \alpha \cdot \text{(from backprop)} \\
&:= \big(1 - \frac{\alpha \lambda}{m}\big) W^{[l]} - \alpha \cdot \text{(from backprop)}
\end{align*}
$$

$\big(1 - \frac{\alpha \lambda}{m}\big)$ is the cause of the _weight decay_.

In [10]:
def backward_propagation_with_regularization(X, Y, caches, lambd):
    """
    Implements the backward propagation with L2 regularization for an L-layer network.

    Arguments:
    X: input data, shape (input size, number of examples)
    Y: true labels, shape (output size, number of examples)
    caches: list of caches from forward propagation [(Z1, A0, W1, b1, A1), ..., (ZL, AL-1, WL, bL, AL)]
    lambd: L2 regularization hyperparameter

    Returns:
    grads: dict with grads: dWl, dbl, dZl, dAl
    """
    grads = {}
    m = X.shape[1]
    L = len(caches)

    # Lth-layer
    ZL, AL_prev, WL, bL, AL = caches[-1]
    dZL = AL - Y
    dWL = (1. / m) * np.dot(dZL, AL_prev.T) + (lambd / m) * WL
    dbL = (1. / m) * np.sum(dZL, axis=1, keepdims=True)
    dA_prev = np.dot(WL.T, dZL)

    grads[f"dZ{L}"] = dZL
    grads[f"dW{L}"] = dWL
    grads[f"db{L}"] = dbL

    # for L-1 hidden layers
    for l in reversed(range(1, L)):
        Z, A_prev, W, b, A = caches[l - 1]
        dZ = dA_prev * (A > 0)  # ReLU derivative
        dW = (1. / m) * np.dot(dZ, A_prev.T) + (lambd / m) * W
        db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
        dA_prev = np.dot(W.T, dZ)

        grads[f"dZ{l}"] = dZ
        grads[f"dW{l}"] = dW
        grads[f"db{l}"] = db

    return grads

## Dropout regularization

Dropout is a regularization technique that randomly _drop_ (set to zero) some neurons in each layer with a probability of $1-\text{keep\_prob}$.

If $\text{keep\_prob}=0.8$ each neuron has:
- 80% to remain, and
- 20% to be zeroed out.

The most common implementation of dropout is the _inverted dropout_.

Inverted dropout procedure:
1. Generate a mask $d \sim \text{Bernoulli(keep\_prob)}$ for each layer $l$.
2. Apply mask using element-wise multiplication to drop units.
3. Scale activations by dividing activations by $\text{keep\_prob}$.

Scaling the activation via division by `keep_prob` ensures that the expected value of `a3` remains the same.

Lower $\text{keep\_prob}$ means stronger regularization.

Typical probabilities per layer:

| Layer                   | keep_prob  | Remark                       |
|-------------------------|------------|------------------------------|
| Input                   | 0.9–1.0    | Rarely drop input features   |
| Early hidden layers     | 0.8–0.9    | Mild regularization          |
| Middle and deep Layers  | 0.5–0.7    | Stronger regularization      |
| Output                  | 1.0        | No dropout for stable output |

Dropout is similar to L2 as it spreads weights across inputs to reduce weight norms. However, dropout is adaptive, it:
1. Randomly zeroes out neurons during training.
2. Each iteration trains a smaller sub-network.
3. Reduces reliance on specific neurons.
4. Prevents overfitting.

A downside of dropout is that the cost function $J$ becomes non-deterministic—harder to check monotic decrease. A solution for this is to first, train with dropout off to confirm convergence.

Use dropout when:
- the model overfits.
- the model has large fully connected layers like in computer vision.


Remarks:
- Use dropout only in training. Don't use during test time.
- Deep learning frameworks like TensorFlow, Keras, or PyTorch come with a dropout layer implementation.
- Apply dropout both during forward and backward propagation.

#### Forward propagation with dropout
Assumes ReLU activations for hidden layers and sigmoid for the output.

In [15]:
def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Arguments:
    X: input data, shape (input size, number of examples)
    parameters: dict containing "W1", "b1", ..., "WL", "bL"
    keep_prob: prob of keeping a neuron active (scalar)
    
    Returns:
    AL: last activation (sigmoid output)
    caches: list of tuples for backpropagation (including dropout masks)
    """
    caches = []
    A = X
    L = len(parameters) // 2  # num of layers

    for l in range(1, L):
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        Z = np.dot(W, A) + b
        A = relu(Z)

        D = np.random.rand(A.shape[0], A.shape[1]) < keep_prob  # dropout mask
        A = np.multiply(A, D) # apply mask
        A /= keep_prob # scale the activation

        caches.append((Z, D, A, W, b))

    # final layer (no dropout, sigmoid activation)
    WL = parameters[f'W{L}']
    bL = parameters[f'b{L}']
    ZL = np.dot(WL, A) + bL
    AL = sigmoid(ZL)

    caches.append((ZL, None, AL, WL, bL))  # no dropout in output

    return AL, caches

#### Backward propagation with dropout
Implements the backward propagation for an L-layer neural network with dropout.

In [18]:
def backward_propagation_with_dropout(X, Y, caches, keep_prob):
    """
    Arguments:
    X: input data, shape (input size, number of examples)
    Y: true labels, shape (output size, number of examples)
    caches: list of tuples from forward_propagation_with_dropout
            (Z, D, A, W, b) for hidden layers, (Z, None, A, W, b) for output layer
    keep_prob: dropout keep probability (scalar)

    Returns:
    grads: dict with grads
    """
    m = X.shape[1]
    L = len(caches)
    grads = {}

    # output layer (sigmoid, no dropout)
    ZL, _, AL, WL, bL = caches[-1]
    dZL = AL - Y
    grads[f'dW{L}'] = (1 / m) * np.dot(dZL, caches[-2][2].T if L > 1 else X.T)
    grads[f'db{L}'] = (1 / m) * np.sum(dZL, axis=1, keepdims=True)
    dA_prev = np.dot(WL.T, dZL)

    # loop through hidden layers (ReLU + dropout), in reverse
    for l in reversed(range(1, L)):
        Z, D, A, W, b = caches[l - 1]
        dA_prev = np.multiply(dA_prev, D) # apply dropout mask
        dA_prev /= keep_prob              # scale
        dZ = dA_prev * (Z > 0)            # ReLU derivative
        A_prev = caches[l - 2][2] if l > 1 else X
        grads[f'dW{l}'] = (1 / m) * np.dot(dZ, A_prev.T)
        grads[f'db{l}'] = (1 / m) * np.sum(dZ, axis=1, keepdims=True)
        dA_prev = np.dot(W.T, dZ)

    return grads