# Regularization
---
Regularization helps to prevent model from overfitting by adding an extra penelization term at the end of the loss function.

$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
To:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$

Where $m$ is the batch size. The shown regularization is called `L2 regularization`, which `L2` applies square to weights, `L1 regularization` applies absolute value, which has the form of $|W|$.

The appended extra term would enlarge the loss if either there are two many weights or the weight becomes too large, and the adjustable factor $\lambda$ emphasis on how much we want to penalize on the weights.

_**1. Why penalizing weights would help to prevent overfitting?**_

An intuitive understanding would be that in the process of minizing the new loss function, some of the weights would decrease close to zero so that the corresponding neurons would have very small effect to our results, as if we are trainig on a smaller neural network with fewer neurons.

# Forward
---
In the forward process, we need only to change the loss function. let's review the cost function we've built in `deepNN`.

In [1]:
import numpy as np
from model import deepNN

In [2]:
model = deepNN([2, 4, 1])

In [3]:
A = np.array([[.3, .5, .7]])
Y = np.array([[1, 1, 1]])

loss = model.compute_cost(A, Y)
print(f'loss: {loss}')

loss: 0.7512649762748712


In [13]:
def compute_loss(A, Y, parameters, reg=True, lambd=.2):
    """
    With L2 regularization
    parameters: dict with 'W1', 'b1', 'W2', ...
    """
    assert A.shape == Y.shape
    n_layer = len(parameters)//2
    m = A.shape[1]
    s = np.dot(Y, np.log(A.T)) + np.dot(1-Y, np.log((1 - A).T))
    loss = -s/m
    if reg:
        p = 0
        for i in range(1, n_layer+1):
            p += np.sum(np.square(parameters['W'+str(i)]))
        loss += (1/m)*(lambd/2)*p
    return np.squeeze(loss)

In [6]:
model.weights_init()
model.params

{'W1': array([[ 0.00224882, -0.00683036],
        [-0.0155842 ,  0.00439355],
        [ 0.0026745 ,  0.00287223],
        [-0.00977243,  0.00515391]]),
 'b1': array([[0.],
        [0.],
        [0.],
        [0.]]),
 'W2': array([[-0.02002206,  0.00227708,  0.00470624,  0.00502016]]),
 'b2': array([[0.]])}

In [14]:
loss = compute_loss(A, Y, model.params)
print(f'loss: {loss}')

loss: 0.7512951351356093


# Backward
---
The backward propagation of `L2 reglularization` is actually straight forward, we only need to add the gradient of the L2 term.

$$ \underbrace{\frac{\partial{J}^{\text{L2 Reg}}}{\partial{W}}}_{\text{new gradient}} = \underbrace{ \frac{\partial{J}^{\text{old}}}{\partial{W}} }_{\text{new gradient}} + \frac{\lambda}{m}|W|$$

In [15]:
def backward(params, cache, X, Y, lambd=0.2):
    """
    params: weight [W, b]
    cache: result [A, Z]
    Y: shape (1, m)
    """
    grad = {}
    n_layers = int(len(params)/2)
    m = Y.shape[1]
    cache['A0'] = X
    
    for l in range(n_layers, 0, -1):
        A, A_prev, Z = cache['A' + str(l)], cache['A' + str(l-1)], cache['Z' + str(l)]
        W = params['W'+str(l)]
        if l == n_layers:
            dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)
        
        if l == n_layers:
            dZ = np.multiply(dA, sigmoid_grad(A, Z))
        else:
            dZ = np.multiply(dA, relu_grad(A, Z))
        
        # with an extra gradient at the end, other terms would remain the same
        dW = np.dot(dZ, A_prev.T)/m + (lambd/m)*W
        
        db = np.sum(dZ, axis=1, keepdims=True)/m
        dA = np.dot(W.T, dZ)

        grad['dW'+str(l)] = dW
        grad['db'+str(l)] = db
    
    return grad