## L2 Regularization

The standard way to avoid overfitting is called **L2 regularization**. It consists of appropriately modifying your cost function, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
To:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$

Let's modify your cost and observe the consequences.

**Exercise**: Implement `compute_cost_with_regularization()` which computes the cost given by formula (2). To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$  , use :
```python
np.sum(np.square(Wl))
```
Note that you have to do this for $W^{[1]}$, $W^{[2]}$ and $W^{[3]}$, then sum the three terms and multiply by $ \frac{1}{m} \frac{\lambda}{2} $.

In [1]:
import numpy as np

X = np.array([
    [0,0],
    [0,1],
    [1,0],
    [1,1]
])

Y = np.array([
    [0],
    [0],
    [0],
    [1]
])

m = X.shape[0]
lambd = 0.1
num_nodes = 400

W1 = np.random.randn(num_nodes,X.shape[1])*0.1
b1 = np.zeros((num_nodes,1))

W2 = np.random.randn(1,num_nodes)*0.1
b2 = np.zeros((1,X.shape[0]))

X = X.T
Y = Y.T

costs = []

for i in range(4000):
    # Foward Prop
    # LAYER 1
    Z1 = np.dot(W1,X) + b1
    A1 = 1/(1+np.exp(-Z1))
    # LAYER 2
    Z2 = np.dot(W2,A1) + b2
    A2 = 1/(1+np.exp(-Z2))
    
    # Back Prop
    dZ2 = A2 - Y
    dW2 = (1/m)*np.dot(dZ2,A1.T) + (lambd * W2) / m # CHANGED
    db2 = (1/m)*np.sum(dZ2,axis=1,keepdims=True)
    
    dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))
    dW1 = (1/m)*np.dot(dZ1,X.T) + (lambd * W1) / m # CHANGED
    db1 = (1/m)*np.sum(dZ1,axis=1,keepdims=True)
    
    # Gradient Descent
    W2 = W2 - 0.01*dW2
    b2 = b2 - 0.01*db2
    
    W1 = W1 - 0.01*dW1
    b1 = b1 - 0.01*db1
    
    # Loss
    L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2))) / (2 * m) # ADDED
    
    L = (-1/m)*np.sum(Y*np.log(A2) + (1-Y)*np.log(1-A2))
    L = np.squeeze(L) + L2_regularization_cost # CHANGED
    costs.append(L)
    if i%500 == 0:
        print("=======================================")
        print("Loss = ",L)
        print(Y,"===",A2)

Loss =  0.7135990346948016
[[0 0 0 1]] === [[0.29550276 0.29202402 0.29783112 0.29434383]]
Loss =  0.5988161747148047
[[0 0 0 1]] === [[0.17387406 0.2448307  0.2491679  0.33759626]]
Loss =  0.5577531479030945
[[0 0 0 1]] === [[0.125802   0.23623601 0.2397158  0.40195564]]
Loss =  0.5435489496253247
[[0 0 0 1]] === [[0.09152545 0.22563662 0.22836479 0.45704237]]
Loss =  0.5428776442021304
[[0 0 0 1]] === [[0.06693312 0.21410997 0.21617359 0.50454299]]
Loss =  0.5496013017444414
[[0 0 0 1]] === [[0.04986207 0.20280469 0.20431198 0.54415989]]
Loss =  0.5600653818053128
[[0 0 0 1]] === [[0.03828546 0.19256645 0.19363563 0.57614871]]
Loss =  0.571978062061931
[[0 0 0 1]] === [[0.03047377 0.18379608 0.18453673 0.60141174]]
