# Deep Neural Network

## Notations and demonstrations
- $m$ is denoted as the number of training examples
- $n$ is denoted as the number of the features
- $X$ is a matrix that each column is a training example and each row is a feature <br>
$
X = 
\begin{bmatrix}
    \displaystyle
    x^{(1)}_{1} & x^{(2)}_{1} & \dots & x^{(m)}_{1} \\
    x^{(1)}_{2} & x^{(2)}_{2} & \dots & x^{(m)}_{2} \\
    & \vdots & \vdots & \\
    x^{(1)}_{n} & x^{(2)}_{n} & \dots & x^{(m)}_{n} \\
\end{bmatrix}_{n_x \times m}
$
- $y$ is a vector denoted as the labels <br>
$
y = 
\begin{bmatrix}
    y_1 & y_2 & \dots & y_m
\end{bmatrix}
$
- $w$ is a vector of weights <br>
$
w = 
\begin{bmatrix}
\displaystyle
    w_1\\
    w_2\\
    \vdots\\
    w_n
\end{bmatrix}
$
<br> <br>
Neural networks might have multiple layers; each layers contains multiple perceptron and each themselves could be logistic function or ...; so for each we need different vector of weights that could be obtained via stacking each function's weights in a matrix;
<br>
- $W^{[1]}$ is a matrix for the first layer of the neural network; for example, if the first layer contains 4 logistic functions we have<br>
$
W^{[1]} =
\displaystyle
\begin{bmatrix}
    \dots & w_1^{[1]T} & \dots\\
    \dots & w_2^{[1]T} & \dots\\
    \dots & w_3^{[1]T} & \dots\\
    \dots & w_4^{[1]T} & \dots\\
\end{bmatrix}_{4, n}
$
<br><br>
Since we have multiple functions in each layer, we need multiple intercepts.
- $b$ is vector of intercepts; If we have 4 logistic function in the first layer, it would be like below <br>
$
b^{[1]} = 
\displaystyle
\begin{bmatrix}
    b_1^{[1]}\\
    b_2^{[1]}\\
    b_3^{[1]}\\
    b_4^{[1]}\\
\end{bmatrix}
$
<br><br>
For calculating $Z$ we have
$$
Z^{[1]} = 
\begin{bmatrix}
    \dots & w_1^{[1]T} & \dots\\
    \dots & w_2^{[1]T} & \dots\\
    \dots & w_3^{[1]T} & \dots\\
    \dots & w_4^{[1]T} & \dots\\
\end{bmatrix}_{4, n}
\begin{bmatrix}
    \displaystyle
    x^{(1)} & x^{(2)} & \dots & x^{(m)}
\end{bmatrix}_{n_x \times m} + 
\begin{bmatrix}
    b_1^{[1]}\\
    b_2^{[1]}\\
    b_3^{[1]}\\
    b_4^{[1]}\\
\end{bmatrix}
$$
The result will be
$$
Z^{[1]} = 
\begin{bmatrix}
    \displaystyle
    W_1^{[1]T}.x^{(1)} + b_1^{[1]}\\
    W_2^{[1]T}.x^{(2)} + b_2^{[1]}\\
    W_3^{[1]T}.x^{(3)} + b_3^{[1]}\\
    W_4^{[1]T}.x^{(4)} + b_4^{[1]}\\
\end{bmatrix}
=
\begin{bmatrix}
    \displaystyle
    Z_1^{[1]}\\
    Z_2^{[1]}\\
    Z_3^{[1]}\\
    Z_4^{[1]}\\
\end{bmatrix}_{4, m}
$$

Applying the activation function for the first layer containing 4 perceptrons we have
$$
a^{[1]} = 
\begin{bmatrix}
    \displaystyle
    a_1^{[1]}\\
    a_2^{[1]}\\
    a_3^{[1]}\\
    a_4^{[1]}\\
\end{bmatrix}
= G(Z^{[1]})
$$

## Activation functions
- Sigmoid
$$
    \begin{equation}
        a = g(z) = \displaystyle\frac{1}{1 + e^{-z}} \\
    \end{equation}
$$
<img src="https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg" width="300px" height="300px"/>

    - Derivative
    $$
        \begin{equation}
            g'(z) = \frac{dg(z)}{dz} = \frac{1}{1 + e^{-z}}(1 - \frac{1}{1 + e^{-z}}) \\ 
            g'(z) = g(z)(1 - g(z)) = a(1 - a)
        \end{equation}
    $$
    Calculations :
    $$
        \begin{equation}
            \frac{d}{dz}g(z) = \frac{0 - (-e^{-z})}{(1 + e^{-z})^2} = \frac{e^{-z}}{(1 + e^{-z})^2} = \\
                \frac{e^{-z} + 1 - 1}{1 + e^{-z}} \times \frac{1}{1 + e^{-z}} = \\
                (1 - \frac{1}{1 + e^{-z}})\frac{1}{1+e^{-z}}
        \end{equation}
    $$
    <br><br>

- tanh
$$
    \begin{equation}
        a = g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
    \end{equation}
$$
<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Sinh_cosh_tanh.svg" width="300px" height="300px"/>

    - Derivative
    $$
        \begin{equation}
        g'(z) = \frac{d}{dz}tanh(z) = 1 - (tanh(z))^2\\
        g'(z) = 1 - a^2
        \end{equation}
    $$
    Calculations:
    $$
        \begin{equation}
            \frac{d}{dz} tanh(z) = \frac{(e^z+e^{-z}) - (e^z-e^{-z})}{(e^z+e^{-z})^2} = \\
            1 - \frac{(e^z-e^{-z})}{e^z+e^{-z}} = 1 - (tanh(z))^2
        \end{equation}
    $$

<br><br>
- ReLU
$$
    a = max(0, z)
$$
<img src="https://upload.wikimedia.org/wikipedia/commons/4/42/ReLU_and_GELU.svg" width="300px" height="300px"/>

    - Derivative
    $$
    g'(z) = 
        \begin{equation}
            \begin{cases}
            0 & z < 0\\
            1 & z \ge 0
            \end{cases}
        \end{equation}
    $$
    
<br><br>
- leaky ReLU
$$
    a = max(0.01z, z)
$$

    - Derivative
    
    $$
    g'(z) = 
        \begin{equation}
            \begin{cases}
            0.01 & z < 0\\
            1 & z \ge 0
            \end{cases}
        \end{equation}
    $$
    

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Implement L-Layered Neural Network functions

### Initialization

In [2]:
def initialize_parameters(n_x, n_h, n_y):
    """ 
    arguments:
        n_x is the size of input layer
        n_h is the size of hidden layers
        n_y is the size of output layer
    returns:
        W1 is a matrix of weights for the first layer of the NN, shape=(n_h, n_x)
        b1 is a vector of intercepts for the first layer of NN, shape=(n_h, 1)
        W2 is a matrix of weights for the second layer of the NN, shape=(n_y, n_h)
        b2 is a vector of intercepts for the second layer of the NN, shape=(n_y, 1)
    """
    
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
    
    params = {"W1":W1,
              "b1":b1,
              "W2":W2,
              "b2":b2}
    return params

In [3]:
# Test function
params = initialize_parameters(4, 5, 3)
for k, v in params.items():
    print(k, ":")
    print(v)

W1 :
[[-2.05635676e-02 -3.56257408e-03 -7.16440517e-03 -1.51355083e-02]
 [ 3.65237917e-03  1.12983080e-02  1.18881992e-02  1.11196012e-02]
 [ 1.65494760e-03 -1.51354204e-05  1.19140318e-03 -6.23067487e-03]
 [-1.28221001e-02 -3.90151885e-03  4.03808821e-03  2.24828296e-03]
 [-2.32608847e-03  1.69275231e-03  1.98555883e-03 -4.08751622e-03]]
b1 :
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]
W2 :
[[-0.01849866 -0.00469481 -0.01029114  0.01751907  0.01047541]
 [-0.01411954  0.00605058 -0.00198229 -0.00612071 -0.00748033]
 [ 0.00055922  0.00470869 -0.01111253 -0.00044706 -0.00909943]]
b2 :
[[0.]
 [0.]
 [0.]]


In [4]:
def initialize_parameters_deep(layer_dims):
    """
    arguments:
        layer_dims: A list that contains the size of each layer
        
    returns:
        dict: parameters for each layer
    """
    
    params = {}
    L = len(layer_dims) # number of layers
    
    for l in range(1, L):
        params["W" + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01
        params["b" + str(l)] = np.zeros((layer_dims[l], 1))
        
    return params

In [5]:
# Test function
params = initialize_parameters_deep([3, 5, 4, 1])
for k, v in params.items():
    print(k, ":")
    print(v)

W1 :
[[-0.0115657   0.00360858  0.00295114]
 [ 0.02942256 -0.00571661  0.00509639]
 [-0.01574587 -0.00864654  0.00670793]
 [ 0.00788212 -0.00356852 -0.00956044]
 [ 0.01647727 -0.00797142 -0.00980176]]
b1 :
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]
W2 :
[[ 7.31405629e-03 -2.98410968e-03  9.23645827e-05  2.11270800e-03
   1.43781742e-02]
 [-1.16380391e-02 -1.05307857e-02 -9.46954877e-03  5.07818544e-03
  -5.97416338e-03]
 [-6.67101356e-03 -1.82783248e-03 -9.49261239e-03 -2.01604609e-02
  -3.34573958e-03]
 [ 3.86915976e-03 -4.96848409e-03 -2.98154004e-03  1.24279286e-02
   5.13886513e-03]]
b2 :
[[0.]
 [0.]
 [0.]
 [0.]]
W3 :
[[-0.00921872  0.01960317 -0.00783147 -0.00435459]]
b3 :
[[0.]]


### Forward Propagation

#### Linear forward
Here we need to calculate $Z^{[l]} = W^{[l]}.A^{[l-1]} + b^{[l]}$

In [6]:
def linear_forward(A, W, b):
    """
    arguments:
        A: activations from previous layer or inputs, shape=(size of l-1, m)
        W: weights from layer l, shape=(size of layer l, size of layer l-1)
        b: intercepts from layer l, shape(size of layer l, 1)
        
    returns:
        Z: forward calculation
        cache: The same arguments useful for backward propagation
    """
    
    Z = np.matmul(W, A) + b
    cache = (A, W, b)
    
    return Z, cache

In [7]:
X = np.random.randn(3, 4)
Y = np.array([[1, 0, 1, 0]])
Z, cache = linear_forward(X, params['W1'], params['b1'])
print(Z)
print(cache[0])
print(cache[1])
print(cache[2])

[[ 0.01565665  0.00912571  0.00129837  0.02729514]
 [-0.0448115  -0.01547954  0.01633857 -0.06643729]
 [ 0.0219114   0.00930118 -0.00285427  0.02995895]
 [-0.00779827 -0.0105304  -0.0120543  -0.01986947]
 [-0.01994317 -0.016868   -0.01136909 -0.04092758]]
[[-1.48263878 -0.55166167  0.43448755 -2.16660949]
 [-0.11604555  0.33387478  0.61643479  0.54842977]
 [-0.36336679  0.52201538  1.38897524  0.08733081]]
[[-0.0115657   0.00360858  0.00295114]
 [ 0.02942256 -0.00571661  0.00509639]
 [-0.01574587 -0.00864654  0.00670793]
 [ 0.00788212 -0.00356852 -0.00956044]
 [ 0.01647727 -0.00797142 -0.00980176]]
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


In [8]:
def sigmoid(Z):
    """
    arguments:
        Z: a matrix calculated in "linear_forward" step
        
    returns:
        A: activation function of sigmoid(Z)
        cache: The same arguments useful for backward propagation
    """
     
    A = 1 / (1 + np.exp(-Z))
    cache = Z
    
    return A, cache

In [9]:
def relu(Z):
    """
    arguments:
        Z: a matrix calculated in "linear_forward" step
        
    returns:
        A: activation function of relu(Z)
        cache: The same arguments useful for backward propagation
    """
     
    A = np.maximum(0, Z)
    cache = Z
    
    return A, cache

In [10]:
def linear_activation_forward(A_prev, W, b, activation):
    """
    arguments:
        A: activations from previous layer or inputs, shape=(size of l-1, m)
        W: weights from layer l, shape=(size of layer l, size of layer l-1)
        b: intercepts from layer l, shape(size of layer l, 1)
        activation: A string that specifies what type of activiation function to use (sigmoid or ReLU)
    
    returns:
        A: activation calculated
        cache: to be used for backward propagation
    """
    
    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation == "sigmoid":
        A, activation_cache = sigmoid(Z)
    
    elif activation == 'relu':
        A, activation_cache = relu(Z)
    
    cache = (linear_cache, activation_cache)
    
    return A, cache

In [11]:
A, cache = linear_activation_forward(X, params["W1"], params["b1"], activation='relu')
print(A)
print(cache[0])
print(cache[1])

[[0.01565665 0.00912571 0.00129837 0.02729514]
 [0.         0.         0.01633857 0.        ]
 [0.0219114  0.00930118 0.         0.02995895]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]]
(array([[-1.48263878, -0.55166167,  0.43448755, -2.16660949],
       [-0.11604555,  0.33387478,  0.61643479,  0.54842977],
       [-0.36336679,  0.52201538,  1.38897524,  0.08733081]]), array([[-0.0115657 ,  0.00360858,  0.00295114],
       [ 0.02942256, -0.00571661,  0.00509639],
       [-0.01574587, -0.00864654,  0.00670793],
       [ 0.00788212, -0.00356852, -0.00956044],
       [ 0.01647727, -0.00797142, -0.00980176]]), array([[0.],
       [0.],
       [0.],
       [0.],
       [0.]]))
[[ 0.01565665  0.00912571  0.00129837  0.02729514]
 [-0.0448115  -0.01547954  0.01633857 -0.06643729]
 [ 0.0219114   0.00930118 -0.00285427  0.02995895]
 [-0.00779827 -0.0105304  -0.0120543  -0.01986947]
 [-0.01994317 -0.016868   -0.01136909 -0.04092758]]


In [12]:
def L_model_forward(X, parameters):
    """
    arguments:
        X: numpy array, shape=(input size, input examples i.e m)
        parameters: output of initialize_parameters_deep
        
    returns:
        AL: activation value from the output layer
        caches: Caches from calling Linear activation forward, the size is L since there are L layers
    """
    
    L = len(parameters) // 2
    caches = []
    A_prev = X
    
    # Do forward propagation with relu activation function for layers from 1 to l-1
    for l in range(1, L):
        Wl = parameters['W' + str(l)]
        bl = parameters['b' + str(l)]
        A, cache = linear_activation_forward(A_prev, Wl, bl, 'relu')
        caches.append(cache)
        A_prev = A
        
    # Do forward propagation with sigmoid activation function for layer L
    Wl = parameters['W' + str(L)]
    bl = parameters['b' + str(L)]
    A, cache = linear_activation_forward(A_prev, Wl, bl, 'sigmoid')
    caches.append(cache)
    return A, caches

In [13]:
AL, caches = L_model_forward(X, params)
print(AL)
print('\n')
print(caches)

[[0.49999973 0.49999984 0.5        0.49999952]]


[((array([[-1.48263878, -0.55166167,  0.43448755, -2.16660949],
       [-0.11604555,  0.33387478,  0.61643479,  0.54842977],
       [-0.36336679,  0.52201538,  1.38897524,  0.08733081]]), array([[-0.0115657 ,  0.00360858,  0.00295114],
       [ 0.02942256, -0.00571661,  0.00509639],
       [-0.01574587, -0.00864654,  0.00670793],
       [ 0.00788212, -0.00356852, -0.00956044],
       [ 0.01647727, -0.00797142, -0.00980176]]), array([[0.],
       [0.],
       [0.],
       [0.],
       [0.]])), array([[ 0.01565665,  0.00912571,  0.00129837,  0.02729514],
       [-0.0448115 , -0.01547954,  0.01633857, -0.06643729],
       [ 0.0219114 ,  0.00930118, -0.00285427,  0.02995895],
       [-0.00779827, -0.0105304 , -0.0120543 , -0.01986947],
       [-0.01994317, -0.016868  , -0.01136909, -0.04092758]])), ((array([[0.01565665, 0.00912571, 0.00129837, 0.02729514],
       [0.        , 0.        , 0.01633857, 0.        ],
       [0.0219114 , 0.009301

#### Compute Cost
Cost for sigmoid function is defined as
$$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))$$

In [14]:
def compute_cost(AL, Y):
    """
    arguments:
        AL: nparray; the output of the neural network
        Y: array of labels
        
    returns:
        decimal number specifing the cost
    """
    m = Y.shape[-1]
    cost = (np.matmul(Y, np.log(AL).T) + np.matmul(1 - Y, np.log(1 - AL).T)) / -m
    return np.squeeze(cost)

In [15]:
# Test function
compute_cost(AL, Y)

array(0.69314699)

### Backward Propagation

In [16]:
def sigmoid_backward(dA, cache):
    """
    arguments:
        dA: post-activation gradient, of any shape
        cache: it is "Z" that we stored to use later in back prop
    
    returns:
        derivative of loss function with respect to Z
    """
    
    Z = cache
    s = 1 / (1 + np.exp(-Z))
    dZ = dA * s * (1 - s)
    
    return dZ

In [17]:
def relu_backward(dA, cache):
    """
    arguments:
        dA: post-activation gradient, of any shape
        cache: it is "Z" that we stored to use later in back prop
    
    returns:
        derivative of loss function with respect to Z
    """
    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z <= 0] = 0
    
    return dZ

Let's say we have dZ, now we are to find:
$$ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}$$
$$ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}$$

In [18]:
def linear_backward(dZ, cache):
    """
    arguments:
        dZ: gradient of the cost with respect to the linear output i.e Z
        cache: the tupe (A_prev, W, b)
    
    returns:
        dW: gradient of the cost with respect to the weights i.e W
        db: gradient of the cost with respect to the intercepts i.e b
        dA_prev: gradient of the cost with respect to the activation function in the previous layer
    """
    
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = 1/m * np.matmul(dZ, A_prev.T)
    db = 1/m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.matmul(W.T, dZ)
    
    return dA_prev, dW, db

In [19]:
def linear_activation_backward(dA, cache, activation):
    """
    arguments:
        dA: post-activation gradient for current layer l
        cache: tuple contaning (linear_cache, activation_cache)
        activation: activation function go to be used in this layer
        
    returns:
        dW: gradient of the cost with respect to the weights i.e W
        db: gradient of the cost with respect to the intercepts i.e b
        dA_prev: gradient of the cost with respect to the activation function in the previous layer
    """
    
    linear_cache, activation_cache = cache
    
    if activation == 'relu':
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    elif activation == 'sigmoid':
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    return dA_prev, dW, db

In [20]:
def L_model_backward(AL, Y, caches):
    """
    arguments:
        AL: predicted outputs from the neural network
        Y: correct values
        caches: cached values in each step of linear forward and linear activation forward
        
    returns:
        gradients that are --> dA_prev, dW, db for each layer
    """
    
    grads = {}
    Y = Y.reshape(AL.shape)
    m = AL.shape[1]
    L = len(caches)
    
    dAL = -np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)
    current_cache = caches[L - 1]
    dA_prev, dW, db = linear_activation_backward(dAL, current_cache, 'sigmoid')
    grads['dA' + str(L - 1)] = dA_prev
    grads['dW' + str(L)] = dW
    grads['db' + str(L)] = db
    
    for l in reversed(range(L - 1)):
        current_cache = caches[l]
        dA_prev, dW, db = linear_activation_backward(grads['dA' + str(l + 1)], current_cache, 'relu')
        grads['dA' + str(l)] = dA_prev
        grads['dW' + str(l + 1)] = dW
        grads['db' + str(l + 1)] = db
        
    return grads

In [22]:
# Test function
grads = L_model_backward(AL, Y, caches)
grads

{'dA2': array([[ 0.00460936,  0.00460936,  0.00460936,  0.00460936],
        [-0.00980159, -0.00980158, -0.00980158, -0.00980157],
        [ 0.00391574,  0.00391573,  0.00391573,  0.00391573],
        [ 0.00217729,  0.00217729,  0.00217729,  0.00217729]]),
 'dW3': array([[-4.83184693e-05,  0.00000000e+00,  0.00000000e+00,
         -2.98280767e-06]]),
 'db3': array([[-0.49999991]]),
 'dA1': array([[ 3.37131487e-05,  4.21374118e-05,  0.00000000e+00,
          4.21373848e-05],
        [-1.37548481e-05, -2.45726791e-05,  0.00000000e+00,
         -2.45726633e-05],
        [ 4.25741994e-07, -6.06594296e-06,  0.00000000e+00,
         -6.06593907e-06],
        [ 9.73824048e-06,  3.67974671e-05,  0.00000000e+00,
          3.67974435e-05],
        [ 6.62742404e-05,  7.74629952e-05,  0.00000000e+00,
          7.74629456e-05]]),
 'dW2': array([[6.00109934e-05, 0.00000000e+00, 7.04903916e-05, 0.00000000e+00,
         0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.000000

### Update Parameters
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

In [23]:
def update_parameters(params, grads, learning_rate):
    """
    arguments
        params: is the parameters that we initialized
        grads: gradients that are calculated in "L_model_backward"
        learning_rate: a decimal illustrating how much of the gradient should effect params
    
    return:
        updated parameters
    """
    parameters = params.copy()
    L = len(parameters) // 2
    
    for l in range(1, L + 1):
        parameters['W' + str(l)] = params['W' + str(l)] - learning_rate * grads['dW' + str(l)]
        parameters['b' + str(l)] = params['b' + str(l)] - learning_rate * grads['db' + str(l)]
        
    return parameters

In [24]:
# Test function
update_parameters(params, grads, 0.01)

{'W1': array([[-0.01156529,  0.0036085 ,  0.00295111],
        [ 0.02942256, -0.00571661,  0.00509639],
        [-0.01574591, -0.00864652,  0.00670794],
        [ 0.00788212, -0.00356852, -0.00956044],
        [ 0.01647727, -0.00797142, -0.00980176]]),
 'b1': array([[-2.94969863e-07],
        [ 0.00000000e+00],
        [ 2.92653501e-08],
        [ 0.00000000e+00],
        [ 0.00000000e+00]]),
 'W2': array([[ 7.31345618e-03, -2.98410968e-03,  9.16596787e-05,
          2.11270800e-03,  1.43781742e-02],
        [-1.16380391e-02, -1.05307857e-02, -9.46954877e-03,
          5.07818544e-03, -5.97416338e-03],
        [-6.67101356e-03, -1.82783248e-03, -9.49261239e-03,
         -2.01604609e-02, -3.34573958e-03],
        [ 3.86896151e-03, -4.96848409e-03, -2.98175375e-03,
          1.24279286e-02,  5.13886513e-03]]),
 'b2': array([[-3.45702037e-05],
        [ 0.00000000e+00],
        [ 0.00000000e+00],
        [-1.08864587e-05]]),
 'W3': array([[-0.00921872,  0.01960317, -0.00783147, -0.0043545