## Problem 1

### Part 1

Consider a typical neural network setup, where each layer's output, $ h^l $, is computed using a weight matrix $ W^l $, a bias $ b^l $, and an activation function $ f^l $. 

Given that the output of layer $ l $ is defined by:
$$ h^l = f^l(W^l h^{l-1} + b^l) $$
where $ f^l $ is the activation function applied element-wise.

To show that:
$$ \frac{\partial h^l}{\partial b^l} = \left(f^l\right)^{\prime}(W^l h^{l-1} + b^l) $$
we can use the chain rule:

1. Start by noting the function composition in $ h^l $:
   $$ h^l = f^l(z^l) $$
   where $ z^l = W^l h^{l-1} + b^l $.

2. The derivative of $ h^l $ with respect to $ b^l $ is:
   $$ \frac{\partial h^l}{\partial b^l} = \frac{\partial f^l}{\partial z^l} \cdot \frac{\partial z^l}{\partial b^l} $$
   
   Applying the chain rule, we have:
$$\frac{\partial h^l}{\partial b^l} = \left(f^l\right)^{\prime}\left(W^l h^{l-1} + b^l\right) \cdot \frac{\partial (W^l h^{l-1} + b^l)}{\partial b^l}$$

3. Since $ z^l $ is linear in $ b^l $, the derivative $ \frac{\partial z^l}{\partial b^l} $ is straightforwardly 1 (keeping in mind that $ b^l $ is added element-wise to $ W^l h^{l-1} $).

4. Therefore, $ \frac{\partial h^l}{\partial b^l} $ simplifies to:
   $$ \frac{\partial h^l}{\partial b^l} = \left(f^l\right)^{\prime}(z^l) $$
   which is exactly:
   $$ \left(f^l\right)^{\prime}(W^l h^{l-1} + b^l) $$




### Part 2

Compute the gradient of the loss function $ L $ with respect to the biases $ b^l $ using the chain rule in the context of backpropagation. 

1. Recognize that the gradient of the loss $ L $ with respect to the biases $ b^l $ can be propagated from the output backwards using the chain rule:
   $$ \frac{\partial L}{\partial b^l} = \frac{\partial L}{\partial h^l} \cdot \frac{\partial h^l}{\partial b^l} $$

2. From Part 1, we know:
   $$ \frac{\partial h^l}{\partial b^l} = \left(f^l\right)^{\prime}(W^l h^{l-1} + b^l) $$

3. The derivative $ \frac{\partial L}{\partial h^l} $ is calculated during backpropagation as the "upstream gradient" from later layers, often denoted as $ \delta^l $. It represents how the change in $ h^l $ impacts the change in loss.

4. Combining these, the formula for $ \frac{\partial L}{\partial b^l} $ in terms of the gradients from later stages in the network is:
   $$ \frac{\partial L}{\partial b^l} = \delta^l \odot \left(f^l\right)^{\prime}(W^l h^{l-1} + b^l) $$
   where $ \odot $ denotes the element-wise product.

By backpropagating the gradient from the output layer back to the inputs using these steps, we ensure that the biases $ b^l $ are updated in a way that minimizes the loss $ L $, thereby improving the model's performance with each training iteration.


## Problem 2

_Back ground functions_

In [146]:
#implement the activation function
def act_func(x, type):
    """
    Compute the activation function for a given type.

    Parameters:
    - x (numpy.ndarray): Input data.
    - type (str): Type of activation function ('sigmoid' or 'ReLU').

    Returns:
    - numpy.ndarray: Output of the activation function.
    """
    if type == "sigmoid":
        # Compute the sigmoid activation function: 1 / (1 + exp(-x))
        return 1 / (1 + np.exp(-x))
    if type == "ReLU":
        # Compute the ReLU activation function: max(0, x)
        return np.maximum(0, x)

    
#get H's [sigma(WX + b)] and the Z's [WX + b]
def feed_forward(X, nl, act:list, parameters: dict):
    """
    Perform the feedforward pass through the neural network.

    Parameters:
    - X (numpy.ndarray): Input data.
    - nl (int): Number of layers in the neural network.
    - act (list): List of activation functions for each layer.
    - parameters (dict): Dictionary containing the parameters of the neural network.

    Returns:
    - forward: Dictionary containing the forward pass computations (ZL and HL)
    """
    p = parameters
    forward = {}
    forward["H0"] = X  # Input layer is the initial value of H0
    L = nl 
    for l in range(1, L + 1):
        # Calculate the linear transformation Zl = Wl * Hl-1 + Bl
        forward["Z" + str(l)] = np.dot(p["W" + str(l)], forward["H" + str(l - 1)]) \
                               + p["B" + str(l)]
        # Apply the activation function to compute Hl
        forward["H" + str(l)] = act_func(forward["Z" + str(l)], act[l-1])
    return forward

### Part 1a

In [147]:
def MSE(y, y_pred, lambd: float, parameters: list):
    """
    Calculate the Mean Squared Error (MSE) loss function 
    with L2 penalty (Ridge Regression).

    Parameters:
    - y (numpy.ndarray): The true target values.
    - y_pred (numpy.ndarray): The predicted values.
    - lambd (float): The regularization parameter for the L2 penalty.
    - parameters (dict): A dictionary containing the parameters of the neural network.

    Returns:
    - MSE (float): The MSE loss with L2 penalty.
    """
    # Calculate L2 penalty (sum of squares of all parameters)
    # For each layer, square the parameters and then sum their sums
    # parameters.values() returns the value for each key in the dict
    L2_penalty = np.sum([np.sum(param**2) for param in parameters.values()])
    # Compute MSE with L2 penalty
    MSE = (1/2) * np.linalg.norm(y - y_pred)**2 + lambd * L2_penalty
    return MSE

def Cross_entropy(y, y_pred, lambd: float, parameters: list):
    """
    TO DO: Compute the cross entropy loss
    """

### Part 1b

__To do__... math derivation

In [148]:
def MSE_grad_wrt_y_pred(y,y_pred):
    """
    Calculate the gradient of the Mean Squared Error (MSE) loss function 
    with respect to the predicted value (last output of neural network)

    Parameters:
    - y (numpy.ndarray): The true target values.
    - y_pred (numpy.ndarray): The predicted values.

    Returns:
    - numpy.ndarray: The gradient of the MSE loss with respect to the predicted values.
    """
    mse_grad = (y_pred - y)
    return mse_grad


def Cross_entrpy_grad_wrt_y_pred(y,y_pred):
    """
    Calculate the gradient of the cross entropy loss with L2 penalty 
    """
    

### Part 2a

In [149]:
def act_derivative(type, Zl):
    """
    Compute the derivative of the activation function with respect to its input.
    
    Parameters:
    - activation (str): Type of activation function ('sigmoid' or 'ReLU').
    - Zl (numpy.ndarray): Input array to the activation function (pre-activation values).

    Returns:
    - numpy.ndarray: Derivative of the activation function evaluated at Zl.
    """
    if type == "sigmoid":
        # Compute the sigmoid of Zl
        sigmoid = 1 / (1 + np.exp(-Zl))
        # Compute the derivative of the sigmoid function
        derivative = sigmoid * (1 - sigmoid)
        return derivative
    
    if type == "ReLU":
        # Compute derivative: 1 when input > 0, 0 otherwise
        derivative = np.where(Zl > 0, 1, 0)
        return derivative

### Part 3a

In [211]:
#mine updated with doc string and comments
def backward_propagation(nl, nh, Y, lambd, parameters, forward, act:list):
    """
    Perform backward propagation for a neural network to compute gradients 
    for all parameters.

    Args:
    nl (int): Number of layers in the neural network.
    nh (list): an int vec of length nl - that has the number of neurons in each layer
    Y (array): Observed values we try to predict
    lambd (float): Regularization parameter.
    parameters (dict): Dict containing parameters 'W' (weights) and 'B' (biases)
    forward (dict): Dictionary containing the forward pass computations (ZL and HL).
    act (list):a str vec of length nl - the activation function used in each layer 

    Returns:
    g (dict): A dictionary containing gradients of loss with respect to each parameter.
    """
    
    p = parameters
    f = forward
    L = nl
    g = {}
    
    HL = f["H"+str(L)] #Final prediction sigma(W^{L}H^{L-1} + B^{L})
    ZL = f["Z"+str(L)] #Last layer's Z (W^{L}H^{L-1} + B^{L})
    
    # Compute gradient of loss wrt last layer Z (dL_dHL*dHL_dZL)
    g["dLoss_dZ"+str(L)] = MSE_grad_wrt_y_pred(Y, HL) * act_derivative(act[-1], ZL) 
    
    # Deriative of last layer Z wrt its weights & biases (dZL_dWL, dZL_dBL)
    g["dZ"+str(L)+"_dW"+str(L)] = f["H"+str(L-1)] #just H^{L-1}
    g["dZ"+str(L)+"_dB"+str(L)] = np.ones((nh[-1], 1))#vec of 1's row's of last layer neurons
    
    # Calculate derivative with respect to weights and biases for the last layer
    # dLoss_dWL = dL_dHL*dHL_dZL*dZL_dWL, dLoss_dBL = dL_dHL*dHL_dZL*dZL_dBL
    g["dLoss_dW"+str(L)] = np.dot(g["dLoss_dZ"+str(L)], 
                                 g["dZ"+str(L)+"_dW"+str(L)].T) \
                           + lambd * p["W"+str(L)]  # Include regularization term
    g["dLoss_dB"+str(L)] = np.dot(g["dLoss_dZ"+str(L)],
                                 g["dZ"+str(L)+"_dB"+str(L)]) \
                           + lambd * p["B"+str(L)]  # Include regularization term
    
    #from L-1 to first layer, which is 1
    for l in reversed(range(1, L)):
        # Calculate gradient of Z in layer l+1 wrt Z in layer l
        # dZl+1_dZl = dZl+1_dHl*dHl_dZl 
        g["dZ_"+str(l+1)+"_dZ"+str(l)] = np.dot(p["W"+str(l+1)].T, 
                                                g["dLoss_dZ"+str(l+1)])
        
        # Propagate the loss gradient back from layer l+1 to layer l
        #dLoss_dZl = dLoss_dZl+1*dZl+1_dHl*dHl_dZl
        g["dLoss_dZ"+str(l)] = g["dZ_"+str(l+1)+"_dZ"+str(l)] * \
                                act_derivative(act[l], f["Z"+str(l)])
        
        # Deriative of Z wrt its weights & biases (for each layer)
        g["dZ"+str(l)+"_dW"+str(l)] = f["H" + str(l-1)]
        g["dZ"+str(l)+"_dB"+str(l)] = np.ones((nh[l], 1))

        # Calculate derivatives with respect to weights and biases in layer l
        g["dLoss_dW"+str(l)] = np.dot(g["dLoss_dZ"+str(l)], 
                                     g["dZ"+str(l)+ "_dW"+str(l)].T) \
                               + lambd * p["W" + str(l)]
        g["dLoss_dB"+str(l)] = np.dot(g["dLoss_dZ"+str(l)].T,
                                      g["dZ"+str(l)+ "_dB"+str(l)]) \
                                + lambd * p["B" + str(l)]
                                
        
                              

    return g


### Part 3b

_loading the neccesities from pset 8_

In [212]:
nl = 5
X = np.array([0.1, -0.2, 0.3, -0.4, 0.5]).reshape(-1, 1)
nh = [X.shape[0], 5, 4, 3, 5, 1]
act = ["ReLU", "sigmoid", "ReLU", "sigmoid", "ReLU"] 
Y = 3

def parameter(nl):
    """
    Parameters:
    - nl (int): Number of layers in the neural network.
    
    Returns:
    - parameters (dict): Dictionary containing all the weights and biases
    """
    parameters = {}
    for l in range(1, nl + 1): #1 to 5
        # Select the values all but the last one (which is the bias)
        parameters["W" + str(l)] = pd.read_csv(f'data/layer{l}.csv').values[:, :-1]
        # Select the values the last one only (the bias), and make it a column vector
        parameters["B" + str(l)] = pd.read_csv(f'data/layer{l}.csv').values[:, -1].reshape(-1,1)
    return parameters


_computing the back prop_

In [218]:
parameters = parameter(nl)
forward = feed_forward(X, nl, act, parameters)
lambd = 0.05 #random since now was given

grads = backward_propagation(nl, nh, Y, lambd, parameters, forward, act)

print("Gradient (MSE) of the weights in respect to Weights of the first hidden layer:")
display(pd.DataFrame(grads['dLoss_dW1']))
print("Gradient (MSE)of the bias in respect to Bias of the first hidden layer:")
display(pd.DataFrame(grads['dLoss_dB1']))

print("Gradient (Cross Entropy) of the weights in respect to Weights of the first hidden layer:")
#enter display here
print("Gradient (Cross Entropy) of the weights in respect to Bias of the first hidden layer:")
#enter display here

Gradient (MSE) of the weights in respect to Weights of the first hidden layer:


Unnamed: 0,0,1,2,3,4
0,-0.003123,-0.004122,0.007588,-0.000263,0.004643
1,0.000908,0.002457,0.00192,-4.2e-05,0.003862
2,-0.004173,0.003682,-0.003091,0.0047,0.000397
3,0.007976,0.00288,-0.011075,0.004108,-0.009949
4,0.001649,-0.00153,0.005629,0.002964,0.003106


Gradient (MSE)of the bias in respect to Bias of the first hidden layer:


Unnamed: 0,0
0,-0.000224
1,-0.000722
2,-0.007297
3,-0.002334
4,0.002147


Gradient (Cross Entropy) of the weights in respect to Weights of the first hidden layer:
Gradient (Cross Entropy) of the weights in respect to Bias of the first hidden layer:


*Appendix, Finding the dimensions of each*

In [216]:
for i in reversed(range(1, nl  + 1)):
    print("shape of dLoss_dZ" +str(i) + ":", grads["dLoss_dZ"+str(i)].shape)
    print("shape of dLoss_dW" +str(i) + ":", grads["dLoss_dW"+str(i)].shape)
    print("shape of dLoss_dB" +str(i) + ":", grads["dLoss_dB"+str(i)].shape, "\n")

shape of dLoss_dZ5: (1, 1)
shape of dLoss_dW5: (1, 5)
shape of dLoss_dB5: (1, 1) 

shape of dLoss_dZ4: (5, 1)
shape of dLoss_dW4: (5, 3)
shape of dLoss_dB4: (5, 1) 

shape of dLoss_dZ3: (3, 1)
shape of dLoss_dW3: (3, 4)
shape of dLoss_dB3: (3, 1) 

shape of dLoss_dZ2: (4, 1)
shape of dLoss_dW2: (4, 5)
shape of dLoss_dB2: (4, 1) 

shape of dLoss_dZ1: (5, 1)
shape of dLoss_dW1: (5, 5)
shape of dLoss_dB1: (5, 1) 



### Part 4a

blah blah blah

### Part 4b

In [None]:
def stochatic_gradient_descent(parameters, grads, data, batch_size, p, lambd): #p is the patience parameter
    

In [None]:
#Extra, helper cells

In [None]:
def parameter(nl, nh):
    """
    Initialize the parameters (weights and biases) of a neural network.

    Parameters:
    - nl (int): Number of layers in the neural network.
    - nh (list): List containing the number of neurons in each layer.

    Returns:
    - dict: Dictionary containing the initialized parameters.
    """
    parameters = {}
    for n in range(1, nl + 1):
        # Initialize weights randomly from a uniform distribution
        parameters["W" + str(n)] = np.random.rand(nh[n], nh[n-1])
        # Initialize biases as zeros
        parameters["B" + str(n)] = np.zeros((nh[n], 1))
    return parameters

In [117]:
nl = 5
X = np.array([0.1, -0.2, 0.3, -0.4, 0.5]).reshape(-1, 1)
nh = [X.shape[0], 5, 4, 3, 5, 1]
act = ["ReLU", "sigmoid", "ReLU", "sigmoid", "ReLU"] 
Y = 3

In [6]:
sigma = np.array([1, 2, 3, 0, 1 , -1, -2 , -34])
np.where(sigma < 0, 0, 1)

array([1, 1, 1, 1, 1, 0, 0, 0])

In [15]:
parameters = np.array([1, 1, 3, 4])

#np.sum(parameters**2)

#np.square(parameters)
#0.05*np.sum(parameters**2)
0.05*np.sum(np.square(parameters))

1.35

In [22]:
parameters = [
    np.array([[1, 1, 3, 4],  # First matrix
              [1, 2, 3, 4]]),
    np.array([[2, 1, 3, 4],  # Second matrix
              [1, 5, 3, 4]]),
    np.array([1, 2])         # Vector
]

np.sum([np.sum(param) for param in parameters])
#for param in parameters:
#    print(np.sum(param))

45

In [18]:
np.sum(parameters)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


ValueError: could not broadcast input array from shape (2,4) into shape (2,)

In [84]:
for parm in parameters.values():
    print(parm.shape)

(5, 5)
(5, 1)
(4, 5)
(4, 1)
(3, 4)
(3, 1)
(5, 3)
(5, 1)
(1, 5)
(1, 1)


In [85]:
for parm in parameters:
    print(parm)

W1
B1
W2
B2
W3
B3
W4
B4
W5
B5


In [89]:
nh[-1]

1

In [95]:
np.ones(nh[-2]).reshape(-1,1).shape

(5, 1)

In [99]:
for l in reversed(range(1, 5)):
    print(l)

4
3
2
1


In [97]:
a = [1, 2, 3, 4, 5]

In [105]:
np.ones((nh[-2], 1))

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

In [108]:
np.ones((nh[-2], 1)).shape

(5, 1)

In [169]:
forward 

{'H0': array([[ 0.1],
        [-0.2],
        [ 0.3],
        [-0.4],
        [ 0.5]]),
 'Z1': array([[ 0.0976316 ],
        [ 0.02795802],
        [-0.22286044],
        [-0.24213525],
        [ 0.09218016]]),
 'H1': array([[0.0976316 ],
        [0.02795802],
        [0.        ],
        [0.        ],
        [0.09218016]]),
 'Z2': array([[ 0.05258633],
        [-0.05628207],
        [ 0.0357589 ],
        [-0.10550538]]),
 'H2': array([[0.51314355],
        [0.4859332 ],
        [0.50893877],
        [0.47364809]]),
 'Z3': array([[-0.03414519],
        [ 0.23866093],
        [ 0.03396447]]),
 'H3': array([[0.        ],
        [0.23866093],
        [0.03396447]]),
 'Z4': array([[ 0.02747301],
        [ 0.03831927],
        [ 0.09527121],
        [-0.0263907 ],
        [ 0.0336014 ]]),
 'H4': array([[0.50686782],
        [0.50957864],
        [0.5237998 ],
        [0.49340271],
        [0.50839956]]),
 'Z5': array([[0.30069756]]),
 'H5': array([[0.30069756]])}

In [189]:
print(comp)

{'dLoss_dZ5': array([[-2.69930244]]), 'dZ5_dW5': array([[0.50686782],
       [0.50957864],
       [0.5237998 ],
       [0.49340271],
       [0.50839956]]), 'dZ5_dB5': array([[1.]]), 'dLoss_W5': array([[-1.36685405, -1.37821948, -1.40785475, -1.32604112, -1.3688231 ]]), 'dLoss_B5': array([[-2.69136827]]), 'dZ_5_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.31322776],
       [-0.18900884]]), 'dLoss_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.        ],
       [-0.18900884]]), 'dZ4_dW4': array([[0.        ],
       [0.23866093],
       [0.03396447]]), 'dZ4_dB4': array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]]), 'dLoss_dW4': array([[ 1.08630584e-02, -2.34751526e-02, -5.39637656e-03],
       [ 2.37754764e-03,  3.64073495e-02,  2.13050052e-03],
       [-3.54973215e-03, -8.00294807e-02, -1.17496709e-02],
       [ 3.05363177e-03,  5.52675816e-06,  5.89043498e-03],
       [-4.67048816e-03, -4.47373189e-

In [192]:
print(apre)
{'dLoss_dZ5': array([[-2.69930244]]), 'dZ5_dW5': array([[0.50686782],
       [0.50957864],
       [0.5237998 ],
       [0.49340271],
       [0.50839956]]), 'dZ5_dB5': array([[1.]]), 'dLoss_W5': array([[-1.36685405, -1.37821948, -1.40785475, -1.32604112, -1.3688231 ]]), 'dLoss_B5': array([[-2.69136827]]), 'dZ_5_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.31322776],
       [-0.18900884]]), 'dLoss_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.        ],
       [-0.18900884]]), 'dZ4_dW4': array([[0.        ],
       [0.23866093],
       [0.03396447]]), 'dZ4_dB4': array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]]), 'dLoss_dW4': array([[ 1.08630584e-02, -2.34751526e-02, -5.39637656e-03],
       [ 2.37754764e-03,  3.64073495e-02,  2.13050052e-03],
       [-3.54973215e-03, -8.00294807e-02, -1.17496709e-02],
       [ 3.05363177e-03,  5.52675816e-06,  5.89043498e-03],
       [-4.67048816e-03, -4.47373189e-02, -1.40374193e-02]]), 'dLoss_dB4': array([[-0.43773464],
       [-0.43903962],
       [-0.43538887],
       [-0.44222529],
       [-0.43885428]])

{'dLoss_dZ5': array([[-2.69930244]]), 'dZ5_dW5': array([[0.50686782],
       [0.50957864],
       [0.5237998 ],
       [0.49340271],
       [0.50839956]]), 'dZ5_dB5': array([[1.]]), 'dLoss_W5': array([[-1.36685405, -1.37821948, -1.40785475, -1.32604112, -1.3688231 ]]), 'dLoss_B5': array([[-2.69136827]]), 'dZ_5_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.31322776],
       [-0.18900884]]), 'dLoss_dZ4': array([[-0.07209804],
       [ 0.14644256],
       [-0.32604005],
       [-0.        ],
       [-0.18900884]]), 'dZ4_dW4': array([[0.        ],
       [0.23866093],
       [0.03396447]]), 'dZ4_dB4': array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]]), 'dLoss_dW4': array([[ 1.08630584e-02, -2.34751526e-02, -5.39637656e-03],
       [ 2.37754764e-03,  3.64073495e-02,  2.13050052e-03],
       [-3.54973215e-03, -8.00294807e-02, -1.17496709e-02],
       [ 3.05363177e-03,  5.52675816e-06,  5.89043498e-03],
       [-4.67048816e-03, -4.47373189e-

In [175]:
parameters['W5'].shape

(1, 5)

In [165]:
print(act_derivative("sigmoid", forward["Z2"]), act_derivative("sigmoid", forward["Z2"]).shape)

[[0.24982725]
 [0.24980213]
 [0.2499201 ]
 [0.24930558]] (4, 1)


In [127]:
import pandas as pd
pd.read_csv(f'data/layer1.csv').values[:, -1].shape

array([-0.00561287, -0.01557955, -0.14707524, -0.04781501,  0.04179416])

In [109]:
#mine

def backward_propagation(nl, nh, Y, lambd, parameters, forward, type:list):
    
    p = parameters
    f = forward
    L = nl
    g = {}
    
    HL = f["H"+str(L)] #Final prediction
    ZL = f["Z"+str(L)] #Last layer's Z (W^{L}H^{L-1} + B^{L})
    g["dLoss_Z"+str(L)]= MSE_grad_wrt_y_pred(Y,HL)*act_derivative(type[-1], ZL)
    g["dZ"+str(L)+"_W"+str(L)] = f["H"+str(L-1)]
    g["dZ"+str(L)+"_B"+str(L)]= np.ones((nh[-1],1))
    
    
    g["dLoss_W"+str(L)]=np.dot(g["dLoss_Z"+str(L)], 
                               g["dZ"+str(L)+"_W"+str(L)].T) \
                             + lambd*(p["W"+str(L)])
    g["dLoss_B"+str(L)]=np.dot(g["dLoss_Z"+str(L)],
                               g["dZ"+str(L)+"_B"+str(L)]) \
                             + lambd*(p["B"+str(L)])
    
    
    for l in reversed(range(1, L)):
        g["dZ_"+str(l+1)+"_Z"+str(l)]= np.dot(act_derivative(type[l],f["Z"+str(l)]),
                                              p["W"+str(l+1)].T)
        
        g["dLoss_Z"+str(l)]= np.dot(g["dLoss_Z"+str(l+1)],g["dZ_"+str(l+1)+"_Z"+str(l)])
        
        
        g["dZ"+str(l)+"_W"+str(l)] = f["H" + str(l-1)]
        g["dZ"+str(l)+"_B"+str(l)]= np.ones((nh[l],1))

        
        g["dLoss_W"+str(l)]= np.dot(g["dLoss_Z"+str(l)],g["dZ" +str(l)+ "_W" +str(l)].T) \
                                    + lambd*(p["W" + str(l)])
        g["dLoss_B"+str(l)] = np.dot(g["dLoss_Z"+str(l)],g["dZ" +str(l)+ "_B"+str(l)]) \
                                    + lambd*(p["B" + str(l)])
    return g