### Helper functions for a multi-layer neural network for binary classification
1. **Initialize parameters** (adjustable for Xavier and He initialization) -- initializes the layer parameters W and b and stores them in a dictionary
2. **Initialize Adam** (Adaptive Moment Estimation) gradient descent -- initializes two dictionaries of the same shape as the parameters W and b to hold the momentum terms during the gradient descent. The first momentum term consists of the exponentially decaying average of past **squared** gradients (i.e., like RMSprop) and the second momentum term consists of an exponentially decaying average of past gradients (i.e., like momentum) 
3. **Linear forward propagation for one layer** -- computes the linear term Z for a layer based on the activation of the previous layer and the parameters W and b 
4. **Activation functions** -- encompasses functions to compute Sigmoid, tanh, ReLU, and leaky ReLU activations
5. **Activation for one layer** integrating the linear forward function and the activation functions to compute the activation for one layer based on the activation of the previous layer, the parameters W and b, as well as the choice of activation function. Also implements the option for dropout to prevent overfitting 
6. **Forward propagation** -- computes one forward pass through the network based on the feature vector input X. Adjustable for Sigmoid, tanh, ReLU and leaky ReLU activation for hidden layers
7. **Compute cost"** -- computes the cross-entropy loss for each sample and averages it over all samples in the set/batch. Also implements L2 regularization (i.e., adding the sum of squared W parameters to the cost function to force smaller range of W values and thus reduce overfitting) 
8. **Linear backward** -- for a single layer, computes the partial derivative terms dW, db and of the previous layer actiavtion dA_prev. dW is calculated with the L2 regularization term lambda. Also implements dropout (i.e., setting dW and db terms to zero for those neurons that were "dropped out" on the forward pass
9. **Activation function derivatives** -- computes partial derivatives for Sigmoid, tanh, ReLU and leaky ReLU activation functions 
10. **Linear activation backward** -- integrates the computation of the partial derivative of the cost function with respect to the activation in a given layer. Can handle a choice of Sigmoid, tanh, ReLU and leaky ReLU activation functions 
11. **Backward propagation** -- Runs backward propagation through all layers 
12. **Update parameters** -- update parameters W and b using Adam 
13. **Predict labels** -- Predict labels based on input vector X and learned parameters W and b 
14. **Calculate accuracy** -- Calculate accuracy of predictions against label vector (i.e., for trainign and dev set) 


*functions mostly generealizations of code introduced in Andrew Ng's deep learn specialization

In [4]:
import numpy as np

#### 1. Initialize parameters

In [5]:
def initialize_parameters(layer_dims, init_tuning=1):
    """
    Returns a dictionary of randomly initialized parameters W and b
    Arguments: 
        layer_dims  --  array containing the dimensions of each layer [l1, l2, ln]
        init_tuning --  parameter that adjusts the initialization of W for each 
                        layer (Xavier initalization: 0.5, He initualization: 1)
                    
    Returns:
        parameters -- dict consisting of randomly initialized parameters W and b                 
    """ 
    parameters = {}
    depth = len(layer_dims)
    
    for l in range(1, depth): 
        n = layer_dims[l]
        n_prev = layer_dims[l-1]
        
        # balance: W's decrease on average with increasing size of previous layer to limit the 
        # range of Z 
        balance = np.sqrt(2 * init_tuning / n_prev) 
        
        parameters["W" + str(l)] = np.random.randn(n, n_prev) * balance
        parameters["b" + str(l)] = np.random.randn(n, 1)
        
    return parameters

#### 2. Initialize Adam

In [6]:
# initialize Adam parameter dictionaries to store trailing averages 
def initialize_Adam(parameters): 
    """
    Returns two dictionaries for storing the momentum terms for Adam gradient descent
    Arguments: 
        parameters -- dict of parameters W and b 
    Returns:     
        v -- dictionary with zero arrays of the same shape as parameters
        s -- dictionary with zero arrays of the same shape as parameters
    """
        
    L = len(parameters) // 2
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
        v["dW" + str(l+1)] = np.zeros((parameters["W" + str(l+1)].shape[0],
         parameters["W" + str(l+1)].shape[1]))
        v["db" + str(l+1)] = np.zeros((parameters["b" + str(l+1)].shape[0],
         parameters["b" + str(l+1)].shape[1]))
        s["dW" + str(l+1)] = np.zeros((parameters["W" + str(l+1)].shape[0],
         parameters["W" + str(l+1)].shape[1]))
        s["db" + str(l+1)] = np.zeros((parameters["b" + str(l+1)].shape[0],
         parameters["b" + str(l+1)].shape[1]))
        
    return v, s

#### 3. linear forward propagation for one layer 

In [7]:

def linear_forward(A_prev, W, b):
    """
    computes the linear term Z for a layer based on the activation of the previous layer 
    and the parameters W and b 
    
    Arguments:
        A-prev -- activations of previous layer (n, 1)
        W -- parameters for current layer (m, n)
        b -- parameters for current layer (m, 1)

    Returns:
        Z -- Linear product matrix for current layer
    """ 
    Z = np.dot(W, A_prev) + b
    assert(Z.shape == (W.shape[0], A_prev.shape[1]))
    linear_cache = (A_prev, W, b)

    return Z, linear_cache

#### 4. Activation functions

In [8]:
# Sigmoid - covers range 0 <-> +1 - exagerates small differences in Z around zero. 
# Risk of saturation for large and small values of Z (i.e., diminishing slope)  
def sigmoid(Z):
    
    A = 1/(1+np.exp(-Z))
    assert(A.shape == Z.shape)
    
    return A

# tanh - same shape as sigmoid, but covering range -1 <-> +1 - more differentiation range
# Risk of saturation for large and small values of Z (i.e., diminishing slope)  
def tanh(Z):
     
    A = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    assert(A.shape == Z.shape)
    
    return A

# relu - sets all negative values to zero while leaving positive values unchanged
def relu(Z):
    
    A = np.maximum(0, Z)
    assert(A.shape == Z.shape)
    
    return A

# leaky relu - small psoitive slope for negative values 
def leaky_relu(Z):

    A = np.where(Z > 0, Z, 0.01*Z)
    assert (A.shape == Z.shape)
    
    return A      

#### 5. Activation for one layer

In [9]:

def linear_activation(A_prev, W, b, deep_activation, keep_prob):
    """
    Computes the activation for a layer based on the activation of the previous layer, the parameters
    W and b asn as well the choice of activation function
    
    Arguments:
        A_prev -- activation of previous layer 
        W, b -- parameters of layer 
        deep_activation -- type of activation function for hidden units
        keep_prob -- probablity that a node in a hidden layer will not be dropped
    
    Returns:
        A -- Activation matrix of current layer 
    """
    
    # use linear forward to calculate Z for current layer 
    Z, linear_cache = linear_forward(A_prev, W, b)
    
    # use activation
    if deep_activation == "sigmoid":
        A = sigmoid(Z)
        
    elif deep_activation == "tanh":
        A = tanh(Z)
        
    elif deep_activation == "relu":
        A = relu(Z)
        
    elif deep_activation == "leaky_relu":
        A = leaky_relu(Z)
        
    # Implement dropout: 
    # -- create a matrix (mask) of the same shape as A with random values between 0-1
    D = np.random.rand(A.shape[0], A.shape[1])
    # -- set values in D that are larger than keep_prob to zero
    D = (D < keep_prob)   
    # -- zero out nodes from A by multiplying with mask D 
    A = A * D
    # -- rebalance A (so that the sum of all A terms roughly the same as without dropout)
    A = A / keep_prob 
        
    assert(Z.shape == (W.shape[0], A_prev.shape[1]))
    cache = (Z, linear_cache, D)
        
    return A, cache  

#### 6 forward propagation 

In [10]:
def L_model_forward(X, parameters, deep_activation, keep_prob): 
    """
    Implements forward propagation with adjustable activation function for hidden layers
    
    Arguments:
        X -- input feature vector
        parameters -- dictionary of W and b 
        deep_activation -- activation function to use for hidden layers 
        keep_prob -- probability of NOT dropping a node in a layer in a given forward pass
    
    Returns:
        AL -- final layer activation (value between 0-1)
        caches -- list of caches (for each layer: Z, W, b, and the momentum terms D)

    """
    caches = []
    L = len(parameters) // 2    # corresponds to number of layers NOT incl. input layer
    A = X
    
    # forward prop for deep layers
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation(A_prev, 
                                    parameters["W"+str(l)], 
                                    parameters["b"+str(l)],
                                    deep_activation,
                                    keep_prob)

        caches.append(cache)
    
    keep_prob_output = 1.
        
    # forward prop output layer
    AL, cache = linear_activation(A, 
                                    parameters["W"+str(L)], 
                                    parameters["b"+str(L)],
                                    "sigmoid",
                                    keep_prob_output)
    caches.append(cache)

    return AL, caches 

#### 7. Compute cost 

In [11]:
def compute_costs(AL, Y, parameters, lambd): 
    """
    Computes the cross-entropy loss for each sample and averages it over all samples in 
    the set/batch
    
    Arguments: 
        AL -- final layer output 
        Y -- label vector 
        parameters -- W and b 
        lambd -- L2 regularization term 
        
    Returns: 
        cost -- cost after forward pass
        
    """    
    m = Y.shape[1]
    L = len(parameters) // 2
    
    # cost function without regularization 
    cross_entropy_cost = (-1./ m) * np.sum(np.multiply(Y, np.log(AL)) + \
                                           np.multiply((1-Y), np.log( 1-AL)))
    
    # regularization - 1) sum over all W terms across all layers
    sum_W = 0 
    for l in range(1, L+1):
        sum_w = np.sum(np.square(parameters["W" + str(l)]))
        sum_W += sum_w 
    
    # regularization - 2) compute regularization cost term 
    L2_regularization_cost = 1/m * lambd/2 * sum_W
       
    # combine cross entropy cost with L2 regularization cost     
    cost = cross_entropy_cost + L2_regularization_cost

    cost = np.squeeze(cost)     # make sure to only get a single number
    assert(cost.shape == ())
    
    return cost 

#### 8. linear backward 

In [12]:
def linear_backward(dZ, linear_cache, D_prev, lambd, keep_prob):
    """
    For a single layer, computes the partial derivative terms dW, db and of the 
    previous layer actiavtion dA_prev
    
    Arguments: 
        dZ -- derivative of Z with regards to the cost function 
        linear_cache -- contains activation of previous layer, as well as W, b of current layer
        D_prev -- matrix with zeros and ones used to implement droput on the forward pass 
        lambd -- lambda term of L2 regularization 
        keep_prob -- probablity of NOT "dropping a neuron" in any given hidden layer 

    Returns: 
        dA_prev -- partial derivative of the previous layer activation 
        dW, db -- partial derivatives of the parameters W and b for the current layer 
    
    """
    A_prev, W, b = linear_cache
    m = A_prev.shape[1]     # number of samples in set 

    dW = 1./m * np.dot(dZ, A_prev.T) + lambd/m *W
    db = 1./m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ) 
    
    # shut down same neurons as in forward pass 
    dA_prev = dA_prev * D_prev
    # rebalance dA_prev to not change overall sum of terms 
    dA_prev = dA_prev / keep_prob        
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db


#### 9. Activation function derivatives

In [13]:
def relu_backward(dA, Z):
    """
    backward propagation for a single RELU unit.
    """
    # slope of dZ/dA is 1 for z>0 and 0 for z <= 0 and with chain rule dZ = dA*dZ/dA
    dZ_dA = np.where(Z > 0, 1, 0) 
    dZ = dA * dZ_dA
    
    assert (dZ.shape == Z.shape)
    
    return dZ

def sigmoid_backward(dA, Z):
    """
    backward propagation for a single SIGMOID unit
    """
        
    a = 1/(1+np.exp(-Z))
    dZ = dA * a * (1-a)
    
    assert (dZ.shape == Z.shape)
    
    return dZ


def tanh_backward(dA, Z): 
    
    a = (np.exp(Z) - np.exp(-Z)) / (np.exp(Z) + np.exp(-Z))
    dZ = dA * (1 - a**2)
    
    assert (dZ.shape == Z.shape)
    
    return dZ

def leaky_relu_backward(dA, Z): 
    
    # slope of dZ/dA is 1 for z>0 and 0.01 for z <= 0 and with chain rule dZ = dA*dZ/dA
    dZ_dA = np.where(Z > 0, 1, 0.01) 
    dZ = dA * dZ_dA
    
    assert (dZ.shape == Z.shape)
    
    return dZ


#### 10. linear activation backward

In [14]:
def linear_activation_backward(dA, cache, D_prev, deep_activation, lambd, 
                               keep_prob):
    """
    Integrates the computation of the partial derivative of the cost function with respect 
    to the activation in a given layer. Can handle a choice of Sigmoid, tanh, ReLU and 
    leaky ReLU activation functions
    
    Arguments: 
        dA -- partial differential of current layer activation 
        cache -- contains Z of and activation of previous layer, as well as W, b of current layer
        D_prev -- matrix with zeros and ones used to implement droput on the forward pass 
        deep_activation -- activation function for deep layers, i.e., Sigmoid, tanh, ReLU and l ReLU 
        lamb -- lambda term of L2 regularization 
        keep_prob -- probablity of NOT "dropping a neuron" in any given hidden layer 
    
    Returns: 
        dA_prev -- partial derivative of the previous layer activation 
        dW, db -- partial derivatives of the parameters W and b for the current layer 
    """
    Z, linear_cache, D = cache
    
    if deep_activation == "relu":
        dZ = relu_backward(dA, Z)
        
    elif deep_activation == "sigmoid":
        dZ = sigmoid_backward(dA, Z)
        
    elif deep_activation == "tanh":
        dZ = tanh_backward(dA, Z)
        
    elif deep_activation == "leaky_relu":
        dZ = leaky_relu_backward(dA, Z)
        
    dA_prev, dW, db = linear_backward(dZ, linear_cache, D_prev, lambd, keep_prob)
       
        
    return dA_prev, dW, db

#### 11. Backward propagation

In [15]:
def L_model_backward(AL, Y, caches, deep_activation, lambd, keep_prob):
    """
    Runs backward propagation through all layers 
    
    Arguments: 
        AL -- final layer activation (value between 0-1)
        Y -- label vector 
        caches -- list of caches (for each layer: Z, W, b, and the momentum terms D)
        deep_activation -- activation function for hidden layer  
        lambd -- L2 regularization term 
        keep_prob -- probablity that a node in a hidden layer will not be dropped
    
    Returns:
        grads -- dictionary of gradients for W and b (for all layers)
    """
    
    # capture dW and db for each layer 
    grads = {}  
    L = len(caches)     # equal to number of layers excl. input layer 
    
    # calculate dcost/dAL
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    # calculate and store gradients for final layer 
    current_cache = caches[L-1]     # e.g., the cache for layer 1 is at position 0
    previous_cache = caches[L-2]
    Z, linear_cache, D_prev = previous_cache
    dA_prev, dW, db = linear_activation_backward(dAL, current_cache, D_prev, 
                                                 "sigmoid", lambd, keep_prob)
    grads["dA" + str(L-1)] = dA_prev
    grads["dW" + str(L)] = dW
    grads["db" + str(L)] = db
    
    # calculate and store gradients for hidden layers 
    for l in reversed(range(L-1)):
        # lth layer: (e.g., RELU -> LINEAR) gradients.
        current_cache = caches[l]
        previous_cache = caches[l-1]
        Z, linear_cache, D_prev = previous_cache
        dA_prev, dW, db = linear_activation_backward(grads["dA"+str(l+1)], 
                                    current_cache, D_prev, deep_activation, 
                                    lambd, keep_prob)
        grads["dA" + str(l)] = dA_prev
        grads["dW" + str(l+1)] = dW
        grads["db" + str(l+1)] = db
            
    return grads

#### 12. Update parameters

In [16]:
def update_parameters_Adam(parameters, grads, v, s, t, learning_rate,
                                beta1, beta2,  epsilon = 1e-8):
    """
    Update parameters using Adam gradient descent
    
    Arguments:
        parameters -- dict with parameters W and b
        grads -- dict with grads dW and db
        v -- dict with trailing averages (stored as: dW1, db1, etc.)
        s -- dict with trailing averages of the squared gradient (stored as: dW1, db1, etc.)
        learning_rate -- the learning rate alpha
        beta1 -- Exponential decay hyperparameter for the first moment estimates 
        beta2 -- Exponential decay hyperparameter for the second moment estimates 
        epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
        parameters -- dict containig upadted paramters 
        v -- updated Adam variable 
        s -- updated Adam variable
    """
    t += 1
    L = len(parameters) // 2                 
    # open dictionaries for first set of real trailing averages 
    v_corrected = {}                         
    s_corrected = {} 
    
    # Adam update on all parameters
    for l in range(L):
        # compute trailing average of parameters 
        v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1) * grads["dW" + str(l+1)]
        v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1) * grads["db" + str(l+1)]
        
        # Correct for bias (i.e., initialized trailing averages all zero) 
        v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - beta1**t)
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - beta1**t)
        
        # Moving average of the squared gradients 
        s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * np.square(grads["dW" + str(l+1)])
        s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * np.square(grads["db" + str(l+1)])
        
        # Compute bias-corrected second raw moment estimate
        s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - beta2**t)
        s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - beta2**t)
        
        # Update parameters
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)] / np.sqrt(s_corrected["dW" + str(l+1)]+epsilon)
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)] / np.sqrt(s_corrected["db" + str(l+1)]+epsilon)
    
    return parameters, v, s

#### 13. Predict labels 

In [17]:
def predict(parameters, X, deep_activation):
    """
    Predict labels based on input vector X and learned parameters W and b 
    
    Arguments:
        parameters -- a dict containing learned parameters  
        X -- input data of size (n_x, m)
    
    Returns
        predictions -- size (1, m)
    """
    # make sure no nodes are dropped for prediction
    keep_prob_predict = 1
    
    # Computes probabilities using forward prop, and classifies to 0/1 using 0.5 as the threshold.
    AL, caches = L_model_forward(X, parameters, deep_activation, 
                                 keep_prob_predict)
    predictions = AL
    predictions = np.where(predictions > 0.5, 1, 0)
    
    return predictions

#### 14. Calculate accuracy 

In [18]:
def accuracy(predictions, Y): 
    """
    returns accuracy (in %) of predictions vs Y
    
    Arguments: 
        predictions - binary predictions (0, 1) of shape (1, m)
        Y - binary truth vector (0, 1) of shape (m, 1)
   
    Return: 
        accuracy(in %)
    """
    m = Y.shape[1]
    assert (predictions.shape == Y.shape)

    
    # returns matrix of size (1, m) where 0 if prediction == Y 
    Abs = abs(predictions - Y)
    accuracy = (1 - (np.sum(Abs) / m)) * 100 
    return accuracy 