# A 2 layer shallow network for binary classification
In this assignment you will build a two layer network for the same cat vs non-cat binary classification problem. First, lets import the required packages. Note that we have copied the functions 'flatten', 'load_train_data' and 'load_test_data' functions to 'assign1_utils.py'.

In [1]:
import numpy as np
import matplotlib # for plotting
from matplotlib import pyplot as plt # for plotting

from assign1_utils import load_train_data, load_test_data, flatten

%matplotlib inline 

The architecture to be implemented is as follows:
Ip layer(I)----->hidden layer(H)------>op layer(O)

- Ip features shape: nx$^{[0]}$ x m (a batch of m samples each of dim nx. In this assignment, nx$^{[0]}$ will be 64\*64*3 = 12288).
- weights between I and H have shape: nx$^{[1]}$ x nx$^{[0]}$. nx$^{[1]}$ = 32.
- bias vector at H has shape: nx$^{[1]}$ x 1
- non-linearity at hidden layer is ReLU
- weights between H and O have shape: nx$^{[2]}$ x nx$^{[1]}$. nx$^{[2]}$ = 1.
- bias vector at H has shape: nx$^{[2]}$ x 1
- non-linearity at output layer is Sigmoid

The implementation will follow the python style pseudo-code listed in my lecture notes. 


First, you will complete the function that intializes weights and biases and returns them. Weight matrices have to be initialized similar to how weight vector was initialized in logistic regression. Bias vectors have to be initialized to zeros.

In [2]:
def initialize_params(nx):
    """
      Function that intializes weights to scaled random std normal values and biases to zero and returns them
      
      nx: a list that contains number of nodes in each layer in order. For a l-layer network, len(nx) = l+1 
          as it includes num of features in input layer also.
          
      returns W: list of numpy arrays of weight matrices
              b: list of numpy arrays of bias vectors
    """
    Wlist = []
    blist = []
    for i in range(1, len(nx)): 
        Wlist.append(...) # replace the ...; np.random.randn will be useful
        blist.append(...) # replace the ...; np.zeros will be useful
    return Wlist, blist


#uncomment the following two lines to test your function

#W, b = initialize_weights([3, 2, 1])
#[print(f'Shape of W[{i}]: {W[i].shape}, Shape of b[{i}]: {b[i].shape}') for i in range(len(W))]

Now you will complete forward, backward, update_params and part of the main function. Functions f and df are already comlete. Look at the code to understand what they do.

In [None]:
def f(z, fname = 'ReLU'):
    """
      computes and returns the non-linear function of z given the non-linearity
      
      z: numpy array of any shape on which the non-linearity will be applied elementwise
      fname: a string that is name of the non-linearity. Defaults to 'ReLU'. Other valid values are
             'Sigmoid', 'Tanh', and 'Linear'.
      
      returns f(z) f is the non-linear function whose name is fname
    """
    if fname == 'ReLU':
        return np.maximum(z, 0)
    elif fname == 'Sigmoid':
        return 1./(1+np.exp(-z))
    elif fname == 'Tanh':
        return np.tanh(z)
    elif fname == 'Linear':
        return z
    else:
        raise ValueError('Unknown non-linear function error')
        

def df(z, fname = 'ReLU'):
    """
      computes and returns the derivative of the non-linear function of z with respect to z
      
      z: numpy array of any shape 
      fname: a string that is name of the non-linearity. Defaults to 'ReLU'. Other valid values are
             'Sigmoid', 'Tanh', and 'Linear'.
      
      returns df/dz where f is the non-linear function of z. Name of the non-linear function is fname.
    """
    if fname == 'ReLU':
        return z>0
    elif fname == 'Sigmoid':
        sigma_z = 1./(1+np.exp(-z))
        return sigma_z * (1-sigma_z)
    elif fname == 'Tanh':
        return 1 - np.tanh(z)**2
    elif fname == 'Linear':
        return np.ones(z.shape)
    else:
        raise ValueError('Unknown non-linear function error')
        

def forward(a, W, b, fname = 'ReLU'):
    """
      Forward propagates a through the current layer given W and b
      a: I/p activation from previous activation layer l-1 of shape nx[l-1] x m
      w: weight matrix of shape nx[l] x nx[l-1]
      b: bias vector of shape nx[l+1] x 1
      
      returns anew: the output activation from current layer of shape nx[l] x m
              cache: a tuple that contains current layer's linear computation z, previous layer's activation a,
                     current layer's activation anew and weight matrix W
    """
    # Fill rhs in the following 3 lines. No extra lines of code required.
    
    z =                        # np.dot or np.matmul or @ operator will be useful. Also understand numpy 
                               # broadcasting for adding vector b to product of W and a
    anew =                    # function f defined above will be useful
    cache =                  # read the doc string for this function listed above and acoordingly fill rhs
    return anew, cache


def backward(da, cache, fname = 'ReLU'):
    """
      Backward propagates da through the current layer given da, cache and the non-linearity at the current layer
      da: derivative of loss with respect current layers activation a; shape is nx[l] x m
      cache: a tuple that contains current layer's linear computation z, previous layer's activation aprev,
                     current layer's activation a and weight matrix W between previous layer l-1 and current layer l
      fname: name of the non-linearity at current layer l; this will be helpful for local gradient computation in 
             chain rule
      
      returns dW: derivative of loss with respect to W; shape is nx[l] x nx[l-1]
              db: derivative of loss with respect to b; shape is nx[l] x 1
    """
    # Fill rhs in the following 5 lines. No extra lines of code required.
    
    z, aprev, a, W =            # extract from cache
    dz =                        # compute dz as incoming grad da * local grad. For local grad, function df defined 
                                # above will be useful
    dW =                       # np.dot or np.amtmul or @ operator will be useful. Also .T will be useful for 
                              # transposing
    db =                     # np.sum will be useful
    daprev =                # np.dot or np.amtmul or @ operator will be useful. Also .T will be useful for 
                           # transposing
    return daprev, dW, db

def update_params(Wlist, blist, dWlist, dblist, alpha):
    """
      Updates all the parameters using gradient descent rule
      
      Wlist: a lsit of all weight matrices to be updated
      blist: a list of bias vectors to be updated
      dWlist: a list of gradients of loss with respect to weight matrices
      dblist: a list of gradients of loss with respect to bias vectors
      alpha: learning rate
    """
    for i in range(len(Wlist)):
        Wlist[i] -=          # fill rhs
        blist[i] -=          # fill rhs
        
def main(): # main function to train the model
    
    # load train data
    a0, y = load_train_data()
    a0 = flatten(a0)
    a0 = a0/255. # normalize the data to [0, 1]    
    
    # set some hyperparameters and epsilon
    alpha = 0.01    
    miter = 2000
    epsilon = 1e-6
    num_layers = 2
    nx = [a0.shape[0], 32, 1]
    m = a0.shape[1]
    fname_list = ['ReLU', 'Sigmoid']
    
    # initialize weights and biases
    Wlist, blist =       # fill rhs 
    
    # initialize list of caches from each layer, gradients of weights at each layer, gradients of biases at
    # each layer to empty
    cache, dWlist, dblist = ([None]*num_layers for i in range(3))
    
    for i in range(miter):
        a = a0
        # forward propagate through each layer
        for l in range(num_layers):
            a, cache[l] =                        # Fill rhs. call forward function with 
                                                # appropriate arguments

        L =                                    # Fill rhs. compute loss L
        da =                                  # Fill rhs. compute da

        # backward propagate through each layer to compute gradients
        for l in range(num_layers-1, -1, -1):
            da, dWlist[l], dblist[l] =                   # Fill rhs. call backward function with 
                                                        # appropriate arguments

        # update_params
        update_params(...)          # Replace ...; call update_params function with appropriate arguments

        if not i%100: # print loss every 100 iterations
                print(f'Loss at iteration {i}:\t{np.asscalar(L):.4f}')
    
    return Wlist, blist

if __name__ == '__main__':
    Wlist, blist = main()

Let's now test the model.

In [None]:
fname_list = ['ReLU', 'Sigmoid']
num_layers = 2
def predict(a, Wlist, blist, fname_list):
    for l in range(num_layers):
            a, _ = forward(a, Wlist[l], blist[l], fname_list[l])
    predictions = np.zeros_like(a)
    predictions[a > 0.5] = 1
    return predictions

def test_model(a, y, Wlist, blist, fname_list):
    predictions = predict(a, Wlist, blist, fname_list)
    acc = np.mean(predictions == y)
    acc = np.asscalar(acc)
    return acc

x, y = load_train_data()
x = flatten(x)
x = x/255. # normalize the data to [0, 1]
print(f'train accuracy: {test_model(x, y, Wlist, blist, fname_list) * 100:.2f}%')

x, y = load_test_data()
x = flatten(x)
x = x/255. # normalize the data to [0, 1]
print(f'test accuracy: {test_model(x, y, Wlist, blist, fname_list) * 100:.2f}%')

# Questions
1. Why has the test accuracy not improved with this 2-layer network? Explain.
2. How does replacement of ReLU by Sigmoid at the hidden layer affct the model?
3. Expand the 2 layer network to, say a 4 layer network of your choice. How does this model compare to logistic regresion and 2-layer network?
4. Play with a few learning rates and explain your observations.

Note: All questions will be answered in the jupyter notebook only. Wherever code is required, you write and run the code in a code cell. For text, write and render in a markdown cell.