# CS549 Machine Learning - Irfan Khan
# Assignment 7: Simple Neural Network - First Principles

**Total points: 10**

Updated assignment designed by Ex-Professor Yang Xu Computer Science Dept, SDSU

In this assignment, you will implement a 2-layer shallow neural network model.

We will use the model to conduct the same binary classification task , i.e., classify two categories of the sign language dataset.

The input size is the number of pixels in a image (
. The size of hidden layer is determined by a hyperparameter n_h, and the size of output layer is 1. The provided utils file contains functions which get called for generating test data. Pls don't change the utils file.

In [2]:
#Don't change code in this cell
import numpy as np
import matplotlib.pyplot as plt
from utils import *
# import importlib
# importlib.reload(utils)

%matplotlib inline
np.random.seed(1)

In [3]:
#Don't change code in this cell
# Load data
#Since data is in n x m format, convert into m x n format, m: sample size, n: number of features
X_train_orig, y_train_orig, X_test_orig, y_test_orig = load_data()
X_train = X_train_orig.T
y_train = y_train_orig.T
X_test = X_test_orig.T
y_test = y_test_orig.T

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(286, 4096)
(286, 1)
(125, 4096)
(125, 1)


# Expected Result

(286, 4096)<br>
(286, 1)<br>
(125, 4096)<br>
(125, 1)

### 1.1 Intialize parameters
**1 point**

The parameters associated with the hidden layer are $W^{[1]}$ and $b^{[1]}$, and the parameters associated with the output layer are $W^{[2]}$ and $b^{[2]}$.

We use **tanh** as acitivation function for hidden layer, and **sigmoid** for output layer.

**Instructions:**
- Initialize parameters randomly
- Use `np.random.randn((size_out, size_in))*0.01` to initialize $W^{[l]}$, in which `size_out` is the output size of current layer, and `size_in` is the input size from previous layer. 
- Use `np.zeros()` to initialize $b^{[l]}$

In [4]:
def init_params(n_i, n_h, n_o):
    """
    Args:
    n_i -- size of input layer
    n_h -- size of hidden layer
    n_o -- size of output layer
    
    Return:
    params -- a dict object containing all parameters:
        W1 -- weight matrix of layer 1
        b1 -- bias vector of layer 1
        W2 -- weight matrix of layer 2
        b2 -- bias vector of layer 2
    """
    np.random.seed(2) # For deterministic repeatability, DO NOT change this line! 
    
    ### START your code ###
    W1 = np.random.randn(n_h, n_i) * 0.01
    b1 = np.zeros((1, n_h))
    W2 = np.random.randn(n_o, n_h) * 0.01
    b2 = np.zeros((1, n_o))
    
    ### END your code ###
    
    params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    
    
    return params

In [5]:
# Evaluate Task don't change code in this cell
ps = init_params(3, 4, 1)
print('W1 =', ps['W1'])
print('b1 =' ,ps['b1'])
print('W2 =', ps['W2'])
print('b2 =', ps['b2'])

W1 = [[-0.00416758 -0.00056267 -0.02136196]
 [ 0.01640271 -0.01793436 -0.00841747]
 [ 0.00502881 -0.01245288 -0.01057952]
 [-0.00909008  0.00551454  0.02292208]]
b1 = [[0. 0. 0. 0.]]
W2 = [[ 0.00041539 -0.01117925  0.00539058 -0.0059616 ]]
b2 = [[0.]]


**Expected output**

W1 = [[-0.00416758 -0.00056267 -0.02136196]<br>
 [ 0.01640271 -0.01793436 -0.00841747]<br>
 [ 0.00502881 -0.01245288 -0.01057952]<br>
 [-0.00909008  0.00551454  0.02292208]]<br>
b1 = [[0. 0. 0. 0.]]<br>
W2 = [[ 0.00041539 -0.01117925  0.00539058 -0.0059616 ]]<br>
b2 = [[0.]]<br>

### 1.2 Forward propagation

**2 points**

Use the following fomulas to implement forward propagation:
- $z^{[1]} = XW^{[1]T} + b^{[1]}$
- $a^{[1]} = tanh(z^{[1]})$ --> use `np.tanh` function
- $z^{[2]} = a^{[1]}W^{[2]T} + b^{[2]}$
- $z^{[2]} = \sigma(z^{[2]})$ --> directly use the `sigmoid` function provided in `utils` package

In [6]:
def forward_prop(X, params):
    """
    Args:
    X -- input data of shape (m,n_in)
    params -- a python dict object containing all parameters (output of init_params)
    
    Return:
    a2 -- the activation of the output layer
    cache -- a python dict containing all intermediate values for later use in backprop
             i.e., 'z1', 'a1', 'z2', 'a2'
    """
    m = X.shape[0]
    
    # Retrieve parameters from params
    ### START your code ###
    W1 = params['W1']
    b1 = params['b1']
    W2 = params['W2']
    b2 = params['b2']
    
    
    ### END your code ###
    
    # Implement forward propagation
    ### START your code ###
    z1 = np.dot(X, W1.T) + b1
    a1 = np.tanh(z1)
    
    z2 = np.dot(a1, W2.T) + b2
    a2 = sigmoid(z2)
    
    
    ### END your code ###
    
    assert a1.shape[0] == m
    assert a2.shape[0] == m
    
    cache = {'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2}
    
    return a2, cache

In [7]:
# Evaluate Task don't change code in this cell
X_tmp, params_tmp = forwardprop_testcase()

a2, cache = forward_prop(X_tmp, params_tmp)

print('mean(z1) =', np.mean(cache['z1']))
print('mean(a1) =', np.mean(cache['a1']))
print('mean(z2) =', np.mean(cache['z2']))
print('mean(a2) =', np.mean(cache['a2']))


mean(z1) = 0.0064157816283504174
mean(a1) = 0.006410368144939439
mean(z2) = -6.43251619627097e-05
mean(a2) = 0.49998391870952386


**Expected output**

mean(z1) = 0.0064157816283504174<br>
mean(a1) = 0.006410368144939439<br>
mean(z2) = -6.43251619627097e-05<br>
mean(a2) = 0.49998391870952386<br>
***

### 1.3 Backward propagation
**3 points**

Use the following formulas to implement backward propagation:
- $dz^{[2]} = \frac{1}{m}(a^{[2]} - y)$
- $dW^{[2]} = dz^{[2]T}a^{[1]}$ --> $m$ is the number of examples
- $db^{[2]} = \frac{1}{m}$ np.sum( $dz^{[2]}$, axis=0, keepdims=True)
- $da^{[1]} = dz^{[2]}W^{[2]}$
- $dz^{[1]} = da^{[1]}*g'(z^{[1]})$
    - Here $*$ denotes element-wise multiply
    - $g(z)$ is the tanh function, therefore its derivative $g'(z^{[1]}) = 1 - (g(z^{[1]}))^2 = 1 - (a^{[1]})^2$
- $dW^{[1]} = dz^{[1]T}X$
- $db^{[1]} = \frac{1}{m}$ np.sum( $dz^{[1]}$, axis=0, keepdims=True)

In [None]:
def backward_prop(X, y, params, cache):
    """
    Args:
    X -- input data of shape (m,n_in)
    y -- input label of shape (m,1)
    params -- a python dict containing all parameters
    cache -- a python dict containing 'Z1', 'A1', 'Z2' and 'A2' (output of forward_prop)
    
    Return:
    grads -- a python dict containing the gradients w.r.t. all parameters,
             i.e., dW1, db1, dW2, db2
    """
    m = X.shape[0]
    
    # Retrieve parameters from params
    ### START your code ###
    W2 = params['W2']
    
    ### END your code ###
    
    # Retrive intermediate values stored in cache
    ### START your code ###
    
    a1 = cache['a1']
    a2 = cache['a2']
    
    ### END your code ###
    
    # Implement backprop
    ### START your code ###
    dz2 = (a2 - y) / m
    dW2 = np.dot(dz2.T, a1)
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
    
    da1 = np.dot(dz2, W2)
    dz1 = da1 * (1 - np.power(a1, 2))
    dW1 = np.dot(dz1.T, X)
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
    
    ### END your code ###
    
    grads = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    
    return grads

In [9]:
# Evaluate Task. Don't change code in this cell
X_tmp, y_tmp, params_tmp, cache_tmp = backprop_testcase()

grads = backward_prop(X_tmp, y_tmp, params_tmp, cache_tmp)
print('mean(dW1)', np.mean(grads['dW1']))
print('mean(db1)', np.mean(grads['db1']))
print('mean(dW2)', np.mean(grads['dW2']))
print('mean(db2)', np.mean(grads['db2']))



mean(dW1) -0.0001484446585247785
mean(db1) -5.676757938210491e-05
mean(dW2) -0.004079186018202939
mean(db2) 0.019996784000000004


**Expected output**

mean(dW1) -0.00014844465852477848<br>
mean(db1) -5.676757938210493e-05<br>
mean(dW2) -0.00407918601820294<br>
mean(db2) 0.019996784000000004<br>

***

### 1.4 Update parameters
**1 point**

Update $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ accordingly:
- $W^{[1]} = W^{[1]} - \alpha\ dW^{[1]}$
- $b^{[1]} = b^{[1]} - \alpha\ db^{[1]}$
- $W^{[2]} = W^{[2]} - \alpha\ dW^{[2]}$
- $b^{[2]} = b^{[2]} - \alpha\ db^{[2]}$

In [10]:
def update_params(params, grads, alpha):
    """
    Args:
    params -- a python dict containing all parameters
    grads -- a python dict containing the gradients w.r.t. all parameters (output of backward_prop)
    alpha -- learning rate
    
    Return:
    params -- a python dict containing all updated parameters
    """
    #start your code
    # Retrieve parameters
    params['W1'] = params['W1'] - alpha * grads['dW1']
    params['b1'] = params['b1'] - alpha * grads['db1']
    params['W2'] = params['W2'] - alpha * grads['dW2']
    params['b2'] = params['b2'] - alpha * grads['db2']
    
    # Retrieve gradients
    
    
    # Update each parameter and store back in params
    
    #End your code
    return params

In [11]:
# Evaluate Task. Don't change code in this cell.
params_tmp, grads_tmp = update_params_testcase()

params = update_params(params_tmp, grads_tmp, 0.01)
print('W1 =', params['W1'])
print('b1 =', params['b1'])
print('W2 =', params['W2'])
print('b2 =', params['b2'])

W1 = [[ 0.004169   -0.00056367 -0.02136304]
 [ 0.0163645  -0.01790747 -0.00838857]
 [ 0.00504726 -0.01246588 -0.01059348]
 [-0.00911046  0.0055289   0.0229375 ]]
b1 = [[-4.13852251e-07  1.12173654e-05 -5.39304763e-06  5.94305036e-06]]
W2 = [[ 0.00048642 -0.011058    0.00546531 -0.00606545]]
b2 = [[-0.00099984]]


**Expected output**

W1 = [[ 0.004169   -0.00056367 -0.02136304]<br>
 [ 0.0163645  -0.01790747 -0.00838857]<br>
 [ 0.00504726 -0.01246588 -0.01059348]<br>
 [-0.00911046  0.0055289   0.0229375 ]]<br>
b1 = [[-4.13852251e-07  1.12173654e-05 -5.39304763e-06  5.94305036e-06]]<br>
W2 = [[ 0.00048642 -0.011058    0.00546531 -0.00606545]]<br>
b2 = [[-0.00099984]]<br>

***

### 1.5 Integrated model
**1.5 points**

Integrate `init_params`, `forward_prop`, `backward_prop` and `update_params` into one model.

In [20]:
def nn_model(X, y, n_h, num_iters=10000, alpha=0.01, verbose=False):
    """
    Args:
    X -- training data of shape (m,n_in)
    y -- training label of shape (m,1)
    n_h -- size of hidden layer
    num_iters -- number of iterations for gradient descent
    verbose -- print cost every 1000 steps
    
    Return:
    params -- parameters learned by the model. Use these to make predictions on new data
    """
    np.random.seed(3)
    m = X.shape[0]
    n_in = X.shape[1]
    n_out = 1
    
    # Initialize parameters and retrieve them, use init_params
    ### START your code ###
    params = init_params(n_in, n_h, n_out)
    ### END your code ###
    
    # Gradient descent loop
    for i in range(num_iters):
        ### START your code ###
        # Forward propagation
        a2, cache = forward_prop(X, params)
        
        # Backward propagation
        grads = backward_prop(X, y, params, cache)
        
        # Update parameters
        params = update_params(params, grads, alpha)
        
        # Compute cost
        cost = - (1/m) * np.sum(y * np.log(a2) + (1-y) * np.log(1-a2))
        ### END your code ###
        
        # Print cost
        if i % 1000 == 0 and verbose:
            print('Cost after iter {}: {}'.format(i, cost))
    
    return params

In [21]:
# Evaluate Task 1.5. Don't change code in this cell
X_tmp, y_tmp = nn_model_testcase()

params_tmp = nn_model(X_tmp, y_tmp, n_h=5, num_iters=5000, alpha=0.01)
print('W1 =', params_tmp['W1'])
print('b1 =', params_tmp['b1'])
print('W2 =', params_tmp['W2'])
print('b2 =', params_tmp['b2'])

W1 = [[ 0.33222292 -0.07076426 -0.11503028]
 [ 1.50266111  0.04429628 -0.20513728]
 [ 1.55493072  0.05030062 -0.21407279]
 [-1.58953216 -0.0567708   0.21934892]
 [ 0.43065024 -0.08839451 -0.10283363]]
b1 = [[-0.05502926 -0.33938949 -0.35337992  0.35972735 -0.07410474]]
W2 = [[ 0.39386751  1.923321    2.00150814 -2.04849592  0.50569197]]
b2 = [[-0.44031534]]


**Expected output**

W1 = [[ 0.33222292 -0.07076426 -0.11503028]<br>
 [ 1.50266111  0.04429628 -0.20513728]<br>
 [ 1.55493072  0.05030062 -0.21407279]<br>
 [-1.58953216 -0.0567708   0.21934892]<br>
 [ 0.43065024 -0.08839451 -0.10283363]]<br?
b1 = [[-0.05502926 -0.33938949 -0.35337992  0.35972735 -0.07410474]]<br>
W2 = [[ 0.39386751  1.923321    2.00150814 -2.04849592  0.50569197]]<br>
b2 = [[-0.44031534]]<br>

***

### 1.6 Predict
**1 point**

Use the learned parameters to make predictions on new data. 
- Compute $a^{[2]}$ by calling `forward_prop`. Note that the `cache` returned will not be used in making predictions.
- Convert $a^{[2]}$ into a vector of 0 and 1.

In [15]:
def predict(X, params):
    """
    Args:
    X -- input data of shape (m,n_in)
    params -- a python dict containing the learned parameters
    
    Return:
    pred -- predictions of model on X, a vector of 0s and 1s
    """
    
   
    ### START your code ###
    a2, _ = forward_prop(X, params)
    pred = (a2 > 0.5).astype(float)
    
    ### END your code ###
    
    
    return pred

In [16]:
# Evaluate Task 1.6. Don't change code in this cell
# NOTE: the X_tmp and params_tmp are the ones generated in evaluating Task 1.5 (two cells above)
pred = predict(X_tmp, params_tmp)
print('predictions = ', pred)

predictions =  [[0.]
 [1.]
 [1.]
 [0.]
 [0.]]


**Expected output**

predictions =  [[0.]<br>
 [1.]<br>
 [1.]<br>
 [0.]<br>
 [0.]]<br>

***

### 1.7 Train and evaluate

**0.5 point**

Train the neural network model on X_train and y_train, and evaluate it on X_test and y_test.

You can use the code from the previous assignment for Logistic Regression and Evaluation Metrics to compute the accuracy of your predictions.

In [22]:
# Train the model on X_train and y_train, and print cost
# DO NOT change the hyperparameters, so that your output matches the expected one.
X_train_orig, y_train_orig, X_test_orig, y_test_orig = load_data()
X_train = X_train_orig.T
y_train = y_train_orig.T
X_test = X_test_orig.T
y_test = y_test_orig.T

params = nn_model(X_train, y_train, n_h = 10, num_iters=10000, verbose=True)

# Make predictions on X_test
pred = predict(X_test, params)


# Compute accuracy (acc) by comparing predictions and y_test
### START YOUR CODE ###



acc = np.mean(pred == y_test)
### END YOUR CODE ###
print('Accuracy = {0:.2f}%'.format(acc * 100))


Cost after iter 0: 0.6931077265775999
Cost after iter 1000: 0.27178099586018784
Cost after iter 2000: 0.054659232836622856
Cost after iter 3000: 0.024305693413129683
Cost after iter 4000: 0.014585870325865133
Cost after iter 5000: 0.01012561547692077
Cost after iter 6000: 0.00764059619251161
Cost after iter 7000: 0.00608120133544274
Cost after iter 8000: 0.005021476713730624
Cost after iter 9000: 0.00425915815722728
Accuracy = 95.20%


**Expected output**

Cost after iter 0: 0.6931077265775999<br>
Cost after iter 1000: 0.24817555119209228<br>
Cost after iter 2000: 0.05465982816946285<br>
Cost after iter 3000: 0.02429722600673885<br>
Cost after iter 4000: 0.014580135588662868<br>
Cost after iter 5000: 0.010121506027869343<br>
Cost after iter 6000: 0.007637453828781526<br>
Cost after iter 7000: 0.006078683154881118<br>
Cost after iter 8000: 0.005019389206548974<br>
Cost after iter 9000: 0.004257383349387319<br>
Accuracy = 95.20%<br>
***<br>
Ignore any warnings