Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
ID = ""

---

# 01-Introduction to Python and Deep Learning

Reference:
- Coursera - Deep learning specialization

This lab introduces on python and numpy, and implement them to be deep learning function.

## About Jupyter Notebooks

Jupyter Notebooks are interactive coding environments embedded in a webpage. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run Cell" (denoted by a play symbol) in the upper bar of the notebook.

We will often specify "(≈ X lines of code)" in the comments to tell you about how much code you need to write. It is just a rough estimate, so don't feel bad if your code is longer or shorter.

Remind that you can add cells, but do not delete the cells that I have created.

### Exercise:

Set test to "Hello World" in the cell below to print "Hello World" and run the two cells below.

In [None]:
# Grade cell - do not remove
test = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# test function - do not remove

print("You write", test)

assert "Hello" in test, "Wording is incorrect"
assert "World" in test, "Wording is incorrect"
assert test[6] == 'W', "sequence is incorrect"

**Expected output**: You write Hello World

## Building basic functions with numpy
Numpy is the main package for scientific computing in Python. It is maintained by a large community (www.numpy.org). In this exercise you will learn several key numpy functions such as np.exp, np.log, and np.reshape. You will need to know how to use these functions for future assignments.

### sigmoid function

Sigmoid function is known as the logistic function. The equation can be written as

$$sigmoid(x)=\frac{1}{1+e^{-x}}$$

The output graph is:

<img src="img/Sigmoid.png" title="Sigmoid graph" style="width: 400px;" />

Before using np.exp(), you will use math.exp() to implement the sigmoid function. You will then see why np.exp() is preferable to math.exp().

### Exercise

Build a function that returns the sigmoid of a real number x. Use math.exp(x) for the exponential function.

In [None]:
# Grade cell - do not remove

import math

def math_sigmoid(x):
    '''
    Compute sigmoid of x
    '''
    z = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return z

In [None]:
# test function - do not remove
print(math_sigmoid(5))

assert math_sigmoid(10) > 0.9999, "Calculate error"
assert math_sigmoid(-10) < 0.0001, "Calculate error"
assert math_sigmoid(0) == 0.5, "Calculate error"

**Expect output**: 0.9933071490757153

Actually, we rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful.

In [None]:
### One reason why we use "numpy" instead of "math" in Deep Learning ###
x = [1, 2, 3]
basic_sigmoid(x) # you will see this give an error when you run it, because x is a vector.

In fact, if $x=(x_1,x_2,\dots,x_n)$ is a row vector then  will apply the exponential function to every element of $x$. The output will thus be:

$$np.exp(x)=(e^{x_1},e^{x_2},\dots,e^{x_n})$$


In [None]:
import numpy as np

# example of np.exp
x = np.array([1, 2, 3])
print(np.exp(x)) # result is (exp(1), exp(2), exp(3))

Furthermore, if $x$ is a vector, then a Python operation such as $a=b+5$ or s=\frac{1}{x} will output s as a vector of the same size as x.

In [None]:
# example of vector operation
x = np.array([1, 2, 3])
print (x + 3)

Any time you need more info on a numpy function, we encourage you to look at the [document](www.numpy.org).

You can also create a new cell in the notebook and write np.exp? (for example) to get quick access to the documentation.

### Exercise

Implement the sigmoid function using numpy.

**Hint**: $x$ could now be either a real number, a vector, or a matrix. The data structures we use in numpy to represent these shapes (vectors, matrices...) are called numpy arrays.

In [None]:
# Grade cell - do not remove

import numpy as np # you can access numpy functions by writing np.function() instead of numpy.function()

def sigmoid(x):
    """
    Compute the sigmoid of x
    """
    s = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return s

In [None]:
# test function - do not remove

x = np.array([1, 2, 3])

print(sigmoid(x))

assert sigmoid(x).shape[0] == 3, "Output is incorrect" 
assert sigmoid(np.array([1,2,3,4])).shape[0] == 4, "Output is incorrect"
assert sigmoid(100) > 0.99999, "Calculation is incorrect"

**Expect output** : [0.73105858 0.88079708 0.95257413]

## Sigmoid gradient

Next, let's compute sigmoid gradient which use for optimizing loss functions. The sigmoid gradient function can be calculated as

$$\nabla \sigma(x)= \sigma'(x) = \sigma(x)(1-\sigma(x))$$

### Exercise
Implement the function sigmoid_grad() to compute the gradient of the sigmoid function with respect to its input x. 

**Hint**:
1. Set s to be the sigmoid of x. You might find your sigmoid(x) function useful.
2. Compute $\sigma'(x) = s(1-s)$ 


In [None]:
# Grade cell - do not remove

def sigmoid_grad(x):
    """
    Compute the gradient (also called the slope or derivative) of the sigmoid function with respect to its input x.
    You can store the output of the sigmoid function into variables and then use it to calculate the gradient.
    """
    ds = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return ds

In [None]:
# test function - do not remove

x = np.array([1, 2, 3])
print ("sigmoid_gradient(x) = " + str(sigmoid_grad(x)))

assert sigmoid_grad(x).shape[0] == 3, "Output shape is incorrect"
assert sigmoid_grad(np.array([1,2,3,4])).shape[0] == 4, "Output is incorrect"
assert sigmoid_grad(2) > 0.1, "Gradient calculation is incorrect"
assert sigmoid_grad(30) < 0.000000000001, "Gradient calculation is incorrect"

**Expect Output**: sigmoid_gradient(x) = [0.19661193 0.10499359 0.04517666]

##  Reshaping arrays

Two common numpy functions used in deep learning are np.shape and np.reshape().

- <code>X.shape</code> is used to get the shape (dimension) of a matrix/vector $X$.
- <code>X.reshape(...)</code> is used to reshape $X$ into some other dimension.

In computer science, an image is represented by a 3D array of shape (length, height, depth). However, when you read an image as the input of an algorithm you convert it to a vector of shape (length*height*depth, 1).

<img src="img/image2vector_kiank.png" title="image2vector_kiank" style="width: 600px;" />

### Exercise

Implement image2vector() that takes an input of shape (length, height, 3) and returns a vector of shape (length*height*3, 1).

*Do not use hardcode the dimensions of image as a constant..*

In [None]:
# Grade cell - do not remove

def image2vector(image):
    """
    Convert image with 3 dimensions to become vector of (size, 1)
    """
    
    v = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return v

In [None]:
# test function - do not remove

image = np.array([[[ 0.67826139,  0.29380381],
        [ 0.90714982,  0.52835647],
        [ 0.4215251 ,  0.45017551]],

       [[ 0.92814219,  0.96677647],
        [ 0.85304703,  0.52351845],
        [ 0.19981397,  0.27417313]],

       [[ 0.60659855,  0.00533165],
        [ 0.10820313,  0.49978937],
        [ 0.34144279,  0.94630077]]])

print ("image2vector(image) = " + str(image2vector(image)))

import cv2

img = cv2.imread("lena.png")

print(img.shape)
vecimg = image2vector(img)
print(vecimg.shape)

assert image2vector(image).shape == (image.shape[0] * image.shape[1] * image.shape[2], 1), "Dimension is incorrect"

**Expected Output**:
image2vector(image) = [[0.67826139] \
 [0.29380381] \
 [0.90714982] \
 [0.52835647] \
 [0.4215251 ]\
 [0.45017551]\
 [0.92814219]\
 [0.96677647]\
 [0.85304703]\
 [0.52351845]\
 [0.19981397]\
 [0.27417313]\
 [0.60659855]\
 [0.00533165]\
 [0.10820313]\
 [0.49978937]\
 [0.34144279]\
 [0.94630077]]\
(512, 512, 3)\
(786432, 1)


## Normalizing rows

Normalization is a technique for Machine Learning and Deep Learning. The technique control number parameters not overflow. It often leads to a better performance because gradient descent converges faster after normalization. For example, we have a matrix $A$

$$A=\begin{bmatrix}8&1&2 \\ 3&9&5 \end{bmatrix}$$

We can find normalize matrix using <code>np.linalg.norm()</code> function. In this case, we normalize with row, then

$$||A|| = np.linalg.norm(A,axis=1,keepdim=True)$$

and normalize the matrix by

$$norm(A) = \frac{A}{||A||}$$


Note that you can divide matrices of different sizes and it works fine: this is called broadcasting.

### Exercise

Implement <code>normalizeRows()</code> to normalize the rows of a matrix. After applying this function to an input matrix $x$, each row of $x$ should be a vector of unit length (meaning length 1).

In [None]:
# Grade cell - do not remove

def normalizeRows(x):
    """
    Implement a function that normalizes each row of the matrix x (to have unit length).
    """
    
    norm_x = None
    # YOUR CODE HERE
    raise NotImplementedError()

    return norm_x

In [None]:
# test function - do not remove

A = np.array([
    [8, 1, 2],
    [3, 9, 5]])

norm_A = normalizeRows(A)
print("normalizeRows(A) = " + str(norm_A))

sqr_A = norm_A * norm_A
sum_A = np.sum(sqr_A, axis = 1)
print("prove of A in each row", sum_A)

assert sum_A.shape == (2,), "normalize must do in row" 
assert np.round(sum_A[0], 1) == 1 and np.round(sum_A[1], 1) == 1, "Normalize incorrect"

**Expected Output**:
normalizeRows(A) = [[0.96308682 0.12038585 0.24077171]\
 [0.27975144 0.83925433 0.4662524 ]]\
prove of A in each row [1. 1.]

## Broadcasting and the softmax function

Broadcasting is a very important concept in numpy. It performs mathematical operations between arrays of different shapes.

Softmax function or normalized exponential function is used for converting a vector into a probability distribution. The equation is

$$softmax(x) = softmax([x_1,x_2,\cdots,x_n])=
\begin{bmatrix}
\frac{e^{x_1}}{\sum_j e^{x_j}} & \frac{e^{x_2}}{\sum_j e^{x_j}} & \cdots & \frac{e^{x_n}}{\sum_j e^{x_j}}
\end{bmatrix}$$

The softmax must implement each row independently.

### Exercise
Implement a softmax function using numpy. You can think of softmax as a normalizing function used when your algorithm needs to classify two or more classes.

In [None]:
# Grade cell - do not remove

def softmax(x):
    """
    Calculates the softmax for each row of the input x.

    The code should work for a row vector and also for matrices of shape (m,n).
    """
    s = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return s

In [None]:
# test function - do not remove

x = np.array([
    [8, 1, 3, -2, 0],
    [7, 5, 4, 1 ,0]])

s = softmax(x)
print("softmax(x) = " + str(softmax(x)))

sum_s = np.sum(s, axis = 1)
print(sum_s)

assert s.shape == (2, 5), "Softmax is incorrect"
assert np.round(sum_s[0]) == 1 and np.round(sum_s[1]) == 1, "Softmax summation output is incorrect"

**Expected Output**:
softmax(x) = [[9.92033287e-01 9.04617263e-04 6.68426771e-03 4.50382415e-05\
  3.32790093e-04]\
 [8.41387525e-01 1.13869419e-01 4.18902183e-02 2.08559116e-03\
  7.67246110e-04]]\
[1. 1.]\


## Vectorization

In deep learning, you need to deal with very large datasets. To make sure that your code is computationally efficient, you will use vectorization. For example, try to tell the difference between the following implementations of the dot/outer/elementwise product.

In [None]:
import time

x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### CLASSIC DOT PRODUCT OF VECTORS IMPLEMENTATION ###
tic = time.process_time()
dot = 0
for i in range(len(x1)):
    dot+= x1[i]*x2[i]
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC OUTER PRODUCT IMPLEMENTATION ###
tic = time.process_time()
outer = np.zeros((len(x1),len(x2))) # we create a len(x1)*len(x2) matrix with only zeros
for i in range(len(x1)):
    for j in range(len(x2)):
        outer[i,j] = x1[i]*x2[j]
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC ELEMENTWISE IMPLEMENTATION ###
tic = time.process_time()
mul = np.zeros(len(x1))
for i in range(len(x1)):
    mul[i] = x1[i]*x2[i]
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC GENERAL DOT PRODUCT IMPLEMENTATION ###
W = np.random.rand(3,len(x1)) # Random 3*len(x1) numpy array
tic = time.process_time()
gdot = np.zeros(W.shape[0])
for i in range(W.shape[0]):
    for j in range(len(x1)):
        gdot[i] += W[i,j]*x1[j]
toc = time.process_time()
print ("gdot = " + str(gdot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

In [None]:
x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### VECTORIZED DOT PRODUCT OF VECTORS ###
tic = time.process_time()
dot = np.dot(x1,x2)
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED OUTER PRODUCT ###
tic = time.process_time()
outer = np.outer(x1,x2)
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED ELEMENTWISE MULTIPLICATION ###
tic = time.process_time()
mul = np.multiply(x1,x2)
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED GENERAL DOT PRODUCT ###
tic = time.process_time()
dot = np.dot(W,x1)
toc = time.process_time()
print ("gdot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

As you may have noticed, the vectorized implementation is much cleaner and more efficient. For bigger vectors/matrices, the differences in running time become even bigger.

## Implement the L1 and L2 loss functions

The loss L1 and L2 are used to evaluate the performance of your model. The bigger your loss is, the more different your predictions $\hat{h}$ are from the true values $y$. In deep learning, Gradient Descent or Ascent is used to optimize models by minimizing the cost.

To assume loss function in $L_1$, the L1 loss is defined as

$$L_1(\hat{y},y)=\sum_{i=0}^m |y^{(i)}-\hat{y}^{(i)}|$$

### Exercise

Implement the numpy vectorized version of the L1 loss. use function np.abs() to apply the equation.

In [None]:
# Grade cell - do not remove

def L1(yhat, y):
    loss = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return loss

In [None]:
# test function - do not remove

yhat = np.array([.9, 0.2, 0.1, .4, .9])
yhat2 = np.array([.1, 0.7, 0.4, 0.7, .8, 0.2])
y = np.array([1, 0, 0, 1, 1])
y2 = np.array([1, 0, 0, 1, 1, 0])
l1 = L1(yhat,y)
l2 = L1(yhat2,y2)
print("L1 of output 1 = " + str(l1))
print("L1 of output 2 = " + str(l2))

assert np.round(l1,1) == 1.1, "L1 loss is incorrect" 
assert np.round(l2,1) == 2.7, "L1 loss is incorrect" 

**Expected Output**:\
L1 of output 1 = 1.1 \
L1 of output 2 = 2.7

To assume loss function in $L_2$, the L2 loss is defined as

$$L_1(\hat{y},y)=\sum_{i=0}^m (y^{(i)}-\hat{y}^{(i)})^2$$

### Exercise

Implement the numpy vectorized version of the L2 loss.

In [None]:
# Grade cell - do not remove

def L2(yhat, y):
    loss = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return loss

In [None]:
# test function - do not remove

yhat = np.array([.9, 0.2, 0.1, .4, .9])
yhat2 = np.array([.1, 0.7, 0.4, 0.7, .8, 0.2])
y = np.array([1, 0, 0, 1, 1])
y2 = np.array([1, 0, 0, 1, 1, 0])
l1 = L2(yhat,y)
l2 = L2(yhat2,y2)
print("L2 of output 1 = " + str(l1))
print("L2 of output 2 = " + str(l2))

assert np.round(l1,2) == 0.43, "L2 loss is incorrect" 
assert np.round(l2,2) == 1.63, "L2 loss is incorrect" 

## Classification with a Neural Network from scratch

Now, let create a neural network to recognize hand writing text number.

First of all, import all necessary library.

In [None]:
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

Then, load the dataset. and see the data and shape of X and y

In [None]:
# Load data
data = load_digits()

y_indices = data.target
X = np.matrix(data.data)

print('y shape:', y_indices.shape)
print('X shape:', X.shape)

print('y data:', y_indices[0:15])
print('X data:', X[0:5])

data_size = X.shape[0]
x_area = X.shape[1]

Show the data of X into image. Because the X data for each image is 1D vector.
You need to convert X to image size 8x8

### Exercise

Create the function to convert the X data of one image to be image size 8x8. For the good function, you should check the size of vector to convert image.

**Hint**: use np.reshape()

In [None]:
# Grade cell - do not remove

def convert_image(X_one_image):
    img = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return img

In [None]:
# test function - do not remove

img_0 = convert_image(X[0])
plt.imshow(img_0, 'gray')
plt.title('Example MNIST sample (category %d)' % y_indices[0])
plt.show()

img_5 = convert_image(X[5,:])
plt.imshow(img_5, 'gray')
plt.title('Example MNIST sample (category %d)' % y_indices[5])
plt.show()

test_v = np.empty([1,256])
test = convert_image(test_v)

assert img_0.shape == (8,8) and img_5.shape == (8,8), 'Image reshape is incorrect'
assert test.shape == (16,16), 'Image reshape is incorrect'
assert img_0[3,6] == X[0, 30] and img_5[4,2] == X[5, 34], 'Image reshape is incorrect'

**Expected Output**:

<img src="img/1expect.png" title="Expect value 0" style="width: 200px;" />
<img src="img/2expect.png" title="Expect value 5" style="width: 200px;" />

## One hot encoding

As you can see, the y output is index value. To use the value for classify in deep learning, you need to convert it to one hot. 

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In this time, you need to convert the index value to be

$$0 \rightarrow [1, 0,0,0,0,0,0,0,0,0]$$
$$1  \rightarrow  [0, 1,0,0,0,0,0,0,0,0]$$
$$2  \rightarrow  [0, 0,1,0,0,0,0,0,0,0]$$
$$3  \rightarrow  [0, 0,0,1,0,0,0,0,0,0]$$
$$4  \rightarrow  [0, 0,0,0,1,0,0,0,0,0]$$
$$5  \rightarrow  [0, 0,0,0,0,1,0,0,0,0]$$
$$6  \rightarrow  [0, 0,0,0,0,0,1,0,0,0]$$
$$7  \rightarrow  [0, 0,0,0,0,0,0,1,0,0]$$
$$8  \rightarrow  [0, 0,0,0,0,0,0,0,1,0]$$
$$9  \rightarrow  [0, 0,0,0,0,0,0,0,0,1]$$

### Exercise

Do one-hot vector function.

In [None]:
# Grade cell - do not remove

def convert_to_one_hot(y, onehot_size):
    y_vect = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return y_vect

In [None]:
# test function - do not remove

y = convert_to_one_hot(y_indices, 10)
print(y.shape)
print(y[3])
assert y.shape[1] == 10 and y.shape[0] == 1797, "One hot size is incorrect"
assert y[14, 8] == 0 and y[177,1] == 1, "One hot value is incorrect"

**Expected Output**:\
(1797, 10)\
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]

## Normalize input feature

Now, change the input X to be normalize vector. The normalize equation is

$$norm(X) = \frac{X-\bar{X}}{SD}$$

### Exercise

Write a normalize code. If some values are nan, please change them to be zero, using np.nan_to_num()

In [None]:
# Grade cell - do not remove

def normalize(X):
    XX = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return XX

In [None]:
# test function - do not remove

XX = normalize(X)

print(XX[0])

assert XX.shape == X.shape, "Normalize function is incorrect"
assert np.max(XX[0]) < 2 and np.min(XX[0]) > -2, "Data is not normalize"

**Expected Output**:\
[[ 0.         -0.33501649 -0.04308102  0.27407152 -0.66447751 -0.84412939\
  -0.40972392 -0.12502292 -0.05907756 -0.62400926  0.4829745   0.75962245\
  -0.05842586  1.12772113  0.87958306 -0.13043338 -0.04462507  0.11144272\
   0.89588044 -0.86066632 -1.14964846  0.51547187  1.90596347 -0.11422184\
  -0.03337973  0.48648928  0.46988512 -1.49990136 -1.61406277  0.07639777\
   1.54181413 -0.04723238  0.          0.76465553  0.05263019 -1.44763006\
  -1.73666443  0.04361588  1.43955804  0.         -0.06134367  0.8105536\
   0.63011714 -1.12245711 -1.06623158  0.66096475  0.81845076 -0.08874162\
  -0.03543326  0.74211893  1.15065212 -0.86867056  0.11012973  0.53761116\
  -0.75743581 -0.20978513 -0.02359646 -0.29908135  0.08671869  0.20829258\
  -0.36677122 -1.14664746 -0.5056698  -0.19600752]]

## Split data

In deep learning, it is necessary for split your raw data to be 3 types:
1. Training set - Data for training and learn
2. Validate set - Data for test the network in each epoch or loop training.
3. Test set - Data for test the network in finalize after training. This can assume that if the network is going to use, what accuracy should be.

Spliting the training, validate, and test set need to make sure that
1. The data need to be random.
2. The validate and test set must be in the same as training set environment, but do not the same data from training set.
3. A lot of data of training set can make your model accurate, but need to make sure that the validate and test set cover your conditions.

Normally, we should split data in percentage. However, this is not fixed. You can adjust.
- 60% training, 20% validate, and 20% test for the data over 1 million set
- 80% training, 10% validate, and 10% test for otherwise.
- For the very low data (~1000 data), we could use validate and test set in the same data.

However, there are some trick about spliting the data when the data is too low, but we do not talk about it in here.

### Exercise

Random split the train set to be 60% of data and otherwise are test set.

**Hint**: use <code>np.arange</code> for set index number from 0 to data_size. Random index can do by using <code>random.shuffle()</code>

In [None]:
# Grade cell - do not remove

import random

percent_train = .6
# arange index number from 0 to data_size
idx = None
# random shuffle idx (1 line)

# calculate number of training set
m_train = None
# split train_idx and test_idx (uncomment these 2 lines)
# train_idx = idx[0:m_train]
# test_idx = idx[m_train:data_size+1]

# split to X_train and X_test
X_train = None
X_test = None

# split to y_train y_test and y_test_indices
y_train = None
y_test = None
y_test_indices = None

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# test function - do not remove

assert X_train.shape[0] == int(percent_train * data_size), "training size is incorrect"
assert data_size - m_train == X_test.shape[0], "test size is incorrect"
assert train_idx[0] != 0 and train_idx[25] != 25, "training indices are not shuffled"
assert X_train.shape == (m_train, XX.shape[1]) and y_train.shape[0] == m_train
assert X_test.shape == (data_size - m_train, XX.shape[1]) and y_test.shape[0] == data_size - m_train and y_test_indices.shape[0] == data_size - m_train

## General Architecture of the learning algorithm

It's time to design a simple algorithm to distinguish number images.

You will build a Logistic Regression, using a Neural Network mindset.

<img src="img/nn_mnist.jpeg" title="mnist neural network" style="width: 600px;" />

*Note*: change SoftMin to be SoftMax


### Exercise

Create Activation functions. You need to create 3 activation functions: ReLu, Tanh, Sigmoid, and Softmax functions.

ReLu equation is written

$$ReLu(x) = \max(0,x)$$

Tanh equation is written

$$Tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}$$

Sigmoid equation is written

$$Sigmoid(x) = \frac{1}{1+e^{-x}}$$

And Softmax equation is written

$$Softmax(x) = softmax([x_1,x_2,\cdots,x_n])=
\begin{bmatrix}
\frac{e^{x_1}}{\sum_j e^{x_j}} & \frac{e^{x_2}}{\sum_j e^{x_j}} & \cdots & \frac{e^{x_n}}{\sum_j e^{x_j}}
\end{bmatrix}$$

In [None]:
# Grade cell - do not remove

def ReLu(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = ReLu(a)
print(y_hat)

assert y_hat.shape[0] == 5, "ReLu output is incorrect"
assert y_hat[3] > a[3] and y_hat[3] == 0, "ReLu output is incorrect"
assert y_hat[4] > a[4] and y_hat[4] == 0, "ReLu output is incorrect"
assert y_hat[0] == a[0], "ReLu output is incorrect"
assert y_hat[1] == a[1], "ReLu output is incorrect"
assert y_hat[2] == a[2], "ReLu output is incorrect"

**Expected Output**: [0.9 0.2 0.1 0.  0. ]

In [None]:
# Grade cell - do not remove

def Tanh(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Tanh(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Tanh output is incorrect"
assert np.round(y_hat[0],4) == 0.4869, "Tanh output is incorrect"
assert np.round(y_hat[1],4) == 0.9610, "Tanh output is incorrect"
assert np.round(y_hat[2],4) == 0.9901, "Tanh output is incorrect"
assert np.round(y_hat[3],4) == 0.9151, "Tanh output is incorrect"
assert np.round(y_hat[4],4) == 0.6347, "Tanh output is incorrect"

**Expected Output**: [0.48691736 0.96104298 0.99006629 0.91513696 0.63473959]

In [None]:
# Grade cell - do not remove

def Sigmoid(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Sigmoid(a)
print(y_hat)

assert y_hat.shape[0] == 5, "sigmoid output is incorrect"
assert np.round(y_hat[0],4) == 0.7109, "sigmoid output is incorrect"
assert np.round(y_hat[1],4) == 0.5498, "sigmoid output is incorrect"
assert np.round(y_hat[2],4) == 0.5250, "sigmoid output is incorrect"
assert np.round(y_hat[3],4) == 0.4256, "sigmoid output is incorrect"
assert np.round(y_hat[4],4) == 0.3318, "sigmoid output is incorrect"

**Expected Output**: [0.7109495  0.549834   0.52497919 0.42555748 0.33181223]

In [None]:
# Grade cell - do not remove

def Softmax(x):
    """
    Calculates the softmax for each row of the input x.

    The code should work for a row vector and also for matrices of shape (m,n).
    """
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Softmax(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Softmax output is incorrect"
assert np.round(y_hat[0],4) == 0.4083, "Softmax output is incorrect"
assert np.round(y_hat[1],4) == 0.2028, "Softmax output is incorrect"
assert np.round(y_hat[2],4) == 0.1835, "Softmax output is incorrect"
assert np.round(y_hat[3],4) == 0.1230, "Softmax output is incorrect"
assert np.round(y_hat[4],4) == 0.0824, "Softmax output is incorrect"

## Initializing parameters

Each layer contains weight vector $w$ and bias value $b$. You can create the values as random small number or zero. We create 3 layers as below.

In [None]:
h2 = 5
h1 = 6
W = [[], np.random.normal(0,0.1,[x_area,h1]),
         np.random.normal(0,0.1,[h1,h2]),
         np.random.normal(0,0.1,[h2,10])]
B = [[], np.random.normal(0,0.1,[h1,1]),
         np.random.normal(0,0.1,[h2,1]),
         np.random.normal(0,0.1,[10,1])]

act_funcs = [None, ReLu, Sigmoid, Softmax]

L = len(W)-1

##  Forward

For input $x^{(i)}$, the forward propagation in each layer can be calculated by
$$z^{(i)}=W^Tx^{(i)}+b$$
$$\hat{y}^{(i)}=a^{(i)}=act(z^{(i)})$$

### Exercise

Create forward_layer function which input self-define activation function

*Note*: If input act_func as None, the output is linear activation function

**Hint**: use <code>*</code> for multiplication 

In [None]:
# Grade cell - do not remove

def forward_layer(w, b, X, act_func):
    # z is linear function
    z = None
    # y_hat is output after activation function
    y_hat = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return z, y_hat

In [None]:
# test function - do not remove

X = np.array([[.9, 0.2, 0.1, -0.3, -0.7]]).T

w = np.array([[0.2, 0.1, 1, 3, 0.5]])
b = np.array([[1]])

z1, y_hat1 = forward_layer(w, b, X, None)
b = np.array([[0.5]])
z2, y_hat2 = forward_layer(w, 0.5, X, None)
print('Linear output of y_hat1', y_hat1, 'and', y_hat2)

assert y_hat1[2,0] == 1.1
assert np.round(y_hat2[3,0], 4) == -0.4

In [None]:
# test function - do not remove

X = np.array([.9, 0.2, 0.1, -0.3, -0.7])

w = np.array([0.2, 0.1, 1, 3, 0.5])

z1, y_hat1 = forward_layer(w, 1, X, ReLu)
z2, y_hat2 = forward_layer(w, 0.5, X, ReLu)
print('ReLu output of y_hat1', y_hat1, 'and', y_hat2)

assert y_hat1[3] > 0, "Forward layer is incorrect"
assert y_hat2[3] == 0, "Forward layer is incorrect"

In [None]:
# test function - do not remove

X = np.array([.9, 0.2, 0.1, -0.3, -0.7])

w = np.array([0.2, 0.1, 1, 3, 0.5])

z1, y_hat1 = forward_layer(w, 1, X, Tanh)
z2, y_hat2 = forward_layer(w, 0.5, X, Tanh)
print('Tanh output of y_hat1', y_hat1, 'and', y_hat2)

assert y_hat1.shape[0] == 5

In [None]:
# test function - do not remove

X = np.array([.9, 0.2, 0.1, -0.3, -0.7])

w = np.array([0.2, 0.1, 1, 3, 0.5])

z1, y_hat1 = forward_layer(w, 1, X, Sigmoid)
z2, y_hat2 = forward_layer(w, 0.5, X, Sigmoid)
print('Sigmoid output of y_hat1', y_hat1, 'and', y_hat2)

In [None]:
# test function - do not remove

X = np.array([[.9, 0.2, 0.1, -0.3, -0.7]]).T

w = np.array([[0.2, 0.1, 1, 3, 0.5], [0.3, 0.5, 0.1, -0.3, -0.5]])
b = np.array([-1,3])

z1, y_hat1 = forward_layer(w, b, X, Softmax)
print('Linear output of z1', z1)
print('Softmax output of y_hat1', y_hat1)

assert y_hat1.shape == (5, 2), "Forward layer is incorrect"

**Expected Output**:\
Linear output of y_hat1 0.050000000000000155 and -0.44999999999999984\
ReLu output of y_hat1 0.050000000000000155 and 0.0\
Tanh output of y_hat1 0.9975041607715679 and 0.822001229369054\
Sigmoid output of y_hat1 0.5124973964842104 and 0.389360766050778\
Linear output of z1 [-1.95  3.82]\
Softmax output of y_hat1 [0.00311005 0.99688995]

### Exercise

Create full of forward propagation

In [None]:
# Grade cell - do not remove

def forward_one_step(X, W, B, act_funcs):
    L = len(W)-1
    a = [X]
    z = [[]]
    delta = [[]]
    dW = [[]]
    db = [[]]
    for l in range(1,L+1):
        z_layer, a_layer = None, None
        # YOUR CODE HERE
        raise NotImplementedError()
        z.append(z_layer)
        a.append(a_layer)
        # Just to give arrays the right shape for the backprop step
        delta.append([]); dW.append([]); db.append([])
    return a, z, delta, dW, db

In [None]:
# test function - do not remove

x_this = X_train[0,:].T

a, z, delta, dW, db = forward_one_step(x_this, W, B, act_funcs)
print('size of a', len(a), 'a[3] =', a[3])
print('size of z', len(z), 'z[3] =', z[3])

assert len(a) == len(z) and len(a) == 4
assert a[0].shape == (64,1)
assert a[1].shape == (6,1)
assert a[2].shape == (5,1)
assert a[3].shape == (10,1)

**Expected Output** (The output may not the same):\
size of a 4 a[3] = [[0.09043188]\
 [0.11294963]\
 [0.06708026]\
 [0.11812721]\
 [0.11817167]\
 [0.10483097]\
 [0.09818055]\
 [0.08948474]\
 [0.10362842]\
 [0.09711466]]\
size of z 4 z[3] = [[-0.05075633]\
 [ 0.17158877]\
 [-0.34946347]\
 [ 0.21640886]\
 [ 0.2167852 ]\
 [ 0.096996  ]\
 [ 0.03145495]\
 [-0.06128506]\
 [ 0.0854584 ]\
 [ 0.02053917]]\

## Loss function

For softmax loss function, it is cross entropy loss. You can calculate as

$$\mathcal{L} = -\sum_{i=0}^n (y_i * \log\hat{y}_i)$$

### Exercise

Create loss function for multi classification (cross entropy loss)

In [None]:
# Grade cell - do not remove

def loss(y, yhat):
    l = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return l

In [None]:
# test function - do not remove

y_hat = np.array([0.4083291, 0.20277023, 0.18347409, 0.12298636, 0.08244022])
y = np.array([0, 1, 0, 0, 0])

l = loss(y, y_hat)
print(l)

assert np.round(l, 4) == 1.5957, "Loss function incorrect"

**Expected Output**: 1.5956818129123256

## Back propagation

Back propagation can be calculated as

$$\frac{\partial\mathcal{L}}{\partial z^{[l-1]} } =[W^{[l]}]^T \cdot \frac{ \partial\mathcal{L} }{\partial z^{[l]} } * {g^{[l-1]}}'(z^{[l-1]})$$

$$\frac{\partial\mathcal{L}}{\partial W^{[l]} } = \frac{ \partial\mathcal{L} }{\partial z^{[l]} } \cdot [a^{[l-1]}]^T$$

$$\frac{\partial\mathcal{L}}{\partial b^{[l]} } = \frac{ \partial\mathcal{L} }{\partial z^{[l]} }$$

When ${g^{[l-1]}}'$ is derivative activation function.

Thus first of all, we need to calculate derivative of the activation functions that we use.

The Linear_derivative ($dl$) function is
$$dl(x) = [1]$$

The ReLu_derivative ($dReLu$) function is

$$dReLu(x) = [\text{1 when x>0, otherwise 0}]$$

The Tanh_derivative ($dTanh$) function is

$$dTanh(x) = 1 - \tanh^2(x)$$

The Sigmoid_derivative ($ds$) function is

$$ds(x) = sigmoid(x)(1-sigmoid(x))$$

### Exercise

Write the derivative functions as above

In [None]:
# Grade cell - do not remove

def Linear_derivative(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Linear_derivative(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Linear_derivative output is incorrect"
assert y_hat[0] == 1, "Linear_derivative output is incorrect"
assert y_hat[1] == 1, "Linear_derivative output is incorrect"
assert y_hat[2] == 1, "Linear_derivative output is incorrect"
assert y_hat[3] == 1, "Linear_derivative output is incorrect"
assert y_hat[4] == 1, "Linear_derivative output is incorrect"

In [None]:
# Grade cell - do not remove

def ReLu_derivative(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = ReLu_derivative(a)
print(y_hat)

assert y_hat.shape[0] == 5, "ReLu_derivative output is incorrect"
assert y_hat[0] == 1, "ReLu_derivative output is incorrect"
assert y_hat[1] == 1, "ReLu_derivative output is incorrect"
assert y_hat[2] == 1, "ReLu_derivative output is incorrect"
assert y_hat[3] == 0, "ReLu_derivative output is incorrect"
assert y_hat[4] == 0, "ReLu_derivative output is incorrect"

In [None]:
# Grade cell - do not remove

def Tanh_derivative(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Tanh_derivative(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Tanh_derivative output is incorrect"
assert np.round(y_hat[0],4) == 0.7629, "Tanh_derivative output is incorrect"
assert np.round(y_hat[1],4) == 0.0764, "Tanh_derivative output is incorrect"
assert np.round(y_hat[2],4) == 0.0198, "Tanh_derivative output is incorrect"
assert np.round(y_hat[3],4) == 0.1625, "Tanh_derivative output is incorrect"
assert np.round(y_hat[4],4) == 0.5971, "Tanh_derivative output is incorrect"

In [None]:
# Grade cell - do not remove

def Sigmoid_derivative(x):
    output = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return output

In [None]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Sigmoid_derivative(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Sigmoid_derivative output is incorrect"
assert np.round(y_hat[0],4) == 0.2055, "Sigmoid_derivative output is incorrect"
assert np.round(y_hat[1],4) == 0.2475, "Sigmoid_derivative output is incorrect"
assert np.round(y_hat[2],4) == 0.2494, "Sigmoid_derivative output is incorrect"
assert np.round(y_hat[3],4) == 0.2445, "Sigmoid_derivative output is incorrect"
assert np.round(y_hat[4],4) == 0.2217, "Sigmoid_derivative output is incorrect"

### Exercise

Create back propagation function.

In [None]:
# Grade cell - do not remove

def back_propagation(y, a, z, W, dW, db, act_deri):
    '''
    Backprop step. Note that derivative of multinomial cross entropy
    loss is the same as that of binary cross entropy loss. See
    https://levelup.gitconnected.com/killer-combo-softmax-and-cross-entropy-5907442f60ba
    for a nice derivation.
    '''
    L = len(W)-1
    
    # y_hat - y
    # delta[L] = None
    for l in range(L,0,-1):
        # db = delta(l)
        db[l] = None
        
        # dW = a(l-1) * delta(l)
        dW[l] = None
        
        if l > 1:
            # recalculate delta in backward layer
            # dAct_func(z(l-1)) * (W(l) * delta(l))
            delta[l-1] = None
            
    # YOUR CODE HERE
    raise NotImplementedError()
    return dW, db

Create activation derivative variable

In [None]:
act_deri = [None, ReLu_derivative, Sigmoid_derivative, Softmax]

In [None]:
# test function - do not remove

x_this = X_train[0,:].T
y_this = y_train[0,:]

a, z, delta, dW, db = forward(x_this, W, B, act_funcs)
dW, db = back_propagation(y_this, a, z, W, dW, db, act_deri)

lenW = [0, 64, 6, 5]
for i in range(4):
    assert len(dW[i]) == lenW[i]
    
print("dW", dW)
print("db", db)

## Update weight and bias

In the training, to improve accuracy, you need to update weight/bias while training.
Weight and bias update equations are
$$
W_{new}^{(i)} = W_{old}^{(i)} - \alpha * \delta W
$$
$$
B_{new}^{(i)} = B_{old}^{(i)} - \alpha * \delta B
$$

When $\alpha$ is learning rate. and $i$ is the layer number of network

### Exercise

Create <code>update_step</code> function

In [None]:
# Grade cell - do not remove

def update_step(W, B, dW, db, alpha):
    L = len(W)-1
    for l in range(1,L+1):
        # W[l] = None
        # B[l] = None
        # YOUR CODE HERE
        raise NotImplementedError()
    return W, B

In [None]:
# test function - do not remove

x_this = X_train[0,:].T
y_this = y_train[0,:]

alpha = 0.1

a, z, delta, dW, db = forward(x_this, W, B, act_funcs)
dW, db = back_propagation(y_this, a, z, W, dW, db, act_deri)

W_new, B_new = update_step(W, B, dW, db, alpha)
W_new_2, B_new_2 = update_step(W_new, B_new, dW, db, alpha)

result_w = np.array_equal(W, W_new)
result_w2 = np.array_equal(W_new, W_new_2)
assert W[2].shape == W_new[2].shape and W[1].shape == W_new_2[1].shape, "W_new shape must be the same"
assert not result_w and not result_w2, "Weight must be updated"

result_b = np.array_equal(B, B_new)
result_b2 = np.array_equal(B_new, B_new_2)
assert B[3].shape == B_new[3].shape and B[1].shape == B_new_2[1].shape, "b_new shape must be the same"
assert not result_b and not result_b2, "Bias must be updated"

## Put it together

### Exercise

Create training code using the functions above

In [None]:
# Grade cell - do not remove

cost_arr = [] 

alpha = 0.01
max_iter = 100
for iter in range(0, max_iter):
    loss_this_iter = 0
    # random index of m_train
    order = np.random.permutation(m_train)
    for i in range(0, m_train):
        # Grab the pattern order[i]
        x_this = X_train[order[i],:].T
        y_this = y_train[order[i],:]
        
        # Feed forward step
        a, z, delta, dW, db = None, None, None, None, None
        # calulate loss for each epoch
        loss_this_pattern = 0 #(calculate loss here)
        loss_this_iter = loss_this_iter + loss_this_pattern
        # back propagation
        dW, db = None, None
        # update weight, bias (1 line)

        # YOUR CODE HERE
        raise NotImplementedError()
            
    cost_arr.append(loss_this_iter[0,0])

In [None]:
# test function - do not remove

for i in range(max_iter):
    print('Epoch %d train loss %f' % (i + 1, cost_arr[i]))
assert len(cost_arr) == max_iter

### Take home Exercise

1. Plot the loss value into graph using pyplot (10 points)
2. Create Prediction function to predict the test set which we have separated from above. Calculate the accuracy of prediction. (15 points)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()