Before we begin, let's make sure everything works. Click inside the block below and press Ctrl+Enter to run (or use the Run button in the toolbar). If everything went well, you should see the text `All good!` below the block:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
from testCases_v2 import *
from lr_utils import load_dataset
%matplotlib inline


print("All good!")

# Introduction to Neural Networks with Cogito
## 1 Single Neuron

Today we'll learn how we can create neural networks and use them to solve hard problems. A neural network solves problems by simulating how the brain works. So we'll start out by implementring the perceptron. The notebook repeats alot of the theory that was covered in the slides, so feel free to skip ahead to the coding parts.

### 1.1 The Perceptron

Lets begin by createin a simple approximation of a neuron. This simplification is called a perceptron. It has inputs, weights, an activation and an output.

![Perceptron](Images/Perceptron.png "perceptron")

To calculate the perceptron's output we compute a weighted sum of the input signal. We then apply an activation function $g$ to the sum to get the output $\hat{y}$

$$ \boldsymbol{z} = \sum_{i=1}^{n} x_iw_i $$

$$
\hat{y} = g(z) 
$$

There are a few activation functions we can choose from when building our perceptron. The activation function is there to simulate the firing (activation) of the neuron when the total input to the neuron is great enough. If the neuron got enough input, we say it got activated. The actual activation of the neuron varies depending on the type of activation function we use.

### 1.2 Activation
As mentiond above we can choose between a large variety of activation functions. A few choices are the step function, a few types of sigmoid functions and a linear function:

$$ g_{step}(X) = \begin{cases}
1 & \text{if } X >= 0 \\
0 & \text{if } X < 0 \\
\end{cases} $$

$$ g_{sigmoid}(X) = \frac{1}{1+e^{-x}} $$

$$ g_{linear}(X) = x $$

Let's implement some activation functions in Python!

#### Step function

$$ g_{step}(X) = \begin{cases}
1 & \text{if } X >= 0 \\
0 & \text{if } X < 0 \\
\end{cases} $$

![](Images/stepfunction.svg)

In [None]:
# Implement the step activation function (≈ 1 line)
def step(z):
    ### START CODE HERE
    a = None
    ### END CODE HERE
    return a

In [None]:
print("step(-6):", step(-6))
print("step(0):", step(0))
print("step(18):", step(18))

**Expected Output**: 

<table>
  <tr>
    <td> step(-6) </td>
    <td> 0 </td> 
  </tr>
  <tr>
    <td> step(0) </td>
    <td> 1 </td> 
  </tr>
  <tr>
    <td> step(18) </td>
    <td> 1 </td> 
  </tr>
</table>

#### Sign Function

$$ g_{sign}(X) = \begin{cases}
1 & \text{if } X >= 0 \\
-1 & \text{if } X < 0 \\
\end{cases} $$

In [None]:
# Implement the sign activation function (≈ 1 line)
def sign(z):
    ### START CODE HERE
    a = None
    ### END CODE HERE
    return a

In [None]:
print("sign(-6):", sign(-6))
print("sign(0):", sign(0))
print("sign(18):", sign(18))

**Expected Output**: 

<table>
  <tr>
    <td> sign(-6) </td>
    <td> -1 </td> 
  </tr>
  <tr>
    <td> sign(0) </td>
    <td> 1 </td> 
  </tr>
  <tr>
    <td> sign(18) </td>
    <td> 1 </td> 
  </tr>
</table>

### 1.3 Weights
When we create our neuron, we need some initial values for the weights (w in perceptron image above). They decide how much the neuron will "listen" to signals from the different input nodes. When building a neural network we want to initialize thise with small walues, but when we are building a perceptron we can jus initialize them to be zero.

$$ \boldsymbol w = \begin{bmatrix} 0 \\ 0 \end{bmatrix} $$

In [None]:
# Implement initialize weights where n is 
# the number of different weights (≈ 1 line)
def init_weights(n):
    ### START CODE HERE
    w = None
    ### END CODE HERE
    return w

In [None]:
print("len(init_weights(5)):", len(init_weights(5)))
w = init_weights(2)
print("w:", w)

**Expected Output**: 

<table>
  <tr>
    <td> len(init_weights(5)) </td>
    <td> 5 </td> 
  </tr>
  <tr>
    <td> w:  </td>
    <td> [0,0] </td> 
  </tr>
</table>

### 1.4 Output

Now it's time to get some output from our neuron. Remember that we take the weighted sum of the inputs and then aplly an activation.

$ z = \sum_{i=1}^{n} x_iw_i $

$ \hat{y} = g(z - \theta) $

We see that the formula is a bit different, that is because we want to be able to move treshold point of the activation function. I will get back to why and what the treshold is later!

(Hint: you might find python's [zip()](https://docs.python.org/3.7/library/functions.html#zip) useful)

In [None]:
# Implement predict (≈ 2 line)
def predict(x, w, theta, activation):
    ### START CODE HERE
    z = None
    y = activation(z - theta)
    ### END CODE HERE
    return y

In [None]:
print("predict([0, 0],[0.2, 0.2], 0.2, step):", predict([0, 0],[0.2, 0.2], 0.2, step))
print("predict([0, 1],[0.2, 0.2], 0.2, step):", predict([0, 1],[0.2, 0.2], 0.2, step))
print("predict([1, 0],[0.2, 0.2], 0.2, sign):", predict([1, 0],[0.2, 0.2], 0.2, sign))

**Expected Output**: 

<table>
  <tr>
    <td> predict([0, 1],[0.2, 0.2], 0.4, step) </td>
    <td> 0 </td> 
  </tr>
  <tr>
    <td> predict([0, 1],[0.2, 0.2], 0.2, step) </td>
    <td> 1 </td> 
  </tr>
  <tr>
    <td> predict([0, 1],[0.2, 0.2], 0.2, sign) </td>
    <td> 1 </td> 
  </tr>  
</table>

### 1.5 Learning

Now that we have a perceptron with input, activation and weights we _can_ use our neuron, but there is a problem. Since the weights are all zero the perceptron will always output the same. We need a way to adjust the weights such that the model learns to predict the right output.

In machine learning and in neural networks, we (usually) learn from examples. That means we show the algorithm an example of what we want it to learn, and then we correct it by saying if it got the example right or wrong. We can do the exact same thing with our neuron:

$$ \epsilon_{x_1} = y - \hat{y}$$

Here the $ \epsilon_{x_1} $ simply means the error over our first example. This will give our model an indication of _how wrong_ it was. With this we can update our weights with the following rule

$$ w_i = w_i + \alpha * x_i(example) * e(example) $$

where the $\alpha$ is what we call a learning rate and $w_i$ is weigth number i. The learning rate is just our stepsize when making adjustments towards a better model.

In [None]:
# Implement a function to calculate the error (≈ 1 line)
def error(y, y_pred):
    ### START CODE HERE
    return None
    ### END CODE HERE

In [None]:
print("error(1, 0):", error(1, 0))
print("error(1, -1):", error(1, -1))
print("error(0, 1):", error(0, 1))

**Expected Output**: 

<table>
  <tr>
    <td> error(1, 0) </td>
    <td> 1 </td> 
  </tr>
  <tr>
    <td> error(1, -1) </td>
    <td> 2 </td> 
  </tr>
  <tr>
    <td> error(0, 1) </td>
    <td> -1 </td> 
  </tr>
</table>

Next we need to update the weights

(Hint: The x and w in this function are the arrays with the inputs and weights respectivly)

In [None]:
# Implement update weights (≈ 1-2 lines)
def update_weight(x, w, error, learning_rate):
    ### START CODE HERE
    w = None
    ### END CODE HERE
    return w

In [None]:
print("update_weight([1, 0], [0.2, 0.2], 1, 0.1):", update_weight([1, 0],[0.2, 0.2], 1, 0.1))
print("update_weight([1, 0], [0.2, 0.2], -1, 0.2):", update_weight([1, 0], [0.2, 0.2], -1, 0.2))
print("update_weight([1, 0, 1], [0.2, 0.2, 0.1], -1, 0.1):", update_weight([1, 0, 1], [0.2, 0.2, 0.1], -1, 0.1))

**Expected Output**: 

<table>
  <tr>
    <td> update_weight([1, 0],[0.2, 0.2], 1, 0.1)</td>
    <td> [0.30000000000000004, 0.2] </td> 
  </tr>
  <tr>
    <td> update_weight([1, 0], [0.2, 0.2], -1, 0.2)</td>
    <td> [0.0, 0.2] </td> 
  </tr>
  <tr>
    <td> update_weight([1, 0, 1], [0.2, 0.2, 0.1], -1, 0.1) </td>
    <td> [0.1, 0.2, 0.0] </td> 
  </tr>
</table>

### 1.5 Putting it all together
Now it's time to put it all together and create a learning perceptron! We are going to use our perceptron to learn the **and**, **or** and **xor** functions. For those unfamilliar with these logic operators, here's a short recap:

> And, or and xor are logic operators. You give them two statemens and get back a true or a false. An example of the and operator can be: "It is raining `and` I am wet". This is only true if it is both raining and I am wet. The `or` operator is true if it's raining *or* I am wet *or* both. The `xor` ("exlusive or") is true only if _either_ it is raining *or* if I'm wet, but not if both!
> 
> In the table we can see the truth value of all the functions
![](Images/truth-table-and-or-xor.png)

But before we can learn anything we need to put everything together. Let's finish up the perceptron.

Implement the function perceptron witch takes in our input X, expected output Y, activation, number of epochs, learning rate, treshold and the weights. Here the X is a matrix, not a vector. This is because it lets us represent all the training examples in one structure. 

$$ X = \begin{bmatrix}
0 & 0 \\
0 & 1 \\
1 & 0 \\
1 & 1 
\end{bmatrix} , 
Y_{and} = \begin{bmatrix}
0 \\
0 \\
0 \\
1 
\end{bmatrix} $$

The $Y$ is a vector with the value of the truth table in the corresponding place to the examples in $X$. *num_epocs* is the way we in machine learning say how many times we want to iterate over the entire training set(all our examples, more on this later). We also pass in the weights, so that we can initialize the perceptron with weights we have trained before.

(Hint: Scroll up to see the functions we defined earlier, and try to see where they fit in)

In [None]:
# Implement the perceptron
def perceptron(X, Y, activation, num_ephocs=5, learning_rate=0.1, threshold = 0.1, w=None, printer=False):
    
    # initialize the weights
    if w == None:
        ### START CODE HERE (≈ 1 line)
        w = None
        ### END CODE HERE
        
    # iterate through the entire training set multiple times to learn
    for epoc in range(num_ephocs):
        ### START CODE HERE (≈ 4 lines)
        # for every example in the training set
        for _ in None
            
            # calculate the predicted value
            y_pred = None
            
            # find the error
            err = None
            
            # update the weights
            w = None
            ### END CODE HERE
        
        if printer:
            pred = ""
            for x in X:
                pred += str(predict(x, w, threshold, activation)) + ","
            print("Epoch:", epoc + 1)
            print("Prediction\t [", pred[:-1] , "]", sep="")
            print("Weights\t\t", w, end="\n\n")
    return w

Since our perceptron returns its weights, we can use our predict method from earlier to test if our implementation is correct. Dont worry that we represented our X matrix not with 0 but with 0.01 and 1 as 0.99 this is just for the numeriks underneath.

In [None]:
# Set up the X matrix
X = [[0.01,0.01],
     [0.01,0.99],
     [0.99,0.01],
     [0.99,0.99]]

# set up the different Y vectors
and_Y = [0, 0, 0, 1]
or_Y = [0, 1, 1, 1]
xor_Y = [0, 1, 1, 0]

# threshold
threshold = 0.2

We can now run our perceptron and evaluate the result!

In [None]:
and_weight = perceptron(X, and_Y, step, num_ephocs=2, threshold=threshold, printer=True)

We can see from the printout that it learns the **and** weights in just a few iterations! Let's see if we can learn **or**

In [None]:
or_weight = perceptron(X, or_Y, step,num_ephocs=2, threshold=threshold, printer=True)

The **or** weights are also easy to learn, but pay extra attention to the **xor**

In [None]:
xor_weight = perceptron(X, xor_Y, step, num_ephocs=5, threshold=threshold, printer=True)

### What's happening?!?

Even with 5 steps it gets nowhere close to the answer and gets stuck after a few iterations. This is because it actually can't learn the **xor**! To understand why, we need to take a look at what the perceptron actually does when we train the weights and a little somthing called linear seperability.

When we train our perceptron and estimate the weights, it can be shown that what we actually do is to fit a line (or hyperplane) through a plane (or whatever many dimensions each training example has).

$$ x_1 w_1 + x_2 w_2 = \theta $$

We can then use this equation to calculate what side of the line something is. When we do this in the **and** case it looks somthing like this:

In [None]:
print(and_weight)

plt.plot([0, 0, 1], [0, 1, 0], "ro", markersize=20)
plt.plot([1], [1], "b+", markersize=20)
intercept = threshold/ and_weight[1]
slope = (-and_weight[0]/and_weight[1])

x = [intercept + (slope * x) for x in range(-3, 3)]

plt.plot(x, range(-3, 3))
plt.axis([-0.2, 1.2, -0.2, 1.2])
plt.show()

Here I have marked the positive case (`1 and 1 = 1`) as a plus sign and the negative cases as red dots. If you did everything correctly so far, you should se that the line seperates the plus from the red dots. This is what the perceptron does: it finds a line that seperates two different types of data. Now for the **or** case:

In [None]:
print(or_weight)

plt.plot([0], [0], "ro", markersize=20)
plt.plot([0, 1, 1], [1, 0, 1], "b+", markersize=20)
intercept = threshold/ or_weight[1]
slope = (-or_weight[0]/or_weight[1])

x = [intercept + (slope * x) for x in range(-3, 3)]

plt.plot(x, range(-3, 3))
plt.axis([-0.2, 1.2, -0.2, 1.2])
plt.show()

Here again the + sign are positive cases and the red dot is the negative case. We see that the same perceptron as in the and case has managed to learn a new line that separates the two clases. Let's se why the perceptron is not able to learn the **xor** case!

In [None]:
print(xor_weight)

plt.plot([0, 1], [0, 1], "ro", markersize=20)
plt.plot([0, 1], [1, 0], "b+", markersize=20)
intercept = 0.2/ xor_weight[1]
slope = (-xor_weight[0]/ xor_weight[1])

x = [intercept + (slope * x) for x in range(-5, 5)]

plt.plot(x, range(-5, 5))
plt.axis([-1.5, 2.5, -1.5, 2.5])
plt.show()

When we look at the plot we clearly see that our perceptron, aka our line learner can learn how to create a straight line that seperates the two cases. 

## Part 2 Neural Networks (Multi layered perceptrons)

Now we are going to look at how we can solve the problem of the perceptron by createing larger networks of them. To best understad how this enables us to do amazing things you will need to understand a bit about matrix operations and som derivation using the chain rule. That is because we need to combine these concepts to build the learning algorithm called backpropagation that enables us to train these more powerfull models.

This notebook contains a lot of math, especially in the section about backprop. If you're too tired to look at that now, feel free to copy the solution when you get there. _But remember to take a look at it later, it's there because it's essential to understanding deep learning at a deeper level (pun intended)._

### 2.1 Network of Neurons

Our basic building block will be the perceptron, inspired by the neurons in your brain. Because nature is not quite binary, we want a smooth transition between the "on" and "off" states. So in the same way we used actiavtion functions in the perceptron we will use them in the neural network. When building our neural network we'll be using a sigmoid function as the activation function.

![](Images/sigmoid_smoth.png)


In your brain, the electrical signals are collected by the dendrites and combined to form a
stronger electrical signal. If the signal is strong enough to pass the threshold, the
neuron fires a signal down the axon towards the terminals to pass onto the next
neuron’s dendrites. Almost as in this image.

![](Images/neural_network.png)

The thing to notice is that each neuron takes input from many before it, and also
provides signals to many more, if it happens to be firing. One way to replicate this in an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent
layer.

![](Images/neural_network2.png)

So far, it might not be obvious how this network can learn anything at all. However, if we add _weights_ to the connections between neurons, things start getting interesting. Now, we can put amplify some signals and silence others others, simply by adjusting the weights up and down.

![](Images/neural_network3.png)

A core idea behind neural networks is that there exists some combination of weights that makes the network mimic any mathematical function. There exist smart methods for finding these weights and these methods are based on the idea of showing the network many `input → output` examples, and adjusting the weights to make the network "less wrong" until it's no longer wrong.

### Structure in a neural network

In this part of the notebook we will be using numpy. Numpy is a library that lets us quickly and efficiently do matrix opperations in python. You can read more about numpy [here](https://numpy.org/). One of the most usefull functions in the numpy library is the method `numpy.ndarray.shape`. This method lets you find the shape of any numpy array.

**Exercise**: Define three variables:
    - n_x: the size of the input layer
    - n_h: the size of the hidden layer 
    - n_y: the size of the output layer

**Hint**: Use shapes of X and Y to find n_x and n_y.

In [None]:
def layer_sizes(X, Y, num_hidden_nodes):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    ### START CODE HERE ### (≈ 3 lines of code)
    n_x = None # size of input layer
    n_h = None # size of the hidden layer
    n_y = None # size of output layer
    ### END CODE HERE ###
    return (n_x, n_h, n_y)

In [None]:
X_assess, Y_assess, H_assess = layer_sizes_test_case()
(n_x, n_h, n_y) = layer_sizes(X_assess, Y_assess, H_assess)
print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the hidden layer is: n_h = " + str(n_h))
print("The size of the output layer is: n_y = " + str(n_y))

**Expected Output** (these are not the sizes you will use for your network, they are just used to assess the function you've just coded).

<table style="width:20%">
  <tr>
    <td>**n_x**</td>
    <td> 5 </td> 
  </tr>
    <tr>
    <td>**n_h**</td>
    <td> 4 </td> 
  </tr>
    <tr>
    <td>**n_y**</td>
    <td> 2 </td> 
  </tr>
  
</table>

### 2.2 Follow the signal through the network (You can skip this part)

This next part presents a numerical example where we follow the ignal through the network. You can skip this part without loss of the continuety of the workshop.

Now we take a look at the signal and how it changes as we propagate it through the network. To ease the calculations we look at a small neural network with two nodes on the input layer, and two nodes in a _hidden layer_. A hidden layer is simply any other layer than the input layer. The last hidden layer is often called the output layer.

The input will be $x_{1} = 1.0, x_{2} = 0.5$, and the weights are:
* $ w_{1,1} = 0.9 $
* $ w_{1,2} = 0.2 $
* $ w_{2,1} = 0.3 $
* $ w_{2,2} = 0.8 $

![](Images/neural_network_signals.png)

We start by calculating the first node in layer two. We will call it $n^{[1]}_1$

The input to $n^{[1]}_1$, let's call it $z^{[1]}_1$, is:
$$
z^{[1]}_1 = w_{1,1} \cdot x_1 + w_{2,1} \cdot x_2 = 0.9 \cdot 1.0 + 0.3 \cdot 0.5  = 1.05
$$

Now the other node!
$$
z^{[1]}_2 = w_{1,2} \cdot x_1 + w_{2,2} \cdot x_2 = 0.2 \cdot 1.0 + 0.8 \cdot 0.5 = 0.6
$$

Now we need to apply the activation function in each of the nodes. I'm going to use $a^{[1]}_1$ for the activation in node 1 in layer 1:
$$
a^{[1]}_1 = \frac{1}{1+e^{-z^{[1]}_1}} = \frac {1}{1+e^{-1.05}} = 0.741
\\
a^{[1]}_2 = \frac{1}{1+e^{-z^{[1]}_2}} = \frac {1}{1+e^{-0.6}} = 0.646
$$

#### Assignment
Calculate the output if the input is 0.5, 0.5 and the weights are
* $w_{1,1} = 1$
* $w_{1,2} = -1$
* $w_{2,1} = 5$
* $w_{2,2} = 10$

Notice anything interesting?

I know it seems like alot of work to calculate just two outputs, but this is where matrices makes things much easier!

### 2.3 Matrix multiplication and vectorisation (Can be skiped)

If you have taken and still remember TMA4110/TMA4115, this is going to be a walk in the park, but I still encourage you to get to know numpy. Numpy is THE Python library for doing matrix calculations efficiently.

If you have not done TMA4110/TMA4115 don't worry.

Matrices are useful in many ways. First, they allow us
to compress writing all those calculations into a very simple short form. The second benefit is that many
computer programming languages have heavily optimized matrix libraries that runs much
faster than anything we can easily implement ourselves.

In short, matrices allow us to express the work we need to do concisely and
easily, and computers can get the calculations done quickly and efficiently.

A matrix is just a table, a rectangular grid of numbers. If you’ve used spreadsheets, you’re already comfortable with working with numbers arranged in a grid. We usually write them like this:

\begin{equation*}
{A} = \begin{bmatrix} 
    x_{11} & x_{12} & \dots & x_{1n} \\
    x_{21} & x_{22} & \dots & x_{2n} \\
    \vdots & \vdots & \ddots \\
    x_{n1} & x_{n2} & \dots & x_{nn}
    \end{bmatrix} \in \newcommand{\R}{\mathbb{R}} \R^{nxn}
\end{equation*}

This matrix is (n x n) big. That means it has n rows and n columns. A matrix that is (n x 1) has n rows and one column. We usually call a matrix that has either 1 row or 1 column a _vector_ and write it in a bold lower case letter:

$$
\bf{a} = \begin{bmatrix} 
    x_{11} \\
    x_{21} \\
    \vdots \\
    x_{n1} 
    \end{bmatrix} \in \R^{nx1}
$$

We can multiply two matrixes together like this, we call that a dot product:

![](Images/matrix_mult.png)

Spend some time looking at this image. Se how we combine a row in the first matrix with the a column in the last matrix to create a new matrix. You might also notice that we can only do this if the number of columns in the first matrix is the same as the number of rows in the last matrix. If you are really good at looking at this matrix you might also find that if we have a (2 x 3) matrix and a (3 x 2) matrix and multiply them together we will end up with a result that is a (2 x 2) matrix.

When numpy complains that `ValueError: shapes (whatever) not aligned`, you probably got this wrong. That happens, look at the dimensions and try to make sure that the number of columns in the first matrix is the same as the number of rows in the second matrix.

The next image shows how we can use a matrix operation to represent a calculation in a neural network.

![](Images/matrix_mult2.png)

The first matrix contains the weights between nodes of two layers. The second matrix contains the signals of the first input layer. The answer we get by multiplying these two matrices is the combined weighted signal into the nodes of the second layer. Look carefully, and you’ll see this. The first node has the first input_1 moderated by the weight **w1,1** plus the second input_2 moderated by the weight **w2,1** . These are the values of **z** before the sigmoid activation function is applied.

This calculation can be expressed as:

$$
{z^{[1]}} = {W^{[0]}} \cdot {x_i}
$$

Here the $z^{[1]}$ is the input to the hidden layer. The $W^{[0]}$ is the matrix containing the weights between the 0th and the 1st layer.

If we were to implement the matrix operation our selfs, the result would probably not be the fastest matrix operation around. That is because the library use alot of clever instructions on the processor level (AVX) to implement a much faster variant. In machine learning we usualy use a library called numpy to do all the matrix calculations.

Here is a simple assignemnt with som matrix multiplication. Take a look at the [Numpy docs](https://docs.scipy.org/doc/) and [np.dot()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.dot.html)

In [None]:
# First lets create a matrix
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])

### START CODE HERE (≈ 1 line)
C = None
### END CODE HERE
print("C =",C)

**Expected Output**: 

<table>
  <tr>
    <td> C =</td>
    <td> [[19 22]
 [43 50]] </td> 
  </tr>
</table>

### 2.4 Vectorized activation function

When we write code to do many task in machine learning we often want to implement a version that takes in a whole vector or matrix instead of a single value. This is called a _vectorized_ version of the function. Next we need a vectorized version of our activation function. This is a function often called a _logistic sigmoid function_ because of the characteristic _s_-shape.

#### Implementing the sigmoid activation function

![](Images/sigmoid.png)

The shape of the sigmoid function is given by
$$
sigmoid(z) = \frac{1}{1+e^{-z}}
$$

As we know from earlyer we often want to run the activation function on multiple values, so we want to implement this function in a vectrized way like this
$$ \text{For } x \in \mathbb{R}^n \text{,     } sigmoid(x) = sigmoid\begin{pmatrix}
    x_1  \\
    x_2  \\
    ...  \\
    x_n  \\
\end{pmatrix} = \begin{pmatrix}
    \frac{1}{1+e^{-x_1}}  \\
    \frac{1}{1+e^{-x_2}}  \\
    ...  \\
    \frac{1}{1+e^{-x_n}}  \\
\end{pmatrix}\tag{1} $$

(Hint: check out [np.exp()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html))

In [None]:
# implement a vectorized version of the sigmoid activation
def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size

    Return:
    s -- sigmoid(z)
    """
    ### START CODE HERE (≈ 1 line)
    s = None
    ### END CODE HERE
    return s

In [None]:
x = np.array([1, 2, 3])
sigmoid(x)

**Expected Output**:

<table>
    <tr> 
        <td> sigmoid([1,2,3])</td> 
        <td> array([ 0.73105858,  0.88079708,  0.95257413]) </td> 
    </tr>
</table> 

### 2.5 Neural network as matrices

When we work with neural networks we usually have a dataset. That is, we have many examples with input data x with corresponding "answers" Y. Depending on the problem we are trying to solve, the x-es and y-s are vectors of various sizes. Each of the examples $x_i$ will be the same size as our input layer and the corresponding answer $y_i$ will have the same size as our output layer.

When we want to vectorize this, all we do is put the entire dataset (all of the $x_i$-s) into a matrix X:

\begin{equation*}
X = \begin{bmatrix} 
    x_{11} & x_{12} & \dots & x_{1m} \\
    x_{21} & x_{22} & \dots & x_{2m} \\
    \vdots & \vdots & \ddots \\
    x_{n1} & x_{n2} & \dots & x_{nm}
    \end{bmatrix} \in \newcommand{\R}{\mathbb{R}} \R^{n \times m}
\end{equation*}

Here the m is the number of examples we have, and n is the number of inputs to our neural network. This means that each column contains one example. This is a bit different than you will see other places, but it makes our computations a lot cleaner and easier.

Our Y matrix looks the same, only with y's instead of x'es.

The last matrix we need to find is $W^{[i]}$ which is the matrix that represents the weights from layer $i-1$ to layer $i$. If the (i-1)th layer has p nodes and the ith layer have j nodes, the W matrix is:

\begin{equation*}
W^{[i]} = \begin{bmatrix} 
    w_{1,1} & w_{2,1} & \dots & w_{p,1} \\
    w_{1,2} & w_{2,2} & \dots & w_{p,2} \\
    \vdots \\
    w_{1, j} & w_{2,j} & \dots & x_{p,j}
    \end{bmatrix} \in \newcommand{\R}{\mathbb{R}} \R^{j \times p}
\end{equation*}

Notice here that each column is the weights going from that node to the each of the nodes in the next layer. 

#### Initial weights

Before we start implementing the forward propagation of our neural network we need to initialize the weights. Unlike the perceptron where we can initialize the weights to all zero, we need to initialize the neural network weights to actual values.

We want to avoid too large values into our activation function sigmoid. That is because we'll have to differentiate our activation and a large input value will push the derivative close to zero. If we get to close to zero the network will no longer be able to learn, so we want to avoid that. 

We know that the output from the previous layer will be between 0 and 1, because we use sigmoid as our activation function. These are summed together, so if we have more nodes in the previous layer, we want to have smaller weights.

The standard approach to initialize the weights to small random values. Using a gausian probability distribution.

$$ N(\mu=0, \sigma=\text{number of nodes in prev. layer}) $$

For those who haven't taken any statistics courses yet, this means that most of our weights will be in the green area of the plot below

![](Images/weights_sampling.png)

Numpy can generate a matrix with numbers sampled from the normal distribution using [numpy.random.normal($\mu, \sigma,$ matrix_size)](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html).

In [None]:
def initialize_weights(n_x, n_h, n_y):
    """
    Argument:
    n_x -- number of nodes in the input layer
    n_h -- number of nodes the hidden layer
    n_y -- number of output nodes
    
    Returns:
    params -- python dictionary containing the weights:
                    W1 -- weight matrix with size (n_h, n_x)
                    W2 -- weight matrix with size (n_y, n_h)
    """
    
    np.random.seed(2) # slight cheating: get the same random numbers every time you run this function
    
    ### START CODE HERE ### (≈ 2 lines of code)
    W1 = None
    W2 = None
    ### END CODE HERE ###
    
    # Make sure your code crashes if the matrix sizes are wrong
    assert (W1.shape == (n_h, n_x))
    assert (W2.shape == (n_y, n_h))
    
    parameters = {"W1": W1,
                  "W2": W2}
    
    return parameters

weights = initialize_weights(2, 4, 3)

print('W1 mean:', np.mean(weights['W1']))
print('W2 mean:', np.mean(weights['W2']))
print('W1 deviation:', np.std(weights['W1']))
print('W2 deviation:', np.std(weights['W2']))

#### Expected output
<table>
<tr><td>W1 mean</td><td>-0.384183451658</td></tr>
<tr><td>W2 mean</td><td>0.00667665903708</td></tr>
<tr><td>W1 deviation</td><td>0.821496588361</td></tr>
<tr><td>W2 deviation</td><td>0.486046904496</td></tr>
</table>

### 2.6 Learning the weights (backpropagation)

When we are training a neural network we need to find a way to use the information we have to adjust the weights in the model. Just as we did in the perceptron we can calulate how wrong the network is by calulating an error between the true answer in our dataset vs. the predicted value we get from our network.

![](Images/error_prop.png)

In this more complex model we need a way to atribute part or the error in the output to each of the weights that contributed to it. 

The idea is to split the error unevenly. We give more of the error to the incoming links which had greater link weights, because they contributed more to the error. 

![](Images/error_prop2.png)

We can extend this same idea to many more nodes. If we had 100 nodes
connected to an output node, we’d split the error across the 100 connections to
that output node in **proportion** to each link’s contribution to the error, indicated
by the size of the link’s weight.

We can even do this if we have multiple output nodes! 
![](Images/backprop.png)
(In the image the output is called $o_i$, but the standard convention is to use $y_{i \space prediction}$ or $\hat{y}_i$)

We can reuse the same error rule as we did in the perceptron model

$$
e = y_{expected} - \hat{y}
$$

We can then use this error as a measure of how good our model is. This error can also be used to tune our model to get better at predicting our dataset.

We can then calculate the participation of $W_{1,1}$ by calculating

$$
\frac{w_{1,1}}{w_{1,1}+w_{1,2}} \cdot e_1
$$

#### Propagating the error further back

We can expand this consept to a network with 1 hidden layers.

![](Images/backprop2.png)

Working back from the final output layer at the right hand side, we can see that
we use the errors in that output layer to guide the refinement of the link weights feeding into the final layer. We’ve labelled the output errors more generically as $e_{output}$ and the weights of the links between the hidden and output layer as $w_{ho}$ . We worked out the specific errors associated with each link by splitting the weights in proportion to the size of the weights themselves.

By showing this visually, we can see what we need to do for the new additional layer. We simply take those errors associated with the output of the hidden layer nodes, and split those again proportionally across the preceding links between the input and hidden layer $w_{ih}$.

![](Images/backprop3.png)

The important thing here is to notice how the error $e_{hidden,1}$ is dependent on the error propagated back from both output nodes. We get the following equation

$$
e_{hidden,1} =  e_{output,1} \cdot \frac{w_{1,1}}{w_{1,1}+w_{1,2}} + e_{output,2} \cdot \frac{w_{1,2}}{w_{1,1}+w_{1,2}}
$$

We can repeat the process for $e_{hidden,1}$, $e_{input,1}$ and $e_{input,2}$ and the calculation would be the same (but with different weights)

If we had even more layers, we’d repeatedly apply this same idea to each layer working backwards from the final output layer. The flow of the error is propagated backwards through the network hense the name _backpropagation_.

This final image shows this idea all the way back to the input layer.

![](Images/backprop4.png)

##### Assignment (optional)
Do the actual calculations. We know the answer is on the image, but we encurage you to not look at the image before you are finnished.

Now we hopefully have some understanding of how the error is propagated back through the network. Next on the list is to do it all again with matrices to get a good vectorization of the prosess!

#### Propagation with matrices

First, we need to propagate the input forwards through the network. This is very similar to what we did with the perceptron.

Let's definate A as the activation matrix, i.e the output values from the hidden layer.

$$
Z^{[i]} = W^{[i]T} \cdot A^{[i-1]}
$$

$$
A^{[i]} = \sigma (Z^{[i]})
$$
Where the $\sigma$ is the sigmoid activation function!

There we go! The forward computation of the entrie single layer of a neural network reduced to a two simple formulas!

Remember that we had a theta in the perceptron that could change the intercept of our activation function. We also need this in the neural network. In our neural network we actualy need such a bias in every node, se we represent it with the vector **b**. In the image we can see this bias represented as a node that always have the value +1 with conections to all the nodes in the next layer. 

![](Images/bias.png)

If we represent the bias in this way we can use the same backpropagation as we use to learn the other weights. 

To add the bias to the calulations we just add it to the weighted sum. The bias term vector for the bias in layer i is given by

\begin{equation*}
b^{[i]} = \begin{bmatrix} 
    b^{[i]}_1\\
    b^{[i]}_2\\
    \vdots\\
    b^{[i]}_n
    \end{bmatrix}
\end{equation*}

This makes our formulas look like this:
$$
A^{[i]} = \sigma (W^{[i]T} \cdot A^{[i-1]} + b^{[i]})
$$

Let's create an initialization function that prepares both the weights and the bias vectors as well:

In [None]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- number of nodes in the input layer
    n_h -- number of nodes the hidden layer
    n_y -- number of output nodes
    
    Returns:
    params -- python dictionary containing the weights:
                    W1 -- weight matrix with size (n_h, n_x)
                    b1 -- bias vector of size (n_h, 1)
                    W2 -- weight matrix with size (n_y, n_h)
                    b2 -- bias vector of size (n_y, 1)
    """
    
    np.random.seed(2) # slight cheating: get the same random numbers every time you run this function
    
    ### START CODE HERE ### (≈ 2 lines of code)
    b1 = None
    b2 = None
    ### END CODE HERE ###
    
    # make sure your code crashes if the sizes are wrong
    assert (b1.shape == (n_h, 1))
    assert (b2.shape == (n_y, 1))
    
    # fetch the weights from earlier
    weights = initialize_weights(n_x, n_h, n_y)
    
    parameters = {**weights,
                  "b1": b1,
                  "b2": b2}
    
    return parameters

initialize_parameters(2, 4, 3)

Now lets implement the forward propagation in our neural network. Remember the equations for the forward propagation:

$$
Z^{[i]} = W^{[i]T} \cdot A^{[i-1]} + b^{[i]}
$$

$$
A^{[i]} = \sigma (Z^{[i]})
$$

In [None]:
def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 4 lines of code)
    W1 = None
    b1 = None
    W2 = None
    b2 = None
    ### END CODE HERE ###
    
    # Implement Forward Propagation to calculate A2 (probabilities)
    ### START CODE HERE ### (≈ 4 lines of code)
    Z1 = None
    A1 = None
    Z2 = None
    A2 = None
    ### END CODE HERE ###
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache

In [None]:
X_assess, parameters = forward_propagation_test_case()
A2, cache = forward_propagation(X_assess, parameters)

# Note: we use the mean here just to make sure that your output matches ours. 
print(np.mean(cache['Z1']) ,np.mean(cache['A1']),np.mean(cache['Z2']),np.mean(cache['A2']))

**Expected Output**:
<table style="width:50%">
  <tr>
    <td> 0.262818640198 0.546706375956 -1.29856499368 0.214406626539 </td> 
  </tr>
</table>

### 2.7 Updating the weights (gradient descent)

This part of the notebook is quite math heavy. The important part is that you understand the general consept. You can always come back to this later, and if you are more of a visual learner we can recomentd the neural network videos by [3blue1brown on youtube](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi).

So far, we’ve got the errors propagated back to each layer of the network. Why did we do this? Because the error is used to guide how we adjust the link weights to improve the overall answer given by the neural network. The question is how do we use this error to update the weights? 

Because of the way the a neural network is defined, there does not exist a way to calculate what the optimal weights should be. We actualy need to do optimization to find the weights, and that can be hard, but possible.

Another problem is that the training data might not be sufficient to properly teach a network. The training data might have errors so our assumption that it is the perfect truth, something to learn from, is then flawed. The network itself might not have enough layers or nodes to model the right
solution to the problem.

What this means is we must take an approach that is realistic, and recognises these limitations. If we do that, we might find an approach which isn’t mathematically perfect but does actually give us better results because it doesn’t make false idealistic assumptions.

Let’s illustrate what we mean by this. Imagine a very complex landscape with peaks and troughs, and hills with treacherous bumps and gaps. It’s dark and you can’t see anything. You know you’re on the side of a hill and you need to get to the bottom. You don’t have a map, but you have a torch. What do you do? You’ll probably use the torch to look at the area close to your feet. You can see which bit of earth seems to be going downhill
and take small steps in that direction. In this way, you slowly work your way down the hill, step by step.

![](Images/gd_metafor.png)

This is known as **gradient descent** in maths. This method lets you reach your goal of getting down from the mountan. What’s the link between this really cool gradient descent method and neural networks? Well, if the complex difficult function is the error of the network, then going downhill to find the minimum means we’re minimizing the error. We’re improving the network’s output.

![](Images/gd.png)

This image shows what it would look like if we had 2 parameters to tune. As you might have noticed, it is possible to get stuck in a pit high up on the mountain. That's beyond the scope of this guide, but there are things we can do to reduce that problem. And in the high-dimentional spaces we work with in deep learning (usually 4 million and higher!), this is not really a problem. With this said, let's improve our error function to better help the optimization.

There are some things to watch out for here. Look at the following table of training and actual values for three output nodes, together with candidates for an error function.

![](Images/error_functions.png)

The first candidate for an error function is simply $(target - actual)$, like we used before. That seems reasonable enough, right? Well if you look at the sum over the nodes to get an overall figure for how well the network is trained, you’ll see the sum is zero!

What happened? Clearly the network isn’t perfectly trained because the first two node outputs are different to the target values. The sum of zero suggests there is no error. This happens because the positive and negative errors cancel each other out. Even if they didn’t cancel out completely, you can see this is a bad measure of error.

We correct this by taking the absolute value of the difference. That means ignoring the sign, and is written |target - actual|. That could work, because nothing can ever cancel out. The reason this isn’t popular is because the slope isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V-shaped valley that this error function has. The slope doesn’t get smaller closer to the minimum, so our steps don’t get smaller, which means they risk overshooting.

The third option is to take the square of the difference $(target - actual)^2$ . There are several reasons why we prefer this third one over the second one:

* The algebra needed to work out the slope for gradient descent is easy enough.
* It is smooth and continuous making gradient descent work well - there are no gaps or abrupt jumps.
* The gradient gets smaller nearer the minimum, meaning the risk of overshooting the objective gets smaller if we use it to moderate the step sizes.

#### Assignment Implement the error function

To build our neural network we need an error function that can help our optimization work. Implement the squared error function

$$
\epsilon_{error} = (y - \hat{y})^2
$$

When we have multiple examples we want to sum up all the errors so that we can find the overall error in our network. You can treat both y_actual and y_pred as vectors.

**Hint:** you might want to use np.sum and np.subtract

In [None]:
def compute_error(Y_actual, Y_pred):
    """
    Computes the squared error
    
    Arguments:
    Y_pred -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y_actual -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    squared error
    """

    # Compute the error function
    ### START CODE HERE ### (≈ 1 line of code)
    error = None
    ### END CODE HERE ###
    
    error = np.squeeze(error)  # makes sure error comes out as a number ([[1]] => 1).
    
    return error

error = compute_error([[1.0,6.0],[1.0,6.0]], [[2.0,1.0],[2.0,1.0]])

print("Error:", error)

#### Expected output
<table><td>Error</td><td>52.0</td></table>

To do gradient descent, we now need to work out the slope of the error function
with respect to the weights. This requires calculus.

![](Images/gd2.png)

The graph is just like the one we saw before to emphasise that we’re not doing anything different. This time the function we’re trying to minimise is the neural network’s error. The parameter we’re trying to refine is a network link weight. In this simple example we’ve only shown one weight, but we know neural
networks will have many more. Lets write out what we want mathematicly

$$
\frac{\partial E}{\partial w_{j,k}}
$$

This formula describes how the error changes when we change $w_{j,k}$. The following image shows the situation so we better can follow the computation.

![](Images/gd3.png)

The first step is to expand the error function
$$
\frac{\partial E}{\partial w_{j,k}} = \frac{\partial}{\partial w_{j,k}} \sum_n (y_n - o_n)^2 
$$

Where the $y_n$ is the target of example n and the $o_n$ is our network's prediction of the example $x_n$.

This expression can be simplified because all we really care about is the weights that is responsible for the error in output node k=1. This lets us get rid of the sum!

Now we get

$$
\frac{\partial E}{\partial w_{j,k}} = \frac{\partial}{\partial w_{j,k}} (y_k - o_k)^2 
$$

which is a lot more managable. We now use the chain rule to break this expression into even more managable pieces:

$$
\frac{\partial E}{\partial w_{j,k}} = \frac{\partial E}{\partial o_{k}} \cdot \frac{\partial o_k}{\partial w_{j,k}}
$$


The next part is easy, it's just a simple derivative of a squared function

$$
\frac{\partial E}{\partial w_{j,k}} = -2(y_k - a_k) \cdot \frac{\partial o_k}{\partial w_{j,k}}
$$

The next part is a bit harder, but not much. Remember that the $o_k$ is the output of node k which is the sigmoid activation applied to a sum of the incoming signals. Then we get

$$
\frac{\partial E}{\partial w_{j,k}} = -2(y_k - a_k) \cdot \frac{\partial }{\partial w_{j,k}} sigmoid(\sum_j w_j \cdot o_j + b)
$$

where $o_j$ is the output from the previous layer. Differentiating the sigmoid function is left as an exercise for later, but when I do it I sometimes get:

$$
\frac{\partial}{\partial z} sigmoid(z) = sigmoid(z) \cdot (1 - sigmoid(z))
$$

Now let's put this result in to the error function:

$$
\frac{\partial E}{\partial w_{j,k}} = -2(y_k - a_k) \cdot sigmoid(\sum_j w_j \cdot o_j + b) \cdot sigmoid(\sum_j w_j \cdot o_j + b) \cdot \frac{\partial}{\partial w_{j,k}} \sum_j w_j \cdot o_j + b
$$

The last bit comes from the chain rule applied again to the sigmoid derivation. The last part is trivial, but before we write up the final answer we remove the constant term 2. We can do this because the scale of the gradient wont affect the overal result. 

The final expretion for the change in error when we change $w_{i,j}$ is
$$
\frac{\partial E}{\partial w_{j,k}} = -(y_k - a_k) \cdot sigmoid(\sum_j w_j \cdot o_j + b) \cdot sigmoid(\sum_j w_j \cdot o_j + b) \cdot o_j
$$


**Congratulations! We made it!** And now for the vectorized case we have provided the equations.

<img src="Images/grad_summary.png" style="width:600px;height:300px;">
Where the $g^{[i]}()$ is the activation function in layer i.

Now, it's _finally_ time to implement it:

#### Assignment Implement the backward propagation algorithm

Here we want you to use the vectorised equations of the gradients to implement the gradient decent method. Retrieve the parameters from the parameters dictionary.

In [None]:
def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters W1, W2 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    ### START CODE HERE ### (≈ 2 lines of code)
    W1 = None
    W2 = None
    ### END CODE HERE ###
        
    # Retrieve also A1 and A2 from dictionary "cache".
    ### START CODE HERE ### (≈ 2 lines of code)
    A1 = None
    A2 = None
    ### END CODE HERE ###
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    ### START CODE HERE ### (≈ 6 lines of code, corresponding to 6 equations from the image above)
    dZ2 = None
    dW2 = None
    db2 = None
    dZ1 = None
    dW1 = None
    db1 = None
    ### END CODE HERE ###
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

In [None]:
parameters, cache, X_assess, Y_assess = backward_propagation_test_case()

grads = backward_propagation(parameters, cache, X_assess, Y_assess)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dW2 = "+ str(grads["dW2"]))
print ("db2 = "+ str(grads["db2"]))

**Expected output**:

<table style="width:80%">
  <tr>
    <td>**dW1**</td>
    <td> [[ 0.00301023 -0.00747267]
 [ 0.00257968 -0.00641288]
 [-0.00156892  0.003893  ]
 [-0.00652037  0.01618243]] </td> 
  </tr>
  
  <tr>
    <td>**db1**</td>
    <td>  [[ 0.00176201]
 [ 0.00150995]
 [-0.00091736]
 [-0.00381422]] </td> 
  </tr>
  
  <tr>
    <td>**dW2**</td>
    <td> [[ 0.00078841  0.01765429 -0.00084166 -0.01022527]] </td> 
  </tr>
  

  <tr>
    <td>**db2**</td>
    <td> [[-0.16655712]] </td> 
  </tr>
  
</table>  

### 2.8 Learning rate

The learning rate is a small number that controls how large steps we take in our gradient decent algorithm. Back to the hill analogy: if we take small steps, we'll probably get down to the bottom eventually. But it's going to take a long time. It's also easier to get stuck between two small rocks. If we jump around in seven-mile-boots, we might risk stepping over the entire valley. Ideally, we want something in between.

In a plot, it looks like this:
![](Images/sgd.gif)

Too large steps look like this:
![](Images/sgd_bad.gif)

There's no agreement on some "perfect" value, so you have to try out different values and see what works for the problem you are trying to solve. Some good values to try are 0.2, 0.02 and 0.002.

### 2.9 Putting it all together

Now let's put the neural network togehter.

#### Assignment Update the weights

Here we want you to implement the update method for the gradient. Remember the update method will look somthing like this

$$
Parameter = Parameter - learning\_rate * \Delta Parameter
$$

In [None]:
def update_parameters(parameters, grads, learning_rate = 0.02):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 4 lines of code)
    W1 = None
    b1 = None
    W2 = None
    b2 = None
    ### END CODE HERE ###
    
    # Retrieve each gradient from the dictionary "grads"
    ### START CODE HERE ### (≈ 4 lines of code)
    dW1 = None
    db1 = None
    dW2 = None
    db2 = None
    ## END CODE HERE ###
    
    # Update rule for each parameter
    ### START CODE HERE ### (≈ 4 lines of code)
    W1 = None
    b1 = None
    W2 = None
    b2 = None
    ### END CODE HERE ###
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

In [None]:
parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

**Expected Output**:


<table style="width:80%">
  <tr>
    <td>**W1**</td>
    <td> [[-0.00615505  0.01694318]
 [-0.02313436  0.03151137]
 [-0.01691533 -0.01758272]
 [ 0.00937293 -0.0503442 ]]</td> 
  </tr>
  
  <tr>
    <td>**b1**</td>
    <td> [[ -8.99634857e-07]
 [  8.23198382e-06]
 [  6.08613736e-07]
 [ -2.55653636e-06]]</td> 
  </tr>
  
  <tr>
    <td>**W2**</td>
    <td> [[-0.01043155 -0.04026412  0.01609725  0.04445369]] </td> 
  </tr>
  

  <tr>
    <td>**b2**</td>
    <td> [[  9.17132841e-05]] </td> 
  </tr>
  
</table>  

#### Assignment Assemble the full Neural Network

Put together all the methods we have already created to create a neural network that can learn

**Hint:** use the methods you have created earlyer. If you have skipped some code-blocks you might want to copy in the relevant code-blocks from the solutions.

In [None]:
def nn_model(X, Y, num_hidden_nodes, num_iterations = 10000, learning_rate=0.2, print_error=False, seed=3):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_error -- if True, print the error every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    np.random.seed(seed)
    n_x, n_h, n_y = layer_sizes(X, Y, num_hidden_nodes)
    
    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    ### START CODE HERE ### (≈ 5 lines of code)
    parameters = None
    W1 = None
    b1 = None
    W2 = None
    b2 = None
    ### END CODE HERE ###

    
    # Loop (gradient descent)

    for i in range(0, num_iterations):
         
        ### START CODE HERE ### (≈ 4 lines of code)
        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = None
        
        # error function. Inputs: "A2, Y, parameters". Outputs: "error".
        error = None
 
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = None
 
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = None
        
        ### END CODE HERE ###
        
        # Print the error every 100 iterations
        if print_error and i % 100 == 0:
            print ("error after iteration %i: %f" %(i, error))

    return parameters

In [None]:
X_assess, Y_assess = nn_model_test_case()
parameters = nn_model(X_assess, Y_assess, 4, num_iterations=10000, print_error=True, seed=3)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

**Expected Output**:

<table style="width:90%">

<tr> 
    <td> 
        **error after iteration 0**
    </td>
    <td> 
        0.385911
    </td>
</tr>

<tr> 
    <td> 
        <center> $\vdots$ </center>
    </td>
    <td> 
        <center> $\vdots$ </center>
    </td>
</tr>

  <tr>
    <td>**W1**</td>
    <td> [[ -4.81151934  11.78731014]
 [ -6.73856549  15.30642094]
 [  2.12481212  -4.91014969]
 [  7.2665111  -10.83691947]]</td> 
  </tr>
  
  <tr>
    <td>**b1**</td>
    <td> [[ -6.11682924]
 [ -8.18818125]
 [ -3.49542532]
 [-10.72416314]] </td> 
  </tr>
  
  <tr>
    <td>**W2**</td>
    <td> [[-3.49268059 -3.99215598  2.72638945  4.90732591]] </td> 
  </tr>
  

  <tr>
    <td>**b2**</td>
    <td> [[ 0.27470882]] </td> 
  </tr>
  
</table>  

#### Predict on new data

We have now implemented a neural network with the power to predic. Lets alså implement the actual prediction part of the network. 

predictions = $\hat{y} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases}
      1 & \text{if}\ activation > 0.5 \\
      0 & \text{otherwise}
    \end{cases}$  
    
As an example, if you would like to set the entries of a matrix X to 0 and 1 based on a threshold you would do: X_new = (X > threshold)

#### Assignment Prediction

Implement the prediction part of the network. 

**Hint:** you might want to reuse the forward propagation part of the network

In [None]:
def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    ### START CODE HERE ### (≈ 2 lines of code)
    A2, cache = None
    predictions = None
    ### END CODE HERE ###
    
    return predictions

In [None]:
parameters, X_assess = predict_test_case()

predictions = predict(parameters, X_assess)
print("predictions mean = " + str(np.mean(predictions)))

**Expected Output**: 


<table style="width:40%">
  <tr>
    <td>**predictions mean**</td>
    <td> 1.0 </td> 
  </tr>
  
</table>

## 3 Using our neural network
Lets now use our neural network to classify cat images!
### 3.1 - Overview of the Problem set 

**Problem Statement**: You are given a dataset ("data.h5") containing:
    - a training set of m_train images labeled as cat (y=1) or non-cat (y=0)
    - a test set of m_test images labeled as cat or non-cat
    - each image is of shape (num_px, num_px, 3) where 3 is for the 3 channels (RGB). Thus, each image is square (height = num_px) and (width = num_px).

You will build a simple image-recognition algorithm that can correctly classify pictures as cat or non-cat.

Let's get more familiar with the dataset. Load the data by running the following code.

In [None]:
# Loading the data (cat/non-cat)
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()

We added "_orig" at the end of image datasets (train and test) because we are going to preprocess them. After preprocessing, we will end up with train_set_x and test_set_x (the labels train_set_y and test_set_y don't need any preprocessing).

Each line of your train_set_x_orig and test_set_x_orig is an array representing an image. You can visualize an example by running the following code. Feel free also to change the `index` value and re-run to see other images. 

In [None]:
# Example of a picture
index = 7
plt.imshow(train_set_x_orig[index])
print ("y = " + str(train_set_y[:, index]) + ", it's a '" + classes[np.squeeze(train_set_y[:, index])].decode("utf-8") +  "' picture.")

Many software bugs in deep learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs. 

**Exercise:** Find the values for:
    - m_train (number of training examples)
    - m_test (number of test examples)
    - num_px (= height = width of a training image)
Remember that `train_set_x_orig` is a numpy-array of shape (m_train, num_px, num_px, 3). For instance, you can access `m_train` by writing `train_set_x_orig.shape[0]`.

In [None]:
### START CODE HERE ### (≈ 3 lines of code)
m_train = None
m_test = None
num_px = None
### END CODE HERE ###

print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
print ("train_set_x shape: " + str(train_set_x_orig.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x shape: " + str(test_set_x_orig.shape))
print ("test_set_y shape: " + str(test_set_y.shape))

**Expected Output for m_train, m_test and num_px**: 
<table style="width:15%">
  <tr>
    <td>**m_train**</td>
    <td> 209 </td> 
  </tr>
  
  <tr>
    <td>**m_test**</td>
    <td> 50 </td> 
  </tr>
  
  <tr>
    <td>**num_px**</td>
    <td> 64 </td> 
  </tr>
  
</table>


For convenience, you should now reshape images of shape (num_px, num_px, 3) in a numpy-array of shape (num_px $*$ num_px $*$ 3, 1). After this, our training (and test) dataset is a numpy-array where each column represents a flattened image. There should be m_train (respectively m_test) columns.

**Exercise:** Reshape the training and test data sets so that images of size (num_px, num_px, 3) are flattened into single vectors of shape (num\_px $*$ num\_px $*$ 3, 1).

A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b$*$c$*$d, a) is to use: 
```python
X_flatten = X.reshape(X.shape[0], -1).T      # X.T is the transpose of X
```

In [None]:
# Reshape the training and test examples

### START CODE HERE ### (≈ 2 lines of code)
train_set_x_flatten = None
test_set_x_flatten =  None
### END CODE HERE ###

print ("train_set_x_flatten shape: " + str(train_set_x_flatten.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x_flatten shape: " + str(test_set_x_flatten.shape))
print ("test_set_y shape: " + str(test_set_y.shape))
print ("sanity check after reshaping: " + str(train_set_x_flatten[0:5,0]))

**Expected Output**: 

<table style="width:35%">
  <tr>
    <td>**train_set_x_flatten shape**</td>
    <td> (12288, 209)</td> 
  </tr>
  <tr>
    <td>**train_set_y shape**</td>
    <td>(1, 209)</td> 
  </tr>
  <tr>
    <td>**test_set_x_flatten shape**</td>
    <td>(12288, 50)</td> 
  </tr>
  <tr>
    <td>**test_set_y shape**</td>
    <td>(1, 50)</td> 
  </tr>
  <tr>
  <td>**sanity check after reshaping**</td>
  <td>[17 31 56 22 33]</td> 
  </tr>
</table>

To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of three numbers ranging from 0 to 255.


![](Images/imvectorkiank.png)


One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).

<!-- During the training of your model, you're going to multiply weights and add biases to some initial inputs in order to observe neuron activations. Then you backpropogate with the gradients to train the model. But, it is extremely important for each feature to have a similar range such that our gradients don't explode. You will see that more in detail later in the lectures. !--> 

Let's standardize our dataset.

In [None]:
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.

![](Images/2layerNN_kiank.png)

Now, it's time to learn the weights and make some predictions!

In [None]:
parameters = nn_model(X=train_set_x ,Y=train_set_y, num_hidden_nodes=100, num_iterations=5000, print_error=True, seed=3)
pred = predict(parameters=parameters, X=test_set_x)
print("Predictions:", pred)

#### How did we do?

Let's try to find out how we did by comparing to the truth:

In [None]:
print("Correct answers:", np.sum(test_set_y == pred))
print("Wrong answers:", np.sum(test_set_y != pred))

Thats it, you have finnished the entire workshop. Congratulations.
Next we challange you to apply the neural network to a problem of your own. 