# Classification with Neural Networks
In this chapter, you will understand the workings of a classifier and manually train one that operates on a single value. You will improve the classifier step by step and learn fundamental concepts about classification as you go along.
Finally, you will use automated backpropagation to train a multi layer neural network to emulate a logic gate.

## Introduction
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). [1]

A classification process requires a dataset that is split into different categories. A classifier can be trained on this dataset by learning the relationship between certain properties of the input data and the corresponding categories. 
To classify new data, the process is similar as in the chapter "Regression", however additional computational steps can be added depending on the application.
A common classification problem that can be solved by neural networks is image recognition (seen in Figure 1).



<img src="images/neural_network_classification.png" />
<p style="text-align: center;">
    Fig. 1 - Image recognition by a neural network
</p>

Run the cells below to import the needed libraries and define a ReLU and MSE Loss function and a SimpleNeuron Class.

In [1]:
# do not change
import numpy as np
from ipywidgets import interact, Layout, FloatSlider
import plotly.offline as plotly
import plotly.graph_objs as go
import time
import threading

In [2]:
# do not change
def relu(input_val):
    return np.where(input_val > 0, input_val, 0.0)

In [3]:
# do not change
def mean_squared_loss(predictions, solutions):
    total_squared_loss = np.sum(np.subtract(predictions, solutions)**2) #np allows to handle both values and lists
    mean_squared_loss = total_squared_loss/len(predictions)
    return mean_squared_loss

In [4]:
# do not change
class SimpleNeuron:
    def __init__(self, plot):
        self.plot = plot
        self.plot.register_neuron(self) #hey plot, remember me

    def set_values(self, weight, bias):
        self.weight = weight
        self.bias = bias
        self.plot.update() #hey plot, I have changed, redraw my output
        
    def get_weight(self):
        return self.weight
    
    def get_bias(self):
        return self.bias

    def compute(self, x):
        self.activation = np.dot(self.weight, x) + self.bias
        return self.activation

In [5]:
# do not change
# an Interactive Plot monitors the activation of a neuron or a neural network
class Interactive2DPlot:
    def __init__(
        self, points_red, points_blue, ranges, loss_function=mean_squared_loss, loss_string="Loss", width=800, height=400, margin=dict(t=0, l=170),
        draw_time=0.1
    ):
        self.idle = True
        self.points_red = points_red
        self.points_blue = points_blue
        self.draw_time = draw_time
        self.loss_function = loss_function
        self.loss_string = loss_string

        self.x = np.arange(ranges["x"][0], ranges["x"][1], 0.01)
        self.y = np.arange(ranges["y"][0], ranges["y"][1], 0.01)

        self.layout = go.Layout(
            xaxis=dict(title="Neck height in m", range=ranges["x"]),
            yaxis=dict(title="y", range=ranges["y"]),
            width=width,
            height=height,
            showlegend=False,
            margin=margin,
        )
        self.trace = go.Scatter(x=self.x, y=self.y)

        self.plot_points_red = go.Scatter(
            x=points_red["x"], y=points_red["y"], mode="markers", marker=dict(color='rgb(255, 0, 0)', size=10)
        )
        self.plot_points_blue = go.Scatter(
            x=points_blue["x"],
            y=points_blue["y"],
            mode="markers",
            marker=dict(color='rgb(0, 0, 255)', size=10, symbol="square"),
        )

        self.plot_point_new = go.Scatter(
            x=[], y=[], mode="markers", marker=dict(size=20, symbol="star", color='rgb(0,0,0)')
        )

        self.data = [self.trace, self.plot_points_red, self.plot_points_blue, self.plot_point_new]
        self.plot = go.FigureWidget(self.data, self.layout)

    def register_neuron(self, neuron):
        self.neuron = neuron

    def redraw(self):
        self.idle = False
        time.sleep(self.draw_time)
        self.plot.data[0].y = self.neuron.compute(self.x)
        self.idle = True

    def update(self):
        loss_red = self.loss_function(self.neuron.compute(self.points_red["x"]), self.points_red["y"])
        loss_blue = self.loss_function(self.neuron.compute(self.points_blue["x"]), self.points_blue["y"])
        print(self.loss_string,": {:0.3f}".format((loss_red + loss_blue) / 2))

        if self.idle:
            thread = threading.Thread(target=self.redraw)
            thread.start()

# From Regression to Classification

##  Linear Regression

You find yourself working on a farm with sheep and llamas grazing in seperate enclosures. However, last night the shepard forgot to close the gate between the two enclosures. The llamas and sheep now are mixed and have to be seperated again. You immediately come up with a machine learning based solution to separate the sheep from the llamas again: You assume that llamas can be distinguished from sheep by measuring the distance from the top of their head to their spine, since llamas have significantly longer necks. Using a LIDAR scanner, neck heights will be measured autonomously and the animals will be seperated using a food enticement and an electronic turnstile that only lets llamas through.


<img src="images/neck_heights.png" />
<p style="text-align: center;">
    Fig. 2 - Concept of neck height measurement
</p>

To collect sample data, you go out on the field with a measuring tape and measure the neck heights of some sheep and llamas. You specify two categories: '0' for sheep and '1' for llamas. (See table 1)

Most llamas are grown up and have long necks, but there are also some young llamas with smaller necks, but since their necks are still longer than the sheeps' you figure that this won't be a problem.

|  Animal | Neck height  | Category  |
|---------|--------------|-----------|
| Sheep #1| 0.05m        |0          |
| Sheep #2| 0.08m        |0          |
| Sheep #3| 0.13m        |0          |
| Sheep #4| 0.17m        |0          |
| Sheep #5| 0.20m        |0          |
| Llama #1| 0.35m        |1          |
| Llama #2| 0.68m        |1          |
| Llama #3| 0.74m        |1          |
| Llama #4| 0.83m        |1          |
| Llama #5| 0.95m        |1          |

<p style="text-align: center;">
    Table. 1 - Your data mining results
</p>





### Training a Linear Regression Neuron
For the sake of simplicity, you start by using a single neuron as a classifier. Run the two cells below to define the data mining points and to display a plot.

In [6]:
# do not change
points_sheep = dict(
              x=[ 0.05, 0.08, 0.13, 0.17, 0.20],
              y=[ 0, 0, 0, 0, 0]
             )

points_llamas = dict(
              x=[ 0.35, 0.68, 0.74, 0.83, 0.95],
              y=[ 1,  1, 1, 1, 1]
             )

ranges = dict(x=[-0.1, 1.25], y=[-0.5, 1.4])
slider_layout = Layout(width="90%")

In [7]:
# do not change
plot1 = Interactive2DPlot(points_sheep, points_llamas, ranges, loss_string="Mean Squared Loss")
neuron1 = SimpleNeuron(plot1)

interact(
    neuron1.set_values,
    weight=FloatSlider(min=-2, max=4, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-1, max=1, step=0.1, layout = slider_layout),
)

plot1.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=4.0, min=-2…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '56d117cb-42c4-4085-a60f-11486da06d5f',
 …

### Task: Question

**Question**:
What is the optimal weight and bias combination in the plot above?

**Answer:** The optimal weight and bias combination is weight = 1.3 and bias = 0.00, resulting in a mean squared loss of 0.053.

***
### Defining a Classifier
Now we want to use our trained neuron to classify new neck heights. To do that, we have to write a program that takes in a neck height and outputs what the trained neuron thinks about it. The classifier will also plot the new neck height. Run the box below to get the values from the task before.

In [8]:
# do not change
#a duplicate of plot 1, so you don't have to scroll
plot2 = Interactive2DPlot(points_llamas, points_sheep, ranges, loss_string="Mean Squared Loss") 
neuron2 = SimpleNeuron(plot2)
neuron2.set_values(neuron1.get_weight(), neuron1.get_bias()) #get your values from last task

plot2.plot

Mean Squared Loss : 0.500


FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '1d76efd2-0fac-432d-8d32-f423a5e02ab0',
 …

### Task: Implement a Classifier
Complete the python code below.

In [9]:
new_neck_height = 0.35  #this shall be varied to answer the questions below

#classification_result = ??

#STUDENT CODE HERE

classification_result = neuron2.compute(new_neck_height)

#STUDENT CODE until HERE

plot2.plot.data[3].x = [new_neck_height] #update plot
plot2.plot.data[3].y = [classification_result]

print("Result:", classification_result)

Result: 0.0


### Task: Questions

**Question:**
What classification value does the smallest llama have? (run the cell above and change new_neck_height)


**Answer:** The smallest llama with a neck height of 0.35m has a classification value of 0.455.

**Question:**
What classification value does an animal with a neck height of 0.1m have?


**Answer:** ~0.13

**Question:**
What classification value does an animal with a neck height of 0.9m have?


**Answer:** ~1.17

**Question:**
Why is the classification value continuous, even though the training data had only two discrete values?


**Answer:** The output of artificial neurons is always a continuous function. The contious categories in between 0 and 1 are a generalization based on linear regression made by the network.

**Question:**
How do you interpret this continuous classification value? Try to describe it in a few words.


**Answer:** The continuous value shows the probability of the algorithm being right in its classification result.

**Question:**
You want to use your continuous result to create a discrete "llama" or "sheep" classifier. The decision should be approximately just as sensitive towards llamas as to sheep.
What threshold y-value would you choose?



**Answer:** 0.5

**Question:**
You want to add more data to your model to improve its performance. As you collect more data, you find a very small llama with a neck height of 0.25m in your dataset. After you train your model on the new data, your discrete classifier decides that this small llama is a sheep. What is the problem with applying thresholds to linear regression models for classification?


**Answer:** The problem with applying thresholds is that the algorithm simply classifies by a fixed value that doesn't consider exceptions.

***
## Logistic Regression

In machine learning, the go-to assumption for an unknown two-class probability distribution is a logistic distribution.[2]
Its cumulated function is the logistic function, of which the sigmoid function is the most used special case. (See Fig 3.)
The sigmoid function allows for a model that approximatly describes most natural occuring probability distributions.[3] (Further reading: see section "Further Reading" at the end of document)
<img src="images/sigmoid.png" />
<p style="text-align: center;">
    Fig. 3 - Sigmoid function
</p>



Run the cell below to define a sigmoid function.

In [10]:
# do not change
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

### Task: Complete Code and Train Neuron


Change the SigmoidNeuron class below to apply a sigmoid function to the final output

In [13]:
class SigmoidNeuron(SimpleNeuron): #inheriting from SimpleNeuron, 
                                   #all functions stay the same unless they are specified here

    def compute(self, x):
        # STUDENT CODE HERE
        
        self.activation = sigmoid(np.dot(self.weight, x) + self.bias)

        # STUDENT CODE until HERE
        return self.activation

In [14]:
# do not change
classification_plot_sig = Interactive2DPlot(points_llamas, points_sheep, ranges, loss_string="Mean Squared Loss")

our_sig_neuron = SigmoidNeuron(classification_plot_sig)

interact(
    our_sig_neuron.set_values,
    weight=FloatSlider(min=-50, max=200, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-50, max=50, step=0.1, layout = slider_layout),
)

classification_plot_sig.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=200.0, min=…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': 'bce6e84e-f455-4689-bc95-720bff97d1f1',
 …

### Task: Questions

**Question:**
What is an optimal weight and bias combination?


**Answer:** The optimal comibnation is weight = 40.60 and bias = -11.30 with a Mean Squared Loss of 0.00.

**Question:**
What advantage does a classifier have in general that also outputs a probability compared to a classifier that just outputs a binary yes/no value? (a few words)


**Answer:** We not only know the result but also how sure we can be that the given result is right. With that information we can correct wrong classifications of exceptions.

**Question:**
Give one example how we can use the additional probability information to increase the accuracy of our seperation process


**Answer:** In cases where the classifier is uncertain (classification value around 0.5) we can sort out the animal and classify it by ourselves or we can take an additional information about the object into account if the classifier is uncertain.

## Cross Entropy/Logarithmic Loss:
The most common loss function for classification is cross entropy loss, also called logarithmic loss. (In the context of machine learning, they are equal). In the special case of two categories, the loss is called binary cross entropy. The binary cross entropy loss between an actual data value $y$ and a predicted value $p$ is calculated as follows:

\begin{align}
−[y \cdot log(p)+(1−y)\cdot log(1−p)]
\end{align}

In this manner, the average of all data points is calculated.
It turns out that the derivative of a logarithmic loss using one hot encoding (explained below) is just the solution vector subtracted by the network output, which makes it very easy to work with.
**Note** Cross entropy loss can only be used, if the output values are between 0 and 1.

<img src="images/cross_entropy.png" />
<p style="text-align: center;">
        Fig. 4 - Logarithmic / cross entropy loss function

</p>




### Task: Calculate Squared and Cross Entropy Loss

**Question:**
What are the the squared loss and cross entropy loss results for the following predictions? Copy the table and fill out the ??? as an answer below.  Use the cell below for calculations.


| Input         | Llama Probability  |      Squared Loss    | Cross Entropy Loss   |
|---------------|--------------------|----------------------|----------------------|
|    llama(1)   | 0.99               |0.0001                |0.0101                |
|    sheep(0)   | 0.6                |0.3600                |0.9163                |
|    sheep(0)   | 0.95               |0.9025                |2.9957                |
|    sheep(0)   | 0.999999           |1.0000                |13.8155               |


**Answer:**

In [15]:
# do not change
def cross_entropy_loss(predictions,solutions):
    predictions += 1e-15 #in order to prevent log(0)
    total_loss = np.sum(-(solutions*np.log(predictions)+(1-solutions)*np.log(1-predictions)))
    avg_loss = total_loss/len(predictions)
    return avg_loss

In [16]:
predicted = np.array([0.999999]) #insert here
actual = np.array([0]) #insert here


print("mean squared loss: {:0.4f}".format(mean_squared_loss(predicted,actual)))
print("cross entropy loss: {:0.4f}".format(cross_entropy_loss(predicted,actual)))

mean squared loss: 1.0000
cross entropy loss: 13.8155


**Question:**
How do the goals of regression and classification generally differ?

**Answer:** The goal of regression is a continuous output, the goal of classification is a discrete output.

**Question:**
Why do you think cross entropy loss is better suited for classification training algorithms?

**Answer:** The cross entropy loss is better suited because it is more sensitive. This is important for the classification algorithm, because it is the only way to check our results, since the output is a discrete value and doesn't give us any information about the probability of the result being right.

## One-Hot Encoding
To do classification, categories have to be represented in a way that the classifier can process. Neural networks cannot understand categories directly and need a numeric representation.

### Disadvantages of Integer Encoding

In the llama classifier, llamas were assigned the value $1$ and sheep the value $0$. One single output neuron would "fire", if a llama was found, and not fire, if a sheep was found. This type of representing categories is called **integer** or **label encoding**

This works reasonably well for binary classification, but what if we want to distinguish between sheep, llamas and shepherd dogs?
Doing this with just one output neuron would result in complications: 
- Dogs would need a label that is numerically higher or lower (for example $2$), implying an order (Dogs > Llamas) where there actually is none.
- it would be necessary to interpret three different states out of one output neuron value

Another disadvantage can be seen in the next question:

### Task: Question

**Question:**
Suppose the encodings are: 0 for sheep, 1 for llamas and 2 for dogs.
You classified 5 sheep and 5 dogs today. You want your classifier to output the average classification for today. What will the classifier say?

**Answer:** Since the mean of those classified results is 1, the average classification for today would be llama.

### Composition of One-Hot Encoding

The solution looks like this:

| Input         | One Hot Encoding  | 
|---------------|--------------------|
|    sheep   | [1,0,0]                |
|    llama   | [0,1,0]               |
|    dog     | [0,0,1]           |



The length of the representation vector is always equal to the amount of categories. Only one element of the vector is 1 for each category ("one-hot").
Using this encoding, we can conveniently use 3 output neurons for 3 different categories, so that the activation of each output neuron represents the classification score for that category.

###  Limits of One-Hot Encoding
One-hot encoding is not an unimprovable solution to represent categories, but rather another tool in the box that happens to work well for many problems, but not for all.

### Task: Question

**Question:**
Suppose you would like to train a speech recognition neural network that can classify all English words contained in the Oxford English Dictionary. It does not need to classify whole sentences, just single words. What would be a problem using one-hot encoding?

**Answer:** You would need as many elements in the representation vector as there are words in the Oxford English Dictionary.

## Softmax

The sigmoid function works fine for a "yes or no" problem. But more often than not we want to distinguish between more than two categories. For that, we need a function that takes in **multiple** neuron activations from the last layer of a network and outputs a **probability vector** containing the probabilities for each category.
The key: Each input of this function is normalized by the other inputs such that the sum of the output vector is always 1. Figure 3 shows an example network.

We can realize a softmax function by taking each element $x_i$ of the input vector, calculating $exp(x_i)$ and then normalizing this value by dividing it by the sum of the $exp$ results of all single input vector elements.
\begin{align}
\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
\end{align}


<img src="images/softmax_example_network.png" />
<p style="text-align: center;">
    Fig. 3 - Sigmoid function
</p>

### Task: Question

**Question:**
In "logistic regression", we also obtained a probability by applying a sigmoid function on the last layers' output.
Why can't we apply a sigmoid function on each output neuron of this network instead of a softmax and get a probability vector?


**Answer:** If we were to apply a sigmoid function on each neuron we would only recieve separate probabilities for each output but we wouldn't get the probabilities for each output in relation to all other outputs. 

***
# Automated Classification Training

## Introduction

We already have explored automated training using backpropagation in the last chapter. We had one set of points that we had to fit a function as close as possible. The task is similar for classification training. However instead of y-coordinates for points, we now have discrete categories.

In task 1, you were given a set of neck lengths and the correspoding categories (see table 1). In the field of machine learing, this dataset is called __training data__. It specifies the behaviour that the neural net should have. We will use backpropagation to adjust the weights and biases of the network over and over again until the network outputs the same values to a given set of inputs as in the training data. During backpropagation, the network is figuratively "learning" the training data. 

***
## Realizing an XOR Gate with a Neural Network

You find yourself working as an engineer at a major electronic component manufacturing company. Your company wants to produce the first XOR gate chip that runs on artificial intelligence. You are given the training data in the form of a truth table:


| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |0          |


<p style="text-align: center;">
    Table. 2 - XOR Truth table
</p>


In this task we will make use of arrays and matrices to ease the handling of the data and the network parameters. We will also utilize a neural network without biases in order to make the algorithm as simple as possible.
The training data consists of a 2D Array of all possible input states and a 1D Array of all corresponding outputs. 

### Task : Create Training Data

A training set consists of an input set and a solution set. During supervised training, the network is adjusted until its predictions to the input set match the corresponding predetermined solutions.
Complete the training data below using the truth table

In [28]:
#xor_input_set = np.array(tablegoeshere)
#xor_solution_set = np.array(tablegoeshere)

# STUDENT CODE HERE

xor_input_set = np.array([[0, 0],
                          [0, 1],
                          [1, 0],
                          [1, 1]])

xor_solution_set = np.array([[0],
                             [1],
                             [1],
                             [0]])

# STUDENT CODE until HERE


### Initializing the Network
Next, the Network has to be defined and initialized. For this task, we use a network with 3 hidden neurons (see Figure 4).

<img src="images/3x2_xor_network.png" />
<p style="text-align: center;">
    Fig. 4 - Neural Network 
</p>

We define $w_{01}, w_{02}, w_{03}, w_{10}, w_{11}, w_{12}$ all at once by just defining a 2x3 weight matrix $w_{l1}$ and do the same for $w_{l2}$. The matrices will be initialized with values between -1 and 1

Run the cell below to define a neural network class that is depicted above.

In [18]:
# do not change
class NeuralNetwork:
    def __init__(self):
        self.hl_sum = [0, 0, 0]
        self.hl_activation = [0, 0, 0]
        self.ol_sum = [0]
        self.prediction = 0
        self.b = 0
        self.w_i = np.zeros((2, 3))
        self.w_o = np.zeros((3, 1))
        
    def set_conf(self, w_i, w_o, b):  # w_i and w_o are matrices here
        self.w_i = w_i
        self.w_o = w_o
        self.b = b

    def get_conf(self):
        configuration = dict();
        configuration['w_i'] = self.w_i
        configuration['w_o'] = self.w_o
        configuration['b'] = self.b
        return configuration

    def get_ex(self):
        excitations = dict();
        excitations['hl_sum'] = self.hl_sum
        excitations['hl_activation'] = self.hl_activation
        excitations['ol_sum'] = self.ol_sum
        return excitations
    
    
    def show_conf(self):
        print("weight matrix w_i:")
        print(self.w_i)
        print("\nweight matrix w_o:")
        print(self.w_o)
        print("Bias")
        print(self.b)

    def compute(self, input_set):
        self.hl_sum = input_set.dot(self.w_i)
        #Student Code 
        self.hl_activation = relu(self.hl_sum) 
        self.ol_sum = relu(self.hl_activation).dot(self.w_o) + self.b
        self.prediction = sigmoid(self.ol_sum)

        return self.prediction

In [19]:
# do not change
logic_gate_net = NeuralNetwork()

In [20]:
# do not change
def initialize_network(net):
    #np.random.seed(3)
    weight_matrix_i = np.random.rand(2,3)  # a 2x3 matrix of weights
    weight_matrix_o = np.random.rand(3,1)  # a 3x1 matrix of weights
    bias = np.random.randn()
    net.set_conf(weight_matrix_i,weight_matrix_o,bias)

In [21]:
# do not change
initialize_network(logic_gate_net) #just a test initialization to illustrate the weight matrices
logic_gate_net.show_conf()

weight matrix w_i:
[[0.38762955 0.66531803 0.66662743]
 [0.21083038 0.56962714 0.43640799]]

weight matrix w_o:
[[0.32496836]
 [0.33371356]
 [0.21101071]]
Bias
1.0301343734962858


### Defining Training Process
Finally, run the cells below to implement a backpropagation algorithm. Try to understand the code. See Fig. 4 for explanation of the variable names.

In [22]:
# do not change
def sigmoid_prime(x): #the derivative of sigmoid
    return sigmoid(x)*(1-sigmoid(x))

In [45]:
# do not change
def train(net, input_set, solution_set, learning_rate, epochs):
    for t in range(epochs):
        # Forward pass: compute predicted solution_set
        predictions = net.compute(input_set)
        # Compute and print loss
        log_loss = cross_entropy_loss(predictions, solution_set)
        
        if (t % 5 == 0):  # only output every 5th epoch
            print("Loss after Epoch {}: {:0.4f}".format(t, log_loss))

        #unravel variables here for readability
        ol_sum = net.get_ex()['ol_sum']
        hl_activation = net.get_ex()['hl_activation']
        hl_sum = net.get_ex()['hl_sum']
        w_i = net.get_conf()['w_i']
        w_o = net.get_conf()['w_o']
        b = net.get_conf()['b']
        
        # Backpropagation to compute gradients of w_i and w_o with respect to loss
        # start from the loss at the end and then work towards the front
        grad_ol_sum = sigmoid_prime(ol_sum) * (predictions - solution_set)
        grad_w_o = hl_activation.T.dot(grad_ol_sum)  # Gradient of Loss with respect to w_o
        grad_hl_activation = grad_ol_sum.dot(w_o.T)  # the second layer's error
        grad_hl_sum = hl_sum.copy()  # create a copy to work with
        grad_hl_sum[hl_sum < 0] = 0  # the derivate of ReLU
        grad_w_i = input_set.T.dot(grad_hl_sum * grad_hl_activation)  #

        updated_weight_matrix_i = w_i - learning_rate * grad_w_i
        updated_weight_matrix_o = w_o - learning_rate * grad_w_o
        updated_bias = b - learning_rate * grad_ol_sum.sum()
        net.set_conf(updated_weight_matrix_i, updated_weight_matrix_o,
                       updated_bias)  # Apply updated weights to network

### Task: Choose Hyperparameters and Train
Choose an optimal learning rate and number of epochs by trying out values and running the cell below.

If your training data was correct, the network should be ready for use after training.
A successfull training should result in a loss < 0.02.

**Hint**:
Press Shift+Enter on the cell below and then the "up" arrow key to repeat the training easily.

In [46]:
#learning_rate = ??
#epochs = ??
# STUDENT CODE HERE

learning_rate = 10
epochs = 100

# STUDENT CODE until HERE

initialize_network(logic_gate_net) #initialize again so you can just run this box and train a new network
train(logic_gate_net, xor_input_set, xor_solution_set,learning_rate,epochs)

Loss after Epoch 0: 0.8401
Loss after Epoch 5: 0.7495
Loss after Epoch 10: 0.7910
Loss after Epoch 15: 0.7785
Loss after Epoch 20: 0.9737
Loss after Epoch 25: 0.4810
Loss after Epoch 30: 0.4804
Loss after Epoch 35: 0.4783
Loss after Epoch 40: 0.4579
Loss after Epoch 45: 0.4138
Loss after Epoch 50: 0.4352
Loss after Epoch 55: 3.1481
Loss after Epoch 60: 3.2381
Loss after Epoch 65: 3.3816
Loss after Epoch 70: 3.4479
Loss after Epoch 75: 3.4913
Loss after Epoch 80: 3.5236
Loss after Epoch 85: 3.5492
Loss after Epoch 90: 3.5705
Loss after Epoch 95: 3.5886


### Task: Questions

**Question:**
Why are the losses different each time you run the cell?

**Answer:** Because the initialization of the weight and bias is random.

**Question:**
What is a good learning rate that reaches a loss < 0.02 in < 100 epochs most of the time?

**Answer:** 10

### Task: Classification Test
Run the cell below and change the sliders and do a validation check on your logic gate

In [47]:
# do not change
def change(input1, input2):
    input_vector = np.array([input1 * 1, input2 * 1])     # converting bool to float
    prediction = logic_gate_net.compute(input_vector)
    print("\t input: {} \t \t output: {:0.9f}".format(input_vector, prediction[0]))

interact(
    change,
    input1=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
    input2=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
);

interactive(children=(FloatSlider(value=0.0, description='input1', layout=Layout(width='22%'), max=1.0, step=1…

### Task: Continuous Input Test

Change the sliders and observe the changes when the input is varied continuously instead of binary.


In [27]:
interact(change, input1=0.0, input2=0.0);

interactive(children=(FloatSlider(value=0.0, description='input1', max=1.0), FloatSlider(value=0.0, descriptio…

### Task: Questions

**Question:**
What can you observe when changing the sliders? How would you describe the general relationship between the two inputs and the output (a few words)

**Answer:** If the inputs are same, the output is zero. If the inputs are different from each other, the output is one.

**Question:**
Change the sliders to the training data values e.g.(1.00, 1.00). Does the output match the training data exactly? Why is that the case?

**Answer:** No, we don't get the discrete values as in the training data, this is the result of weight and bias.

**Question:**
The neural network now can do something more than just predicting the values of the input set that you gave it. What "special ability" has your network gained automatically?

**Answer:** The net can generalize and give continuous outputs to inputs it has never seen before.

**Question:** 
How can this special ability be useful when applying neural networks to self-driving vehicles?

**Answer:** If the model has been given enough training data, it can make safe decisions in situations it has never seen before. This way, the training data doesn't have to include every possible scenario.

**Question:**
Why does this ability make it easier to use a neural network for self-driving vehicles than traditional rule-based programming

**Answer:** In traditional rule-based programming you have to define every possible scenario and tell the model what to do in each case. This would be practicaly infinitly complex when it comes to traffic and therefore pretty much impossible. With this ability we don't have to show the model every possible scenario but just enough so that it can make accurate generalizations about new situations.

### Task: Create an OR Gate 
**Change the code above to train an OR Network and verfy your results with a test.**

| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |1          |


<p style="text-align: center;">
    Table. 3 - OR Truth table
</p>


In [48]:
# OR gate

or_input_set = np.array([[0,0],[0,1],[1,0],[1,1]])
or_solution_set = np.array([[0],[1],[1],[1]])

#create model
or_gate_net = NeuralNetwork()
initialize_network(or_gate_net) #just a test initialization to illustrate the weight matrices
or_gate_net.show_conf()

#train
learning_rate_or = 10
epochs_or = 100

train(or_gate_net, or_input_set, or_solution_set, learning_rate_or, epochs_or)

weight matrix w_i:
[[0.60454483 0.56808506 0.95642823]
 [0.79723838 0.33781668 0.88414121]]

weight matrix w_o:
[[0.22216027]
 [0.97069509]
 [0.01045813]]
Bias
-1.0655840409438597
Loss after Epoch 0: 0.7052
Loss after Epoch 5: 0.0542
Loss after Epoch 10: 0.0246
Loss after Epoch 15: 0.0182
Loss after Epoch 20: 0.0150
Loss after Epoch 25: 0.0131
Loss after Epoch 30: 0.0117
Loss after Epoch 35: 0.0107
Loss after Epoch 40: 0.0099
Loss after Epoch 45: 0.0092
Loss after Epoch 50: 0.0087
Loss after Epoch 55: 0.0083
Loss after Epoch 60: 0.0079
Loss after Epoch 65: 0.0075
Loss after Epoch 70: 0.0072
Loss after Epoch 75: 0.0070
Loss after Epoch 80: 0.0067
Loss after Epoch 85: 0.0065
Loss after Epoch 90: 0.0063
Loss after Epoch 95: 0.0061



invalid value encountered in log



In [49]:
def change_or(input1, input2):
    input_vector = np.array([input1 * 1, input2 * 1])     # converting bool to float
    prediction = or_gate_net.compute(input_vector)
    print("\t input: {} \t \t output: {:0.9f}".format(input_vector, prediction[0]))

interact(
    change_or,
    input1=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
    input2=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
);

interactive(children=(FloatSlider(value=0.0, description='input1', layout=Layout(width='22%'), max=1.0, step=1…

## Outlook: Classification Tests in the Real World

A classic application of neural networks is the classification of images. A commonly used data set is CIFAR-10, which consists of:  
 1. Images of  airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks  (10 Categories)
 2. Labels attached to each image that categorize the image
 
<img src="images/cifar10_plot.png" />
<p style="text-align: center;">
    Fig. 3 - CIFAR-10 dataset[4]
</p>

 
The labels (also called annotations) act as the "solution" for the training set. Each item (airplane, car..) is a separate category. 
During training, the weights and biases in the network are adjusted in just the right way, until it performs the right mathematical operations to correctly classify the given training data. After training, the network can recognize whether the image is a cat, an airplane, etc. This even works for pictures that the network has never seen. You will find out how neural networks can perform image classification in the next class.

### Sources:
[1] Wikipedia, Statistical classification https://en.wikipedia.org/wiki/Statistical_classification, retrieved 01.05.2019

[2]  Brownlee, Jason 2018. Machine Learning Algorithms From Scratch. p. 70

[3]  Gibbs, M.N. (Nov 2000). "Variational Gaussian process classifiers". IEEE Transactions on Neural Networks. p. 1458–1464.

[4] Cifar-10, Cifar-100 Dataset Introduction
Corochann - https://corochann.com/cifar-10-cifar-100-dataset-introduction-1258.html, retrieved 02.02.2019


### Further Reading

The Sigmoid Function in Logistic Regression: http://karlrosaen.com/ml/notebooks/logistic-regression-why-sigmoid/