#Perceptron and Neural Network (Day8 7/21)

##Agenda


1. Anatomy of perceptron
2. Perceptron example
3. Anatomy of Neural Network
4. Math behind Neural Network and backpropagation
5. Neural Network Playground
6. Extras

#Anatomy of perceptron

##Definition: 
Perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class (Wikipedia).

##Background:

![neuron_img](https://www.brainfacts.org/-/media/Brainfacts2/For-Educators/For-the-Classroom/Light-Up-Neuron/neuron_labels_c-replacement.png)

Perceptron’s goal is to mimic how a neuron in our brain works.

Neurons receive signal from cell body, and it propagates signals along axon. The output signals are “all or nothing” (can be simplified as either 0 or 1). In terms of ML, we would like to compute whether an output is 0 or 1, given inputs.

##Motivation:
![perceptron_graph](https://miro.medium.com/max/1400/1*xsR57_PO8U7PB_ItLslLmA.png)

Algebraically, perceptron classifies data by adjusting its weight and bias, then data in one side of the region will be classified as a particular class. This is similar to line algebra where you have slope and y-intercept.

##Perceptron Components: 
A perceptron has:
-	Input values
-	Weight
-	Bias
-	Activation function

![perceptron_img](https://miro.medium.com/max/645/0*LJBO8UbtzK_SKMog)

*Note: the picture does not have activation function and bias*

To compute for an output for a single perceptron, we calculate:

$output = \sigma(\sum_{i=1}^n w_i x_i$)

where 
- $\sigma$ is an activation function
- $x_i$ is an input
- $w_i$ is a weight for $x_i$
- $b_i$ is bias for $x_i$

Notice how we need to specify activation function. In ML field, there is no right or wrong answer on what should be the ideal activation function. Here are some examples:
![activation_img](https://miro.medium.com/max/1200/1*ZafDv3VUm60Eh10OeJu1vw.png)

In general, we commonly use ReLU because it is simple and fast.

#Perceptron Example

##Example: 
Let’s say you want to have lunch. You consider these four factors:
-	Is there anything left in the fridge? (0 as none, 1 as there are food left)
-	Is there any restaurant within 3 miles? (0 as none, 1 as there are restaurants)
-	Is it sunny now? (0 as raining, 1 as clear sky)
-	Are you using RTX3090? (0 as no, 1 as yes)
![perceptron_example](https://drive.google.com/uc?export=view&id=1kkYCUNg75xvdLFRClRrtcrBgBwXRRfj9)

We want to use these four factors to decide whether you want to eat a leftover food or you want to get a lunch from a restaurant. Let’s classify if $ output <=0$, you want to eat leftover food, else you want to grab food from outside. For simplicity, let’s set every bias to zero and use ReLU as an activation function.

We let 

$$
    ReLU(x) =
    \begin{cases}
        0 & \text{if $x<=0$,}\\
        x & \text{if $x>0$}
    \end{cases}
$$

In practice, we initialize weights and biases with random number, and we use backpropagation algorithm (will discuss later) to adjust the weights and biases correctly. For this problem, let us initialize the weight analytically.

Imagine a weight as “how important is this feature?” If we want to emphasize that a feature is very important, we then adjust its weight to some large positive number (or negative if it inversely affects the importance). Otherwise, if a feature is less important, we then adjust its weight to some small number.

![perceptron_calculated](https://drive.google.com/uc?export=view&id=1srvS1VJpLc4XXDogbX2Qj4DVXckEPVUN)

From earlier section, we let $\sigma = ReLU(Z)$ as our activation function.

We then substitute $Z$ with $Z = \sum_{i=1}^4 w_i x_i$

We calculate 

$$
\begin{align}
Z & = \sum_{i=1}^4 w_i x_i \\
& = w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 \\
& = (1\cdot-3)+(1\cdot2)+(0\cdot3)+(1\cdot0.01) \\
& = -3 + 2 + 0 + 0.01 \\
& = -0.99
\end{align}
$$

And we then compute output 

$$
\begin{align}
output & = \sigma(Z) \\
& = ReLU(Z) \\
& = ReLU(-0.99) \\ 
& = 0
\end{align}
$$

Suppose you believe having leftover food in the fridge is important, then you initialize its weight to -3. Having RTX3090 has less effect to how you decide to eat, so you initialize its weight to 0.01. After final calculation with activation function, it returns 0. So you decide to not eat outside and eat the food you have.

#Neural Network

##Long Definition (optional): 
Artificial neural networks, usually simply called neural networks (NNs), are based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it.

##Short explanation: 
Neural Network is a bunch of perceptron connected together

##Motivation:
The perceptron is not complex enough to classify non-linear data. We then introduced neural network which consists of many perceptrons to handle non-linear classification problem.

Link: https://playground.tensorflow.org/

![playground](http://i.imgur.com/rbB43iO.png?1)

You can try to decrease hidden layer to 1 neuron only, which is equivalent to a perceptron. It classifies data poorly. But after you introduces more perceptron by increasing neurons, the model works much better.

##Definition & Terminology: 
A neural network consists of layers, and each layer consists of nodes. Each connection between two nodes is a representation of parameters, which are weight and bias. For layers, there are input layer, hidden layer, and output layer.

#Backpropagation (Math example)

##Backpropagation in 24 words: We are given the cost function. We use the cost function to update each weight and bias back from output layer to input layer.

In earlier perceptron example, we already initialized weights and biases. In practice, those are initially randomized and then update by using gradient descent.

Let's look at an example of how backpropagation works after we first initialize every weight and bias (credit to [3Blue1Brown's video](https://www.youtube.com/watch?v=tIeHLnjs5U8) and [StatQuest's video](https://www.youtube.com/watch?v=IN2XmBhILt4))

Suppose our neural network looks like this:
![nn](https://drive.google.com/uc?export=view&id=14UjQj7wp0QbvPabr-BBwgnItug0KIBhd)

Now, let's focus on only last two nodes:
![nn2](https://drive.google.com/uc?export=view&id=18a2qC6awdBEokR9oid8lpsG0gzLj9Mtb)

We know that our cost function for one data is

$
Cost(\theta) = (Predicted - Observed)^2 = (activation^{(L)} - Observed)^2
$

where $\theta$ consists of parameters. In this case, our parameters are $w _1$, $b_1$, $w_2$, $b_2$, $w_3$, $b_3$.

Note: for this notation, the superscript $L$ in $activation^{(L)}$ tells us we are at hidden layer $L$. In this case, we are in activation function of output layer

##Let us define more formulas. 

Let
$Z^{(L)} = w^{(L)} \cdot activation^{(L-1)} + b^{(L)}$ which is an output of one perceptron

Let
$activation^{(L)} = \sigma(Z^{(L)})$ which is an activation function for an output of perceptron

##Analysis
We can see that each activation function consists of weight, bias, and activation output from the last layer

![nn3](https://drive.google.com/uc?export=view&id=1cIigRioxMgmxFIH32_rfOr_6wX4eM8ie)

##Use Chain Rule to update each variable
We will use chain rule to calculate how much to change $w_3$ given that we have the cost function value. **Note: in simplified word, chain rule is a way to see how much one variable change when we change another variable.** There is a proof note under "Extra" section

In this case, let's assume that we already find true value of everything except $w_3$ at the last node. We then find a way to update $w_3$ by using cost function. **Note: you don't need to understand calculus behind the proof. *Again, tl;dr -> Backpropagation in 24 words: We are given the cost function. We use the cost function to update each weight and bias back from output layer to input layer.***

We will have $\frac{\partial Cost}{\partial w_3}$ by using chain rule.
After that, we update $w_3$ by gradient descent
\begin{equation}
w_3^{(new)} = w_3^{(old)} - learningrate \cdot \frac{\partial Cost}{\partial w_3}
\end{equation}

And after each update for each variable, we will use it to update former parameters from prior layer. Then we repeatedly train the model and use backpropagation to update variables until the model output converges.

#Neural Network Code Example

Credit: adapted from Crashcourse AI (https://www.youtube.com/watch?v=6nGCGYWMObE)

In [None]:
from keras.datasets import mnist
import matplotlib.pyplot as plt
(X_train, y_train), (X_test, y_test) = mnist.load_data()


img_index = 1 # <-- change this index to see different number picture

print("Number: {}".format(y_train[img_index]))
plt.imshow(X_train[img_index], cmap=plt.get_cmap('gray'))
# show the figure
plt.show()

In [None]:
# Convert pixel range to between 0 and 1 to make the model learns easier
X_train = X_train / 255
X_test = X_test / 255

# convert all 2D pixel picture into 1D vector
X_train = X_train.reshape(60000,784)
X_test = X_test.reshape(10000,784)

In [None]:
from sklearn.neural_network import MLPClassifier # import multi-layer perceptron classifier

hidden_layer_architecture = (100, 100) # <----- try changing the hidden layer to make the model more accurate

mlp = MLPClassifier(hidden_layer_sizes=hidden_layer_architecture, max_iter=10, alpha=1e-4, 
                     solver='sgd', verbose=10, tol=1e-4, random_state=1,
                     learning_rate_init=.1)
mlp.fit(X_train, y_train)
print('\ntraining score: {}'.format(mlp.score(X_train, y_train)))
print('test score: {}'.format(mlp.score(X_test, y_test)))

In [None]:
predicted_index = 1954 # <--- change this index to see different test data

output = mlp.predict([X_test[predicted_index]])
print("Predict below picture as: {} \nwhile the true label is: {}".format(output[0], y_test[predicted_index]))
plt.imshow(X_test[predicted_index].reshape((28,28)), cmap=plt.get_cmap('gray'))
# show the figure
plt.show()

#Extra:

## Book about neural network
- http://neuralnetworksanddeeplearning.com/index.html


## youtube explanation for neural network
- https://www.youtube.com/watch?v=tIeHLnjs5U8
- https://www.youtube.com/watch?v=IN2XmBhILt4

## Chain rule proof for $w_3$
![nn4](https://drive.google.com/uc?export=view&id=1Ni7U0LasOUGR_KuZCqUz6tShDqhufJxq)