# Chapter 10 Notes: Introduction to Artificial Neural Networks with Keras

# From Biological to Artificial Neurons
 - McCulloch Pitts 1943 computational paper
## Logical Computatiosn with Neurons
 - Uses artificial binary neurons
 - Can perform basic logic functions like AND, OR, =, OR NOT 
 - Cannot do XOR
## The Perceptron 
 - Frank Rosenblatt 1957
 - Threshold Logic Units (TLU) make up the perceptron
 - inputs and outputs are scalars
 - each input is associated with a weight
 - each TLU computes a weighted sum of its inputs
    - z = $w_1x_1$ + $w_2x_2$ ... = **$x^T$w**
 - Then the TLU applies a step function to the result and outputs the result
    - $h_w$(**x**) = step(z) where z = **$x^T$w**
 - Types of step functions for Preceptrons:
    - Heaviside Step Function - 0 until z > 0, then 1
    - Sign Step Function - -1 for z < 0,  0 for z=0, 1 for z>0
 - A single TLU can perform linear binary classification. 
 - A perceptron is merely a single layer of TLUs and an input layer 
 - The input layer also contains a bias neuron which always outputs 1
 - Perceptrons can do multi-class classification
 - Computing the outputs of a fully connected perceptron layer:
    - $h_{W,b}$ = $\phi$(**XW** + **b**)
    - **X**  # instances by # features
    - **W** wieght matrix, # input neurons by # artificial neurons (TLUs) 
    - **b** bias vector contains weights between bias neuron and all the TLUs. len= # TLUs
    - $\phi$ activation function 
 - Learning Rule: reinforces the connections which help reduce the error
 - $w_{i,j}^{(next Step)}$ = w$_{i,j}$ + $\eta$($y_j$ - $\hat{y}_j$)$x_i$
     - $x_i$ ith input value for this instance
     - $\hat{y}_j$ out put of jth output neuron for this instance
     - $y_j$ tartget output for jth neuron for this instance
     - $\eta$ learning rate
 - Only works for linear problems
 - Perceptrons do not output a class probability
 - Perceptrons cannot perform XOR operations
 - MLP - Multi-Layer Perceptrons can do XOR and other thins

In [11]:
import numpy as np 
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:,(2,3)] #length/width of petal 
y = (iris.target == 0).astype(np.int) #1 for setosas

iris.keys()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = (iris.target == 0).astype(np.int) #1 for setosas


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [12]:
per_clf = Perceptron()
per_clf.fit(X,y)
y_pred = per_clf.predict([[2, 0.5]])
print(y_pred)

[0]


 ## The Multilayer Perceptron and Backpropagation
- MLPs are composed of an input layer, 1 or more hidden layers, and an output layer.
- the input layer consists of pass through units and the other layers are TLUs
- Rumelhart, Hinton, and Williams 1986 introduced backpropagation
- **Backpropagation** - computes the gradient with respect to every single model parameter and for all the layers. Accomplishes this in two passes through the network. Finds how to tweak the weights in order to reduce the error. 
    - handles instances in minibatches
    - cycles through whole dataset multiple times in epochs
    - On the *forward pass* the instances are passed through the network and all the intermediate outputs are saved. 
    - error is measured by using a loss function which compares the actual output vs the desired output
    - the chain rule is used to determine how much each output contributed to the error
    - the algorithm works backwards, determining how much of these error contributions came from each connection in the next lower layer. It propagates the error gradient backwards through the network. 
    - gradient descent performed by tweaking all connections in the network using the error gradients just computed. 
- Step function is replaced by a sigmoid function so there is a gradient to follow. $\sigma$ = $\frac{1}{1 + e^{-z}}$
    - this is an activation function like the hyperbolic tangent function or Rectified linear produce
- non-linear activation functions allow the MLP to approximate non-linear continuous functions. 

## Regression MLPs
- you need one output neuron per value you are trying to predict.
- Usually you do not use an activation function for the output neurons. 

## Classification MLPs
- can output the estimated probability for binary classification with a single output neuron
- You need one output neuron per class you are predicting
    - Softmax activation funciton will ensure all the outputs sum to one. This is useful for exclusive multiclass classification
    - cross entropy loss function is useful here