# Artifical Neurons Detecting Hot Dogs
Artificial neurons are based on biological ones (who'da thunk). A biological neuron *recieves information* from other neurons via its dendrites, *aggregates this information* with changes in cell voltage in the cell body, and *transmits a signal if the cell voltage crosses a threshold*, a signal that can be recieved by many other neurons in the network. 

First artificial neuron was the Perceptron, by neurobiologist Frank Rosenblatt. 
The perceptron algorithm used weighted inputs, in the example it was the presence of ketchup(3 pts), mustard(2 pts), and bun (6 pts). Determining the weighted sum of the inputs and comparing them to a threshold (theshold was 4 in the example) outputs a 1 (it's a hotdog) or a 0 (it's not a hotdog). 

Bias. Bias in an artificial neuron is negative threshold value. 

These values (input weights, threshold, bias, etc) are the parameters of the neuron. 

## Modern Neurons and Activation Functions
Modern artifical neurons are not perceptrons. The most obvious restriction of the perceptron is that it only has a binary output. This makes learning very hard for a perceptron-based network, as well as we want to make predictions from inputs that are continuous variables. 

## Sigmoid
Compared to the step transition of the perceptron's binary output, the sigmoid uses a curve and a function. Sigmoid functions are ubiquitous in statistics (the normal distribution and the t-distribution for example). The sigmoid function is defined by Sigma(Z) = 1/(1+e^(-z)) where z is equivalent to w\*x+b. 
The sigmoid function is the first example of an *activation function*. The sigmoid function is the canonical activation function, so much so that Sigma is conventionally used for *any* activation function. The output from any given neuron's activation function is referred to simply as its *activation*, in the book they use the term **a**.

Now we work with the Sigmoid Function notebook. 

In [1]:
from math import e 

we have to load the constant *e*. 

In [2]:
def sigmoid(z):
    return 1/(1+e**-z)

We define the sigmoid function

In [3]:
sigmoid(.00001)

0.5000024999999999

As z goes closer to 0, sigma goes closer to 0.5. 

In [4]:
sigmoid(10000)

1.0

And then larger values of z will equal 1. 

In [7]:
sigmoid(-1)
sigmoid(-10)

4.539786870243442e-05

Negative values of z will move closer toward zero. 

Pretty standard stuff in all honesty. Small gradual changes in the neuron's *w* and *b* parameters cause small, gradual changes in *z*, causing gradual changes in the neuron's activation **a**. Large negatives and large positives of z illustrate exception, extreme values create zeroes and ones. Like the perceptron, subtle updates to the weights and biases during training will have little to no effects on the output, so learning will stall. This is called neuron *saturation* and occurs with most activation functions. 

## Tahn Neuron
(Pronounced tanch colloquially)
The tahn activation function is defined as Sigma(z) = (e^z - e^(-z))/(e^z + e^(-z)). The shape of the tahn curve is similar to the sigmoid, but the tahn's output is \[-1;1\]. Negative z inputs correspond to negative **a** activations, z = 0 corresponds to **a** = 0, positive z inputs correspond to positive **a** activations. The output from tahn neurons tend to be around zero, which reduces neuron saturation. 

## ReLU: Rectified Linear Units
ReLU functions are characterized by **a** = *max*(0,z). If z is a positive value, **a** = z, if z is negative or zero, **a**=0. The output **a** does not vary uniformly linearly across all values of z. The ReLU is in essence two distinct linear functions combined to form a straightforward nonlinear function overall. Nonlinearity is essential to activation functions used within deep learning architectures. Nonlinear functions permit deep learning models to pproximate any continuous function. This univeral ability to approximate some output y given some input x is one of the hallmarks of deep learning - the characteristic that makes the approach so effective across such a breadth of applications. 
The simple shape of the ReLU's brand of nonlinearity works to its advantage. Learning appropriate values for *w* and *b* within deep learning networks involves partial derivative calculus, and these calculus operations are more computationally efficent on the linear portions of the ReLU relative to its efficency on the curves of, say, the sigmoid and tanh functions. ReLU are the most widely used neurons. 

## choosing a neuron
ReLU is the most preferred. It's very efficent compared to tanh and sigmoid. Don't bother with a perceptron. 
Several other forms of ReLUs include: Leaky, parametric and exponential linear unit. 

## Key Concepts of Chapter 6
Parameters:
    Weight *w*
    Bias *b*
Activation **a**
Artificial neurons:
    sigmoid
    tanh
    ReLu