### Activation Functions (Nonlinearities):
An activation function is a function associated with a given neuron which takes in the weighted sum and operates on it, returning a number. The $\texttt{relu}$ function is one such activation function. 

The activation function must/should have the properties:
- Continuous at every $x$ in the real numbers
- Injective &mdash; one unique output for each unique input. Simple tests: check the derivative is monotonically increasing or use the horizontal line test
- Non-linear &mdash; literally means any function that isn't a pure straight line. The relu function is a piecewise function so it's not linear in that way. Having a non-linear function is necessary for <em>conditional correlation</em>
- Should be efficient to compute, because the activation function could be called a massive number of times. There is usually a trade-off between the expressive power of the function and how fast it is to compute


### Standard Activation Functions &mdash; Hidden Layer:
- $sigmoid$ &mdash; maps weighted sums to a value in $(0, 1)$. This lets you interpret the output of a neuron as a probability measure. $S(x)=\frac{1}{1+e^{-x}}$

<img src="img/sigmoid.png" style="width: 35%">

- $tanh$ &mdash; maps weighted sums to a value in $(-1, 1)$. $tanh$ can give a measure of negative correlation (rather than just positive correlation in the case of sigmoid). Generally outperforms than sigmoid for hidden layers because its ability to measure negative correlation, although selecting which activation function to use depends on the application. $tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

<img src="img/tanh.png" style="width: 35%">



### Standard Activation Functions &mdash; Output Layer:
The choice of activation function in the <em>output layer</em> depends on what prediction is being made.
- $sigmoid$ &mdash; for yes/no probability predictions. Eg. is this a dog?
- $softmax$ &mdash; for classifications, based on highest probabilities (selecting a single label out of many possible labels). 
- No activation function &mdash; for non-probability predictions. Eg. what will the temperature be tomorrow?

For the MNIST digit classifier, for instance, $\texttt{softmax}$ is the best choice. 



### Softmax
Consider the output layer of the MNIST digit classifier.

1. Get the raw outputs (with no activation function applied to the weighted sums)

<img src="img/softmax_1.png" style="width: 75%">

2. For each output, pass it through $e^{x}$

<img src="img/softmax_2.png" style="width: 75%">

3. Get the sum of all the outputs, then divide every node's value by that sum. Now we've obtained a probability distribution for each label:

<img src="img/softmax_3.png" style="width: 75%">


The sum of all the final values add up to one. From this, we can see that $\texttt{softmax}$ tends to attenuate the weaker possibilities and amplify one output. If we had $\texttt{sigmoid}$, we would have obtained:

<img src="img/softmax_vs_sigmoid.png" style="width: 75%">

This consequently affects backpropagation because the deltas are quite large for the other 9 output nodes, even though the neural network was able to indicate which label it thought was the best answer. The network will then proceed to unnecessarily change its weights. With $\texttt{softmax}$, the deltas for the other 9 output nodes are basically 0, so no weight update is incurred, the way it should be.


Note: the "input to a layer" refers to the raw vector of values obtained from the multiplication of the previous layer and the weights to the current layer. 

To compute $\texttt{layer_n_delta}$, we just multiply the backpropagated delta by the layer's slope. Eg. $\texttt{layer_1_delta} = (\texttt{layer_2_delta} \times \texttt{weights_1_2}) \times \texttt{reluDerivative(layer_2_delta)}$

In the case of $\texttt{sigmoid}$, looking at the slope, we can see that for large inputs, the impact of further increases to the input is much weaker. For large negative inputs, the impact is similarly weaker. This reduces how severely the weights will be adjusted, meaning that we don't run a smaller risk of corrupting their usefulness in predicting other labels. In general, nonlinearities like $\texttt{sigmoid}$ tend to prevent the corruption

Softmax is best used with the <em>cross-entropy</em> error function.