# Intro to TFLearn
TFLearn is used to build neural networks for sentiment analysis.  
The library does a lot of the dirty works such as *initializing weights, running the forward pass, and taking care of the backpropagation* (Looks like **All** the work of building a neural network to me)

## Activation Functions
In the past, **sigmoid function** is the main function used as the activation function. This is not the only activation function used and it has some drawbacks:
<img src="https://d17h27t6h515a5.cloudfront.net/topher/2017/February/5893d15c_sigmoids/sigmoids.png" alt="Drawing" style="width: 600px;"/>
- Error shrinking: The derivative of sigmoid funcion maxes out at 0.25, meaning when performing backpropagation with sigmoid, the errors going back into the network will be shrunk by at least 75% at every layer. For models with a lot of layers the weight updates will be tiny.
As a result, sigmoids should not be chosen as activations on hidden units

## Rectified Linear Units (ReLUs)

### ReLu Definition
Most recent deep learning networks use **rectified linear units (ReLUs)** for the hidden layers.  
Mathematically:  

$$
f(x) =
\left\{
	\begin{array}{ll}
		x  & \mbox{if } x > 0 \\
		0 & \mbox{if } x \leq 0
	\end{array}
\right.
$$

Graphically, it looks like:  
<img src="https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58915ae8_relu/relu.png" alt="Drawing" style="height: 300px;"/>

### Drawbacks
It's possible that a large gradient can set the weights such that a ReLU unit will always be 0. These "dead" units will always be 0 and a lot of computation will be wasted in training.

From Andrej Karpath:
>Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

## Softmax
The softmax function **squashes the outputs of each unit** to be between 0 and 1, just like a sigmoid. It also divides each output such that **the total sum of the outputs is equal to 1**. The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.  
The only real difference between softmax and sigmoid is that the softmax normalizes the outputs so that they sum to 1.  
<img src="https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58950908_softmax-input-output/softmax-input-output.png" alt="Drawing" style="height: 100px;"/>  
Mathematically:  
<img src="https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58938e9e_softmax-math/softmax-math.png" alt="Drawing" style="height: 50px;"/>


## Categorical Cross-Entropy
For previous gradient descent the sum of squared errors were used as the cost function in the networks, but in those cases we only have singular (scalar) output values.
When using **softmax** the output is a vector.  
Can also express your data labels as a vector using what's called **one-hot encoding**.  
Cross entropy calculates *how far apart label vector vs. predicted vector*
<img src="https://d17h27t6h515a5.cloudfront.net/topher/2017/February/589b18f5_cross-entropy-diagram/cross-entropy-diagram.png" alt="Drawing" style="height: 150px;"/> 

### Sentiment Analysis with TFLearn