# Deep Sparse Rectifier Neural Networks

#### Introduction

Rectifier (Hinge Loss) Function:
$$max(0, x)$$

LIF is the firing rate and the input current:
- $f(I)$
    - $[ t\ log( \frac{ E + RI - V_r }{ E + RI - V_{th} } ) + t_{ref} ]^{-1}$ if $E + RI > V_{th}$
    - 0 if $E + RI \le V_{th}$
    
Where $t_{ref}$ is the refactory period (time between two action potentials), $I$ is the input current, $V_r$ is the resting potential and the $V_{th}$ the threshold potential (with V_{th} > V_r), and $R$, $E$, $t$ the membrane resistance, potential and time constant. 

Most commonly used activation functions in deep learning and neural networks are the logistic sigmoid (between 0 and 1) and hyperbolic tangent (steady state at 0 and preferred from optimization standpoint; number between -1 and 1, but forces asymmetry around 0), which are equivelant to a linear transformation. 

###### Advantages of Sparsity

Information Disentangling: dense representations is highly entangled because almost any change in input modifies most entries in representation. If a representation is sparse to small inputs, set of non-zero features is almost always conserved by small changes in inputs.

<br>

Efficient Variable-Size Representation: Different inputs may contain different amounts of informatio and would be more conveniently rperesented using a variable-size data structure. By varying number of active neurons allows a model to control the effective dimensionality of the representation for a given input and the required precision.

<br>

Linear Separability: more likely to be linearly seperable because the information is represented in a high-dimension space. 

<br>

Distributed but Sparse: sparse representations are exponentially greater with the power being the number of non-zero features.

#### Deep Rectifier Networks

Rectifier function is one-sided and does not enforce a sign symmetry or antisymmety. 0 represent no response. We can, however obtain symmetry or antisymmetry by combining two rectifier units sharing parameters.

###### Advantages

This function allows a network to obtain sparse representations. After uniform initialization half of hidden units continuous output values are real zeros.

###### Disadvantages

$$softplus(x) = log(1 + e^x)$$

which is a smooth version of the rectifying non-linearty. Hard non-linearities do not hurt so long as the gradient can propagate along some paths (some hidden units in each layer are non-zero).

It can be recommended to use the $L_1$ penalty on the activation function to enable even greater sparsity. 

#### Unsupervised Pre-training

Linear Reconstruction Function:
$$f(x, \theta) = W_{dec}\ max( W_{enc}x + b_{enc}, 0 ) + b_{dec}$$

with $\tilde x$ denoting the corrupted version of x, $\sigma()$ the logistic sigmoid function and $\theta$ the model parameters $(W_{enc}, b_{enc}, W_{dec}, b_{dec})$

###### Strategies

Use softplus activation function for the reconstruction layer, along with a quadratic cost:
$$L(x, \theta) = || x - log( 1 + exp( f( \tilde x, \theta ) ) ) ||$$

This strategy is proven to yield better generalizations on image data.

Scale the rectifier activation values coming from the previous encoding layer to bound them between 0 and 1, then use a sigmoid activation function for the reconstruction layer along with a cross-entropy reconstruction set.
- $L( x, \theta ) = $
    - $-x\ log( \sigma( f( \tilde x, \theta ) ) )$
    - $-( 1 - x )\ log( 1 - \sigma( f( \tilde x, \theta ) ) )$
    
This has proven to privde better generalizations on text data.

Use a linear activation function for reconstruction layer along with a quadratic cost. And use a rectifier activation function for the reconstruction layer along with a quadratic cost.

#### Experimentation

Negative Log-Likelihood as the cost function for training:
$$-log\ P( CorrectClass | input )$$

Despite hard threshold at 0, networks trained with the rectifier activation function can find local minima of greater or equal quality than those obtained by softplus. This makes it so rectifiers are both biologically plausible and computationally efficient. 

Rectifiers also do not depend as much on unsupervised pre-training. Rectifier networks improve after an unsupervised pre-training phase in a semi-supervised setting. Adding hidden layers actually hurts the overall RMSE.

<br>

This concludes that rectifier units help to bridge gap between unsupervised pre-training and no pre-training, which suggests they may be better and finding better minima during training.

Rectifier networks tend to perform best for sentiment analysis as text-based tasks have a very large degree of sparsity and also good for image classification.

###### Lesson: Rectifier Neural Networks work well with Sparse Data