# Express Deep Learning in Python

## Advanced Layers

The `Dense` layer is only one of the possible core layers of Keras. `Dense` is a *forward* layer, this are the ones that take an input and do some transformation on it (in this case a matrix multiplication).

Other important layers to consider are: activation layers, regularization layers, dropout layers, convolutional layers, pooling layers, recurrent layers, normalization layers, embedding layers, noise layers, etc.

For this tutorial we will focus on some layers to aid in the tuning of the network: activations, regularizers and dropout; as well as the layers needed to design convolutional neural networks: convolutional and pooling layers.

We will point out other tutorials and examples to learn about the other kind of layers at the end of this tutorial.

In [None]:
from keras import backend as K
from keras import regularizers
from keras.layers import Activation, ActivityRegularization, Dense, Dropout
from keras.models import Sequential

## Activation Functions

A neural network classifier with linear activations has no more *representation* power than a logistic regression classifier. In order to express non-linearity with a neural network model a non-linear function is needed as activation function for each neuron.

One simple activation function to use is the **sigmoid (or logistic) function**, the same one used in the logistic regression algorithm, which restricts the output value to be between zero and one. This was one of the most common nonlinearities used as activation function in some of the *first versions* of neural networks. There are however other possibilities (all the following available in Keras, but there are more which can be adapted):

* rectified linear unit (ReLU)
* tanh
* hard sigmoid
* softsign
* softplus
* exponential linear unit (elu)
* scaled exponential linear unit (selu)
* leaky rectifier linear unit (Leaky ReLU)
* parametric rectified linear unit (PReLU)

Of these, the one most used in the present state-of-the-art neural networks classifiers is the **ReLU**, because tipically learns much faster in networks with many layers [1].

There is another activation layer which is the **SoftMax** activation. This is generally used as the last activation layer, i.e. as the output of the network. This function, also known as *normalized exponential function* is a generalization of the logistic function that "squashes" a K-dimensional vector ${\displaystyle \mathbf {z}}$ of arbitrary real values to a K-dimensional vector ${\displaystyle \sigma (\mathbf {z} )}$ of real values in the range [0, 1] that add up to 1.

### Activation Functions in Keras

Keras provides two ways to define an activation function. Any method is equally valid.

#### Activation as a parameter of a forward layer

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(Dense(10, activation='softmax'))

#### Activation as a layer

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(784,)))
model.add(Activation('tanh'))
model.add(Dense(10))
model.add(Activation('softmax'))

#### Activation from a TensorFlow function

In the previous examples we used some of the available functions in the Keras library.

We can also use an element-wise TensorFlow function as activation.

In [None]:
model = Sequential()
model.add(Dense(64, input_shape(784,),
                activation=K.sigmoid))
model.add(Dense(10, activation='softmax'))

## Regularizers

Regularizers allow to apply penalties on layer parameters or layer activity during optimization. These penalties are incorporated in the loss function that the network optimizes. The penalties are applied on a per-layer basis.

The regularizers can be applied to three parameters:

* Weight/kernel matrix regularization: Applies the regularizer function to the weight matrix (called kernel matrix in Keras documentation).
* Bias regularization: Applies the regularizer to the bias vector.
* Activity regularizer: Applies the regularizer to the output (i.e. the activation function).

There are three possible penalties to apply as regularizers already present in Keras (but the API permits the definition of a custom regularizer) [2]: l1, l2 and elasticnet.

### Regularizers in Keras

As with activation functions, there are two ways to use a regularizer in keras. Although not for all the parameters.

#### Regularization as parameter of a layer

This is the most practical way and the only one which allows the individual regularization of each available parameter.

The regularizer is given as a parameter of the layers (e.g. `Dense`):

* `kernel_regularizer`
* `bias_regularizer`
* `activity_regularizer`

The available penalties for this case are:

* `keras.regularizers.l1`
* `keras.regularizers.l2`
* `keras.regularizers.l1_l2`

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(784,),
                activation='relu',
                kernel_regularizer=regularizers.l2(0.01),
                activity_regularizer=regularizers.l1(0.01)))
model.add(Dense(10, activation='softmax'))

#### Regularization as a layer

The core layer `ActivityRegularization` is another way to apply regularization, in this case (as the name indicates), only for the activation function (not for the weight matrix or the bias vector).

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(ActivityRegularization(l1=0.01, l2=0.1))
model.add(Dense(10, activation='softmax'))

## Dropout

This are special layers useful for regularization which randomly drop (i.e. set to zero) units of the neural network during training. This prevents units from co-adapting too much to the input [3].

Keras has a special layer which can be added to a sequential model which takes a value `rate`, between 0 and 1, and sets the fraction given by the value to 0 during training of the input.

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

## Compiling the model: loss functions and optimizers

Once the model's architecture is defined, i.e. all the layers are given with their respective activation functions and regularization parameters, the model needs to be compiled in order to use it. Remember Keras is an abstraction layer over another abstraction that is the backend, TensorFlow in this case.

When compiling a model there are two important parameters: the loss function and the optimizer algorithm.

### Loss function

Also know as the *objective function*, is the function we want to optimize when training the algorithm (that is find the minimum of the loss function). Depending on the task (whether it is classification or regression), and some other parameters, the objective function can change. Two of the most popular objective functions are the **mean squared error** for regression and **categorical crossentropy** for classification. Keras bring a number of different loss functions already available [4], but for this course we will be using only the *categorical crossentropy*.

#### Categorical format

In case of using a loss function for classification (e.g. the categorical crossentropy) having more than 2 classes, Keras requires the targets to be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility `keras.utils.np_utils.to_categorical`.

### Optimizer

The optimizer algorithm is the way to find the minimum values to the loss function. As with loss functions, there are many available optimizers already packaged with Keras. One of the most popular algorithms is **stochastic gradient descent** (or SGD) optimizer, which is also one of the simplest to understand. However, in this tutorial we will be using mostly the --Adam-- **REVISAR DE ACUERDO A RESULTADOS** optimizer which gives the best results.

### Compiling a model in Keras

In Keras, a model can be compiled with the method `.compile()` in a model. The method takes two parameters: `loss` and `optimizer`. The parameters can either be instances of a loss function (e.g. `keras.losses.hinge_loss`) or an optimizer (e.g. `keras.optimizers.RMSprop`) or a string calling the loss function or optimizer by the name.

The main difference between using an instance and a string is that in the latter case the loss function or optimizer will be used with the default parameters.

## References

- [1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553 (2015): 436-444.
- [2] "Developing new regularizers". Keras Documentation. https://keras.io/regularizers/
- [3] Srivastava, Nitish, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research 15, no. 1 (2014): 1929-1958. Harvard	
- [4] "Available loss functions". Keras documentation. https://keras.io/losses/