<a href="https://colab.research.google.com/github/LxYuan0420/eat_tensorflow2_in_30_days/blob/master/notebooks/5_3_activation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Activation function plays a key role in deep learning. It introduces the nonlinearity that enables the neural network to fit arbitrary complicated functions.

The neural network, no matter how complicated the structure is, is still a linear transformation which cant fit the nonlinear functions without the activation function.

For the time being, the most popular activation function is `relu`, but there are some new functions such as `swish`, `GELU`, claming a better performance over `relu`.

Here are two review papers to the activation function (in Chinese).

1. https://zhuanlan.zhihu.com/p/98472075
2. https://zhuanlan.zhihu.com/p/98863801

**1. The most popular activation functions**

`tf.nn.sigmoid`: compressing real number between 0 to 1, usually used in the output layer for binary classification; the main drawbacks are vanishing gradient, high computing complexity, and the non-zero center of the output

`tf.nn.softmax`: Extended version of sigmoid for multiple categories, usually used in the output layer for multiple classification.

`tf.nn.tanh`: Compressing real number between -1 to 1, expectation of the output is zero; the main drawbacks are vanishing gradient and high computing complexity.

`tf.nn.relu`: Linear rectified unit, the most popular activation function , usually used in the hidden layer; the main drawbacks are non-zero center of the output and vanishing gradient for the inputs <0 (dying relu).

`tf.nn.leaky_relu`: Improved ReLU, resolving the dying relu problem.

`tf.nn.elu`: Exponential linear unit, which is an improvement to the ReLU, alleviate the dying ReLU problem.

`tf.nn.selu`: Scaled exponential linear unit, which is able to normalize the neural network automatically if the weights are initialized through `tf.keras.initializers.lecun_normal`. No gradient exploding/vanishing problems, but need to apply together with AlphaDropout (an alternation of Dropout)

`tf.nn.swish`: Self-gated activation function, a research product from Google. The literature prove that it brings slight improvement comparing to ReLU.

`gelu`: Gassion error linear unit, which has the best performance in Transformer; however, `tf.nn` hasn't implemented it.

**2. Implementing activation function in the models**

There are two ways of implementing activation functions in keras models; specifiying through the `activation` parameter in certain layers, or adding activaion layer `layers.Activation` explicitly.

In [1]:
import tensorflow as tf

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=(None, 16), activation=tf.nn.relu)) #specifiying through the activation parameter
model.add(tf.keras.layers.Dense(10))
model.add(tf.keras.layers.Activation(tf.nn.softmax)) #specifiying explicity
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, None, 32)          544       
_________________________________________________________________
dense_1 (Dense)              (None, None, 10)          330       
_________________________________________________________________
activation (Activation)      (None, None, 10)          0         
Total params: 874
Trainable params: 874
Non-trainable params: 0
_________________________________________________________________
