<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Phystech School of Applied Mathematics and Informatics (PSAMI) MIPT</b></h3>

---

<h2 style="text-align: center;"><b>Homework: neuron with various activation functions</b></h2>

---

### You need to solve first `[seminar]perceptron.ipynb` and `[seminar]neuron.ipynb`!

**It is a frequently asked question: which activation function should I choose?** In this notebook we suggest finding out the truth and compare neurons with various activation functions (their quality on two datasets). Make sure all of the experiments are conducted in the same conditions (otherwise an experement will not be fair).

In this task you will: 
- implement class **`Neuron()`** with various activation functions
- train and validate your class on generated and real data (files with real data are in '/data' folder) 

In this notebook you will implement neuron with various activation functions: Sigmoid, ReLU, LeakyReLU and ELU.

In [None]:
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap  # here some magic things for colorization are lying
import numpy as np
import pandas as pd

In [None]:
RANDOM_SEED = 42  # do not change, results of the test depend on this!
np.random.seed(RANDOM_SEED)

---

In this case we are facing a binary classification problem again. Let's use the same loss function **mean square error**, but instread of threshold activation we'll use sigmoid:

$$MSE\_Loss(\hat{y}, y) = \frac{1}{n}\sum_{i=1}^{n} (\hat{y_i} - y_i)^2 = \frac{1}{n}\sum_{i=1}^{n} (\sigma(w \cdot X_i) - y_i)^2$$ 
 

Here $w \cdot X_i$ - dot product, and $\sigma(w \cdot X_i) =\frac{1}{1+e^{-w \cdot X_i}} $ - sigmoid ($i$ -- object's number in dataset).  

**Note:** It is supposed, that $b$ - free term - is a part of weights vector: $w_0$. So, if we add column of ones to the left side of $X$, we will get $b$ as a free term in dot product (figure out why it works on a piece of paper -- you will easily get it). But in our implementation of `Perceptron()` let's calculate $b$ separately (to make it clearer).

In [None]:
def Loss(y_pred, y):
    y_pred = y_pred.reshape(-1, 1)
    y = np.array(y).reshape(-1, 1)
    return 0.5 * np.mean((y_pred - y) ** 2)

Futher there are several activation functions, and you need to implement a class `Neuron` similarly with how it was in seminars. The principle is the same, but the formula for updating the weights and the for the predicting function.

**The rules are simple**: There are three activation functions, the first have all the formuals, you only need to code them. In the second will be written derivative, but it will not be substituted in $Loss$, this is task for you. The third will have only function formula.

<h2 style="text-align: center;"><b>Neuron with ReLU (Recitified Linear Unit)</b></h2>  

ReLU is the most frequently used (at least couple years ago) activation function in neural networks. It looks very simple:

\begin{equation*}
ReLU(x) =
 \begin{cases}
   0, &\text{$x \le 0$}\\
   x, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

Or:

$$
ReLU(x) = \max(0, x)
$$

We just restrict negative numbers to go.

The derivative here is taken as a derivative of a piecewise-given function (at zero, it is defined by zero):

\begin{equation*}
ReLU'(x) = 
 \begin{cases}
   0, &\text{$x \le 0$}\\
   1, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

Graph of this function and graph of its derivative:

<img src="https://upload-images.jianshu.io/upload_images/1828517-0828da0d1164c024.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" width=800 height=400>

Substitute ReLU in Loss:

$$Loss(\hat{y}, y) = \frac{1}{2n}\sum_{i=1}^{n} (\hat{y_i} - y_i)^2 = \frac{1}{2n}\sum_{i=1}^{n} (ReLU(w \cdot X_i) - y_i)^2 = \begin{equation*}
\frac{1}{2n}\sum_{i=1}^{n}
 \begin{cases}
    y_i^2, &{w \cdot X_i \le 0}\\
   (w \cdot X_i - y_i)^2, &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}$$  

(remember that $w \cdot X_i$ -- is a number in this case (result of dot products of two vectors).

Then the formula for updating weight in gradient descend will be as follows (in a matrix for, we suggest you do it yourself. It derives from a fomula for one object.)

$$ \frac{\partial Loss}{\partial w} = \begin{equation*}
\sum_{i=1}^{n}
 \begin{cases}
   0, &{w \cdot X_i \le 0}\\
   \frac{1}{n} X_i^T (w \cdot X_i - y), &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}$$

(remember that $w \cdot X$ here is a product of matrix and vector $w$ (vector -- is a matrix too, isn't it?) and matrix $X$ )

Why in the first case it is 0? Because weights are not included in $y_i^2$, and we take derivative exactly by weights $w$.

* Implement ReLU and its derivative:

In [None]:
def relu(x):
    """ReLU"""
    return <Your code here>

In [None]:
def relu_derivative(x):
    """Derivative of ReLU"""
    return <Your code here>

Now you need to code neuron with ReLU. Here it's all similar to Perceptron, but the weights are updated differently and the activation function is different:

In [None]:
class NeuronReLU:
    def __init__(self, w=None, b=0):
        """
        :param: w -- weights vector
        :param: b -- bias scalar
        """
        # Let's leave an opportunity for a user to set weights and biases directly
        self.w = w
        self.b = b
        
    def activate(self, x):
        # You code here
        
    def forward_pass(self, X):
        """
        This function computes an answer of the perceptron given a set of objects
        :param: X -- matrix of objects sized (n, m), every row - separate object
        :return: vector sized (n, 1) of zeros and ones containing model answers 
        """
        # You code here
        
        n = X.shape[0]
        y_pred = np.zeros((n, 1))  # y_pred == y_predicted - predicted classes
        # You code here
    
    def backward_pass(self, X, y, y_pred, learning_rate=0.005):
        """
        Updates weights values given objects
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                learning_rate - "speed of learning" (symbol alpha in formulas above)
        This method doesn't return anything, it only corrects weights using gradient
        descend.
        """
        n = len(y)
        y = np.array(y).reshape(-1, 1)
        # You code here
    
    def fit(self, X, y, num_epochs=300):
        """
        Descend in a minimum
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                num_epochs -- number of training steps
        :return: Loss_values -- vector of loss values
        """
        self.w = np.zeros((X.shape[1], 1))  # column (m, 1)
        self.b = 0  # bias (number)
        Loss_values = []  # loss values on every step of fitting
        
        for i in range(num_epochs):
            # You code here
        
        return Loss_values

<h3 style="text-align: center;"><b>Testing neuron with ReLU</b></h3>  

Here your task is to test your neuron **on the same dataset ("Apples and pears")** similarly with the way, how this was made with perceptron (you can freely copy your code, but be careful - something yet need to be corrected).
As the result you need to display: 
* graph showing how loss function $Loss$ changes depending on iterations number
* graph with coloring of the dataset by sigmoidal neuron

***Note***: please, check `.shape` of matricies and vectors more often: `self.w`, `X` and `y` inside the class. Often mistake is solved with transposition or with method `.reshape()`. Don't forget to check what vector (what size) you want to get as an output -- this quite helps not to get confused.

**(for the test) Check forward_pass()**

In [None]:
w = np.array([1., 2.]).reshape(2, 1)
b = 2.
X = np.array([[1., 3.],
              [2., 4.],
              [-1., -3.2]])

neuron = NeuronReLU(w, b)
y_pred = neuron.forward_pass(X)
print ("y_pred = " + str(y_pred))

**(for the test) Check backward_pass()**

In [None]:
y = np.array([1, 0, 1]).reshape(3, 1)

In [None]:
neuron.backward_pass(X, y, y_pred)

print ("w = " + str(neuron.w))
print ("b = " + str(neuron.b))

"Apples and pears":

In [None]:
data = pd.read_csv("./data/apples_pears.csv")
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=data['target'], cmap='rainbow')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

In [None]:
X = data.iloc[:,:2].values  # matrix objects-features
y = data['target'].values.reshape((-1, 1))  # classes (column of zeros and ones)

Display loss during the training of a nueron with ReLU on this dataset

In [None]:
%%time

neuron = <You code here>
Loss_values = <Your code here>

plt.figure(figsize=(10, 8))
plt.plot(Loss_values)
plt.title('Loss function', fontsize=15)
plt.xlabel('iteration number', fontsize=14)
plt.ylabel('$Loss(\hat{y}, y)$', fontsize=14)
plt.show()

Probably your loss is a straight line now, and you can see that weights are not updated. But why?!

Everything is simple -- possibly we have not yet told you, but if you look closely, you can see that self.w and self.b are initialized with zeros in the beginning of the method `.fit()`. If you write it down on the paper how the update is procceed, you will see that because of ReLU weights are simply will not change when you initialize them with zeros.

This is one of the reasons why they initialize weights with random numbers in neural networks (usually from [0, 1)).

Let's train a neuron but initialize weights at the beginning (do 10000 iterations). 

**!!! Comment out the initialization with zeros in the function `.fit()` of the class `NeuronReLU` !!!**

In [None]:
%%time

neuron = NeuronReLU(w=np.random.rand(X.shape[1], 1), b=np.random.rand(1))
Loss_values = neuron.fit(X, y, num_epochs=10000)

plt.figure(figsize=(10, 8))
plt.plot(Loss_values)
plt.title('Loss function', fontsize=15)
plt.xlabel('iteration number', fontsize=14)
plt.ylabel('$Loss(\hat{y}, y)$', fontsize=14)
plt.show()

**(for the test) Check loss:**

Display a summ of the first five and the last five values of loss during training for num_epochs=10000, round to the 4-th decimal:

IMPORTANT! If you have run the previous cell with code several times, then to carry the results in the test in Canvas, please, run this cell from "zero", restart Runtime, since `random` brings in randomness. There are several possible answers, in case, you have run it several times, but it's better not to carry the result after 10 runs of these and following cells.

In [None]:
<your code here>

Let's see how this neuron predicts

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=np.array(neuron.forward_pass(X) > 0.5).ravel(), cmap='spring')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

It should devide more or less not bad. But why we would use ReLU which is used more often and it predicts worse (converges quite a long time), than perceptron with threshold activation which nobody uses? Speaking generally no one knows when and where and which activation function 'will fire'. It depends also on the data itself.

<img src="https://alumni.lscollege.ac.uk/files/2015/12/Interview-questions-square-image.jpg" width=400 height=300>

But there is actually a trend: threshold and sigmoid (sigmoid in the most cases) are used exactly in **output layers** of neural neural networks in classification tasks -- they are used to predict probabilities of an object to belong to a certain class, while more "advanced" activation functions (ReLU and those which are will be down below) are used inside a neural network, i.e. **hidden layers**.

However nothing prevents you to use ReLU in output layers and sigmoid inside. Deep Learning -- "quite experimental" field: you could make a discovery by your own hands by just changing something negligible, i.e. activation function.

**Advantages of ReLU:**

* differentiable (with definition in zero)
* no problem of fading out gradient like in sigmoid

**Possible disadvantages of ReLU:**

* not centered near 0 (can prevent convergement speed)
* zeros out all negative inputs, thereby weights of zeroed out neurons may often *not update*, this issue is called *dead neurons*

The last one can be fought:

<h2 style="text-align: center;"><b>Neuron with LeakyReLU (Leaky Recitified Linear Unit)</b></h2>  

LeakyReLU little difference with ReLU, but it helps network train faster since there is no problem of "dead neurons"

\begin{equation*}
LeakyReLU(x) =
 \begin{cases}
   \alpha x, &\text{$x \le 0$}\\
   x, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

where $\alpha$ -- small number from 0 to 1.

Derivative here is taken in the same way, but in 0 value is $\alpha$:

\begin{equation*}
LeakyReLU'(x) = 
 \begin{cases}
   \alpha, &\text{$x \le 0$}\\
   1, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

Plot of the function:

<img src="https://cdn-images-1.medium.com/max/1600/0*UtLlZJ80TMIM7kXk." width=400 height=300>

Substitude LeakyReLU into Loss:

$$
Loss(\hat{y}, y) = \frac{1}{2n}\sum_{i=1}^{n} (\hat{y_i} - y_i)^2 = \frac{1}{2n}\sum_{i=1}^{n} (LeakyReLU(w \cdot X_i) - y_i)^2 =
\begin{equation*}
\frac{1}{2n}\sum_{i=1}^{n} 
 \begin{cases}
   (\alpha \cdot w \cdot X_i - y_i)^2, &{w \cdot X_i \le 0}\\
   (w \cdot X_i - y_i)^2, &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}
$$  

Formula for updating weights in gradient descend:

$$ \frac{\partial Loss}{\partial w} = \begin{equation*}
\frac{1}{n}\sum_{i=1}^{n} 
 \begin{cases}
   \alpha X_i^T (w \cdot X_i - y), &{w \cdot X_i \le 0}\\
    X_i^T (w \cdot X_i - y), &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}$$

* Implement LeakyReLU and its derivative:

In [None]:
def leaky_relu(x, alpha=0.01):
    """LeakyReLU"""
    <your code here>

In [None]:
def leaky_relu_derivative(x, alpha=0.01):
    """Derivative of LeakyReLU"""
    <your code here>

Now you need to code neuron with ReLU. Here it's all similar to Perceptron, but the weights are updated differently and the activation function is different:

In [None]:
class NeuronReLU:
    def __init__(self, w=None, b=0):
        """
        :param: w -- weights vector
        :param: b -- bias scalar
        """
        # Let's leave an opportunity for a user to set weights and biases directly
        self.w = w
        self.b = b
        
    def activate(self, x):
        # You code here
        
    def forward_pass(self, X):
        """
        This function computes an answer of the perceptron given a set of objects
        :param: X -- matrix of objects sized (n, m), every row - separate object
        :return: vector sized (n, 1) of zeros and ones containing model answers 
        """
        # You code here
        
        n = X.shape[0]
        y_pred = np.zeros((n, 1))  # y_pred == y_predicted - predicted classes
        # You code here
    
    def backward_pass(self, X, y, y_pred, learning_rate=0.005):
        """
        Updates weights values given objects
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                learning_rate - "speed of learning" (symbol alpha in formulas above)
        This method doesn't return anything, it only corrects weights using gradient
        descend.
        """
        n = len(y)
        y = np.array(y).reshape(-1, 1)
        # You code here
    
    def fit(self, X, y, num_epochs=300):
        """
        Descend in a minimum
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                num_epochs -- number of training steps
        :return: Loss_values -- vector of loss values
        """
        #  self.w = np.zeros((X.shape[1], 1))  # column (m, 1)
        #  self.b = 0  # bias (number)
        Loss_values = []  # loss values on every step of fitting
        
        for i in range(num_epochs):
            # You code here
        
        return Loss_values

<h3 style="text-align: center;"><b>Testing neuron with LeakyReLU</b></h3>  

***Note***: please, check `.shape` of matricies and vectors more often: `self.w`, `X` and `y` inside the class. Often mistake is solved with transposition or with method `.reshape()`. Don't forget to check what vector (what size) you want to get as an output -- this quite helps not to get confused.

**Everywhere below in testing don't change $\alpha$=0.01 in `leaky_relu()` and in `leaky_relu_derivative()`**

"Apples and pears":

In [None]:
data = pd.read_csv("./data/apples_pears.csv")
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=data['target'], cmap='rainbow')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

In [None]:
X = data.iloc[:,:2].values  # matrix objects-features
y = data['target'].values.reshape((-1, 1))  # classes (column of zeros and ones)

Let's train the neuron with randomly initialized weights (put 10000 iterations).

In [None]:
%%time

neuron = NeuronLeakyReLU(w=np.random.rand(X.shape[1], 1), b=np.random.rand(1))
Loss_values = neuron.fit(X, y, num_epochs=10000)

plt.figure(figsize=(10, 8))
plt.plot(Loss_values)
plt.title('Loss function', fontsize=15)
plt.xlabel('iteration number', fontsize=14)
plt.ylabel('$Loss(\hat{y}, y)$', fontsize=14)
plt.show()

**(for the test) Check loss:**

Display a summ of the first five and the last five values of loss during training for num_epochs=10000, round to the 4-th decimal:

IMPORTANT! If you have run the previous cell with code several times, then to carry the results in the test in Canvas, please, run this cell from "zero", restart Runtime, since `random` brings in randomness. There are several possible answers, in case, you have run it several times, but it's better not to carry the result after 10 runs of these and following cells.

In [None]:
<your code here>

Let's see how this neuron predicts:

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=np.array(neuron.forward_pass(X) > 0.5).ravel(), cmap='spring')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

**Advantages of LeakyReLU:**

* differentiable (with definition in zero)
* no problem of fading out gradient like in sigmoid
* no problem of "dead neurons" like in ReLU

**Possible disadvantages of LeakyReLU:**

* not centered near 0 (can prevent convergement speed)
* little unstable to "noise" (see Stanford lecture)

<h2 style="text-align: center;"><b>Neuron with ELU (Exponential Linear Unit)</a></b></h2>  
<h2 style="text-align: center;"><b>(optional part, will not be checked)</b></h2>

ELU -- revealed not so long ago (in 2015 year) acrivation function, which, as authors of the paper say, is better than LeakyReLU. Here is a formula for ELU:

\begin{equation*}
ELU(\alpha, x) =
 \begin{cases}
   \alpha (e^x - 1), &\text{$x \le 0$}\\
   x, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

where $\alpha$ -- small number from 0 to 1.

Derivative here is taken in the same way, but in 0 value is $\alpha$:

\begin{equation*}
ELU'(x) = 
 \begin{cases}
   ELU(\alpha, x) + \alpha, &\text{$x \le 0$}\\
   1, &\text{$x \gt 0$}
 \end{cases}
\end{equation*}

Simple trick is used here in derivative -- added $- \alpha + \alpha$, to make computation easier.

Plot of this function:

<img src="http://p0.ifengimg.com/pmop/2017/0907/A004001DD141881BFD8AD62E5D31028C3BE3FAD1_size14_w446_h354.png" width=500 height=400>

Substitute LeakyReLU into Loss:

$$Loss(\hat{y}, y) = \frac{1}{2n}\sum_{i=1}^{n} (\hat{y_i} - y_i)^2 = \frac{1}{2n}\sum_{i=1}^{n} (ELU(\alpha, w \cdot X_i) - y_i)^2 = \begin{equation*}
\frac{1}{2n}\sum_{i=1}^{n} 
 \begin{cases}
   (\alpha (e^{w \cdot X_i} - 1) - y_i)^2, &{w \cdot X_i \le 0}\\
   (w \cdot X_i - y_i)^2, &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}$$  

Formula for updating weights in gradient descend. Here you need to derive it yourself. And it is a little harded than before. Taking derivative head-on is unconvinient. **Chain rule** is required or **rule of taking derivative of a composition of functions**:

$$ \frac{\partial Loss}{\partial w} = \begin{equation*}
\frac{1}{n}\sum_{i=1}^{n} 
 \begin{cases}
   , &{w \cdot X_i \le 0}\\
   , &{w \cdot X_i \gt 0}
 \end{cases}
\end{equation*}$$

* Implement ELU and its derivative:

In [None]:
def elu(x, alpha=0.01):
    """ELU"""
    <your code here>

In [None]:
def elu_derivative(x, alpha=0.01):
    """Derivative of ELU"""
    <your code here>

Now you need to code neuron with ELU activation function

In [None]:
class NeuronReLU:
    def __init__(self, w=None, b=0):
        """
        :param: w -- weights vector
        :param: b -- bias scalar
        """
        # Let's leave an opportunity for a user to set weights and biases directly
        self.w = w
        self.b = b
        
    def activate(self, x):
        # You code here
        
    def forward_pass(self, X):
        """
        This function computes an answer of the perceptron given a set of objects
        :param: X -- matrix of objects sized (n, m), every row - separate object
        :return: vector sized (n, 1) of zeros and ones containing model answers 
        """
        # You code here
        
        n = X.shape[0]
        y_pred = np.zeros((n, 1))  # y_pred == y_predicted - predicted classes
        # You code here
    
    def backward_pass(self, X, y, y_pred, learning_rate=0.005):
        """
        Updates weights values given objects
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                learning_rate - "speed of learning" (symbol alpha in formulas above)
        This method doesn't return anything, it only corrects weights using gradient
        descend.
        """
        n = len(y)
        y = np.array(y).reshape(-1, 1)
        # You code here
    
    def fit(self, X, y, num_epochs=300):
        """
        Descend in a minimum
        :param: X -- matrix of objects sized (n, m)
                y -- right answers vector sized (n, 1)
                num_epochs -- number of training steps
        :return: Loss_values -- vector of loss values
        """
        # self.w = np.zeros((X.shape[1], 1))  # column (m, 1)
        # self.b = 0  # bias (number)
        Loss_values = []  # loss values on every step of fitting
        
        for i in range(num_epochs):
            # You code here
        
        return Loss_values

***Note***: please, check `.shape` of matricies and vectors more often: `self.w`, `X` and `y` inside the class. Often mistake is solved with transposition or with method `.reshape()`. Don't forget to check what vector (what size) you want to get as an output -- this quite helps not to get confused.

"Apples and pears":

In [None]:
data = pd.read_csv("./data/apples_pears.csv")
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=data['target'], cmap='rainbow')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

In [None]:
X = data.iloc[:,:2].values  # matrix objects-features
y = data['target'].values.reshape((-1, 1))  # classes (column of zeros and ones)

Let's train a neuron but initialize weights at the beginning (do 10000 iterations).

In [None]:
%%time

neuron = NeuronELU(w=np.random.rand(X.shape[1], 1), b=np.random.rand(1))
Loss_values = neuron.fit(X, y, num_epochs=10000)

plt.figure(figsize=(10, 8))
plt.plot(Loss_values)
plt.title('Loss function', fontsize=15)
plt.xlabel('iteration number', fontsize=14)
plt.ylabel('$Loss(\hat{y}, y)$', fontsize=14)
plt.show()

**(for the test) Check loss:**

Display a summ of the first five and the last five values of loss during training for num_epochs=10000, round to the 4-th decimal:

In [None]:
<your code here>

Let's see how this neuron predicts

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(data.iloc[:, 0], data.iloc[:, 1], c=np.array(neuron.forward_pass(X) > 0.5).ravel(), cmap='spring')
plt.title('Apples and pears', fontsize=15)
plt.xlabel('symmetry', fontsize=14)
plt.ylabel('yellowness', fontsize=14)
plt.show();

**Advantages of ELU:**

* differentiable (with definition in zero)
* no problem of fading out gradient like in sigmoid
* no problem of "dead neurons" like in ReLU
* more stable to the "noize" (see Stanford lectures)

**Possible disadvantages of ELU:**

* not centered near 0 (can prevent convergement speed)
* computatioally harder than ReLU and LeakyReLU

---

And finally -- all pokemons (almost):

<img src="http://cdn-images-1.medium.com/max/1600/1*DRKBmIlr7JowhSbqL6wngg.png">

It lacks `SeLU()` and `Swish()`. You can learn more about them: [SeLU](https://arxiv.org/pdf/1706.02515.pdf), [Swish](https://arxiv.org/pdf/1710.05941.pdf).

`Tanh()` (hyperbolic tangent) is used rarely, and we decided not to consider `Maxout()` (as, again, we observed that is not usually used, but there are good opinions on it).

---

Do you think these are all activation functions? No, after all you can use any function you think will help in learning. More activation functions [on Wikipedia](https://en.wikipedia.org/wiki/Activation_function).

<h3 style="text-align: center;"><b>Useful links</b></h3>

0). You must check this artcile by Stanford: http://cs231n.github.io/neural-networks-1/

1). Great article on activation functions: https://www.jeremyjordan.me/neural-networks-activation-functions/

2). [Video by Siraj Raval](https://www.youtube.com/watch?v=-7scQpJT7uo)

3). Modern paper on activation functions. One of the hype functions is $swish(x) = x\sigma (\beta x)$: https://arxiv.org/pdf/1710.05941.pdf (by the way, *neural acrhitecture search* was used in search of this function)

4). **SeLU** has some interesting properties, proven with probability theory: https://arxiv.org/pdf/1706.02515.pdf (yes, this paper consists of 102 pages)

5). [List of activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function)