# ___Activation Function___

## ___Universal Approximation Theorem___

___Universal approximation theorem___ _says:_

_that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of R, under mild assumptions on the activation function — Wikipedia_

_In simple terms, UAT says that -_ ___you can always come up with a deep neural network that will approximate any complex relation between input and output.___

### ___Proof of Universal Approximation Theorem___

_For this illustrative proof, we will consider a simple example where we have one-dimensional input x and output y, the graph shown below represents the true relationship between input and the output. Let’s assume that I don’t know what that function equation between x and y and I can’t come up with a representation for the equation._

![image.png](attachment:image.png)

_To solve this problem, I will break this function into multiple smaller parts so that each part is represented by a simpler function. By combining the series of smaller functions (rectangular bars/towers) I can approximate the relation between x and y as close to the true relationship possible._

![image.png](attachment:image.png)

_The key point to note in this case is that I don’t have to worry about coming up with complex equations to represent the relationship between input and output. I can just come up with a simple function and use the combination of these functions to approximate the relationship to my true relationship. The more functions that I choose in this method the better will be my approximation._

![image.png](attachment:image.png)

___How do we come up with these rectangles/towers and how it will tie back to the sigmoid neuron?___

_Let’s take two sigmoid functions having a very steep slope and notice that they have a different place at which they peak. The left sigmoid peaks just before zero and right sigmoid peaks just after zero. If I just subtract these two functions the net effect is going to be a tower (rectangular output)._

![image.png](attachment:image.png)

_If we can get the series of these towers, then we can approximate any true function between input and output._

___Can we come up with a neural network to represent this operation of subtracting one sigmoid function from another?.___

![image.png](attachment:image.png)

_if we have an input x and it is passed through the two sigmoid neurons and the output from these two neurons are combined in another neuron with weights +1 and -1 i.e… same is as subtracting these two outputs, then we will get our tower. Now you can see that we have our building block ready which is a connection of three sigmoid neurons. If we can construct many such building blocks and add all of them up, we can approximate any complex true relationship between input and output. This is called the __representational power of deep neural networks__, to approximate any kind of relationship between input and output._

## ___What is Activation Function?___

_Activation functions are the functions which helps to determine the output of a neural network. These type of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction._

![image.png](attachment:image.png)

_In simple term , it calculates a “weighted sum(Wi)” of its input(xi), adds a bias and then decides whether it should be “fired” or not._

_All the input Xi’s are multiplied with their weight Wi’s assigned to each link and summed together along with Bias b._
_Note : Xi’s and Wi’s are vectors and b is scalar._

_Let Y be summation of ( (Wi*Xi) + b )_

_The value of Y can be anything ranging from -inf to +inf. Meaning it has lot of information ,now neuron must know to distinguish between the “useful” and “not -so-useful” information.To build this sense into our network we add ‘activation function (f)’— Which will decide whether the information passed is useful or not based on the result it get fired._

_The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold._

_Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions. Important use of any Activation function is to introduce non-linear properties to our Network._

### ___Properties of Activation Function___
* ___Derivative or Differential___ _: Should Change in y-axis w.r.t. change in x-axis.It is also known as slope.(Back prop)_
* ___Monotonic function___ _: The function should be either entirely non-increasing or non-decreasing._

### ___Activation function Types___
* ___Linear function___
* ___Binary Step function___
* ___Non-Linear function___

## ___Linear Function___

_A linear activation function takes the form:_
___y=mx+c ( m is line equation represents W and c is represented as b in neural nets so equation can be modified as y=Wx+b)___

_It takes the inputs (Xi’s), multiplied by the weights(Wi’s) for each neuron, and creates an output proportional to the input. In simple term, weighted sum input is proportional to output._

![image.png](attachment:image.png)

_As mentioned above , activation function should hold some properties with fails in linear function._

___Problem with Linear function,___

* ___Differential result is constant.___

_Differential of linear function is constant and has no relation with the input.Which implies weights and bias will be updated during the backprop but the updating factor (gradient) would be the same._

* ___All layers of the neural network collapse into one.___

_Linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer — Meaning Output of the first layer is same as the output of the nth layer._

* ___Doesnot holds true with Universal Approximation Theorum.___

___A neural networks with a linear activation function is simply a linear regression model.___

___Pros and Cons___ _: Linear function has limited power and ability to handle complexity.It can be used for simple task like interpretability._

![image.png](attachment:image.png)

## ___Binary Step Function___

_Binary step function are popular known as __“Threshold function”__. It is very simple function._

![image.png](attachment:image.png)

___Pros and Cons___

* _The gradient(differential ) of the binary step function is zero,which is the very big problem in back prop for weight updation._

* _Another problem with a step function is that it can handle binary class problem alone.(Though with some tweak we can use it for multi-class problem)_

![image.png](attachment:image.png)

## ___Non-Linear Function___

_The deep learning rocketing to the sky because of the non-linear functions.Most modern neural network use the non-linear function as their activation function to fire the neuron. Reason being they allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality._

___Advantage of Non-linear function over the Linear function___

* _Differential are possible in all the non -linear function._
* _Stacking of network is possible , which helps us in creating the deep neural nets._

___Non-linear Activation Function Types___
_The Nonlinear Activation Functions are mainly divided on the basis of their range or curves._

### ___Sigmoid or Logistic Activation Function___

![image.png](attachment:image.png)

<img src='https://miro.medium.com/max/322/1*4JFawbmQVVqO_Ya_9DB2dw.png'/>

* _The output of the sigmoid function always ranges between 0 and 1 ._


* _Sigmoid is S-shaped , ‘monotonic’ & ‘differential’ function._


* _Derivative /Differential of the sigmoid function (f’(x)) will lies between 0 and 0.25._


* _Derivative of the sigmoid function is not “monotonic”._

___Cons:___
* _Derivative of sigmoid function suffers “__Vanishing gradient problem__” - for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction._


* _Sigmoid function in not “zero-centric”.This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder._


* _Slow convergence- as its computationally heavy.(Reason use of exponential math function )_

___Sigmoid is very popular in classification problems.___

#### ___Saturation problem___

_A neuron is said to be saturated if it reaches to its peak value either maximum or minimum._

_Saturation when f(x) = 0 or 1 then f’(x) = f(x)(1- f(x)) = 0_

___Why do we care about Saturation?___

_Refer to the below neural network where let us assume weight w211 needs to be updated during backpropagation using gradient descent update rule._

![image.png](attachment:image.png)

![image.png](attachment:image.png)

_Now if h21 is found to be 1 then its derivative will be 0. So there is no any update in weight w211, this problem is known as __vanishing gradient__ problem. The gradient of weight vanishes or goes down to zero._

_So using logistic activation function, the saturated neuron may cause the gradient to vanish and therefore the network refuses to learn or keep learning at a very small rate._

#### ___Not a zero-centered function___
_A function having an equal mass on both the sides of zero line (x-axis) is known as a zero-centered function. Or in other words, in a zero-centered function, the output can be either negative or positive._

_In the case of logistic activation function, the output is always positive and the output is always accumulated only towards one side (positive side) so it is not a zero-centered function._

___Why do we care about zero-centered functions?___

_Let us assume weight w311 and w312 needs to be updated during backpropagation using gradient descent update rule._

![image.png](attachment:image.png)

![image.png](attachment:image.png)
_h21 and h22 will be always positive due to the logistic activation function. The value of the gradient of w311and w312 can be positive or negative depending on the value of the common part._

![image.png](attachment:image.png)

_So the gradient of all the weights connected to the same neuron is either positive or negative. Hence during update rule, these weights are only allowed to move in certain directions, not in all the possible directions. It makes the optimization harder._

___It is similar to a situation where you are only allowed to move left and forward, not allowed to move right and backward then it is very hard to reach the desired destination.___

### ___tanh Function___

![image.png](attachment:image.png)

<img src ='https://miro.medium.com/max/600/1*Pf89wzuV8Md86GoTnpYvjg.png'/>

_Tanh is a hyperbolic tangent function. The curves of tanh function and sigmod function are relatively similar. Tanh is the modified version of sigmoid function.Hence have similar properties of sigmoid function._

<img src='https://miro.medium.com/max/456/1*fGHYXAb3BNbaRIxqYvMeXg.png'/>

<img src='https://miro.medium.com/max/545/1*Dt5VWESQWaFvRnb9H9OwWA.png'/>

* _The function and its derivative both are monotonic_


* _Output is zero “centric”_


* _Optimization is easier_


* _Derivative /Differential of the Tanh function (f’(x)) will lies between 0 and 1._

___Cons:___
* _Derivative of Tanh function suffers “Vanishing gradient and Exploding gradient problem”._


* _Slow convergence- as its computationally heavy.(Reason use of exponential math function )_


___Tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.___

_In general binary classification problems, the tanh function is used for the hidden layer and the sigmod function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging._

### ___ReLU (Rectified Linear Units) Function___

![image.png](attachment:image.png)

<img src='https://miro.medium.com/max/640/0*-_cfOPFbhUDZ9acO.png'/>

_ReLU is the non-linear activation function that has gained popularity in AI. ReLu function is also represented as f(x) = max(0,x)._

* _The function and its derivative both are monotonic._
 
* _Main advantage of using the ReLU function- It does not activate all the neurons at the same time._
 
* _Computationally efficient_
 
* _Derivative /Differential of the Tanh function (f’(x)) will be 1 if f(x) > 0 else 0._
 
* _Converge very fast_

___Cons:___
* _ReLu function in not “zero-centric”.This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder._

* _Dead neuron is the biggest problem.This is due to Non-differentiable at zero._

___Problem of Dying neuron/Dead neuron___ _: As the ReLu derivative f’(x) is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0), ReLu does not saturate (exploid) and no dead neurons (Vanishing neuron)are reported. Saturation and vanishing gradient only occur for negative values that, given to ReLu, are turned into 0- This is called the problem of dying neuron._

### ___Leaky ReLU Function___

<img src='https://miro.medium.com/max/700/1*zttz83tAhQ8-ljSgNUIPHw.png'/>

![image.png](attachment:image.png)

<img src='https://miro.medium.com/max/582/0*3yKiiHYG3Xqx7-9N.png' width=300/>

_Leaky ReLU function is nothing but an improved version of the ReLU function with introduction of “constant slope”._

* _Leaky ReLU is defined to address problem of dying neuron/dead neuron._


* _Problem of dying neuron/dead neuron is addressed by introducing a small slope having the negative values scaled by α enables their corresponding neurons to “stay alive”._


* _The function and its derivative both are monotonic_


* _It allows negative value during back propagation_


* _It is efficient and easy for computation._


* _Derivative of Leaky is 1 when f(x) > 0 and ranges between 0 and 1 when f(x) < 0._

___Cons:___

* _Leaky ReLU does not provide consistent predictions for negative input values._

_In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU._

### ___ELU (Exponential Linear Units) Function___

![image.png](attachment:image.png)

_ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU, and:_

* _ELU is also proposed to solve the problem of dying neuron._


* _No Dead ReLU issues_


* _Zero-centric_

___Cons:___
* _Computationally intensive._


* _Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU._


* _f(x) is monotonic only if alpha is greater than or equal to 0._


* _f’(x) derivative of ELU is monotonic only if alpha lies between 0 and 1._


* _Slow convergence due to exponential function._

### ___PRelu (Parametric ReLU) Function___

![image.png](attachment:image.png)

<img src='https://miro.medium.com/max/700/0*8VBU_GxQ07OvZN2q.png'/>

_PReLU is also an improved version of ReLU. In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death. Compared to ELU, PReLU is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage._

* _The idea of leaky ReLU can be extended even further._


* _Instead of multiplying x with a constant term we can multiply it with a “hyperparameter (a-trainable parameter)” which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU._


* _The parameter α is generally a number between 0 and 1, and it is generally relatively small._


* _Have slight advantage over Leaky Relu due to trainable parameter._


* _Handle the problem of dying neuron._

___Cons:___
* _Same as leaky Relu._


* _f(x) is monotonic when a> or =0 and f’(x) is monotonic when a =1_

![image.png](attachment:image.png)

_We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call PReLU as Leaky Relu , it is regarded as a special case PReLU it._

_Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter._
* _if aᵢ=0, f becomes ReLU_
* _if aᵢ>0, f becomes leaky ReLU_
* _if aᵢ is a learnable parameter, f becomes PReLU_

### ___Maxout___

_The Maxout activation function is defined as follows:_

![image.png](attachment:image.png)

_One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.) that generalizes the ReLU and its leaky version. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1,b1 =0).The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks._

_The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function._

_Maxout can be seen as adding a layer of activation function to the deep learning network, which contains a parameter k. Compared with ReLU, sigmoid, etc., this layer is special in that it adds k neurons and then outputs the largest activation value. value._

### ___Swish (A Self-Gated) Function___

<img src= 'https://miro.medium.com/max/600/0*-XUzhLPuV1WIXNSd.png'/>

* _Google Brain Team has proposed a new activation function, named Swish, which is simply f(x) = x · sigmoid(x)._


* _Their experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets._


* _The curve of the Swish function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU._


* _Swish function is “not monotonic”. This means that the value of the function may decrease even when the input values are increasing._


* _Function is unbounded above and bounded below._

![image.png](attachment:image.png)

___“Swish tends to continuously match or outform the ReLu”___

_Note that the output of the swish function may fall even when the input increases. This is an interesting and swish-specific feature.(Due to non-monotonic character)_

___f(x)=2x*sigmoid(beta*x)___

_If we think that beta=0 is a simple version of Swish, which is a learnable parameter, then the sigmoid part is always 1/2 and f (x) is linear. On the other hand, if the beta is a very large value, the sigmoid becomes a nearly double-digit function (0 for x<0,1 for x>0). Thus f (x) converges to the ReLU function. Therefore, the standard Swish function is selected as beta = 1. In this way, a soft interpolation (associating the variable value sets with a function in the given range and the desired precision) is provided._

_Swish's design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called **self-gating**._

_The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters._

1) _Unboundedness (unboundedness) is helpful to prevent gradient from gradually approaching 0 during slow training, causing saturation. At the same time, being bounded has advantages, because bounded active functions can have strong reguairzation, and larger negative inputs will be resolved._

2) _At the same time, smoothness also plays an important role in optimization and generalization._

### ___Softmax___

![image.png](attachment:image.png)

_The “softmax” function is also a type of sigmoid function but it is very useful to handle multi-class classification problems._

_“Softmax can be described as the combination of multiple sigmoidal function.”_

_“Softmax function returns the probability for a datapoint belonging to each individual class.”_

_For an arbitrary real vector of length K, Softmax can compress it into a real vector of length K with a value in the range (0, 1), and the sum of the elements in the vector is 1._

_Softmax is different from the normal max function: the max function only outputs the largest value, and Softmax ensures that smaller values have a smaller probability and will not be discarded directly. It is a "max" that is "soft"._

_The denominator of the Softmax function combines all factors of the original output value, which means that the different probabilities obtained by the Softmax function are related to each other._

![image.png](attachment:image.png)

![image.png](attachment:image.png)

_While building a network for a multiclass problem, the output layer would have as many neurons as the number of classes in the target._

_For instance if you have three classes[A,B,C], there would be three neurons in the output layer. Suppose you got the output from the neurons as [2.2 , 4.9 , 1.75].Applying the softmax function over these values, you will get the following result — [0.52 , 0.21, 0.27]. These represent the probability for the data point belonging to each class. From result we can that the input belong to class A._

_“Note that the sum of all the values is 1.”_

_In the case of binary classification, for Sigmoid, there are:_
![image.png](attachment:image.png)

_For Softmax with K = 2, there are:_
![image.png](attachment:image.png)

_Among them: It_

![image.png](attachment:image.png)

_can be seen that in the case of binary classification, Softmax is degraded to Sigmoid._

### ___Softplus___

<img src='https://miro.medium.com/max/700/0*492PBXPSjBYH3nsI.png'/>

_The softplus function is similar to the ReLU function, but it is relatively smoother.Function of Softplus or SmoothRelu:_

___f(x) = ln(1+exp x)___

_Derivative of the Softplus function is f’(x) is logistic function (1/(1+exp x))._

_Function value ranges from (0, + inf).Both f(x) and f’(x) are monotonic._

## ___Which one is better to use ? How to choose a right one?___

_We can’t differentiate between activation function.Each activation function as its own pro’s and con’s.All the good and bad will be decided based on the trail._

_But based on the properties of the problem we might able to make a better choice for easy and quicker convergence of the network._

* _Sigmoid functions and their combinations generally work better in the case of classification problems_


* _Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem_


* _ReLU activation function is widely used in modern era._


* _In case of dead neurons in our networks due to ReLu then leaky ReLU function is the best choice_


* _ReLU function should only be used in the hidden layers_

___“As a rule of thumb, one can begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results”.___

## ___Vanishing Gradient Problem___

_Vanishing Gradient Problem is a difficulty found in training certain Artificial Neural Networks with gradient based methods (e.g Back Propagation). In particular, this __problem makes it really hard to learn and tune the parameters of the earlier layers in the network__. This problem becomes __worse__ as the __number of layers__ in the architecture __increases__._

_This is not a fundamental problem with neural networks - it's a problem with gradient based learning methods __caused by certain activation functions__. Let's try to intuitively understand the problem and the cause behind it._

_Recall the sigmoid function, one that was almost always used as an activation function for ANNs in a classification context:_

_The sigmoid function is useful because it “squeezes” any input value into an output range of (0, 1) (where it asymptotes). This is perfect for representations of probabilities and classification._

_Let’s take a look at the derivative of the sigmoid function._  ___f(x)*(1-f(x)___

![image.png](attachment:image.png)

<img src='https://miro.medium.com/max/1000/1*GOjCYjqPb5oMcfs0ZT8TQQ.png' width=800/>

_Looks decent, you say. Look closer. The maximum point of the function is 1/4, and the function horizontally asymptotes at 0. In other words, the output of the derivative of the cost function is always between 0 and 1/4. In mathematical terms, the range is (0, 1/4]._

_Now, let’s move on to the structure of a neural network and backprop and their implications on the size of gradients._

<img src='https://miro.medium.com/max/700/0*SUsopH_FuWnAhQjD.'/>

_Recall this general structure for a simple, univariate neural network. Each neuron or “activity” is derived from the previous: it is the previous activity multiplied by some weight and then fed through an activation function. The input, of course, is the notable exception. The error box J at the end returns the aggregate error of our system. We then perform backpropagation to modify the weights through gradient descent such that the output of J is minimized._

_To calculate the derivative to the first weight, we used the chain rule to “backpropagate” like so:_

<img src='https://miro.medium.com/max/649/0*d5zULA4KVEsVYlng.'/>

_Let’s focus on these individual derivatives:_

<img src='https://miro.medium.com/max/270/1*GxkJqivU9eIY1FIOF5BgHw.png'/>

_With regards to the first derivative — since the output is the activation of the 2nd hidden unit, and we are using the sigmoid function as our activation function, then the derivative of the output is going to contain the derivative of the sigmoid function. In specific, the resulting expression will be:_

<img src='https://miro.medium.com/max/422/1*c9gelH3uoC54l9h4NeVrBA.png'/>

_The same applies for the second:_

<img src='https://miro.medium.com/max/428/1*VDUbtzZ03InBSSINfS-q8A.png'/>

_In both cases, the derivative contains the derivative of the sigmoid function. Now, let’s put those together:_

<img src='https://miro.medium.com/max/700/1*zejGoPP4Q1C170WsW7K_-w.png'/>

_Recall that the derivative of the sigmoid function outputs values between 0 and 1/4. By multiplying these two derivatives together, we are multiplying two values in the range (0, 1/4]. Any two numbers between 0 and 1 multiplied with each other will simply result in a smaller value. For example, 1/3 × 1/3 is 1/9._

_Now, look at the magnitude of the terms in our expression:_

<img src='https://miro.medium.com/max/700/1*8JJ6sYleUtvUZR7TOyyFVg.png'/>

_At this point, we are multiplying four values which are between 0 and 1. That will become small very fast. And even if this weight initialization technique is not employed, the vanishing gradient problem will most likely still occur. Many of these sigmoid derivatives multiplied together would be small enough to compensate for the other weights, and the other weights may want to shift into a range below 1._

_This neural network isn’t that deep. But imagine a deeper one used in an industrial application. As we backpropagate further back, we’d have many more small numbers partaking in a product, creating an even tinier gradient! Thus, with deep neural nets, the vanishing gradient problem becomes a major concern._

___How do we solve this?___

___Rectified Linear Unit (ReLU)___

<img src='https://miro.medium.com/max/386/1*ZD5kma5J-6UabfEwERv_dQ.png'/>

_Another way of writing the ReLU function is like so:_

<img src='https://miro.medium.com/max/310/1*_Jo2pajYfG1y5ThUFA3PNA.png'/>

_In other words, when the input is smaller than zero, the function will output zero. Else, the function will mimic the identity function. It’s very fast to compute the ReLU function._

_It doesn’t take a genius to calculate the derivative of this function. When the input is smaller than zero, the output is always equal to zero, and so the rate of change of the function is zero. When the input is greater or equal to zero, the output is simply the input, and hence the derivative is equal to one:_
          
<img src='https://miro.medium.com/max/417/1*-1esz_gtjhs2JTmSh4xAUg.png'/>

_If we were to graph this derivative, it would look exactly like a typical step function:_

<img src='https://miro.medium.com/max/700/1*cIsyfhOzTjQrfhUiimxn8g.png'/>

_So, it’s solved! Our derivatives will no longer vanish, because the activation function’s derivative isn’t bounded by the range (0, 1)._

_ReLUs have one caveat though: they “die” (output zero) when the input to it is negative. This can, in many cases, completely block backpropagation because the gradients will just be zero after one negative value has been inputted to the ReLU function. This would also be an issue if a large negative bias term / constant term is learned — the weighted sum fed into neurons may end up being negative because the positive weights cannot compensate for the significance of the bias term. Negative weights also come to mind, or negative input (or some combination that gives a negative weighted sum). The dead ReLU will hence output the same value for almost all of your activities — zero._

_A __“leaky” ReLU__ solves this problem. Leaky Rectified Linear Units are ones that have a very small gradient instead of a zero gradient when the input is negative, giving the chance for the net to continue its learning._

## ___Exploding Gradient Problem___

_Exploding gradients are a problem when __large error gradients accumulate__ and __result in very large updates to neural network model weights during training__. Gradients are used during training to update the network weights, but when the typically this process works best when these updates are small and controlled. When the magnitudes of the gradients accumulate,  an unstable network is likely to occur, which can cause poor predicition results or even a model that reports nothing useful what so ever. There are methods to fix exploding gradients, which include __gradient clipping__ and __weight regularization__, among others. This problem mainly occurs due to large values of weights, if the value of weights are higher those when multiplied by derivates of function will be higher values and eventually in the while updating the weights in previous layer its imapct will be very much high._

_[Reference](https://www.youtube.com/watch?v=IJ9atfxFjOQ)_