# Activation Functions                               

https://www.jeremyjordan.me/neural-networks-activation-functions/#:~:text=An%20ideal%20activation%20function%20is,training%20to%20optimize%20the%20weights.
                                                                                    

![image.png](attachment:image.png)

In a neural network, The inputs - numeric data points, are fed into the neurons in the input layer. 

The activation function is the tranformation logic/ mathematical “gate” in between the input feeding to the current neuron and its output going to the next layer.

An ideal activation function is both nonlinear and differentiable. 
- The nonlinear behavior of an activation function allows our neural network to learn nonlinear relationships in the data.
- Differentiability is important because it allows us to backpropagate the model's error when training to optimize the weights.

## 1. Linear Activation Function
The output is proportional to input. 

No Backpropogation. It is just a linear regression.
![image.png](attachment:image.png)

## 2. Non-Linear Activation Functions


Allow backpropagation because they have a derivative function which is related to the inputs.

Design of Deep neural networks with non-linear activation functions possible with high levels of accuracy, as the multiple level of derivative exist.

## Sigmoid 
Smooth gradient with Output values bound between 0 and 1, normalizing the output of each neuron.
![image.png](attachment:image.png)


#### DisAdvantages:

The exp( ) function is computationally expensive.

Vanishing gradients

Not useful for the regression tasks as well. 

## 3. Non Linear - TanH
Smooth gradient with Output values bound between -1 and 1, normalizing the output of each neuron.
![image.png](attachment:image.png)

#### DisAdvantages:

The exp( ) function is computationally expensive.

Vanishing gradients

Not useful for the regression tasks as well. 

## 4.Non Linear - ReLU
Computationally efficient—allows the network to converge very quickly.
Although it looks like a linear function, ReLU has a derivative function and allows for backpropagation.
![image.png](attachment:image.png)

#### DisAdvantages:
DyingReLU: When inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

Exploding Gradients

## 5.Non Linear - ELU
Exponential Linear Unit is a function that tend to converge cost to zero faster and produce more accurate results. 
ELU is very similiar to RELU except negative inputs.ELU has a extra alpha constant, a positive number.

ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.

![image.png](attachment:image.png)

## 6.Non Linear - Leaky ReLU
This variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values.

### For Leaky ReLU: alpha = 0.01


![image.png](attachment:image.png)

#### DisAdvantages:
Leaky ReLU does not provide consistent predictions for negative input values.

## 7.Non Linear - Parametric ReLU
This variation of ReLU allows the negative slope to be learned—unlike Leaky ReLU, this function provides the slope of the negative part of the function as an argument. (Learnable parameter through experiments)

It is, therefore, possible to perform backpropagation and learn the most appropriate value of α.
![image.png](attachment:image.png)

![image.png](attachment:image.png)

## 8.Non Linear - Swish
Swish, a variation of RELU, self-gated activation function discovered by researchers at Google.

![image.png](attachment:image.png)
### Properties:
1. It is bounded below. Swish therefore benefits from sparsity similar to ReLU. Large negative weights are simply zeroed out.

2. Small negative values are  not zeroed out. (In ReLU, f(x) = 0 for x < 0]. These negative values may still be relevant for capturing patterns underlying the data. 

3. The fact that it is a smooth curve means that its output landscape will be smooth. This provides benefits when optimizing the model in terms of convergence towards the minimum loss.

4. It is slightly unbounded above. This means that for very large values, the outputs do not saturate to the maximum value (i.e., to 1 for all the neurons).

## 9. NonLinear - MaxOut ReLU
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.
It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique. 
 
 
   Multiple the number of parameters for each neuron based on to size of the input. In traditional neural nets, this pre-activation is passed through a sigmoid activation to convert to -1 and 1. The maxout activation instead takes the maximum value of the pre-activations and reshapes it into a vector containing only this value.
![image.png](attachment:image.png)

Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

## 10. NonLinear - SoftPlus
The smooth approximation to the ReLU,  f(x) = ln(1+exp x) , which is called the softplus function or SmoothReLU function.  


![image.png](attachment:image.png)

 The derivative of softplus is f ′(x) = 1/ (1 +exp(−x )) which is also a sigmoid function.


## 11. SoftMax
SoftMax function calculates the probabilities distribution of the event over ‘n’ different events. 

![image.png](attachment:image.png)
This function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.

Typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.

The softmax function is commonly used as the output activation function for multi-class classification because it scales the preceding inputs from a range between 0 and 1 and normalizes the output layer so that the sum of all output neurons is equal to one. As a result, we can consider the softmax function as a categorical probability distribution. This allows you to communicate a degree of confidence in your class predictions.

# Loss Functions

Machines learn by means of a loss function. It’s a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, loss function would go up to a very large number. Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction.




 ## Regression Loss Functions       
![image.png](attachment:image.png)
        


### 1. MAE (L1)   Least Absolute Error 
![image.png](attachment:image.png)

One issue to be aware of is that the L1 is not smooth at the target and this can result in algorithms not converging well.

### 2. MSE (L2)   Least Squared  Error
![image.png](attachment:image.png)
L2 squares the error increasing by a lot if error > 1 (outlier can cause this kind of error), so the model is very sensitive to variations, and, when it is used to optimize an algorithm, it adjusts the model to minimize the error.

### 3. Huber Loss
The Huber loss combines the best properties of MSE and MAE. 

It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter:

![image.png](attachment:image.png)

### Pseudo-Huber Loss Function
It is a smooth approximation to the Huber loss function. This loss function attempts to take the best of the L1 and L2 by being convex near the target and less steep for extreme values. The form depends on an extra parameter, delta, which dictates how steep it will be.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Binary Classification Loss Functions

### 4. Hinge Loss
Hinge loss for an input-output pair (x, y) is given as:
![image.png](attachment:image.png)

Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1 and 1. So make sure you change the label of the ‘Malignant’ class in the dataset from 0 to -1.

### Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident.

Squared loss function which operates statistical assumptions of mean, is more prone to outliers. It penalises the outliers intensely. This results in slower convergence rates when compared to hinge loss or cross entropy functions.

When it comes to hinge loss function, it penalises the data points lying on the wrong side of the hyperplane in a linear way. Hinge loss is not differentiable and cannot be used with methods which are differentiable like stochastic gradient descent(SGD). In this case Cross entropy(log loss) can be used. This function is convex like Hinge loss and can be minimised used SGD.

![image.png](attachment:image.png)

# Entropy

Generally, we use entropy to indicate disorder or uncertainty. It is measured for a random variable x with probability distribution p(x):

![image.png](attachment:image.png)
A greater value of entropy for a probability distribution indicates a greater uncertainty in the distribution. Likewise, a smaller value indicates a more certain distribution.

The negative sign is used to make the overall quantity positive and for performance comparisons.

![image.png](attachment:image.png)

![image.png](attachment:image.png)