# Popular types of Loss Funtions

A loss function is a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they're pretty good, it'll output a lower number.

## 1. Least Square Error : 
Least Squared error also know as L2 Loss is the sum of the squared of the differences between the actual and predicted values. This error is usually used for regression problems.

![title](img/l1.png)


This error penalises heavily to outliers since it is the squre of errors. Thus if our data contains a lot of outliers then this error should not be used. 



## 2. Least Absolute Error :

Least Absolute error also known as L1 error is the absolute difference between the actual and predicted values. This error is also used for regression problems. It is more robust to outliers since unlike LSE it does not penalise outliers heavily. 

![title](img/l2.png)

## 3. Huber Loss:

Huber loss combines the best of L1 and L2 losses. It is quadratic for small errors and for larger errors it is linear.

$ \begin{equation}
  L_{\delta}=\begin{cases}
    \frac{1}{2} (y - f(x))^{2}, & \text{if | y - f(x) |<=}\delta .\\
    \delta |y - f(x)| - \frac{1}{2}\delta^{2}, & \text{if | y - f(x) |>}\delta.\\
  \end{cases} 
  \end{equation}$
  
It is also more robust to outliers than LSE.

### The graph for huber loss for various values is:

![title](img/huber.jpeg)

It allows for large gradients for large numbers and smaller gradient for small numbers. Also computing $\delta$ is computationally expensive so we use  Log-Cosh Loss.

## 4. Hinge Loss :

This loss is usually used in SVM classification with class labels -1 and 1.  

![title](img/hinge.jpg)

It penalises not only wrong prediction but also no so confident right predictions.

### The resulting graph is:

![title](img/hinge.svg)

## 5. Binary Cross Entropy :

We use binary cross entropy to minimise loss for models which output a probablility $\hat{y_{i}}$. 

![title](img/BinaryCrossEntropy.png)

It is also known as log loss. This is used for binary classification as the name suggests.

### The resulting graph is:

![title](img/logloss.png)

## 6. MultiClass Cross Entropy :

![title](img/mce.jpg)

This is a generalisation of binary cross entropy over more than two classifications

## 7. KL-Divergence

The Kullback-Liebler Divergence is a measure of how a probability distribution differs from another distribution. A KL-divergence of zero indicates that the distributions are identical.

![title](img/kl.jpg)

KL-Divergence cannot be used as a distance metric.

## 8. Exponential Loss

The exponential loss was designed at the beginning of the Adaboost algorithm which greedily optimized it. The mathematical form is:

![title](img/exp.png) 

### The resulting graph is:

![title](img/exp.jpeg)

## 9. Log-Cosh Loss :

This is another type of loss function used in regression. It is smoother than L2 loss. Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

![title](img/logcosh.png)

log(cosh(x)) is approximately equal to $x^{2}/ 2 $for small x and to abs(x) - log(2) for large x. This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. 

### The resulting graph of the loss is:

![title](img/logcosh1.png)


## 10. Quantile regression loss function :

This is a regression loss function that is applied to predict quantiles. A quantile determines how many values in a distribution are above or below a certain limit.

Given a prediction $y_{i}^{p}$ and outcome $y_{i}$, the mean regression loss for a quantile q is:

![title](img/quantile.png)

# Popular types of activation functions

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron's input is relevant for the model's prediction.

## 1. Sigmoid Function:

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point and exactly one inflection point. A sigmoid “function” and a sigmoid “curve” refer to the same object. Additionally, the sigmoid function is not symmetric around zero.

![title](img/sigmoid_function.svg)

### The resulting graph is :

![title](img/Sigmoid.png)

## 2. Tanh

Tanh or hyperbolic tangent Activation Function is also like a logistic sigmoid but better. The range of tanh function is from -1 to 1, tanh is also sigmoidal (s-shaped).Tanh function is mainly used for classification between two classes. It is monotonic while its derivative is not monotonic. It is also differentiable. 

![title](img/tanh.png)

### The resulting graph is:

![title](img/TanH.png)


## 3. ReLU: 

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better. The function and its derivative both are monotonic. The ReLU is the most used activation function in the world right now. Since, it is used in almost all the convolutional neural networks or deep learning.

![title](img/relu.png)

### The resulting graph is: 

![title](img/ReLU.png)

## 4. Leaky ReLU:

Leaky ReLUs are one attempt to fix the “dying ReLu” problem. Instead of the function being zero x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). Both leaky ReLU function and its derivative are monotonic in nature. The function computes 

![title](img/lrelu.png)

### The resulting graph is: 

![title](img/Leaky_ReLU.png)

## 5. Parameterised ReLu:

Parameterized ReLU or Parametric ReLU activation function is a variant ReLU. It is similar to Leaky ReLU, with a slight change in dealing with negative input values. It aims to solve the problem of gradient’s becoming zero for the left half of the axis. It introduces a new parameter as a slope of the negative part of the function.

![title](img/prelu.png)

### The resulting graph is: 

![title](img/Parametric_ReLU.png)

## 6. Exponential Linear Unit:

Exponential Linear Unit or ELU for short is also a variant of Rectified Linear Unit (ReLU) that modifies the slope of the negative part of the function. Unlike the leaky ReLU and parametric ReLU functions, instead of a straight line, ELU uses  a long curve for defining the negative values.

![title](img/elu.svg)

### The resulting graph is: 

![title](img/Exponential_LU.png)


## 7. Swish:

Swish is a lesser known activation function which was discovered by researchers at Google. Swish is as computationally efficient as ReLU and shows better performance than ReLU on deeper models. The values for swish ranges from negative infinity to infinity. This function is not monotonic.

![title](img/swish.svg)

### The resulting graph is: 

![title](img/Swish.png)


## 8. Softmax:

The softmax function, also known as softargmax or normalized exponential function, is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

![title](img/softmaxequation.jpg)

### The resulting graph is: 

![title](img/softmax.jpg)


## 9. Softplus:

The softmax function, also known as softargmax or normalized exponential function, is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

![title](img/softplus.svg)

### The resulting graph is: 

![title](img/activation-softplus.png)

## 10. SoftSign:

The Softsign function is an activation function which rescales the values between -1 and 1 by applying a threshold just like a sigmoid function. The advantage, that is, the value of a softsign is zero-centered which helps the next neuron during propagating.

![title](img/softsign.svg)

### The resulting graph is: 

![title](img/softsign.png)