## Definition:
* A loss function in Machine Learning is a measure of how accurately our ML model is able to predict the expected outcome i.e the ground truth.
* All the algorithms in machine learning rely on minimizing or maximizing a function, which we call “objective function”. The group of functions that are minimized are called “loss functions”.
* There is not a single loss function that works for all kind of data. It depends on a **number of factors including the presence of outliers, choice of machine learning algorithm, time efficiency of gradient descent, ease of finding the derivatives and confidence of predictions**.


* The Cost function and Loss function refer to the same context. The cost function is a function that is calculated as the average of all loss function values. Whereas, the loss function is calculated for each sample output compared to its actual value.
* The Loss function is directly related to the predictions of our model that we have built. So if our loss function value is less, our model will be providing good results. Loss function or we can rather say, the Cost function that is used to evaluate the model performance, needs to be minimized in order to improve its performance.

* Loss functions can be broadly categorized into 2 types:

  1) Classification Loss

  2) Regression Loss
  
Notations:
    * n or m — Number of training samples.
    * log — the natural log
    * i — ith training sample in a dataset.
    * y(i) — Actual value for the ith training sample.
    * ŷ(i) — Predicted value for the ith training sample.
    
## Classification Loss
Some important types of classification loss:

### 1. Binary Cross-Entropy Loss / Log Loss:
* This is the most common Loss function used in Classification problems.
* The cross-entropy loss decreases as the predicted probability converges to the actual label.
* It measures the performance of a classification model whose predicted output is a probability value between 0 and 1.
* So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

![cross_entropy.png](attachment:cross_entropy.png)
* The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases.
* Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.

![log_loss_formula.png](attachment:log_loss_formula.png)

### 2) Hinge Loss:
* e second most common loss function used for Classification problems and an alternative to Cross-Entropy loss function is Hinge Loss, primarily developed for Support Vector Machine (SVM) model evaluation.
![hingeLoss.jpeg](attachment:hingeLoss.jpeg)

* Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident. It is primarily used with SVM Classifiers with class labels as -1 and 1. Make sure you change your malignant class labels from 0 to -1

## Regression Loss

Some important types of regression loss:

### 1) Mean Absolute Error / L1 Loss:
* MSE loss function is defined as the average of absolute differences between the actual and the predicted value. 
* It is the second most commonly used Regression loss function. 
* It measures the average magnitude of errors in a set of predictions, without considering their directions.
![MAE_with_y.png](attachment:MAE_with_y.png)

* The MAE Loss function is more robust to outliers compared to MSE Loss function. Therefore, it should be used if the data is prone to many outliers.

In [1]:
# Code for MAE
def MAE(yHat, y):
    return np.sum(np.absolute(yHat - y)) / y.size

### 2) Mean Square Error / Quadratic Loss / L2 Loss:
* It is the most commonly used Regression loss function.
* MSE loss function is defined as the average of squared differences between the actual and the predicted value.
![MSE.png](attachment:MSE.png)

* The MSE Loss function penalizes the model for making large errors by squaring them and this property makes the MSE cost function less robust to outliers. Therefore, it should not be used if the data is prone to many outliers.

### 3) Huber Loss / Smooth Mean Absolute Error:
* This loss function is defined as the combination of MSE and MAE Loss function as it approaches MSE when 𝛿 ~ 0 and MAE when 𝛿 ~ ∞ (large numbers).
* It’s Mean Absolute Error, that becomes quadratic when the error is small. And to make the error quadratic depends on how small that error could be which is controlled by a hyperparameter, 𝛿 (delta), which can be tuned.
![huberLoss.png](attachment:huberLoss.png)
* Here the choice of the delta value is critical because it determines what you’re willing to consider as an outlier.
* The Huber Loss function could be less sensitive to outliers compared to MSE Loss function depending upon the hyperparameter value. Therefore, it can be used if the data is prone to outliers and we might need to train hyperparameter delta which is an iterative process.

### 4) Log-Cosh Loss:
* This function is defined as logarithm of the hyperbolic cosine of the prediction error. It is another function used in regression tasks which is much smoother than MSE Loss.
* It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss as some Learning algorithms like XGBoost use Newton’s method to find the optimum, and hence the second derivative (Hessian) is needed.
![log_cosh.png](attachment:log_cosh.png)
* log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to abs(x) - log(2) for large x. This means that ‘logcosh’ works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.

### 5) Quantile Loss:
* A quantile is a value below which a fraction of samples in a group falls. Machine learning models work by minimizing (or maximizing) an objective function.
* As the name suggests, the quantile regression loss function is applied to predict quantiles. For a set of predictions, the loss will be its average.
* Quantile loss function turns out to be useful when we are interested in predicting an interval instead of only point predictions.