## Loss Function:
The loss function is a mathematical function that computes the error for the single training example in machine learning. In other words, it is used to quantify the difference between the predicted output and the actual output (target) for a given input.

<div style="text-align:center">

![image.png](attachment:image-2.png) 

Source: https://www.datarobot.com/wp-content/uploads/2022/02/word-image-5.png</div>

## Cost Function:
The cost function is the average of the loss of the entire training set.

\begin{equation}
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(y^{(i)}, \hat{y}^{(i)})
\end{equation}

Where:
- $w$ and $b$ are the model parameters (weights and bias).
- $m$ is the number of training examples.
- $y^{(i)}$ is the true label for the $i$-th example.
- $\hat{y}^{(i)}$ is the predicted label for the $i$-th example.
- $L(y, \hat{y})$ is the loss function that measures the difference between the true label $y$ and the predicted label $\hat{y}$.


## Loss Functions for Regression

### Mean Squared Error (MSE)
The Mean Squared Error (MSE) is a widely used loss function in machine learning for regression tasks. It measures the average of the squared differences between the predicted values and the true values.

Mathematically, the MSE is defined as:

\begin{equation}
MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
\end{equation}

where $y_i$ is the true value for the $i$-th example and $\hat{y}_i$ is the predicted value for the $i$-th example. $m$ is the number of examples in the dataset.

The MSE is a non-negative value, and a smaller MSE indicates better performance of the model. The MSE loss function is used in various regression models, such as linear regression, neural networks, and decision trees. The goal is to minimize the MSE by adjusting the model parameters to improve the predictions.

One limitation of the MSE is that it gives equal weight to all the errors, whether they are small or large. This can be problematic if the dataset contains outliers or if the model needs to prioritize reducing errors for certain regions of the input space. In such cases, alternative loss functions, such as the Mean Absolute Error (MAE) or Huber loss, can be used.

### Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) is a commonly used loss function in machine learning for regression tasks, similar to the Mean Squared Error (MSE). Instead of measuring the average of the squared differences between the predicted values and the true values (as in MSE), the MAE measures the average of the absolute differences:

\begin{equation}
MAE = \frac{1}{m} \sum_{i=1}^{m} |y_i - \hat{y}_i|
\end{equation}

where $y_i$ is the true value for the $i$-th example and $\hat{y}_i$ is the predicted value for the $i$-th example. $m$ is the number of examples in the dataset.

The MAE loss function is robust to outliers and puts equal weight on all errors. This makes it a better choice when outliers are present in the dataset. In addition, the MAE is easier to interpret than the MSE because it gives the average magnitude of the errors in the same units as the target variable.

However, one disadvantage of the MAE is that it is less sensitive to small errors compared to the MSE, because it does not square the errors. As a result, the MAE might not be as effective at penalizing models that make small but frequent errors.

Overall, the choice between the MAE and the MSE depends on the specific problem and the characteristics of the dataset. In some cases, it might be beneficial to use both loss functions and compare the results to choose the best one.

### Huber loss

The Huber loss is a loss function that combines the advantages of the Mean Squared Error (MSE) and Mean Absolute Error (MAE) loss functions. It is commonly used in machine learning for regression tasks, especially when the dataset contains outliers or when the model needs to prioritize reducing errors for certain regions of the input space.

The Huber loss is defined as follows:

\begin{equation}L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2, & \text{if } |y - \hat{y}| \leq \delta \ \delta |y - \hat{y}| - \frac{1}{2}\delta^2, & \text{otherwise} \end{cases}\end{equation}

where $y$ is the true value and $\hat{y}$ is the predicted value. $\delta$ is a hyperparameter that determines the point at which the loss function transitions from behaving like the MSE loss to behaving like the MAE loss.

When $|y - \hat{y}| \leq \delta$, the loss function is the squared difference between $y$ and $\hat{y}$, which is the same as the MSE loss. This region is where the model is penalized for small errors.

When $|y - \hat{y}| > \delta$, the loss function is the absolute difference between $y$ and $\hat{y}$, multiplied by $\delta$. This region is where the model is penalized for large errors, but the penalty is capped at a constant value of $\frac{1}{2}\delta^2$.

The Huber loss combines the benefits of the MSE and MAE loss functions: it is robust to outliers like the MAE loss, and it gives more weight to small errors like the MSE loss. By adjusting the value of the hyperparameter $\delta$, the Huber loss can be tuned to prioritize either the MSE or the MAE behavior.

Overall, the Huber loss is a flexible and effective loss function that can be a good choice for regression tasks when the dataset has outliers or when the model needs to prioritize certain regions of the input space.

<div style="text-align:center">

$L(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})$ </div>

Here's the corrected version of the equation with the binary cross-entropy loss function:

<div style="text-align:center">

$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(y^{(i)}, \hat{y}^{(i)}) = -\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})\right]$</div>

## Loss Functions for Classification:

### Cross-entropy loss

Cross-entropy loss (also known as log loss) is a loss function commonly used in machine learning for classification tasks, especially in problems where the output variable is binary or contains multiple classes. It is a measure of how well the predicted probabilities of the classes match the true probabilities.

The cross-entropy loss is defined as follows:

<div style="text-align:center">

$J(W,b) = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{c} y_{ij} log(\hat{y}_{ij})$ </div>

where $m$ is the number of examples in the dataset, $c$ is the number of classes, $y_{ij}$ is a binary indicator (0 or 1) whether the true label of example $i$ is class $j$, and $\hat{y}_{ij}$ is the predicted probability of example $i$ belonging to class $j$.

The first summation over $m$ computes the loss for each example, and the second summation over $c$ computes the loss for each class. The loss is the negative logarithm of the predicted probability of the true class, weighted by the binary indicator of the true class. The logarithm is used to penalize predictions that are far from the true class more heavily, while small errors receive a smaller penalty.

The cross-entropy loss is widely used for classification tasks because it has several desirable properties:

- It is a continuous, differentiable function that allows for efficient optimization using gradient-based methods.
- It is a surrogate for the 0-1 loss, which is the optimal loss function for classification problems but is difficult to optimize directly.
- It is a probabilistic measure that captures the uncertainty in the predicted probabilities.

`The cross-entropy loss is especially useful when the classes are imbalanced or when misclassifying one class is more costly than misclassifying another class.` In such cases, the cross-entropy loss can be weighted to assign higher penalties to misclassifications of the minority class or the more important class. `Binary cross-entropy and categorical cross-entropy are variants of the cross-entropy loss function` that are used for specific types of classification problems, depending on the number of classes in the output variable. 


#### Binary cross-entropy: 

Binary cross-entropy is used when the output variable is binary (i.e., has two possible values). It is defined as follows:

<div style="text-align:center">

$J(W,b) = -\frac{1}{m}\sum_{i=1}^{m} [y_i log(\hat{y}_i) + (1-y_i) log(1-\hat{y}_i)]$ </div>

where $m$ is the number of examples in the dataset, $y_i$ is the true binary label of example $i$ (either 0 or 1), and $\hat{y}_i$ is the predicted probability that example $i$ belongs to the positive class (i.e., has a label of 1).

The loss function is the negative logarithm of the predicted probability of the true class. If the true class is 1, the loss is $-log(\hat{y}_i)$, and if the true class is 0, the loss is $-log(1-\hat{y}_i)$. The loss function penalizes predictions that are far from the true class more heavily, while small errors receive a smaller penalty.


#### Categorical cross-entropy: 

Categorical cross-entropy, on the other hand, is used when the output variable is a categorical variable with more than two possible values. It is defined as follows:

<div style="text-align:center">

$J(W,b) = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{c} y_{ij} log(\hat{y}_{ij})$ </div>

where $m$ is the number of examples in the dataset, $c$ is the number of classes, $y_{ij}$ is a binary indicator (0 or 1) whether the true label of example $i$ is class $j$, and $\hat{y}_{ij}$ is the predicted probability of example $i$ belonging to class $j$.

The loss function is the negative logarithm of the predicted probability of the true class, weighted by the binary indicator of the true class. The loss function penalizes predictions that are far from the true class more heavily, while small errors receive a smaller penalty.
