# Loss functions for both regression and classification problems.

### 1. Mean Squared Error (MSE)
- **Type:** Regression
- **Mathematical Equation:** 
  $$
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  $$
  where $ y_i $ is the actual value, $ \hat{y}_i $ is the predicted value, and $ n $ is the number of observations.
- **How it works:** MSE measures the average of the squares of the errors—that is, the average squared difference between the actual and predicted values. It is sensitive to outliers because the errors are squared.
- **Use case:** Used in linear regression and other regression problems where the goal is to minimize the prediction error.

### 2. Mean Absolute Error (MAE)
- **Type:** Regression
- **Mathematical Equation:** 
  $$
  \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
  $$
  where $ y_i $ is the actual value, $ \hat{y}_i $ is the predicted value, and $ n $ is the number of observations.
- **How it works:** MAE measures the average of the absolute differences between the actual and predicted values. It is more robust to outliers compared to MSE.
- **Use case:** Used in regression problems where robustness to outliers is desired.

### 3. Huber Loss
- **Type:** Regression
- **Mathematical Equation:** 
  $$
  \text{Huber Loss} = \begin{cases} 
  \frac{1}{2} (y_i - \hat{y}_i)^2 & \text{for } |y_i - \hat{y}_i| \leq \delta \\
  \delta |y_i - \hat{y}_i| - \frac{1}{2} \delta^2 & \text{for } |y_i - \hat{y}_i| > \delta 
  \end{cases}
  $$
  where $ \delta $ is a threshold parameter.
- **How it works:** Huber loss combines MSE and MAE, being quadratic for small errors and linear for large errors. This makes it less sensitive to outliers than MSE while maintaining sensitivity for small errors.
- **Use case:** Used in regression problems where the presence of outliers is a concern, but small errors should still be penalized more than MAE.

### 4. Logarithmic Loss (Log Loss)
- **Type:** Classification (Binary)
- **Mathematical Equation:** 
  $$
  \text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]
  $$
  where $ y_i $ is the actual binary label (0 or 1), $ \hat{p}_i $ is the predicted probability of the positive class, and $ n $ is the number of observations.
- **How it works:** Log loss measures the performance of a classification model whose output is a probability value between 0 and 1. It penalizes incorrect predictions more heavily when the prediction is confident but wrong.
- **Use case:** Used in binary classification problems, commonly in logistic regression and neural networks.

### 5. Cross-Entropy Loss
- **Type:** Classification (Multi-class)
- **Mathematical Equation:** 
  $$
  \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{p}_{i,c})
  $$
  where $ y_{i,c} $ is a binary indicator (0 or 1) if class label $ c $ is the correct classification for observation $ i $, $ \hat{p}_{i,c} $ is the predicted probability of class $ c $ for observation $ i $, $ n $ is the number of observations, and $ C $ is the number of classes.
- **How it works:** Cross-entropy loss generalizes log loss to multiple classes. It measures the performance of a classification model whose output is a probability distribution over classes.
- **Use case:** Used in multi-class classification problems, commonly in softmax regression and neural networks.

### 6. Hinge Loss
- **Type:** Classification (Binary)
- **Mathematical Equation:** 
  $$
  \text{Hinge Loss} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \hat{y}_i)
  $$
  where $ y_i $ is the actual label (−1 or 1), $ \hat{y}_i $ is the predicted label, and $ n $ is the number of observations.
- **How it works:** Hinge loss is used for training classifiers, particularly support vector machines (SVMs). It penalizes predictions that are not only wrong but also within a margin of error.
- **Use case:** Used in binary classification problems, especially with support vector machines (SVMs).

### 7. Kullback-Leibler Divergence (KL Divergence)
- **Type:** Classification (Probability Distributions)
- **Mathematical Equation:** 
  $$
  \text{KL Divergence} = \sum_{i=1}^{n} P(x_i) \log \left( \frac{P(x_i)}{Q(x_i)} \right)
  $$
  where $ P(x) $ is the true probability distribution and $ Q(x) $ is the predicted probability distribution.
- **How it works:** KL divergence measures how one probability distribution diverges from a second, expected probability distribution.
- **Use case:** Used in classification problems involving probability distributions, such as variational autoencoders.

### 8. Poisson Loss
- **Type:** Regression (Count Data)
- **Mathematical Equation:** 
  $$
  \text{Poisson Loss} = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{y}_i - y_i \log(\hat{y}_i) + \log(y_i!) \right)
  $$
  where $ y_i $ is the actual count, $ \hat{y}_i $ is the predicted count, and $ n $ is the number of observations.
- **How it works:** Poisson loss is used for count data and is derived from the Poisson distribution. It measures the error between predicted and actual counts.
- **Use case:** Used in regression problems involving count data, such as the number of events occurring within a fixed period.

These are some of the most commonly used loss functions in machine learning. The choice of loss function depends on the specific problem (regression or classification), the nature of the data, and the model being used.