# Loss functions

-----
### Dice Loss

- Dice loss: $1 - \frac{2|X \cap Y|}{|X| + |Y|}$
- The Dice coefficient, or Dice-Sørensen coefficient, is a common metric for pixel segmentation that can also be modified to act as a loss function.
- Used in segmentation problems.

### Cross Entropy Loss

- Cross Entropy Loss: $- \sum_{i=1}^{n}y_i \log(\hat{y_i})$.
- For binary classification: $-y \log(\hat{y}) - (1 - y) \log(1 - \hat{y})$.
- Cross-entropy loss (also known as log loss) is a common loss function used in supervised machine learning, particularly in classification problems. The function measures the dissimilarity between the predicted probability distribution and the true distribution.
- Used in classification problems.

### MAE Loss (L1 Loss)

- MAE Loss: $\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y_i}|$.
- Mean Absolute Error is the sum of absolute distances between our target variable and predicted values.
- Used in regression problems.

### MSE Loss (L2 Loss)

- MSE Loss: $\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$.
- Mean Squared Error as the name suggests is the sum of squared distances between our target variable and predicted values.
- Used in regression problems.


-----

### Regularization

L1 and L2 regularization are techniques that discourage learning a more complex or flexible model, so as to **avoid the risk of overfitting**.

Lasso and Ridge regression **add a penalty term to the linear regression loss function** to prevent overfitting. 

1. L1 Regularization (Lasso):
   - L1 regularization adds a penalty term to the loss function that is proportional to the absolute values of the model's weights.
   - The L1 regularization term is calculated as the sum of the absolute values of the weights: $\lambda \sum_{i=1}^{n}|w_i|$, where $\lambda$ is the regularization strength.
   - L1 regularization can be used to perform feature selection, as it encourages the weights of irrelevant features to be set to zero.
   - L1 regularization can be useful when dealing with high-dimensional datasets with many irrelevant or redundant features, as it helps in automatic feature selection by eliminating less important features.
2. L2 Regularization (Ridge):
   - L2 regularization adds a penalty term to the loss function that is proportional to the square of the model's weights.
   - The L2 regularization term is calculated as the sum of the squared values of the weights: $\lambda \sum_{i=1}^{n}w_i^2$, where $\lambda$ is the regularization strength.
   - L2 regularization is useful for reducing the impact of correlated features, as it encourages the weights of correlated features to be similar.
   - L2 regularization can help in preventing overfitting by penalizing large weights and making the model more robust to noise in the training data.


What's the difference between Lasso and Ridge regression?

1. Use L1 regularization (Lasso) when:
   - Feature selection is important, and you want to automatically eliminate less important features.
   - You have high-dimensional data with many irrelevant or redundant features.
   - You are interested in a sparse model with fewer non-zero coefficients.
2. Use L2 regularization (Ridge) when:
   - You want to prevent overfitting and improve generalization performance.
   - You don't necessarily need feature selection or sparsity.
   - You want a smoother optimization landscape that allows for faster convergence.

When there are highly correlated features in your dataset, how would the weights for L1 and L2 end up being?
1. L1 regularization tends to arbitrarily select one of the correlated features and set its weight to zero while keeping the others non-zero. 
2. L2 regularization tends to shrink the weights of highly correlated features towards each other, effectively reducing their magnitudes.

In PyTorch, you can add regularization terms to the loss function. For L2 regularization (weight decay), you can use the weight_decay parameter in the optimizer. For example:

`optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-5)`

While L1 and L2 regularization are conceptually applied to the loss function, implementing them within optimizers in frameworks like PyTorch offers practical benefits

-----

## Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a framework for training reinforcement learning agents using human feedback. In RLHF, the agent interacts with the environment and receives feedback from a human teacher, which is used to update the agent's policy. The feedback can take various forms, such as binary rewards, preference comparisons, or natural language instructions. RLHF is designed to enable efficient and effective learning from human feedback, and it has applications in areas such as interactive machine learning, human-robot interaction, and personalized recommendation systems.


-----
## Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a technique used to optimize a model's parameters directly based on the preferences of the user. In DPO, the model's parameters are updated based on the user's preferences, rather than using a loss function to measure the model's performance.

$$\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} J(\theta)$$

where $\theta$ is the model's parameters, $\alpha$ is the learning rate, and $\nabla_{\theta} J(\theta)$ is the gradient of the objective function $J(\theta)$ with respect to the model's parameters.

## References

- [https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks)
- [https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde](https://towardsdatascience.com/fantastic-activation-functions-and-when-to-use-them-481fe2bb2bde)