### [Loss Function:](../04-Optimization/ml-loss-function.ipynb)
A loss function, also known as a cost function or objective function, is a fundamental concept in machine learning and optimization. It is a mathematical function that quantifies the error or discrepancy between the predicted values generated by a machine learning model and the actual target values (ground truth) in the training data.
1. Regression Loss
2. Classification Loss

#### Regression Loss:
1. Mean Squared Error (MSE) or L2 Loss: Measures the average squared difference between the predicted values $\hat y_i$ and the true values $y_i$. It is sensitive to outliers.
    * $MSE=\displaystyle \frac {1}{N}\sum_{1}^{N}(y_i-\hat y_i)^2$

2. Mean Absolute Error (MAE) or L1 loss: MAE takes the average sum of the absolute differences between the actual and the predicted values. MAE is less sensitive to outliers compared to Mean Squared Error (MSE). Mean Absolute Error would be an ideal option in such an outlier.
    * $MAE=\displaystyle \frac {1}{N}\sum_{1}^{N}\mid y_i-\hat y_i\mid$

3. Mean Bias Error(MBE): Mean Bias Error takes the actual difference between the target and the predicted value, and not the absolute difference. One has to be cautious as the positive and the negative errors could cancel each other out, which is why it is one of the lesser-used loss functions.
    * $MBE=\displaystyle \frac {1}{N}\sum_{1}^{N}( y_i-\hat y_i)$

4. Mean Squared Logarithmic Error (MSLE): Mean Squared Logarithmic Error is the same as Mean Squared Error, except the natural logarithm of the predicted values is used rather than the actual values. It is commonly used when the target values or the prediction errors have a wide range.
    * $MSLE=\displaystyle \frac {1}{N}\sum_{1}^{N}(\log( 1+\hat y_i)-\log (1+y_i))^2$

5. Huber Loss: If a data point has a small error then we take MSE; if the error is high then we take MAE loss. The main problem is that we have to deal with an extra hyperparameter.
    * $J_{\delta}=\begin{cases}\frac{1}{2} (y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{if } |y - \hat{y}| > \delta \end{cases}$
    * $\delta$ is a non-negative threshold parameter that controls the point at which the loss transitions from quadratic (like MSE) to linear (like MAE).
    * When $|y_i - \hat y_i|\leq \delta$ , it behaves like the Mean squared error loss (MSE).
    * When $|y_i - \hat y_i|\geq \delta$ , it behaves like the Mean absolute error loss (MAE).
    * The choice of the $\delta$ parameter is crucial in determining how much the Huber loss is influenced by outliers. Smaller values of $\delta$ make the loss function more robust to outliers, resembling MAE, while larger values make it more sensitive to outliers, resembling MSE.

In [38]:
import numpy as np

y = 10
f_x = 7

# Huber loss threshold
delta = 2

# Calculate the absolute error
absolute_error = np.abs(y - f_x)
print(absolute_error)

# Calculate the Huber loss
if absolute_error <= delta:
    huber_loss = 0.5 * (absolute_error**2)
else:
    huber_loss = delta * absolute_error - 0.5 * (delta**2)

print("Huber Loss:", huber_loss)

3
Huber Loss: 4.0


#### Classification Loss:

1. ***`Hinge Loss or SVM loss:`*** In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).$$L_i = \sum_{j\neq y_i} \max(0, w_j^T x_i - w_{y_i}^T x_i + \Delta)$$<br> If the iput of an image $x_i$ and the label $y_i$ that specifies the index of the correct class. For example, j-th elements output: $s_j=f(x_i,W)_j$ 
    * $L_i = \displaystyle{\sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)}$
        * $\Delta$ is the hyper parameter and The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta($\Delta$).
    * Example: Three classes that receive the scores s=[13,−7,11] and that the first class is the true class (i.e. $y_i=0$).
        * $L_i = \max(0, -7 - 13 + 10) + \max(0, 11 - 13 + 10)$
    * Regularization penalty(L2 norm): 
        * $L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss}$ and $R(W) = \displaystyle\sum_k\sum_l W_{k,l}^2$
        
        * Or $L = \displaystyle \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2$


In [39]:
import numpy as np
x=np.array([1,1,1,1])
w1=np.array([1,0,0,0])
w2=np.array([0.25,0.25,0.25,0.25])
print(f"w1.T*x: {w1.T*x}\t np.sum(w1.T*x):{np.sum(w1.T*x)}")
print(f"w2.T*x: {w2.T*x}\t np.sum(w2.T*x):{np.sum(w2.T*x)}")
rw2=0
for i in w2:
    rw2+=i**2
print(np.sum(w1 ** 2))


w1.T*x: [1 0 0 0]	 np.sum(w1.T*x):1
w2.T*x: [0.25 0.25 0.25 0.25]	 np.sum(w2.T*x):1.0
1


2. ***`Cross Entropy Loss or Softmax classifier:`***  It measures the dissimilarity between predicted probabilities and true class labels, encouraging the predicted probabilities to be as close as possible to the actual labels. Cross-entropy loss is especially suitable for models that output probability distributions over classes, such as logistic regression and neural networks with softmax activation in the output layer.
    * $$L_i = \displaystyle-\log\left(\displaystyle \frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$$
    * $\text {The cross-entropy between a “true” distribution p and an estimated distribution q is}\; H(p,q)=−\displaystyle \sum_x p(x)\log q(x)$ where $q = e^{f_{y_i}}  / \sum_j e^{f_j}$
    * Binary Cross Entropy: $L=\displaystyle -(y_i\log(\hat y_i)+(1-y_i)\log(1-\hat y_i))$
        * $\hat y_i=1\;\text{then } L=\displaystyle -(y_i\log(\hat y_i))$
        * $\hat y_i=0\;\text{then } L=\displaystyle (1-y_i)\log(1-\hat y_i)$

In [40]:
import numpy as np
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
# p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup
# print(p)
# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
print(np.sum(p))

1.0


3. Sparse Cross Entropy loss: Sparse cross-entropy loss, also known as sparse categorical cross-entropy loss, is a specific form of the cross-entropy loss function used in multiclass classification problems where the true labels are integers representing class indices rather than one-hot encoded vectors.
$$L=-\sum \log (p_i)$$

In [59]:
import numpy as np

# create predicted probabilities
predicted_probs = np.array([[0.1, 0.6, 0.3],
                            [0.4, 0.2, 0.4],
                            [0.7, 0.1, 0.2]])

# Simulated true class labels (as integers) rather than hot encoding
true_labels = np.array([1, 2, 0])
selected_probs=[]
for i in range(len(true_labels)):
    selected_probs.append(predicted_probs[i][true_labels[i]])

# Calculate the negative log probabilities for the true classes
neg_log_selected_probs=-np.log(selected_probs)

avg_loss=np.mean(neg_log_selected_probs)

# Calculate the negative log probabilities for the true classes
negative_log_probs = -np.log(predicted_probs[range(len(true_labels)), true_labels])

# Calculate the average loss
average_loss = np.mean(negative_log_probs)

print("Average Loss:", avg_loss, "\tNegative Log Probabilities:", neg_log_selected_probs)
print("Average Loss:", average_loss, "\tNegative Log Probabilities:", negative_log_probs)

Average Loss: 0.594597099859626 	Negative Log Probabilities: [0.51082562 0.91629073 0.35667494]
Average Loss: 0.594597099859626 	Negative Log Probabilities: [0.51082562 0.91629073 0.35667494]
