# Loss Functions
Given an input vector, a loss function is a measure of how bad a particular model performs in predicting a desired output  quantity (regression) or correctly labeling the input vector (classification).


Farhad Kamangar  Sept. 2019

## Mean Squared Error (MSE).

Mean Squared Error is the most commonly used regression loss function and it is defined as the mean of the squared distances between the desired and the predicted output(s).

$$\large MSE=\frac{1}{N}\sum_{i=1}^{N} {(t_i-y_i)}^2$$

where $N$ is the total number of samples , $t_i$ is the desired output for sample $i$ and $y_i$ is the actual output for sample $i$


## Mean Absolute Error (MAE).

Mean Absolute Error is another common function used for regression models and it is defined as as the mean of the absolute differences between the desired and the predicted output(s).

$$\large MAE=\frac{1}{N}\sum_{i=1}^{N} {|t_i-y_i|}$$

where $N$ is the total number of samples , $t_i$ is the desired output for sample $i$ and $y_i$ is the actual output for sample $i$



## Hinge (Multiclass Support Vector Machine) Loss

The hinge loss function is used for classification and it is based on the concept of maximum-margin. The hinge loss for sample number $\large s$ is formulated as:


$$\large L_s=\sum_{j \neq s_t}^{C} max(0,y_j-y_{s_t}+\Delta)$$

where $\large s$ is the sample number, $\large C$ is the number of classes, and  $\large s_t$ is the index of the true class for sample number $\large s$, and $\Delta$ is a constant.


The total loss across all the samples can be calculated as:

$$\large L=\frac {1}{N} \sum_{s} L_s$$


where  $N$ is the number of the samples, $L$ is the total loss over all the samples and $s$ is the sample number




## Numerical Example
Consider a one layer neural network with 3 nodes and linear activation function.
This network is supposed to classify its input into one of three possible classes.
Assume that the input to this network is 4 dimensional and the loss function is defined as SVM (Hinge).
For simplicity assume that his network does not have any bias.

In [1]:
import numpy as np
def calculate_output(input_vector, w):
    y = np.dot(input_vector,w)
    print("Actual output: ",y)
    return y
def SVM_loss(y,true_class_index,delta=1):
    """
    This function calculates the hinge loss function
    Farhad Kamangar Apr. 2020
    
    """
    print("True class: ",true_class_index)
    margins = np.maximum(0, y - y[true_class_index] + delta)
    margins[true_class_index] = 0
    loss_i = np.sum(margins)
    return loss_i

x = np.array([[1.0, 1.0, 1,3], [1.0, 0,0,-3.5], [0,1,0,2],[5,1,2,3],[3,6,5,1],[3,7,7,1]])
true_class_index=[1,2,0,1,0,1,2]
w=np.array([[2,4,7], [1,5,6], [6,2,5], [7,2,5]])

print("X= ",x)
print("weights: ",w)
selected_sample_number=5
print("Selected sample number: ",selected_sample_number)
y=calculate_output(x[selected_sample_number], w)

# y=np.array([30,10,40])
loss=SVM_loss(y,0,delta=1)
# loss=SVM_loss(y,true_class_index[selected_sample_number],delta=1)
print("SVM Loss: ", loss)


X=  [[ 1.   1.   1.   3. ]
 [ 1.   0.   0.  -3.5]
 [ 0.   1.   0.   2. ]
 [ 5.   1.   2.   3. ]
 [ 3.   6.   5.   1. ]
 [ 3.   7.   7.   1. ]]
weights:  [[2 4 7]
 [1 5 6]
 [6 2 5]
 [7 2 5]]
Selected sample number:  5
Actual output:  [ 62.  63. 103.]
True class:  0
SVM Loss:  44.0


## Cross Entropy Loss

The cross entropy loss uses a softmax function to calculate the loss. 

### Softmax Function

The softmax function gives a probabilistic interpretation to the output values and it is formulated as:

$$\large S(i)=\frac{e^{y_i}}{\sum_{j}^{C}e^{y_j}}$$

where $\large S(i)$ is the softmax value corresponding to the output $\large y_i$, and $ C$ is the number of classes. This function interprets the outputs as unnormalized log probabilities of each class. Notice that the denominator of the above equation normalizes the probabilities so the total sums to 1. 

In other words the softmax function takes a vector of floating point numbers and proportionally compresses each number between zero and one such that the total adds up to 1.

Using the softmax function, the cross entropy loss for sample $\large s$ can be calculated as:

$$\large L_s=-log(\frac{e^{y_{s_t}}}{\sum_{j=1}^{C} e^{y_j}})$$


where $\large s$ is the sample number, $ C$ is the number of classes, and  $\large s_t$ is the index of the true class for sample number $\large s$.


The above equation is really a simplified version of a discrete cross entropy between two distributions.

Let's imagine that we have a true discrete distribution $p$ and an estimated discrete distribution $q$. The cross entropy between these two distribution is defined as:

$$\large H(p,q)= - \sum_{x}p(x)log(q(x))$$

Notice that in a multi-class classification problem the true probability distribution has all zeros except for the correct class, $i$, which has the value of 1:

$$\large p=[0,0,..., 1, ... 0]$$

If the above discrete distribution $p$ is substituted into the general cross entropy equation it will result in the simplified cross entropy loss 

$$\large L_s=-log(\frac{e^{y_{s_t}}}{\sum_{j=1}^{C} e^{y_j}})$$

where $\large s_t$ is the index of the correct class.

Notice that to calculate the overall loss we still need to average the loss over all the samples.
$$\large L=\frac {1}{N} \sum_{s} L_s$$


where $ N$  is the number of samples and $L_s$ is the cross entropy loss for sample $s$

**Note:** There is no "softmax loss". The correct terminology is "cross-entropy loss". The "cross entropy loss" uses the "softmax" function to calculate the loss.  

## Numerical Example
Consider a one layer neural network with 3 nodes and linear activation function.
This network is supposed to classify its input into one of three possible classes.
Assume that the input to this network is 4 dimensional and the loss function is defined as cross entropy categorical.
For simplicity assume that his network does not have any bias.

In [2]:
import numpy as np


def calculate_output(input_vector, w):
    y = np.dot(input_vector,w)
    print("Actual output: ",y)
    return y

def cross_entropy_categorical_loss(y,true_class_index):
    """
    This function calculates the categorical cross entropy
    Farhad Kamangar Apr. 2020
    
    """
#     y=np.exp(y - np.max(y)) # This is for numerical stability
    softmax=np.exp(y) /np.sum(np.exp(y))
    print("Softmax: ",softmax)
    return -np.log(softmax[true_class_index])

x = np.array([[1.0, 1.0, 1,3], [1.0, 0,0,-3.5], [0,1,0,2],[5,1,2,3],[3,6,5,1],[3,7,7,1]])
true_class_index=[1,2,0,1,0,1]
w=np.array([[2,4,7], [1,5,6], [6,2,5], [7,2,5]])

print("X= ",x)
print("weights: ",w)
selected_sample_number=4
print("Selected sample number: ",selected_sample_number)
y=calculate_output(x[selected_sample_number], w)
y=np.array([1,-0.1,1.3])
loss=cross_entropy_categorical_loss(y,0)
# loss=cross_entropy_categorical_loss(y,true_class_index[selected_sample_number])
print("Cross entropy loss: ", loss)

X=  [[ 1.   1.   1.   3. ]
 [ 1.   0.   0.  -3.5]
 [ 0.   1.   0.   2. ]
 [ 5.   1.   2.   3. ]
 [ 3.   6.   5.   1. ]
 [ 3.   7.   7.   1. ]]
weights:  [[2 4 7]
 [1 5 6]
 [6 2 5]
 [7 2 5]]
Selected sample number:  4
Actual output:  [49. 54. 87.]
Softmax:  [0.37275463 0.12407924 0.50316613]
Cross entropy loss:  0.9868348922324128
