In supervised machine learning algorithms, we want to minimize the error for each training example during the learning process. This is done using some optimization strategies like gradient descent. And this error comes from the loss function

# What’s the Difference between a Loss Function and a Cost Function?
A loss function is for a single training example. It is also sometimes called an error function. A cost function, on the other hand, is the average loss over the entire training dataset. The optimization strategies aim at minimizing the cost function.

# Loss Functions

##  L1(Absolute Error Loss) and L2(Squared Error Loss) loss

*L1* and *L2* are two common loss functions in machine learning which are mainly used to minimize the error.

**L1 loss function** are also known as **Least Absolute Deviations** in short **LAD**.
**L2 loss function** are also known as **Least square errors** in short **LS**.

Let's get brief of these two

## L1 Loss function or Absolute Error loss
It is used to minimize the error which is the sum of all the absolute differences in between the true value and the predicted value.Absolute Error for each training example is the distance between the predicted and the actual values, irrespective of the sign. Absolute Error is also known as the L1 loss: **L1 has linear equations**

![Screenshot%202021-06-09%20142933.png](attachment:Screenshot%202021-06-09%20142933.png)
 the **cost is the Mean of these Absolute Errors (MAE).**

![Screenshot%202021-06-09%20143058.png](attachment:Screenshot%202021-06-09%20143058.png)
**Advantage**   
->The MAE cost is more robust to outliers as compared to MSE.

**Disadvantage**  
->most of Researchers said that handling the absolute or modulus operator in mathematical equations is not easy.  
-> it may have local Minima

In [6]:
# define model
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='mean_absolute_error', optimizer=opt, metrics=['mse'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_mse = model.evaluate(trainX, trainy, verbose=0)
test_mse = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('Mean Absolute Error Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

NameError: name 'Sequential' is not defined

## L2 Loss Function or Squared Error loss
It is also used to minimize the error which is the sum of all the squared differences in between the true value and the pedicted value.Squared Error loss for each training example, also known as L2 Loss, is the square of the difference between the actual and the predicted values: **in l2 loss function we have quadratic equation**
![Screenshot%202021-06-09%20141205.png](attachment:Screenshot%202021-06-09%20141205.png)

The corresponding **cost function is the Mean of these Squared Errors (MSE).**
![Screenshot%202021-06-09%20141553.png](attachment:Screenshot%202021-06-09%20141553.png)
Let’s talk a bit more about the MSE loss function. It is a positive quadratic function (of the form ax^2 + bx + c where a > 0). Remember how it looks graphically?

![Screenshot%202021-06-09%20142002.png](attachment:Screenshot%202021-06-09%20142002.png)
**Advantage:**     
   ->A quadratic function only has a global minimum. Since there are no local minima, we will never get stuck in one. Hence,        it is always guaranteed that Gradient Descent will converge (if it converges at all) to the global minimum  
   ->The MSE loss function penalizes the model for making large errors by squaring them.

**Disadvantage**     
      Squaring a large quantity makes it even larger, right? But there’s a caveat. This property makes the MSE cost function         less robust to outliers. Therefore, it should not be used if our data is prone to many outliers.

In [7]:
# define model
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='mean_squared_logarithmic_error', optimizer=opt, metrics=['mse'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_mse = model.evaluate(trainX, trainy, verbose=0)
test_mse = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('Mean Squared Logarithmic Error Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

NameError: name 'Sequential' is not defined

In [8]:
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from matplotlib import pyplot
# generate regression dataset
X, y = make_regression(n_samples=5000, n_features=20, noise=0.1, random_state=1)
# standardize dataset
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]
# split into train and test
train1 = 2500
trainX, testX = X[:train1, :], X[train1:, :]
trainy, testy = y[:train1], y[train1:]
# define model
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='mean_squared_error', optimizer=opt)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_mse = model.evaluate(trainX, trainy, verbose=0)
test_mse = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))
# plot loss during training
pyplot.title('Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

ModuleNotFoundError: No module named 'sklearn'

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


ModuleNotFoundError: No module named 'matplotlib'

In [None]:
x_guess = tf.lin_space(-1., 1., 100)
x_actual = tf.constant(0,dtype=tf.float32)

In [None]:
l1_loss = tf.abs((x_guess-x_actual))
l2_loss = tf.square((x_guess-x_actual))

In [None]:
with tf.Session() as sess:
    x_,l1_,l2_ = sess.run([x_guess, l1_loss, l2_loss])
    plt.plot(x_,l1_,label='l1_loss')
    plt.plot(x_,l2_,label='l2_loss')
    plt.legend()
    plt.show()

## 2. Huber Loss 

Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to outliers(because if the residual is too large, it is a piecewise function, loss is a linear function of the residual).The Huber loss combines the best properties of MSE and MAE. It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter:
![Screenshot%202021-06-09%20143446.png](attachment:Screenshot%202021-06-09%20143446.png)

Among them, $\delta$ is a set parameter, $y$ represents the real value, and $f(x)$ represents the predicted value.

The advantage of this is that when the residual is small, the loss function is L2 norm, and when the residual is large, it is a linear function of L1 norm
**Huber loss is more robust to outliers than MSE. It is used in Robust Regression, M-estimation and Additive Modelling. A variant of Huber Loss is also used in classification.**

#### Pseudo-Huber loss function 

A smooth approximation of Huber loss to ensure that each order is differentiable.

<img src=".\Images\img2.png">

Where $\delta$ is the set parameter, the larger the value, the steeper the linear part on both sides.

<img src=".\Images\img3.png">


# Binary Classification Loss Functions   
  **1.Binary Cross-Entropy**    
  **2.Hinge Loss**


## Hinge Loss

->Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value y = wx + b  
->Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1 and 1. So make sure you change the label of the ‘Malignant’ class in the dataset from 0 to -1.  
->Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident.  
Hinge loss for an input-output pair (x, y) is given as:
![Screenshot%202021-06-09%20160329.png](attachment:Screenshot%202021-06-09%20160329.png)
Hinge Loss simplifies the mathematics for SVM while maximizing the loss (as compared to Log-Loss). It is used when we want to make real-time decisions with not a laser-sharp focus on accuracy.  

In [4]:
# Hinge Loss
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='hinge', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_acc = model.evaluate(trainX, trainy, verbose=0)
test_acc = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('Hinge Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

NameError: name 'Sequential' is not defined

In [None]:
x_guess2 = tf.linspace(-3.,5.,500)
x_actual2 = tf.convert_to_tensor([1.]*500)

#Hinge loss
#hinge_loss = tf.losses.hinge_loss(labels=x_actual2, logits=x_guess2)
hinge_loss = tf.maximum(0.,1.-(x_guess2*x_actual2))
0with tf.Session() as sess:
    x_,hin_ = sess.run([x_guess2, hinge_loss])
    plt.plot(x_,hin_,'--', label='hin_')
    plt.legend()
    plt.show()

## Binary Cross Entropy Loss
Let us start by understanding the term ‘entropy’. Generally, we use entropy to indicate disorder or uncertainty. It is measured for a random variable X with probability distribution p(X): 
![Screenshot%202021-06-09%20161025.png](attachment:Screenshot%202021-06-09%20161025.png)
**The negative sign is used to make the overall quantity positive.**   
->A greater value of entropy for a probability distribution indicates a greater uncertainty in the distribution. Likewise, a smaller value indicates a more certain distribution.     

->This makes binary cross-entropy suitable as a loss function – you want to minimize its value. We use binary cross-entropy loss for classification models which output a probability p.    

**Probability that the element belongs to class 1 (or positive class) = p**    
**Then, the probability that the element belongs to class 0 (or negative class) = 1 - p**  

Then, the cross-entropy loss for output label y (can take values 0 and 1) and predicted probability p is defined as:

![Screenshot%202021-06-09%20161342.png](attachment:Screenshot%202021-06-09%20161342.png)

![Screenshot%202021-06-09%20153708.png](attachment:Screenshot%202021-06-09%20153708.png)

In [5]:
# Cross entropy loss
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=5000, noise=0.1, random_state=1)
# split into train and test
train1 = 2500
trainX, testX = X[:train1, :], X[train1:, :]
trainy, testy = y[:train1], y[train1:]
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_acc = model.evaluate(trainX, trainy, verbose=0)
test_acc = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('Binary Cross Entropy Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

ModuleNotFoundError: No module named 'sklearn'

# Multi-class Classification Loss Functions     
**1.Multi-class Cross Entropy Loss**   
**2.Kullback Leibler Divergence Loss**  

## Multi-Class Cross Entropy Loss
->If we take a dataset like Iris where we need to predict the three-class labels: Setosa, Versicolor and Virginia, in such cases where the target variable has more than two classes Multi-Class Classification Loss function is used.  
->The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss. The loss for input vector X_i and the corresponding one-hot encoded target vector Y_i is:  
![Screenshot%202021-06-09%20162151.png](attachment:Screenshot%202021-06-09%20162151.png)
**“Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.**  


![Screenshot%202021-06-09%20162300.png](attachment:Screenshot%202021-06-09%20162300.png)

In [3]:
# Multi-class Cross-entropy loss
from sklearn.datasets import make_blobs
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import to_categorical
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_blobs(n_samples=5000, centers=3, n_features=2, cluster_std=2, random_state=2)
# one hot encode output variable
y = to_categorical(y)
# split into train and test
train1 = 500
trainX, testX = X[:train1, :], X[train1:, :]
trainy, testy = y[:train1], y[train1:]
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_acc = model.evaluate(trainX, trainy, verbose=0)
test_acc = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('Categorical Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

ModuleNotFoundError: No module named 'sklearn'

## KL-Divergence  
->The Kullback-Liebler Divergence is a measure of how a probability distribution differs from another distribution. A KL-divergence of zero indicates that the distributions are identical.  
->Kullback Leibler Divergence Loss calculates how much a given distribution is away from the true distribution. These are used to carry out complex operations like autoencoder where there is a need to learn the dense feature representation  
![Screenshot%202021-06-09%20163029.png](attachment:Screenshot%202021-06-09%20163029.png)

![Screenshot%202021-06-09%20163110.png](attachment:Screenshot%202021-06-09%20163110.png)

In [2]:
# KL Divergence
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='kullback_leibler_divergence', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0)
# evaluate the model
train_acc = model.evaluate(trainX, trainy, verbose=0)
test_acc = model.evaluate(testX, testy, verbose=0)
# plot loss during training
pyplot.title('KL Divergence Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

NameError: name 'Sequential' is not defined

In [6]:
arr3 = np.array([
[[11, 12, 8, 19],
[13, 21, 4, 19]],
[[5, 1, 7, 21],
[13, 9, 6, 1]],
[[9, 32, 18, -5],
[15, 25, 11, -6]]])
print(arr3[:2,1])

[[13 21  4 19]
 [13  9  6  1]]


In [4]:
print(arr3[:2, :].sum())

170
